Messaging Layer Security (MLS) with Raphael Robert

Messaging Layer Security (MLS) 1.0 is (basically) here! We invited Raphael Robert, coauthor of the MLS specification to explain it to us and answer our annoying questions (read: why does this exist?)

Links:

This transcript has been edited for length and clarity.

Deirdre: Hello, welcome to Security Cryptography Whatever. I’m Deirdre.

David: I’m David.

Thomas: Hi I’m Tom.

Deirdre: And then we have a special guest today, Raphael Robert from Phoenix R&D. Hi Raphael.

Raphael: Hi everybody. Thanks for having me.

Deirdre: Yay. he’s our special guest today because we’re talking about MLS messaging layer security, a new almost done, IETF standard. Like, what is the status?

Raphael: So the status is, we are almost done because MLS has received a positive review from the ISG, which is the steering group at IETF. So that means the protocol is essentially done. What’s left is what’s called the RFC editor phase: so during that time, essentially some polishing happens, some wordsmithing, I think there is a style guide.

And that takes, it can take a few weeks, it can take a few months. I think the current guesstimate from the chairs of the working group was like a couple of months from now, but it also means that the authors of the document cannot really change anything at this point. Nothing fundamental anymore, so, yeah.

Deirdre: Very good. Congratulations. I am a co-author on an IRTF document that we hope will be reaching a similar phase soon. So kudos on making it all the way through for a much larger protocol.

Before we get to into the weeds too much, can you tell us a little bit more about the protocol, messaging layer security? MLS sounds similar to TLS, but it’s about a messaging protocol.

Raphael: That is true. And the similarity is very much on purpose, I think. But I’ll get into the details in a bit. So, on a high level, MLS is a protocol for end-to-end encryption, and specifically in groups. So it can be in very small groups, in groups of two, but it can also be much larger groups with thousands of members, essentially.

Deirdre: Ooh. When you say thousands, is that like several thousand, or is that like exactly 1024 like in some apps?

Raphael: No, definitely several. So there is no like practical upper limit. I think there’s a theoretical limit of 1 billion, but before you reach that, you run into a lot of other problems that have nothing to do with MLS.

Deirdre: Uh huh.

Raphael: In theory, it can grow quite large; we haven’t seen really large deployments yet.

I think Cisco has rolled it out a while back, from an earlier draft, and they’re running it on calls with I think, more than a thousand participants. So I think this is the largest, you know, real life deployment we have so far. But who knows? I expect to see larger groups than that.

Deirdre: Awesome. And so can you tell us more about where MLS came from? We have Signal protocol. Signal protocol is, a bunch of things, all wearing a hat that says, ‘I am Signal.’ It’s not necessarily just one thing. It’s several things together, I would say. And the kind of general understanding, rather than like a formally modeled protocol understanding.

Can you tell us a little bit more about how MLS came to be?

Raphael: Yeah, sure. So around like five or six years back, like across the industry, there was a problem that encrypting in groups was not very efficient. So what we had at that point were essentially one-to-one protocols. So we call them pairwise protocols: there was OTR v3, then there was Signal, which at the time was called the Double Ratchet, plus X3DH.

Deirdre: It still has that under the hood, right?

Raphael: It’s a good point you raise. I’m not sure what exactly the Signal protocol is. But essentially it’s the sum of all these, you know, technical sub-components. So the more comparable ones would be Double Ratchet and X3DH essentially, if you want to compare it to MLS.

Yeah. So long story short, that was, and still is a great protocol, but it doesn’t scale very well in groups. So what we’ve seen, for example, WhatsApp, that has implemented the protocol, used something on top of that called sender keys, which essentially uses a Signal protocol or double ratchet to bootstrap another much smaller protocol where keys are being distributed among all the members of a group.

And then that key material is reused to encrypt messages exactly once, so, way more efficient than using a pairwise protocol. But on the security side, it’s pretty bad, because all you get is forward secrecy, if you want, but having post-compromise security is really expensive and then you end up not doing it very frequently.

So this is, on a technical level, this is what we wanted to solve.

Deirdre: Got it. And post-compromise security, as a refresher, is after a secret has been exposed, continually running the protocol with honest participants will heal basically your shared secret.

Raphael: Yes, yes.

Deirdre: Yeah. Cool. Very good.

Thomas: We’re talking about, when we talk about the WhatsApp group problem, in my head I have sort of a division between, the group that we’re talking about when we’re making plans to go to dinner, and then the group that we’re talking about when we’re thinking something on the scale of a Telegram group. Right? Or a Slack channel, right? Where Signal is probably fine scalability-wise for organizing a dinner, but not a good way to run a Slack channel. Is that, that’s roughly the way to think about it?

Raphael: Yeah, probably. I mean, phones and computers get faster every year, so whereas, the size of a social group doesn’t necessarily increase in that sense. Signal is, you know, increasingly apt to encrypt for smaller groups. But what I heard several times is that, if you do have larger groups, then battery life is a concern, even if it feels snappy enough from users’ perspective.

So, vendors of popular messaging apps, let’s say they really care about the performance there. Of course, Telegram-sized groups and that’s the other end, and there you could ask if that is still a messenger even, or if it’s some sort of social network at that point. And, to what extent, end-to-end encryption makes sense in a really, really large group.

But there is something in between; also for business communication where you could have hundreds or thousands of employees in one group so that you run into limits.

Deirdre: I can imagine the Apple’s who are over a hundred thousand strong definitely would like their, you know, all their communications, even if they’re having all their employees on one Apple Slack to have usable end-to-end encryption and, you know, efficient, not just, uh, usable.

Okay, so we had originally OTR, Signal pairwise; if you have a thousand people in the Signal group, which I think you can do, and you’ve been able to do for a while, you have to do pairwise ratcheting and key agreement between all of these parties, and then they layer some other stuff on top.

Thomas: It’s kind of straightforward to kind of get a sense of how that would work with Signal, where the whole protocol all comes down to essentially this Diffie-Hellman ratchet: every time we’re sending a message or whatever, we’re gonna, you know, renegotiate a new ratcheting key forward. Right. And you see how that would work pair-wise.

And then, I’m a little familiar with the earlier specs for MLS, but the general approach of how you would do that for a group without having to run the protocol one for one between every pair of participants, right. Like how does that roughly work?

Raphael: Yeah, excellent question. So on a very high level, MLS uses a binary tree to solve exactly that problem. So a binary tree is essentially a mechanism that converts a linear problem into a logarithmic one, if you’re lucky, it doesn’t work with all problems, unfortunately, but. So what happens there is that there is a key agreement among all the members of a group.

And so the group needs to warm up a little bit before it becomes efficient, in the sense that every participant needs to update their key material once, but after that is done, you can essentially rotate key material or remove people from the group, in O(log n), N being the number of members in the group.

Deirdre: And members of the group is keys: N device keys—

Raphael: That’s a good point. So MLS is fairly abstract in that sense. It could be users, it could be their devices. The recommendation is to have unique key material per device, and not copy keys from one device to another for obvious reasons.

Deirdre: It’s not prohibitive against that scenario if you had some well motivated scenario.

Thomas: That’s an interesting point about MLS, right? Because that problem has motivated other kind of bespoke method protocols, you know, come up with ways to, if I add a device, sync up the keys or whatever. And if you get to the point where it’s efficient to fan out to huge numbers of people, then you might as well just, then your problem is just group membership at that point, right. It’s just, is it okay for this person to join the group, but it’s no longer, how do we sync keys up?

When we talk about it being O(log n) with respect to the number of keys, the log n there is the number of key update operations we’re talking about. Is the number, is it the number of scalar mults or whatever in the elliptic curve algorithm? Like what’s the actual theory of merit we’re talking.

Raphael: Yeah, that’s pretty accurate. So essentially, it’s based on KEMs and KEMs is DH under the hood if it’s not post quantum resistant. So essentially, yeah.

Deirdre: But MLS supports post quantum?

Raphael: Yes, I dropped that on purpose. So yeah, but essentially it is, on an abstract level, iterations of, whatever the crypto primitive is going to be, including also then the payload size of the resulting message.

Deirdre: Awesome. So, instead of pairwise, we have this group tree, binary tree structure, based on KEMs, so that when we have a little bit of bootstrapping, but we create the group, we add keys, devices / users to the group tree structure. The tree KEM—, is it still called tree KEM? Did it get another name?

Raphael: TreeKEM is correct. Yes.

Deirdre: Good. And then every add, key rotation, or update— is it update for any one member? Or any one key, or is it just like a global update or is there a difference?

Raphael: So members can update their key material, and what that means is, they sample some random new encryption keys, and that gives them post-compromise security. So the threat model here is that we have a notion of a session in a way.

Deirdre: Okay.

Raphael: This session is super long, meaning it starts when you enter the group and it ends when you leave the group.

And that can be years essentially. So the underlying assumption is that during that time, your device could get compromised, one way or another, and post-compromise security means that you can actually get out of that situation if the compromise is not long term. So you should actually regularly update your key material, at which point you send what’s called a commit message in the MLS jargon to everybody.

And that message has a number of KEMs in it, log N, but it’s still one message. And that is being distributed to everybody. So everybody receives the same message and they pick whatever KEM is appropriate for them. So if you picture a binary tree, all the members are at the leaves, and then these KEMs, essentially they’re on the path of the updating member all the way to root, and the length of that path is log N.

Deirdre: Excellent. And then of course you have your delete operation so you can leave and delete yourself or the “admin” of the group or any, I think this goes into a higher layer of MLS, which is sort of like the application level of like who is in charge or who maintains the edit bits of who can add people, remove people from the group.

So that might get us to, what things MLS the protocol is not solving in this first iteration—

Thomas: Before we, yeah, before we get there, the one place where I feel like I don’t have a great intuition for how this works is, yeah, I see how we build the group membership and I see how we key a group, Right? So like, assume that we’re in some steady state of group membership. I under, I, I have an intuition for how we can have a binary tree of, you know, key agreements or whatever to, you know, make the whole group addressable from a single key or a subgroup addressable from some sub key or whatever.

And then do that by individual pair wise, between just a couple of members and then kind of aggregate that up. That makes sense to me. Where I’m not clear is how the ratcheting part of this works, right? Obviously it does ratchet, right? But like the post-compromise security that we’re talking about, I see how we update the tree when group members get added or removed or when somebody adds a device, right?

But as the protocol runs and people are sending thousands of messages, what’s updating?

Raphael: So, what’s updating, when you send an update is that, first and foremost you update your own encryption key pair, and that is what gives you post-compromise security. So, if you start using a new key pair and I KEM a message to that, that’s essentially PCS, as as simple as that.

But in order for that to work in the whole group, and in order for you not to distribute your key, like individually to everybody, and then everybody having to, you know, individually also send messages, that’s where the binary tree comes in. So instead of encrypting it individually to everybody, you encrypt it to this tree.

So yeah, I understand that that’s a bit abstract at this point. There is, let’s say that every member of the group has access to a certain part of the tree. And so one of these KEMs that go from your leaf to the root, is gonna be interesting for me because I will have the key material to at least decrypt one of them..

Thomas: Got it. Yeah, and essentially what we’re doing is like, in the sense of designing a ratchet or whatever, all we’re really we’re doing is we’re abstracting away the single person that we were talking to before and replacing it with kind of like a set of possible keys that we’re sending it to. And then everything else kind of runs the same way.

So it’s just that in the abstract when we’re sending that message or when we’re running the messaging protocol, we’re talking to a key that, you know, resolves to a group of people as opposed to just one person.

Good. Okay. I’m not totally off on this.

David: So is it reliant then on each user effectively kind of updating their key after they send a message?

Raphael: So users should do that regularly, or devices should do it like fully automatically, regularly. And so the more often you do it, the better. Obviously you need to find some sweet spot of what works and what doesn’t. Of course, if the groups are really large, you cannot do it as often anymore. Intuitively, that that would be too noisy.

And it depends on what your threat model is exactly, but you should definitely do it when you join the group and then with some regularity afterwards. And going back to what Deirdre said earlier, at the protocol level, everybody’s equal in the group: it’s full democracy or full anarchy depending on how you wanna call it. There’s no notion of admins and non-admins, and that is something you can implement essentially on the application layer above. And it’s fairly simple because, all you need is agreement on who’s an admin and who’s not.

Deirdre: Mm-hmm.

[MLS] has a huge upside… that everybody can agree on state. So it’s a great mechanism to synchronize state among, different devices, asynchronously.

Raphael: MLS is actually quite interesting in the sense that, the whole thing chronologically is divided into epochs and since messages have to be ordered, which sounds like a downside at first (and to some extent it is) but it has a huge upside. And that means that everybody can agree on state. So it’s a great mechanism to, synchronize state among, different devices, asynchronously. So if you have some sort of policy on who’s an admin and who’s not, that’s very easy to sync that among all the clients that participate in a group.

Then everybody completely agrees and you even have a cryptographic guarantee because you know this arbitrary data can get hashed into the key material. So you have a cryptographic guarantee that everybody sees the same list of admins, sees the same list of, of non-admins and general members and whatnot.

If a non-admin tries to add someone to a group, you just drop that message, just reject it. If the server doesn’t do that for you, you can still do that at the client level.

And then it becomes very easy because if a non-admin tries to add someone to a group, you just drop that message, just reject it. If the server doesn’t do that for you, you can still do that at the client level.

Deirdre: That’s cool. That sounds like it makes encrypting and syncing stateful things in a private way, very doable as well, which is like, I probably will throw these links in the show notes, but Signal didn’t have this sort of, nice syncable group state from the start because they basically built groups out of these pairwise, double ratchets and and triple Diffie-Hellmans, and they built a thing that basically allowed the group membership and other group metadata to remain private from Signal the service itself with anonymous credentials and encrypting the thing and you being able to disassociate who is updating the Signal group metadata from who is authenticating against the Signal server with anonymous credentials and things like this.

It sounds like MLS because of this cryptographically secured group metadata because of the tree KEM, allows this to kind of easier to do with cryptographic guarantees. But that’s, to be clear, that’s not in MLS.

It’s a thing that MLS may enable service providers to do in a nice way.

Thomas: But that’s super interesting. It’s like a thing I didn’t realize that— it should have occurred to me, right? But, on the spectrum of ways to—, so first of all, like in a group, in a secure group messaging system, group membership is key distribution, right? Like it’s the core security decision that you’re making for the protocols: who’s gonna be allowed to, once you’re allowed in the group, can read messages to the group, you have the keys for the group, right?

So, there’s a spectrum of ways to approach that. the original Signal way of doing it was essentially that if you wanted to create a group, you would come up with an unguessable secret, just like a 128 bit number or something like that. Broadcast that to your, your friends, the people you wanted to be in the group.

And then knowing that secret ID was the thing that enabled you to link up to the group. The other end of that spectrum looks more like Matrix, which is a modern protocol with kind of this surprising design feature, which is that the server kind of controls who’s a member of a group, and so, you have end-to-end security between group members, but then you have a server in the middle that’s determining who gets the keys, which kind of breaks the model. Right?

And like, this sounds closer to the original Signal, like the, the thing we’re describing with using authenticated data and, nominated group members that are kind of confirmed in the protocol, kind of pushes things back towards the model where, individual participants or clients or endpoints are actively involved in the security decision of who’s in the group.

So it sounds to me like you’re saying that one thing that MLS does makes it possible to design secure group membership protocols that don’t depend on a server making same decisions about who’s in the group.

The list of members is hashed and then fed into the key schedule. So that’s how you have agreement on who’s in the group and who’s not.

Raphael: Oh yeah, absolutely. I mean, essentially everything is signed. It’s very easy. So, the list of members is hashed and then fed into the key schedule. So that’s how you have agreement on who’s in the group and who’s not. So, it’s very subtle, but obviously on Signal when you are with a Signal protocol, when you send a message, you know exactly who’s going to receive it, because you encrypt it individually for everybody, so logically you know who you’re encrypting it to.

You know, the same is true for MLS, but the one addition here is that, when you receive a message, you also know who the sender thought they were sending it to. So it’s very subtle, but that’s essentially the, the guarantee you get.

Deirdre: That’s very cool because that’s the thing you don’t get in Signal groups: you don’t know, everyone else that this person was trying to send to, because it’s all pairwise. And if you receive a message and you’re like, why is Bobby trying to send this to some key that I’ve never seen, that does not match what my device sees as the membership of the group?

That’s pretty cool. That does rely on, the server just kind of forwarding messages to you or whatever, the delivery service in the, in the architecture document. And then if you can have your own client that does whatever it wants, that can detect or reject or whatever it wants. But that’s cool. That’s interesting.

Raphael: The server can absolutely not inject participants because the server is not a member. So, there is this add operation, that can only be performed by an existing member. However, there is also a way for a server, or let’s say generally an outside party to suggest, uh, other members.

But that requires the outside party, you know, to have a well-defined credential and to sign that request. And then that can be honored and everybody will see that that was a suggestion from the server. And that’s a controlled way, how you can add people to a group, but you can never do that, you know, steathily.

So, it’s fully transparent among the group members, to avoid any problems with, you know, ghost users.

Deirdre: Yeah, I was gonna say this is like a known or at least worried-about issue of just someone inserting someone into a group. And it not being, if not cryptographically detectable, it’s not easy to detect that someone is just getting looped into all your messages silently or adding a new device to an existing account or something like that.

One last thing before we kind of talk, a little bit more protocol document sort of stuff is, one thing about the KEM design as opposed to strictly Diffie Hellman design is it doesn’t give you the nice associativity and commutivity niceness of Diffie Hellman-ish primitives. Which is a, it’s a whole thing with post-quantum, is that we don’t have a nice Diffie-Hellman-like primitive, you know, RIP SIDH. Which was the, one of the closest things that we had, that’s now dead—

Thomas: You can’t drop ‘associativity’ and ‘commutativity’ into a conversation and just assume that I understand what the hell you’re talking about.

Deirdre: Okay. Basically, if you could use Diffie-Hellman things or primitives that, instead of a KEM that don’t have those properties, you could, apply updates to the tree in “any order”, and it would all compute out to the same values. But because they’re using KEMs, the ordering matters because you can’t just swap the order around.

Sort of like with multiplication, like if you do two times three, it’s the same value as three times two. It doesn’t matter what order you go, you’ll still get six.

Thomas: KEMs is more like matrix multiplication: the side I do it on matters. So I know enough to write Diffie Hellman, right? What’s the intuition for— KEMs is based on, at root, for a, a conventional KEM algorithm or whatever, it’s still gonna be based on something like Diffie Hellman, right? So what, what, what’s breaking associativity and commutativity there?

Raphael: Is a question for me?

Deirdre: Sure

Thomas: One of you knows—

Deirdre: because I just started asking the question and then answering the question instead of asking you.

Raphael: Yeah, so I’m not an expert on, like key exchange primitives, but essentially KEMs, are a different primitive than Diffie Hellman, and it’s what’s being mandated by NIST, for example, and for the competition of the post quantum resistant, primitives, simply because that works better.

And I think you were mentioning it earlier there, there were some Diffie Hellman attempts, for post-quantum resistant primitives, that didn’t really hold up so well. but who knows? Maybe going forward we’ll have them again. But, from a very pragmatic standpoint, if you want to have an easy way to upgrade to post-quantum resistance for a protocol, you use KEMs.

And that’s essentially one of the reasons why MLS is using KEMs. Another one, maybe this is interesting for you, originally there was something else before treeKEM, called ART: asynchronous ratcheting trees, which was completely Diffie Hellman based, and that was more flexible in the sense that, the updates of members when they update the key material, to a certain degree, you could do that in a random order.

Deirdre: Mm-hmm.

Raphael: So that was nice because you didn’t have to have this clock that synchronizes everything. It did also have some downsides; it was more of an academic effort initially, it was a great idea because it actually introduced a binary tree for messaging. Which was fairly new and sort of kickstarted the whole thing.

It had some downsides because also, it was really hard to, have dynamic groups where, you know, people join and leave. So it worked really well when you had a group with a static set of users, which is not very realistic in a messenger, obviously. So, maybe we could have worked around those problems eventually, but we didn’t at the time.

And so KEMs were a lot easier in that sense because it meant that we could have blank nodes, like empty nodes in the binary tree, which made it, you know, easy to construct them, when you bootstrap a group. It made it easy also when people were leaving because you could just delete the set of nodes between their leaf and the root.

Deirdre: Oh.

Raphael: So you would essentially punch holes in that binary tree. So that’s how you make sure that they don’t have access to private key material anymore, cuz they only knew the private keys of those nodes. So if you remove them from the tree and everybody, now KEM’s to other nodes, those that are left, then the evicted member cannot decrypt any of those KEMs.

Deirdre: Mm-hmm. But we can’t do that anymore with the updated one.

Raphael: No, no, that, that’s what we can do with treeKEM essentially—

Deirdre: Oh.

Raphael: —because the tree does not need to have full nodes everywhere. In theory you can randomly delete nodes in the context of treeKEM, and it still works. It’s just that the efficiency is not gonna be as good anymore.

Deirdre: Right. Okay. This does, as you said, basically mean you have to maintain a little bit more state on all of your ends. You have to maintain a bit of a clock, you have to rely on, ordering from your server or your delivery service. That sounds a little bit more annoying, but the being able to delete whole swabs of the tree because of treeKEM, does sound really nice.

Raphael: That’s exactly it. So the downside, if you will, is that the delivery service, is essentially a clock and orders messages, so there are two ways of doing that. The more intuitive one is that that happens in real time. So whenever you try and send commit message, the server’s gonna check, you know what epoch is it, is your commit message, does it have the right epoch number? And if somebody sent a message just before, then, yours is gonna get rejected. And then you have to try again. So it’s really just a counter that runs on the server. In terms of state keeping it’s super easy, but it does require having this real time ordering mechanism.

The alternative would be to not do it in real time, like in completely asynchronous systems where, you know, not even the server is guaranteed to be online. You could do it differently. You could actually fork the groups, and then reconcile them later. So if you have some mechanism to determine the order after the fact, you can still do that, but what you need to do on the clients is essentially fork groups. So you have more state.

Deirdre: That’s fascinating. Has anyone written about— is that in the architecture document?

Raphael: It has a very fancy name, so that there is the strongly consistent delivery service I believe it’s called, and the eventually consistent one. It’s not super intuitive, but that’s what it is. I think this is also what Matrix has been looking at. There’s also, like a small deployment of and very early deployment of MLS in a protocol called P2Panda, which is some sort of alternative to Secure Scuttlebutt, if I’m not mistaken.

So it is peer-to-peer, it is decentralized, it does work. but most systems are not peer to peer and decentralized.

Deirdre: This is cool. We have links to the architecture document in our show notes. I’m gonna go read more about the different consistency models, cuz I’m curious about the different solutions. Awesome. Thank you very much. All right, Thomas, let’s pivot to some of your questions.

Thomas: We’re talking as if I have lots of questions, right. So I mean, I have some low level technical and standardization questions about how MLS came about, but I wanna make sure I have my head totally around the model here, right? So, we’re kind of like cackling to each other as we’re talking to you, which has to be a weird off putting experience, but we all have some like preconceived notions about IETF cryptography, which are based in like, mid two thousands ideas of how IETF cryptography design works and not, you know, more modern IETF processes, right?

And to start with, just the idea of a large, ambitious, new crypto system being designed like, afresh from IETF, not already done the industry somewhere and then standardized out, or kind of percolating the way the Signal model kind of became for a while, the de facto standard for all new messaging crypto, right? But here, it’s like a brand new idea. It’s a new service model: it’s designed from scratch from IETF, so I’m still, this all makes sense, right?

I can see now why, what the motivation would be to do something like this, right? So, I guess the last service model question I have about this right, you’re calling this MLS, right? Which is kind of like obviously a reference to TLS, right?

It’s like multi-point TLS, is essentially what we’re talking about here. The way that I think about, like secure messaging is almost entirely motivated by privacy technology. It’s Signal, right? And before Signal it was Sigma and OTR and all that stuff. And then it’s like, commercial messaging applications.

And in those kinds of applications you do a lot of work on like a message-by-message basis to keep post-compromise security, to maximize privacy; if you imagine like a dial of like cryptographic overhead versus privacy, in Signal that dial’s all the way set over to 11: you’re assuming that, state level adversaries are trying to get all of your individual messages.

Am I right? Kind of to see that MLS is more of kind of a low level protocol, that it’s not inherently privacy technology? Obviously it’s a basis for privacy technology, but, while we’re talking about this, I’m just imagining running a routing protocol on top of it, right? Lots of messages all the time.

I care a lot less about the integrity of individual messages. I have a group membership system. I want secure messaging. I want, all my points, I have individual keys and some kind of cryptographic notion of who’s in the group or not. But I don’t wanna spend a lot of overhead trying to make sure that state level adversaries can’t read a message once they’ve compromised their router for like, five minutes.

Am I right to kind of assume that, MLS gives us enough knobs where we can dial overhead down so that we can run it literally kind of as a group TLS system or as a group message— as a group DTLS system really? Is that right? Or, is there kind of intrinsically a lot of overhead in the protocol?

[MLS working group] try to have a super strong and tight collaboration with academia from day one, essentially to have formal verification, to have proofs because obviously when you roll your own, and that’s what we did here, you want to do that in the best possible way.

Raphael: Yeah, excellent question. And just to go back to what you said initially, it is definitely a greenfield project at IETF. There hasn’t been any precedent to that specifically. There has been TLS obviously, but TLS is more of an evolution of an older protocol. And has some, you know, some baggage obviously.

It mimics TLS in the sense that, you know, we try to have a super strong and tight collaboration with academia from day one, essentially to have formal verification, to have proofs because obviously when you roll your own, and that’s what we did here, you want to do that in the best possible way.

And so, they also received a lot of input from the industry. In my mind, I think of MLS as some sort of a Swiss Army knife of end-to-end encryption in a way because, it could be used in different scenarios. Like it could be used for enterprise messaging, because, you know, it is scalable, because it has all the security features, because if you don’t care about privacy, it’s fairly easy to deploy on a server, and you don’t need a lot of overhead on the architecture, et cetera.

But one thing that, has always been super important to me, was that you should be able to run MLS in an architecture that really cares about privacy as well, in the same vein as Signal does. There shouldn’t be anything prohibitive in the protocol that you couldn’t do that.

So the protocol itself is not enough to give you a completely private system because it’s really just one componen, and to degree it is agnostic. Like, if you take double ratchet and X3DH, that’s when— you know it’s run inside of, of the Signal app, it’s super private. If you run that inside of WhatsApp, there’s two tons of metadata, but, it’s agnostic to the protocol as such. And the same is true for MLS.

So running MLS in a really private environment, or in a really private way, we think that that’s possible and that’s in fact what we are working on right now.

But that’s really more on the architecture side. and, and not on the protocol side.

Deirdre: Mm-hmm. Ties into, basically exactly what you said. You have WhatsApp, which has more as a full service, as a full end-to-end service of how it’s a messaging application, it has more metadata about its membership and its groups than say, Signal, which looks very, very similar, but it has added other things to its full end-to-end service, such as the group membership privacy from the server, and anyone else that’s looking at what’s happening, and other privacy features that, if you put Signal and WhatsApp side by side, you’re like, oh, they both use Signal Protocol. They both use Double Ratchet and you know, X3DH protocol. But it’s the next layer up at more of the application level that more privacy parts. And you’re basically saying that MLS, the kind of swap out for Double Ratchet and Triple Diffie Hellman plays a similar role; but it’s trying to purposely be efficient and enable the more privacy oriented applications, like a Signal, in general, which that sounds really great.

I think you mentioned earlier that Cisco is using MLS. Are they using it for a like messaging application or are they using it for calling?

Raphael: I’m not affiliated with Cisco in any way; I think they’re using it for conference calling.

Deirdre: Oh, that’s interesting. So they’re using MLS to manage their keying material for a large end-to-end encrypted conference call.

Raphael: Yeah.

David: They also use it for large WebEx rooms, which is their like quasi-Slack competitor, but I’m not sure.

Raphael: That that might be the case.

Deirdre: Well I’ll stop asking you about somebody else’s product, but okay. That’s interesting. Sounds like they are using MLS to do group key agreement, but they’re not using it for Signal or Slack or WhatsApp. They’re using it— well ,ignore what David just said. For the, for the video calling part, they’re using it in a completely different setting.

They’re using it to do group key agreement, but for video calling. So that’s kind of speaking to what Thomas was basically saying, which is like, yeah, you can use it for a, new-gen Signal, you know, very large groups, end to end encrypted Slack. But you can also use it for, I just need a key that’s efficiently updateable for people coming and going from my large application, such as a big end-to-end encrypted call, web call.

David: Yeah, you don’t want pairwise keys for a video call. That’s not gonna be fun for anybody.

Deirdre: Yeah.

Raphael: Yeah, that’s true. I mean, MLS was designed for asynchronous communication, to start with. But you do have this problem for conference calling that you do have many participants and you want end to end encryption. And so at this point, MLS is the only protocol that can really give you that, more or less, so that’s why MLS is being used in that context, but it wasn’t, like specifically tuned for real time communication.

David: So why did MLS start in the IETF, though? Like, who started that? What was the origin story? Because I mean, it sounds like it went well and like clearly, Cisco’s using it for some things, Wire’s using it for some things. I’m sure other people are using it. How did that come about?

Raphael: Yeah, I think there were multiple starting points, to how it came about. So one is, the motivation to have a more efficient protocol. And so technically I think this asynchronous ratcheting tree paper is what started it on the protocol level. I think Cisco was looking to that problem.

Then later on, Mozilla and Wire and others sort of came together and thought, we should, we have the same goals here essentially, and we should bring that to IETF.

Another motivation was that, there was no, you know, full specification of the Signal protocol at the time, down to the wire format, meaning you didn’t have a document you could use to do your own implementation.

And there weren’t many implementations around, [aside from] those from Signal themselves. They were published under the GPL, which didn’t wor for a lot of folks; you could license it commercially. But let’s say, end-to-end encryption was not as accessible as it is now. So now we have a full specification that academia can easily analyze, that people can use to do implementations.

We have several implementations now that are some of them open source and they interoperate. So this has been happening in the past weeks.

Deirdre: Yes

Raphael: So in a way, this is a bit of a new era in terms of accessibility to enter an encryption technology.

Deirdre: Fantastic. To jump off that, I’ve linked several papers of, as far as I can tell, independent academic groups analyzing different properties of MLS based on the spec that’s available, based on these implementations that are available, which is very nice because everyone kind of being like, well, I’m going to look at, a very small part of the Signal service or the Signal “protocol” because you have to only agree— you can only talk about the thing that has like a well understood name of what you’re talking about to do formal analysis, modeling, you know, looking at security properties of such and such a thing.

Having that for MLS and especially for treeKEM, it makes more academic analysis better available because everyone understands what we’re talking about. They can point to a document and say, this thing, I’m going to analyze this thing. As opposed to a kind of wibbly,amorphous, the thing implemented by libsignal, which is not everyone’s favorite.

Can you just describe what formal academic analysis has been done for MLS, versus treeKEM or I think continuous group key agreement, is like the property that has gotten the most analysis out of it? Basically, has there been a high level symbolic analysis of MLS, the protocol, versus treeKEM?

Raphael: So other aspects besides treeKEM have been looked at, but this is also, you know, where you reach the limits of what you can do today. It is a complex protocol, and when you analyze something, you always have to focus, you know, on some aspects of it. So, I don’t think every aspect of MLS has been fully covered, simply because that would be really hard.

And I’m not a trained cryptographer, with an academic background. So, I might not give you all the answers here. But I think it has received a lot of attention. I remember maybe 10 years ago, you know, there was like one paper on the Signal protocol at the time, and that was it.

Deirdre: Uh huh.

Raphael: Academia did not look at secure messaging too much, and then there was, all of a sudden, this explosion of academic interest in all of that. And obviously Signal was super interesting. But then, MLS could also get the interest from academia and have people look at it.

And that’s great because that’s, frankly, that’s how it should be done. Because if just a bunch of engineers, and I’m including myself here specifically, just have a good feeling about how a protocol should be designed, that doesn’t end well typically. So, that’s why you want the formal proofs.

Deirdre: Mm-hmm.

David: Is there any algorithm negotiation in MLS or is that, it’s built around a KEM and we mentioned there’s post-quantum KEMs and there’s like kind of DH-ish KEMs; is that something that’s chosen kind of at deploy time or is that something that’s negotiated by clients, or?

Raphael: Yeah, that’s one of the perks of IETF. Essentially you have to have some crypto agility there. You have to have ciphersuites. Some people like it, some don’t. What it means in practice is, the protocol document specifies I think five or six— six, yeah— that are all not post quantum yet, and they can get negotiated.

However, in practice it’s gonna be quite different from TLS, because you also don’t— so first of all, you don’t have these downgrade attacks, at this time at least, cuz there’s only one version. The other thing is you will probably pick the one that is sort of recommended, like it’s called the mandatory to implement (MTI) ciphersuite, which is curve25519 and AES, more or less.

And yeah, there are the NIST curves for those who need that for compliance reasons.

Deirdre: Mm.

Raphael: And then at some point we will have ciphersuites with post-quantum primitives obviously.

Deirdre: Mm-hmm. And I see that the GREASE, that random values in the key agreement fields that we kind of learned about in TLS negotiation, that’s already in there.

Raphael: Yeah, that was a very last minute edition. Yeah.

Deirdre: Someone ran in and was like, ‘Hey, we should probably just not take a step back in this new version negotiation we’re introducing here.’

Thomas: I’m shocked that towards the end of the process somebody went in and IETF’d the whole protocol up. So, we found exactly one thing for me not to like about MLS. So, good.

Raphael: The crypto agility?

Thomas: Yeah, just in general, the idea of, it’s also like if you’re an implementer now, then you have some pressure, but like, okay, I have to implement the NIST part of this too.

Like it’s so much cleaner, if you have the WireGuard model where it’s just like, look, here’s the protocol. Just go implement this,. Noise kind of works the same way, where you kind of read it top to bottom, implement it and you’re done. And here it’s like, I need to indirect everything because somebody else might need to add a ciphersuite to it.

But it—, it’s fine.

Raphael: Yes and no. I mean, the big difference is whether or not you consider your system to be interoperable. Like TLS, it’s, it’s different because you have a client that is gonna talk to random servers, so you have to be compatible with all of them in a way. With MLS, if you have your own deployment, you pick one and then you stop having crypto agility and that’s it.

Deirdre: Mm-hmm.

Raphael: It doesn’t preclude you from, like, you do not need to implement all of them. You can do whatever you want.

Deirdre: If you deploy MLS, if you’re Cisco and you’re like, I’m gonna pick the P-256 ciphersuite, because I know that most of my users don’t care so much about whatever the IETF-ish, crypto ciphersuite is, I care about NIST-compatible ciphersuites because I service these kinds of customers. So I am just going to be like, P-256, AES, that’s my only ciphersuite and I control all of my clients and yada, yada, yada. That helps.

I have a question about federation. It seems like MLS might be open to federation in a way that some other protocols are not. Would that put more pressure on keeping open the set of ciphersuites that implementation supports or negotiates?

If you’re trying to interop with other federation instances.

Raphael: Yeah, good question. And I think that’s very much an open question. So first of all, regarding federation. So there were two things during the design process where MLS, you know, should not solve a problem but also not be prohibitive in any way. And that is privacy and federation. So it should be able to work in an environment that is similar to Signal, but also in an environment that is similar to Matrix in terms of federation.

So this is, again, a very much an open question. To some extent, interoperability would be a lot easier if there was just one ciphersuite because there’s no negotiation. Obviously we would see that over time, how that’s gonna work. There’s this other really early day effort at IETF called Mimi: more instant messaging, interoperability.

So this is more about federation and agreeing and stuff, et cetera, but it’s so early, it’s, you know, it’s hard to say how that’s gonna go and in which direction. So again, MLS is sort of agnostic of federation as such, but it doesn’t prevent you using it.

Deirdre: That’s nice.

David: Is there interest in using MLS as like a path forward, cause, I think these passed the EU regulations about, like all these messaging platforms have to interop, I’m not sure what the status of that is, but.

Raphael: Yeah, I’m not sure anyone knows what the real status of that is, but, that’s essentially what sort of motivated Mimi, this new working group to look for potential technological solutions to that. But the whole effort is political obviously. So we have to see. Obviously MLS is some sort of a candidate in that context, that vendors could potentially agree on.

Deirdre: But we still have to deploy it first.

Raphael: Yeah, I mean, it took a while to, to get to this stage, like it took five years all-in-all.

Deirdre: Hmm.

Raphael: Really long time, obviously, but that’s also because it’s never been done before at IETF. And now I think we will see some deployments, and then we will gather some experience, and see how it works.

It’s kind of hard to, you know, predict every aspect of it.

David: Still faster than a TLS 1.3 was so that you could, beat them.

Raphael: Well, there were far fewer stakeholders as well, and that was the advantage. And there was not a lot of disagreement in the working group, which was really nice. I’m, we’re really not an IETF expert, so this was my first rodeo really, from what I hear in other groups, you know, there’s much more drama and disagreement.

So in that sense, it was a pretty smooth ride.

Deirdre: Nice. I’m happy for you, that you had a nice IETF experience. I’m not jealous at all.

Raphael: What you’re not saying right now. Yeah.

Deirdre: Yeah, I’ve had a fine IETF, IRTF experience, there’s just a lot of bureaucracy.

You’ve reached the barely grazing your fingers on the finish line of MLS… 1.0? Is this 1.0? Are you versioning it?

Raphael: That’s a good question, whether we actually call it 1.0, I think we just caught it MLS at this point. But if there’s gonna be another version, then for sure, it’s gonna be versioned.

Thomas: Were there things that got bounced from the scope for this? Do you have a thought about like what a next version would be? It’d be great if the answer was no, right, but.

Do you think there will be a 2.0, like in the near future?

When you design these systems, it’s a bit like having a balloon filled with water: if you grab it on one hand and you squeeze it for split second, you think you’ve made it shrink, but what you really did is you pushed the water somewhere else. So when you make it faster on one end, then you probably compromised on some security aspect. So, it’s relatively rare that you have these leaps where you really just innovate without compromising too much.

Raphael: Well, I don’t know about the near future, but I mean, there will be evolution because messaging protocols evolve. We had OTR, which was fantastic in a way because it solved— it was really modern, had modern properties that didn’t exist before, but then it didn’t work on mobile phones. And then Signal came along, which took all these properties and actually made them work and has been fantastic ever since, uh, didn’t scale well. This is where MLS fits in and, there’s gonna be other things. So yeah, things were bounced of course, and there’s a million things you can do: you can try to be more efficient, you can have better PCS, but in a way, when you design these systems, it’s a bit like having a balloon filled with water: if you grab it on one hand and you squeeze it for split second, you think you’ve made it shrink, but what you really did is you pushed the water somewhere else. So when you make it faster on one end, then you probably compromised on some security aspect. So, it’s relatively rare that you have these leaps where you really just innovate without compromising too much.

But yeah, to answer your question, we could have done this in different ways, and over the past five years, many people, came up with a lot of different ideas on how it could have been done, et cetera. So all of these things would be candidates for future versions, but what we really need is some experience with this version, see how it works because on an academic level, you know, you can be really creative, and that’s fantastic, but it actually needs to address real problems.

Deirdre: Mm-hmm. All right. Do we have anything else?

David: I was just gonna say congratulations to yourself and everyone else that worked on MLS. I think I remember first hearing about this at like RWC in I wanna say 2019, maybe 2018 when Richard Barnes gave a talk. So it’s been been quite the journey since then.

Deirdre: Yes, it’s very, very big accomplishment, as you said, a greenfield-like, complex protocol in the IETF, first end-to-end secure messaging protocol that I’m aware of. I’m very excited that all the interoperable implementations are up and running, including openmls, which is a Rust implementation, that I pushed some commits to, little ones. I’m paying close attention to that one, to play with. Very exciting. Congratulations. Hopefully it’ll get deployed in more places, it can keep evolving and we can keep learning about it and publishing analysis of it. Anything else we wanna touch on before we wrap up?

Raphael: Well, I wanna say thank you very much on behalf of all the people who worked at MLS. It’s really been a lot of different people, and great people who got in a lot of different ideas and made it what it is today.

Deirdre: Yeah. Raphael, thank you very much for coming on our show.

Raphael: Well, thank you for having me.

David: Security Cryptography Whatever is a side project from Deirdre Connolly, Thomas Ptacek, and David Adrian. Our editor is Netty Smith. You can find the podcast on Twitter @scwpod and the hosts on Twitter @durumcrustulum @tqbf and @davidcadrian. You can buy merchandise at merch.securitycryptographywhatever.com.

Thank you for listening.

Messaging Layer Security (MLS) with Raphael Robert

Latest Posts

Trump’s Golden Post-Quantum EO(s)

Facing the Vulnpocalypse With lcamtuf