Having fun with Insertable Streams and E2EE (and SFrame!)

May 28, 2020 Lorenzo Miniero

About a month ago, we shared a post on several things that were happening in Janus. One of the most exciting sections was definitely the one related to end-to-end encryption, and how Insertable Streams could be used for the purpose. Considering this is something a lot of people have looked up to for a long time, I thought it made sense to write a new post, to dig a bit deeper in what we’ve been doing with that so far, and what’s next. In particular, I’ll share some details on how I started playing with Sergio Garcia Murillo’s excellent SFrame JavaScript library to play with SFrame (learn more in this very informative blog post by Sergio himelf), something you won’t find in the available pull request yet.

Before doing that, though…

Hey, I thought WebRTC was encrypted already!

This is a common source of confusion. Quick answer: yes, WebRTC is encrypted by design, using DTLS to exchange keys (and encrypt data channel messages), and SRTP to exchange real-time audio and video streams. As such, each PeerConnection established between two peers is secure, and if things are done properly, no one can have a look at what’s inside.

But there’s the caveat, though: WebRTC was conceived at the outset as a peer-to-peer solution, which means it will protect any conversation between two individual peers. The moment you add a server to the mix, media is not peer-to-peer anymore: you’re sending media to a server, and the server (at the very least) sends it to other people for you. As such, while WebRTC is still used, and the peer-to-peer concept is still very much alive, it’s the peers in the conversation that change: even if we’re in a 1-1 call, if a server is handling the media we setup WebRTC PeerConnections with the server, and not the other person itself. This means that, in the simple 1-1 call case, two separate and independent PeerConnections are established:

one between caller and server;
one between server and callee.

As such, both connections are indeed secure, but only up to the server: since the server terminates the DTLS connectivity, it does have access to the unencrypted media.

This is not necessarily a bad thing, not at all. There are many reasons why a server may want, or need, to have access to the unencrypted media: for instance, the streams may need to be transcoded (e.g., to be used in an MCU, or translated for a CDN) or recorded to a playable file, both only possible if the server has access to the plain packets to work with. Besides, even when not caring about the media packets themselves, the server may need to be able to peek at part of the payload, e.g., to detect if a specific packet marks the beginning of a keyframe for switching purposes (think simulcast or SVC).

That said, being able to have proper end-to-end encryption in a server-based conversation (e.g., a multiparty conference) is still very much a requirement many people are interested in. Falling back to pure peer-to-peer (e.g., via a full mesh) just because of the issues mentioned above may be seen as a solution, but it’s rarely a viable one: it may work fine with very few participants, but it’s just not doable when many more join the conversation. This is one of the reasons why SFUs have gained so much popularity in the past few years, so making them somehow E2EE-compliant, or at the very least aware, would definitely bring the best of both worlds.

A bit of history

All of the above reasonings are what eventually led the IETF to start working on this, a few years ago, in a working group called PERC (Privacy Enhanced RTP Conferencing). The main aim was indeed to make sure that the component distributing the media (e.g., an SFU) wouldn’t necessarily have to be trusted with the keys to decrypt the media. As such, the objective was coming up with a framework, possibly backwards compatible with legacy SIP based infrastructures, that would make that possible.

While this sounded promising and exciting at start, it unfortunately soon ended up becoming a specification that more than one person in the WebRTC world agreed would not be usable, if not implementable, the way WebRTC works. It would take more than a simple blog post to go deep on this, but if you’re interested you can read this interesting (and long!) thread on the PERC mailing list that summarizes what everyone thought about it.

Long story short, WebRTC developers soon lost interest in the “official” PERC efforts. An alternative proposal came up at about the same time, which tried to overcome what were seen as the main PERC framework limitations, and come up with a more usable solution: this proposal was called PERC Lite, and if you’ve followed this blog or the Janus development in the past few years, you may remember we actually worked a lot on this, in a joint effort with our friends at CoSMo Software. In a nutshell, PERC Lite envisaged some sort of double SRTP encryption just like PERC (a hop-by-hop one like the one WebRTC already implements, and one at the media layer), with a simplified key management that made it easier to configure keys in a JavaScript application. The following figure shows a simplified view of how this would work in an SFU scenario, with the VideoRoom plugin.

The integration in Janus was very lightweight, touching as little code as possible, and mostly to just make Janus aware of when end-to-end encryption via PERC Lite was in use. This made it possible to re-use many of the Janus features out of the box. For instance, we mentioned how recordings are normally made impossible by end-to-end encryption, as the server wouldn’t have access to the plain media packets, thus preventing it to either transcode the media, or put them in a playable file (which is what our post-processing tool does). That said, considering how recordings work in Janus (where RTP packets are basically just stored in a structured way), it was very simple to replay a Janus recording via WebRTC itself, and so using PERC Lite for this new session as well (assuming the recording viewer had access to the keying information used during the original session). A simple diagram that sketches how that worked can be seen below.

That said, that proposal didn’t live long either (even though it was actually deployed with success in many scenarios), which meant a different solution would be needed to scratch the E2EE itch.

Insertable Streams

A very interesting approach to this was introduced recently, when Chrome started implementing support for the so-called Insertable Streams. Quoting from the documentation, Insertable Streams basically are an “API [that] enables the insertion of user-defined processing steps in the encoding and decoding of a WebRTC MediaStreamTrack”. Dr Alex and Fippo did a much better job than the one I could do to introduce how they work and their functionality, so I suggest you to read their posts for a good overview. For the purpose of this blog post, suffice it to say that they provide web applications with the means to add custom functions/callbacks with the ability to manipulate a media frame (i) after it is encoded and before it is sent, and (ii) after it is received and before it’s decoded.

As such, while out of the box they cannot help modifying the content of the packets (since they work with encoded media, you’d need to be aware of the encoding to do anything with them), a very good use case for them is indeed end-to-end encryption. In fact, let’s assume I implement a simple function that performs some basic modulo operation: if the person I want to talk to and I agree on how to use that function, and hook them up via Insertable Streams for our WebRTC session, the media packets on the wire will not be decodable out of the box (even without the WebRTC HBH encryption later, packets would still be “scrambled” by my modulo function), but would require the modulo function to recover the media packet as it was originally encoded, before it can be decoded and rendered.

This very simple example shows how flexible and powerful Insertable Streams actually are: rather than being stuck with a hardcoded solution for end-to-end encryption (as PERC or PERC Lite were, since they operated with packets their own way and out of the reach of applications), you have access to a completely open mechanism to implement the feature any way you want, and possibly replace/improve whatever was done later on.

At the time of writing, there have already been several efforts in the WebRTC space to take advantage of Insertable Streams for the purpose. Of course, we were interested in them evem before they became available on an experimental basis in Chrome, so as soon as that happened we rolled up our sleeves and made sure Janus could use them accordingly.

Adding Janus to the mix

As anticipated a few weeks ago, we started working on Insertable Streams pretty soon, and came up with the related changes in a dedicated branch.

If you have a look at the code, you’ll notice how, even if we updated many files, the required changes were actually quite minimal. This is because, exactly as it happened when we worked on PERC Lite at the time (effort on which this new PR was based), most of the work was focused on making Janus simply aware when end-to-end encryption takes place. Since Janus rarely touches the media itself, this simply meant two things:

implementing different behaviours in different plugins, depending on whether or not they’d be fine with E2EE media;
adding the ability to flag our MJR recordings, so that we would know when they contained E2EE media and when they didn’t.

On the first point, we configured very few plugins to reject attempts to setup an E2EE session: namely, AudioBridge (which is an audio MCU and so needs to be able to transcode the packets), VoiceMail (which saves audio packets to an .opus file, and so needs access to the unencrypted media), SIP and NoSIP (both because they’re supposed to interact with legacy infrastructure, which won’t likely support this for a while). All the other plugins supporting audio and/or video were configured to support the feature one way or the other: as such, e.g., the VideoRoom can be used to host E2EE conferences, while the Record&Play plugin can replay E2EE recordings if the viewer has access to the right decrypt functions.

For what concerns the Janus API, this basically had no impact at all. In fact, the only thing we had to do was add something that could “flag” a new PeerConnection as end-to-end encrypted, when Insertable Streams were to be used: this was needed because nothing in the SDP changes, when using the feature, meaning that the SDP itself cannot be used to signal E2EE functionality. The way we ended up implementing this was to basically just “enrich” the existing jsep object we use in the Janus API to negotiate a PeerConnection with an additional (and optional) e2ee boolean attribute, e.g.:

[..]
"jsep": {
    "type": "offer",
    "sdp": "v=0\r\n...",
    "e2ee": true"
},
[..]

This works in both directions, of course; meaning that a client would definitely need to add the property if they’re going to offer a session with E2EE support, but the same could be done by Janus as well. Let’s say, for instance, that we’re hosting a videoconference using the VideoRoom plugin, and we want all streams to be E2EE: the publishers would definitely send an offer with the e2ee: true flag in there, and the VideoRoom would react accordingly by taking note of this and including the same flag in the answer as well; at the same time, when someone subscribes to one of those streams, it’s Janus that offers, meaning it’s up to Janus itself to signal that the PeerConnection will be end-to-end encrypted, and the offer recipient should be prepared to handle the media accordingly.

On the client side, apart from adding support for the flag above we actually didn’t do much, if not provide a way in janus.js to specify which transforms to use. In fact, with Insertable Streams that’s really the only thing that matters: all the complexity resides in how those transforms are implemented, while hooking them up to a PeerConnection is relatively easy. In order to make them fit seamlessly in the usual janus.js flow, we added two new properties that can be set when setting up a new PeerConnection, so when doing either a createOffer or a createAnswer, e.g.:

echotest.createOffer({
    media: {audio: true, video: true },
    senderTransforms: {
        audio: new TransformStream({ .. }),
        video: new TransformStream({ .. })
    },
    receiverTransforms: {
        audio: new TransformStream({ .. }),
        video: new TransformStream({ .. })
    },
    [..]
    success: function(jsep) {
        // Send offer to Janus
    }
});

As you can see, it’s quite straightforward: you can specify which transforms to use for audio and/or video, on the way out and/or in. Of course, you’re not forced to specify everything: if it’s an audio only session, you’ll only need audio transforms; if the media just goes in one direction, you’ll only need either the sender or the receiver transform, and not both. What’s important to point out is that the janus.js stack will automatically add the e2ee: true flag to the jsep object you get back, in case transforms were provided, and will also automatically configure the PeerConnection to use them. At the same time, it’s up to the web application to intercept the e2ee property on incoming offers, and be prepared to specify the right transforms when calling createAnswer to process the incoming packets.

To showcase all this, we only added a simple E2EE demo based on the EchoTest to use them, and nothing else. The rationale behind that was simple. First of all, as we’ve seen Insertable Streams give a whole lot of flexibility when it comes to which transform functions to use: as such, we decided to keep it simple, and re-use the same basic XOR-based implementation the official WebRTC demo takes advantage of in this demo as well (thanks to Fippo for the feedback he provided on that part, both publicly and privately!). It’s of course just an example, and not really a bullet-proof implementation (as the authors themselves admit, since it’s mostly a proof-of-concept for demo purposes), but it’s quite helpful to understand how to use them in a Janus web application. As you can see from the screenshot below, the demo just asks you for a string to use as the key, and that is then used to configure the crypto context the transforms will implement.

That said, while the PR out of the box only ships a demo based on the EchoTest plugin (which means the demo encrypts media using the XOR transforms on the way out, and then decrypts them again on the way in, and everything is still supposed to work), we did also play a bit with the VideoRoom demo, as you can see in the (hopefully funny, definitely ugly) screenshot below. The code for that effort is not provided, but if you have a look at how the modified E2EE EchoTest demo works, it should be easy enough, and a useful homework for you, to tinker with the VideoRoom demo to get it working there as well.

A step forward: Secure Frames (SFrame)

As we’ve discussed, one of the Insertable Streams’ strongest points is definitely the flexibility they give in terms of what to do with the packets. Everyone is basically free to come up with their own transforms for the media, which means there’s plenty of room for experimentation and, more importantly, research on the best solutions to the WebRTC E2EE problem. The example we’ve seen before, for instance, showed how even a simple XOR-based mechanism can already provide some protection to the streams.

At the same time, though, encryption is a serious matter, and how to do it properly can be quite the challenge. This is even more true when you have constraints like the ones WebRTC imposes, in terms of keeping the functionality lightweight enough to be used in real-time for a potentially large number of participants, and at the same time provide the right level of protection on the streams.

One of the most interesting proposals in that sense have been Secure Frames (SFrame), which you can read about in this good overview by Dr Alex. While I won’t even try to describe how it works in terms of encryption (you can refer to the blog post above or the recent IETF draft for that), it can be seen as the “spiritual” successor to PERC Lite, in terms of the functionality it provides but also, hopefully, ease of use in web applications. The SFrame was originally co-developed by Google, and has been used in DUO for a long time, which makes it a very good candidate if you’re looking for an option to provide E2EE in WebRTC as well.

In a nutshell, as the diagram below (which comes from the draft) describes, SFrame works on media frames rather than RTP packets, which is why it fits so well with Insertable Streams: any time a media frame is ready, the “SFrame encoder” encrypts it, and only after that the result is packetized in RTP as usual ans sent over the wire; the recipient would do the inverse, so depacketize the RTP packets to a frame, and decrypt that via the “SFrame decoder” before passing it to the actual media decoder. The crypto settings used by the SFrame encoder/decoder are exchanged out of band, e.g., via a dedicated E2EE channel on a messaging server of some sort: this will of course be the most complex part, since it basically means implementing a KMS (Key Management System), which is never trivial.

Now, looking at this from a developer perspective, a solution like SFrame can be either good news or bad news:

It’s incredibly powerful (and Google wouldn’t be using it for DUO otherwise), meaning that it can definitely be used to provide strong encryption in WebRTC applications (good thing!).
At the same time, it also is quite complex, especially if you think about the fact you may have to do bit-wise operations in JavaScript (bad thing…) and possibly involve a KMS.

Luckily for us common mortals, this is where the open source community once more comes to the rescue! Specifically, as I was anticipating at the beginning, Sergio Garcia Murillo started writing an excellent SFrame library in JavaScript, that can be used in any WebRTC application. Even better, he made it completely open source, which was really good news! As such, when he asked if I could help him test it, I jumped at the opportunity (and bothered him constantly with my stupid questions, so kudos to him for his great patience and support!), and started hacking at our demos to see how that might work.

Playing with Sergio’s SFrame library

First of all, if you want to learn more about how SFrame actually works, and how the specification got translated to an actual library, make sure to read this very interesting blog post by Sergio, where he goes very much in detail on both. For the purpose of this section, I’ll just focus on how I integrated the work he did in janus.js, in order to test if and how it could work in different existing demos: as such, I will not go very deep in the library APIs themselves, which I encourage you to learn more about yourself.

It’s worth mentioning that Sergio’s SFrame library is provided as a JavaScript module. This makes sense since encryption can be a heavy task, and so leveraging workers for the purpose helps making sure the UI is not affected. That said, this is not how janus.js works by default, or the demos for that matter: in fact, janus.js is loaded as a regular script, and that’s true for the code of all the demos too. As such, the first step to take was to make both work as modules as well, as otherwise it would be impossible to import the SFrame library. As a fairly weak JavaScript developer, I actually never used modules in web applications, so this initially scared me a bit… luckily, this ended up being quite easy, as all I needed to do was first of all load those scripts as modules, e.g.:

<script type="module" src="janus.js" ></script>
<script type="module" src="e2etest.js"></script>

Then make sure janus.js would export the Janus object:

export { Janus };

And finally import that object in the demo script:

import {Janus} from './janus.js';

After that, everything in the regular demo worked pretty much out of the box, meaning I could start looking at how to hack at the code to get SFrame working as well.

Since I wanted to make SFrame support quite transparent to the user, the plan was to integrate it in janus.js somehow. I explained above how I added a generic support for sender and receiver transforms to the library, and how that made it easy to just pass some random transform functions to be used in a PeerConnection. That said, the SFrame library actually takes care of creating the transforms itself, which makes sense considering how complex they effectively are. Besides, the SFrame library exposes a simple Client object that can be used for all the SFrame functionality, in order to facilitate using the several different features the specification makes available. This meant I couldn’t reuse the same simple approach I had already implemented, and had to do something like this instead:

Integrate the SFrame client in janus.js itself, and…
Expose some way to provide SFrame-specific settings when doing a createOffer or createAnswer (exactly as I did for the generic transforms).

In order to make the SFrame code available to my library, I made sure I’d import the SFrame object from the client module:

import {SFrame} from './sframe/Client.js';

To keep things simple, I started modifying the E2EE EchoTest demo, by basically removing the transforms provided there and use SFrame instead. As anticipated, this first of all meant exposing a new configurable property (which in an impetus of creativity I called sframe) when creating an offer and an answer: at the time of writing, the only properties you can configure are the sender ID (outgoingId), the receiver ID (incomingId), a shared key to use for encrypting (sharedKey), and a public/private key used for signing/verifying instead (keyPair).

echotest.createOffer({
    media: {audio: true, video: true },
    sframe: {
        outgoingId: 0,
        incomingId: 0,
        sharedKey: cryptoKey,
        keyPair: keyPair
    },
    [..]
    success: function(jsep) {
        // Send offer to Janus
    }
});

I won’t go in detail on the keys part (you can refer to Sergio’s post for more info and a few examples on how to generate and/or share those), while for the IDs you’ll notice they’re the same in the snippet above: this is expected in the EchoTest demo, since the entity encrypting the content (the sender) is also the one decrypting it, which means that after creating a client with a specific sender ID, we’ll need to add a receiver to be ready to decrypt the same content. Receivers can be added and removed dynamically, in SFrame, which is why Sergio’s library makes it possible as well, and requires you to specify in advance what you’re expecting to receive, and possibly the related keying information.

In janus.js, passing such an object when creating an offer or answer does indeed translate to invoking the SFrame library behing the curtains, and specifically: (i) create a new SFrame client instance, using the provided sender ID; (ii) add the keying information for the sender side; (iii) optionally add a receiver, in case the PeerConnection will receive media. This is what the code in janus.js actually looks like:

if(callbacks.sframe) {
    Janus.log("Using SFrame to encrypt media end-to-end:", callbacks.sframe);
    config.sframe = callbacks.sframe;
    config.sframeClient = await SFrame.createClient(config.sframe.outgoingId, {});
    // Sender part
    await config.sframeClient.setSenderEncryptionKey(config.sframe.shared);
    if(config.sframe.keyPair && config.sframe.keyPair.privateKey) {
        await config.sframeClient.setSenderSigningKey(config.sframe.keyPair.privateKey);
    }
    // Receiver part
    if(config.sframe.incomingId !== undefined && config.sframe.incomingId !== null) {
        await config.sframeClient.addReceiver(config.sframe.incomingId);
        await config.sframeClient.setReceiverEncryptionKey(config.sframe.incomingId, callbacks.sframe.shared);
        if(config.sframe.keyPair && config.sframe.keyPair.publicKey) {
            await config.sframeClient.setReceiverVerifyKey(config.sframe.incomingId, config.sframe.keyPair.publicKey);
        }
    }
}

As you can see, the setup is relatively simple:

First of all, we create a client instance using SFrame.createClient(), and specifying the sender ID (outgoingId); notice that a sender ID is still required even if you’re just going to receive media.
Then we configure the encryption context via setSenderEncryptionKey() (and optionally a signing context via setSenderSigningKey(), if a key pair has been provided).
Finally, if a receiver ID (incomingId) has been provided, we add it to the client via addReceiver() and configure how to decrypt it via setReceiverEncryptionKey() (and, again, only optionally specify how to do verification via setReceiverVerifyKey() if a key pair was provided).

Once that’s done, the library is ready to work, and the only steps missing are passing it the streams to actually process. This can be done quite easily using the encrypt() and decrypt() functions on the client: per the Insertable Streams specification, the former expects an RTCRtpSender instance (as that’s what you’d apply the sender transform on), and the latter an RTCRtpReceiver instance instead, which means that you just need to use those functions when you have access to those values, e.g.:

var sender = config.pc.addTrack(track, stream);
config.sframeClient.encrypt(sender.track.id, sender);
[..]

config.pc.ontrack = function(event) {
    [..]
    config.sframeClient.decrypt(event.track.id, event.receiver);
    [..]
}

Once I took care of all that, the EchoTest worked as expected, which was quite exciting!

Of course, I was also interested in checking how it could work in a conferencing environment, and so in the VideoRoom demo. To keep things simple, and avoid setting up a complete KMS or key exchange, I limited the integration to just use the shared key (prompted when opening the web page), but no key pair for signing/verification: this is definitely something that will take some more time and effort, and that goes beyond the simple proof of concept implementation I worked on here. Besides, I decided to use the unique VideoRoom participant IDs as the SFrame sender and receiver IDs: this basically meant (i) just configuring outgoingId: participantID for publishers, and (ii) configuring outgoingId: participantID, incomingId: feedID for subscribers instead (where the sender ID is unused but here just means this subscription was originated by that part particiant, while the receiver ID refers to the unique ID of the participant we’re subscribing to). With these small tweaks, the VideoRoom demo worked too, as you can see in the screenshot below! (where as usual I’m talking to myself, and making fun of a “Bad Guy” with the wrong key )

That’s all, folks!

I hope you enjoyed this overview on our end-to-end encrypted journey. There’s still quite a lot to do: first of all, I should make this SFrame integration available (I need to clean up the code a bit, first); besides, Insertable Streams are still only available behind an experimental flag in Chrome anyway, and so things may change further in the specification as well.

That said, I feel like we’re definitely living in exciting times, and I’m looking forward to you guys playing with all this, and hopefully having as much fun as I did so far!

Lorenzo Miniero

I'm getting older but, unlike whisky, I'm not getting any better