It’s been a few weeks since we started having a look at QUIC and how it can be used for real-time media. We started a couple of months ago with an overview on QUIC itself, and my prototype stack implementation of it. After that, we started looking at some practical applications, focusing specifically on RTP Over QUIC (RoQ) with some interop tests performed at the IETF Hackathon in Vancouver.
Now is the time to go one step further, and have a look at what is probably the most interesting (and ambitious) effort related to how to realize multimedia applications on top of QUIC: Media Over QUIC (or MoQ for short). I’ve been working on this a lot, lately, especially in terms of getting MoQ and WebRTC to talk to (and like) each other, so without further ado, let’s have a look at what came out of it!
As a side note, I’ll talk about this (and more) at the upcoming (at the time of writing) RTC.ON event in Krakow. Meetecho is one of the sponsors of the event, so if you want to learn more about all this, or just chat with me about it in person, see you there!
A quick recap
As anticipated, in the previous posts we introduced QUIC in general first, and the steps I followed to try and implement a basic stack I could build some scenarios upon. The stack is not complete yet, but it supports enough features to be able to use it as a foundation for QUIC-based protocols and applications, at least in a local and controlled environment.
Considering my interest is and remains real-time multimedia applications, I started looking into the existing alternatives, and the new mechanisms that are currently being devised in the standardization activities to make them possible using QUIC as well. This is how I started focusing on two efforts in particular:
- RTP Over QUIC (RoQ)
- Media Over QUIC (MoQ)
I’m already very familiar with RTP, so looking into RoQ first made sense, which is exactly what I did and covered in my previous blog post. That said, while RTP Over QUIC “works”, it’s still an attempt to bend an existing protocol to a different transport: it can definitely be done (and the demos I shared in my blog post show that), but at the same time there’s a lot of overlap between RTP/RTCP and what QUIC provides (which the draft goes a lot in detail on). Besides, considering the unique nature of QUIC as a transport protocol, it may make more sense to try and come up with something new and more specific, and ideally more flexible: something that could map more seamlessly to the features QUIC provides out of the box, and that could in theory even extend QUIC itself should there be a need for it.
This was one of the main rationale behind the proposal of Media Over QUIC, which is what we’ll talk about in this blog post.
What is MoQ, and what is it for?
As the name suggests, Media Over QUIC (MoQ) is a proposal for a media delivery solution that leverages QUIC connections. Its main use cases are not that different from what we’ve been using WebRTC so far, namely live streaming, real-time collaboration, gaming and much more. A good high level overview is provided in this article on the IETF blog, and this blog post by Luke Curley is also very interesting and informative, as it provides good insights on what may be some of MoQ’s selling points over other technologies (including HLS/DASH). Please notice I may say something partially or completely incorrect in the MoQ overview that follows: I’m new to all this myself, and so some of what I understood so far or some of my assumptions may be wrong. Please do let me know if that’s the case, and I’ll correct it!
I’ll start by pointing out that, as media technologies, some see WebRTC and MoQ actually competing with each other, and one better than the other. I do see some overlap, especially considering MoQ is trying to address many use cases that we use WebRTC everyday for, but in practice, I don’t see that as much of an issue, or a clear winner when it comes to all use cases. First of all, MoQ is quite in its early stages, which means that, until it becomes more widespread, IMHO WebRTC is there to stay for quite some time in many production environments. Besides, there will still be scenarios where one will be better than the other, and others where it will be the opposite; for some, they’ll pretty much be interchangeable, meaning it will be up to you to choose which one you prefer. This may change as the specification (and adoption) evolves, but in general I definitely see opportunities for both of them to coexist, and possibly even interoperate: for instance, think of a WHIP ingestion that not only feeds a pool of WHEP servers (and maybe an HLS CDN as well), but also a battery of MoQ relays. That’s what intrigued me and got me to start looking into MoQ, and into possible ways to get it to talk to WebRTC somehow, which is what this blog post is all about.
As it happened for WebRTC, the standardization effort on MoQ is happening within the context of the IETF, where MoQ has been a Working Group for a couple of years. As the charter says, the idea behind MoQ was to try and come up with a “a simple low-latency media delivery solution for ingest and distribution of media”, with one of the main requirements being that it “will scale efficiently”. Again, rather than using RTP for the job, the idea was to define different roles that could map to different nodes in a configurable, dynamic and extensible topology: the protocol used by these nodes would then be generic enough to allow for different kind of media to be transported, thus allowing for different formats, rate adaptation strategies and so on. All this mapping media “onto underlying QUIC mechanisms” (raw QUIC or WebTransport).
Even from this very high level and generic description, it’s clear that MoQ is most definitely a very interesting technology to investigate and keep track of, especially for those familiar with WebRTC and media distribution technologies. The work on the actual transport is currently being done in this internet draft. From a topology perspective, we mentioned that there are different roles that MoQ applications may have, which are basically three:
- Publishers, that is applications sending media to be distributed;
- Subscribers, that is applications receiving media that has been contributed;
- PubSub, that is applications that could do both (typically relays).
This means that the simplest media topology that we can imagine is the one depicted below, where a publisher advertises media to be distributed to a relay, which can in turn distribute it to a number of interested subscribers. All done using the MoQ Transport as the underlying protocol for the job.
If you’re familiar with WebRTC SFUs, this won’t be a particularly surprising topology, as it basically maps what we already know and do every day with WebRTC itself. That said, one of the cool parts of MoQ is that, with no change to the protocol and just using the roles we introduced, a topology like the above can be extended to become much more complex than that, for instance like the basic CDN depicted below:
In this diagram we actually have relays feeding other relays, and each of them having their own subscribers. And while my diagram shows a simple distribution tree, this could be even more complex and flexible than that, e.g., like the diagram Erik Herz shared for one of his MoQ proof-of-concepts. And while I’ve focused on broadcasting as an example, as we mentioned MoQ can be used for conferencing as well, which simply means similar topologies where the same endpoint can both send and receive media. Again, a very flexible approach.
At the basis of MoQ are the concepts of MoQ objects, groups and tracks. An object is basically the “basic data element” of the MoQ transport, a group is a collection of objects, while a track is a sequence of objects. Thinking of this in terms of video encoding, you can see an object as being a frame, a group the list of all frames starting from, and dependent on, a specific keyframe, and a track a sequence of all those frames (the full video sequence from start to end). This is depicted visually in the diagram below.
The main idea behind groups is that objects belonging to a specific group should never depend on objects in other groups, which maps well to the video frames example we made above: if a group contains frames starting from a keyframe and the intraframes that follow, once a new keyframe arrives, all frames (and so objects) that follow will only depend on that keyframe, and not the one (and the group) before. This is a reasonable assumption and basis for the data model of MoQ, and straightforward enough to understand: there have been discussions lately to extend/change this, e.g., for SVC purposes, but there has been no agreement yet at the time of writing.
More in general, when it comes to publishing and subscribing, the main concepts are track namespaces and track names. The idea is that a publisher can advertise a specific namespace (e.g., “Lorenzo”), and within the context of this namespace it may publish, at the same time or at different times, multiple tracks (e.g., “my microphone”, “my cool webcam”, “this application”, etc.). It’s worth mentioning that, through the MoQ transport itself, only namespaces are advertised, e.g., to a relay: this means that track names and their availability are supposed to be known to subscribers already. While this may seem counter intuitive, the main reasoning behind this is that this sort of signalling and mapping is assumed to be out of scope to the transport for the media itself, and left instead to something like the Common Catalog Format. This is admittedly one of the areas I know less about, so I won’t delve too much into this for the time being, especially considering I wanted to focus mostly on media delivery here.
That said, we mentioned how each track is a sequence of group of objects, which means that once a publisher starts publishing a track because someone subscribed to it, it will indeed start pushing objects on the wire, with different identifiers addressing its unique ID, along with the ID of the parent concepts (group, track, etc.). These are the same objects that a subscriber will receive, and the hierarchical nature of the media will allow the subscriber to handle them accordingly. Subscriptions themselves could be done in different ways, as while you may be interested in only receiving the latest stuff as it arrives, you may also want to start receiving something from before, e.g., starting from the keyframe preceding the latest objects (keeping the same video example as before) in order to be able to decode and render something immediately. You may even want to receive everything starting from the very beginning, rather than only the current stuff. This is one of the interesting features of MoQ, and a key difference from WebRTC: the ability to configure the scope of the distribution, for the sake, for instance, of having configurable latency for the stream. The fact each object has a unique ID means relays can be configured to cache them accordingly, in order to make this dynamic distribution approach possible. That said, we won’t focus too much on this aspect for the moment, especially considering there’s a lot of discussions happening right now around how this should happen in practice (e.g., in terms of whether it makes sense to have different APIs for real-time delivery and fetch-based distribution instead).
Coming back to what goes on the wire, without going into much detail, there are different ways these objects may be transported in practice: just as we saw happening in RoQ, multiplexing may happen in different ways, e.g., using a separate DATAGRAM
/STREAM
per object, a STREAM
per group, or a STREAM
per track. Not everyone in the standardization activities agrees on which should be the supported multiplexing modes, which explains why different versions of the draft currently define different mechanisms, and slight changes in the available attributes (in particular on the scoping of the distribution). That said, for now suffice it to say that these objects could be transferred in different ways, taking advantage of different properties of the QUIC transport.
In a nutshell, we can summarize a simple MoQ scenario as depicted in the sequence diagram below:
In this simple scenario, a publisher connects to a relay and announces their namespace, which needs to be unique. As we mentioned, the tracks the publisher will actually publish within the context of this namespaces are not advertised here. We just assume an interested subscriber knows what they’ll be, which is what leads them to send a subscribe to a “video” track for that namespace via the relay: notice how a subscribe ID is provided, which is meant as a “shortcut” to immediately address that namespace and track name (and in fact is used in the objects that are sent for demultiplexing purposes). This subscription is, in this case, relayed by the relay to the publisher: this will not always happen, due to the caching we mentioned relays can do; if the relay is receiving the objects a subscriber is interested in already, it can simply fan them out to the new subscriber too. Once the publisher accepts the subscription, objects start to actually flow on the wire, and the relay will take care of sending them to the interested subscriber(s). Again, we don’t really care at this stage how (as in via what multiplexing mode) these objects are delivered: we simply assume they are, and they get to destination.
As a first high level intro to the protocol, this is probably enough at this stage, so let’s see how this could be implemented in practice.
Sounds fun, let’s start working on it!
The summary you read in the previous section came actually over weeks of study of the specification and some other sources, which means that I started implementing stuff before having a complete (or even good) understanding of all the concepts.
An excellent source of information, besides the draft itself, was the quic.video website, that Luke Curley (one of the forces behind the MoQ standardization efforts) opened to provide details on the specification, besides demos and open source code. Luke was also incredibly gracious and helpful when I got in touch with him privately, and helped me get a better understanding of the basics, answering all the (often dumb) questions I had. He’s also the main author of moq-rs, a MoQ stack written in Rust that I used a lot for studying the protocol and doing some initial interop tests.
Coming back to the different MoQ roles we introduced in the previous section, his project actually includes several different components that can be used for talking MoQ: there’s a video publisher (moq-pub
), a subscriber (moq-sub
), a relay (moq-relay
), and even a tool that can publish/subscribe a ticking clock that emulates video encoding for the purpose of showcasing groups (moq-clock
), that I found very helpful as a starting point. There’s also code that allows you to create a MoQ subscription in the browser (the same that is used in his public demo), but I admittedly didn’t focus too much on that (more on that later).
That said, I spent a lot of time playing with his demos, and trying to write code that would allow me to interact with his as different roles. Tinkering with the different multiplexing modes I even managed to contribute a small fix, since there was a typo preventing a couple of them from working as expected.
The first demo I started studying was the one that powers his online demo, that is the delivery of a video stream sourced by a prerecorded file. In this demo, his moq-pub
publisher is fed by a ffmpeg instance, and then serves the video frames using the MP4 container as format when subscriptions come in. Studying the demo and looking at traffic on the wire, I figured out that, within the context of the advertized namespace, the demo would always send, in sequence, two separate tracks:
0.mp4
, which would serve a single object containing the metadata of the video file;1.m4s
, which would instead serve the actual video frames, grouped as we discussed previously.
If you’re curious where those track names come from, they’re actually specified in the catalog that’s printed when you first start moq-pub
to announce your namespace:
[2024-08-29T14:49:34Z INFO moq_pub::media] catalog: {
"tracks": [
{
"codec": "avc1.42C01E",
"container": "mp4",
"data_track": "1.m4s",
"height": 270,
"init_track": "0.mp4",
"kind": "video",
"width": 480
}
]
}
As anticipated, though, I haven’t started looking into catalogs yet, so I first discovered those track names just from observing the debug generated by Luke’s applications in actions. That said, after working on the MoQ stack itself on top of my library, I could finally prototype my own simple subscriber, that in this specific case would save all the received frames (extracted from incoming objects) to an MP4 file. A simple (and probably not very useful) example is presented below, where Luke’s moq-pub
is advertising a pippo
namespace to a local relay out of a video file I provided, and my MoQ subscriber then subscribes to that same namespace and the 0.mp4
track (being configured to automatically subscribe to 1.m4s
after that).
You can see how, after subscribing to both tracks, objects start coming in. For the video frames, specifically, you can see how group IDs grow monotonically, as each new keyframe comes in, and object IDs reset any time a new group appears, since they’re scoped to the group they belong to. My subscriber was configured to save all those frames to an MP4 file, so I just had to try and play it in order to see if that worked and it did: eureka!
Next step was to try and publish something myself, that I (or moq-rs
) could subscribe to. I didn’t want to start looking into video publishing right away, though, mostly because there are actually different ways to handle that, especially for real-time media (we’ll get back to that later). I mentioned how moq-rs
comes with a cool demo called moq-clock
, which is a complete MoQ publisher/subscriber in all senses, with the exception that the media it distributes is not audio or video, but just plain text, and more precisely a representation of the current date and time. This is is yet another demonstration of how flexible, and media agnostic, MoQ actually is, but more importantly it gave me the opportunity to experiment with the protocol using a more manageable media stream.
Luke conceived the demo in quite a clever way, admittedly to try and mimic how video encoding works. This means that, rather than sending a complete string for the current date and time, the demo only sends (almost) the full thing when the time is as 00
seconds, and then for the other seconds in the same minute it only sends the value of the seconds themselves. This is a cool way to mimic video encoding, because we can see that first full string as the “keyframe”, and the other ones as the intraframes, thus allowing us to group them accordingly. The video below shows this in practice, with moq-clock
publishing the current time (which by default uses the clock
namespace and now
as a track name), and my MoQ subscriber subscribing to that and displaying the content of the objects as they’re received.
You can see how the full date/time as printed in moq-clock is actually sent using the differential algorithm we introduced previously. Pretty cool, and more importantly, easy enough for me to try and publish myself! So my next step was trying to do exactly that, with my own stack acting as a publisher and moq-clock
now acting as a subscriber instead.
As you can see, that worked too, which I was quite happy about!
At this stage, I was about to leave for Vancouver, where I had to provide remote participation support at the IETF 120 meeting as part of the Meetecho team. I mentioned some of the things that happened at that meeting already, since I explained how I had been working on RoQ at the same time, and how that led me to cool interop sessions during the Hackathon. Unfortunately I didn’t have any opportunity for interop tests with MoQ as well, but during the hackathon I got to meet Jordi Cenzano Ferret, the main force behind Meta’s efforts on MoQ: more specifically, he’s one of the authors of both moxygen, a MoQ relay, and moq-encoder-player, a very cool project implementing a MoQ publisher and subscriber via a browser, using WebWorkers and WebCodecs. Jordi anticipated he wanted to try and do a live MoQ demo during the two sessions that would take place over the course of the IETF week, which of course picked my interest: I started working on some interop tests with his code as well, and looked forward to attending both MoQ sessions as now I had at least some understanding of what it was all about!
A very active IETF Working Group
One thing that I had noticed even before attending was how active the MoQ WG was. As in all IETF WGs, there’s a mailing list where most discussions happen, but the MoQ WG also has a Github presence, with different repositories for the different activities (e.g., transport, catalog, etc.). Issues and pull requests are used a lot on those repositories to dig deeper in the different issues that are encountered, or to propose changes, and a digest of all repo activities is also published on a regular basis on the official mailing list, to allow everyone to keep up to date.
Besides, there’s frequent meetings as well: I mentioned how MoQ did indeed meet twice in Vancouver (which only happens for WGs with a lot of activity), but MoQ contributors also meet regularly and frequently in interim meetings in between in person meetings. This definitely helped MoQ get into shape quite quickly, and the regular discussions and feedback helped address some potential issues in the protocol earlier than it would have normally happened. There’s also many impementations available of MoQ, even though not all implementing the same version of the protocol: the screenshot below comes from the recording of the first MoQ session in Vancouver, and depicts a basic matrix showing the interop tests that were documented so far. It’s worth mentioning that this slide didn’t reference my stack or Kota‘s, since we had only advertised them shortly before the meeting.
With two 1h30m meetings, as you can guess there was a lot to discuss and present, so I won’t go through the proposed changes in this blog post: you can check the recordings online if you want to learn more. I thought I’d focus more on Jordi’s demo instead, since it’s more of interest from an implementation perspective.
Jordi’s idea was to basically stream the whole meeting using MoQ. Considering that for Meetecho we work a lot on the room setup for the purpose of remote participation, we helped Jordi get in touch with the AV team, so that he could get access to audio from the room mixer, while he already had a webcam he could use for video. During the meeting, he then used moq-encoder-player
to publish the stream via MoQ on a public moxygen
instance, and gave everyone a link to watch the session via MoQ. This was indeed pretty cool, and seemed to work nicely for what was arguably the very first public MoQ stream (at least at this scale). You can see a couple of pictures of that demo below (thanks to Jordi for the pics!).
In the days between the hackathon and the first MoQ session, I had started tinkering with both moxygen
and moq-encoder-player
, to see if I could come up with some interop tests pretty much like I had done with Luke’s moq-rs
. In this case, I was doubly interested, as Jordi’s code used real-time media, rather than pre-recorded files, and the use of WebCodecs was quite interesting too. From a visual perspective, it looked a lot like many of the WebRTC demos we all know, so I knew what I wanted to do was also figure out ways to get those MoQ streams to WebRTC, and viceversa.
During the course of the meeting, and looking at Jordi’s repos, I noticed that the format that was used to transport audio and video was different from the one Luke had employed. Specifically, where Luke had used MP4 as a container for his video streaming demo, Jordi had instead leveraged something called the Low Overhead Media Container (LOC), even though slightly tweaked for his needs. As the name suggests, LOC is meant as a lightweight (low overhead) container format for media, with some properties that can help act as metadata to associate to media frames as they’re delivered. Studying the specification, I thought it could be seen, twisting reality a bit, as something not that far from what RTP is to WebRTC: a way to provide, e.g., timing information or info on the content of the media frames. This further encouraged me to start digging deeper in this, to see if a WebRTC-to-MoQ translation was indeed possible, and how hard it would be.
If you visit the repo for the moq-encoder-player, the actual LOC format used in the demos is documented quite well, and is depicted in the diagram below:
As you can see, most info is indeed similar to the one we’re used when we look at RTP. It has a sequence ID and a timestamp for timing information, for instance, and the metadata contains information that WebCodecs produce/need to be able to do their work. I started by just using moq-encoder-player
in conjunction with moxygen
as it was, to have a look at what LOC looked like in practice for both audio and video, until I felt confident enough to try and start messing with it, to produce my own.
When you use the demos on their own, they look like this:
As you can see, it does indeed feel familiar to those experimenting with WebRTC SFUs every day, since it’s basically a publish/subscribe demo. In this specific instance, I was testing everything locally: the reason why latency seems a bit off is probably because the demo defaults to a 100ms buffer for video. Besides, WebCodecs are used both to encode (page on the left) and decode (page on the right), and canvas elements are used to render each frame on a regular basis: both may cause the code to struggle a bit on some less powerful machines (and I can’t rule out it was the case for my laptop as well, considering the 854×480 video resolution).
At any rate, time to bring WebRTC in the picture!
Enter WebRTC (and Janus)
Checking Jordi’s demos, I noticed some things that helped me kickstart the effort:
- Unlike
moq-rs
, which used the version-03
of the MoQ transport,moxygen
used version-04
: this forced me to make some changes to my stack, in order to be able to dynamically detect the negotiated version (which is part of the MoQ transport client/server setup), and take the negotiated version into account when parsing/crafting messages (due to slight differences that exist between the two versions). - Both the publisher page (
src-encoder.html
) and the subscriber one (src-player.html
) had namespace and track names as configurable properties in text boxes: this made the discoverability of those properties not an issue, as I could simply copy and pass them along in my demos (as those two demos do indeed expect you to do). - For audio, Opus was used, and packaged in chunks of 10ms: this meant that bridging audio was probably going to be relatively easy, as browsers support Opus as a codec out of the box, and 10 is a ptime all browsers support without problems as well.
- For video, H.264 was used, and the metadata information produced by WebCodecs was probably going to be quite important: whether this would end up being H.264 I could just bring to WebRTC (and back) without problems remained to be seen, though. An important distinction that I knew I’d have to take care of was that on MoQ video frames would be sent in their entirety, while on RTP they’d have to be split across multiple packets most of the times, due to MTU constraints. This meant splitting RTP packets when doing MoQ-to-WebRTC, and reconstructing frames out of multiple packets when doing WebRTC-to-MoQ. As such, definitely more work than for audio, but not impossible.
With these assumptions, I could start working on the code. As I had done when implementing my WebRTC/RoQ integrations, I modified the same Janus QUIC plugin I was using as a testbed to add MoQ support as well. To mimic what happens with WHIP and WHEP in the WebRTC world, I configured the plugin to expect an SDP offer no matter the expected MoQ role: this meant that for a WebRTC user to become a MoQ publisher, they’d have to provide a sendrecv
/sendonly
SDP offer, while to act as a MoQ subscriber they’d have to provide a recvonly
SDP offer. Additional attributes passed via the plugin API would provide the additional context, e.g., QUIC server to connect to, track namespace and names (for audio and video), etc.
This way, just as I had done for RoQ, I could react to new PeerConnections by creating new MoQ endpoints using my library accordingly, and then take care of announcing/subscribing depending on the role. Specifically, for publishers this meant:
- negotiating a WebRTC PeerConnection with RTP on the way in;
- creating a MoQ publisher using my library;
- announcing the provided namespace, and waiting for incoming subscriptions;
- when a subscription arrived, handling incoming RTP packets, and preparing MoQ objects to send accordingly.
Likewise, for subscribers the process was meant to be similar:
- negotiating a WebRTC PeerConnection with RTP on the way out;
- creating a MoQ subscriber using my library;
- subscribing to the provided namespace/track names;
- for each incoming MoQ object, translating them to RTP packets to send back via WebRTC.
This was the idea in a nutshell, and what I started doing. You can find a (slightly spoiler-y, since it anticipates a few of the points I’ll make soon) diagram below of how this translation process was sketched and implemented from a high level perspective.
To make things easier, I started from implementing subscribers, so the MoQ-to-WebRTC translation, using src-encoder.html
as the application publishing media via MoQ, and the Janus plugin subscribing to it.
Getting audio to work was pretty straightforward. As mentioned, LOC provided enough info for me to craft my own RTP headers, and besides each MoQ object contained a payload that I could place in an RTP payload without further manipulation. Doing that, audio worked pretty much out of the box, which was quite exciting!
Video, on the other end, proved immediately to be trickier. I anticipated how I would have to deal with the splitting of incoming video frames to multiple RTP packets, but that wasn’t a big deal: it’s what I’ve done countless times already, e.g., in my videomixer code. Just copying code I had written in the past would probably help address that part. What was much trickier to handle was the actual H.264 bitstream. In fact, as you may or may not know, the RTP packetization rules for H.264 define custom ways to transport NALs in RTP packets: so far, I had always worked with Annex B as a format for H.264 bitstreams (where each NAL is preceded by a custom code), but it turned out that WebCodecs in Jordi’s demo were using a different bitstream format instead, called AVCC. I won’t go too much into detail on the differences, also because I’m not really much of an expert on the subject, but this blog post by our friends at Software Mansion provides a good enough overview if you’re curious.
Long story short, before I could put those frames in RTP, I had to take care of translating the AVCC bitstream to Annex B, as that’s what I was more familiar with. Not a titanic job, but still something that had to be done with care, as otherwise a broken H.264 stream could end up in the RTP packets, thus resulting in completely broken video on WebRTC. Translating between AVCC and inband SPS/PPS NALs was particularly important, in this context, since that’s fundamental for WebRTC to be able to decode H.264 streams. After a bit of work, I managed to get it working, as you can see in the video below:
As you can see, the video is slightly more “jerky” on the WebRTC side, but that’s probably mainly caused by the fact Jordi’s page is configured to generate keyframes every two seconds. If you’re familiar with WebRTC, you’ll know that such frequent keyframes are common in broadcasting applications (e.g., HLS), but are actually overkill and discouraged when using WebRTC instead, since they make the stream more “fragile” (a loss is more likely to impact a keyframe, since there’s more of them), and force the decoder to work more. At any rate, video was working, so the first step was done: success! Latency seemed a bit better than the original demo as well, but that may have to do with the native decoding and rendering of video that browsers perform in the WebRTC case.
Now it was time to work on the reverse chain instead: a WebRTC publisher being turned to a MoQ publisher I could subscribe to using src-player.html
. This meant crafting my own LOC headers as well, besides preparing the frames to be sent as payloads, so allegedly a more complex endavour than before, where I only had to parse LOC headers and translate them to a familiar format like RTP.
I once more started with audio, and again I managed to get something working pretty quickly. Assuming 20ms as a ptime in use on the WebRTC side, it was easy to craft LOC headers accordingly, and once more I could just take the payload from RTP packets and put it exactly as it was in MoQ objects (after the LOC header). I only had to craft a metadata object, since WebCodecs rely on that to know how to interpret the frames they’ll receive, but just mimicking what I had seen src-encoder.html
originate was enough to make it happy, and I could hear what I was capturing via WebRTC on the MoQ player page.
Again, it was video that got me scratching my head for quite a while. I knew that, just as before, I’d have to take care of the Annex B to AVCC translation, but in this case I’d also have to reconstruct frames out of incoming RTP packets. Long story short, I couldn’t get it done in time before starting my summer holiday break: after three long and relaxing weeks, I got back to it, and finally managed to “crack the code” and get it partially working. When I say “partially”, I mean that it worked for a few seconds, and then it froze, with errors being displayed on the src-player.html
page console about the codec being closed. Debugging the issue, I figured out the freeze happened as soon as the second keyframe arrived, and I understood it had to be related to the video resolution changing in the WebRTC video stream during the ramp-up phase. In fact, the way Jordi’s code was conceived, it would only create a WebCodecs decoder the first time it saw a metadata in the LOC header: this is not an issue when using src-encoder.html
as a media source, as the encoded resolution never changes there, but it was definitely an issue with a WebRTC source, as the video resolution can change often there, as a consequence of BWE (and so particularly at the beginning of a PeerConnection).
Considering my lack of experience with WebCodecs, I mentioned my progress to Jordi in a mail and mentioned the problem, and he was gracious enough to solve the issue pretty much immediately! He ended up preparing a patch that would re-create the decoder any time the metadata differed, and that did indeed solve my video freezing problem. The result is what you can see in the video below:
In the animation you can also see the video changing resolution, after a few seconds, which confirms this was indeed what was causing the freeze before Jordi’s fix. We also see latency being a little worse than the MoQ-to-WebRTC demo, which again is probably due to a combination of the default 100ms buffering the MoQ viewer page employs, the use of WebCodecs, and the canvas rendering.
But what’s important: success! I actually managed to get WebRTC and MoQ to like each other, if not for a little while!
What’s next?
There’s definitely a lot to do next, but the first step will be talking about it at RTC.ON in Krakow: this whole effort (starting from the study of QUIC itself) was born out of me submitting a talk on this topic, and so force myself to work on it and get results in time. 😀 Should you attend the event (you should!), see you there!
Apart from this, there’s indeed a lot of work that still needs to be done. As we discussed in this blog post, MoQ is a very active effort in the IETF standardization activities, with frequent interim meetings and many iterations on the state of the MoQ transport specification. The availability of so many different implementations has proven so far very beneficial to the gathering of precious feedback, that is helping ironing issues out and coming up with specification changes where needed. This also means that the protocol changes often, which will indeed require frequently catching up with those changes in order to keep the implementations up to date.
For what concerns my own implementation, the MoQ stack is at a good point but it is by no means complete. Not all delivery methods have been properly implemented (or tested), for instance, and while I have publisher and subscriber implementations for most functionality, some are still missing. A relay implementation is also entirely missing, at the moment, since I only started working on some stubs (intercepting the relevant events), but without adding the required “meat” to them. Catalog support is also sadly missing at the moment, and that will indeed require some effort in the future.
Besides the state of the MoQ stack, my QUIC stack needs a lot of love too. To get to a point where I could quickly start prototyping the protocols I was interested in, I did indeed cut some corners in the QUIC stack itself, meaning there are some parts that could see a lot of improvement (e.g., the event loop management, and performance in general), while there are others that are missing entirely (basically the whole of RFC 9002, so retransmissions, flow/congestion control, etc.).
Long story short, there will be enough work to keep me busy for a while, that’s for sure! I obviously plan to release all this as open source, which means that hopefully in the next few weeks I won’t be the only one hacking at the code, as there may be other people interesting in experimenting with these new technologies, and contributing testing and enhancements.
That’s all, folks!
It took a while to start working on this blog post, as the summer holidays started happening right in the middle of my efforts on this, and shortly after the IETF meeting ended, but I finally managed to get to a point where I had something meaningful to share. It was fun revisiting the different steps that led me here, and I’m definitely excited to see where MoQ will bring me next. While MoQ and WebRTC are apparently competing technologies, I personally don’t see them as such, and I definitely see some overlap, at least for some years to come, which is a space that I think will open many interesting opportunities. I hope you enjoyed reading this too, so if you have any thought or want to get in touch for some interop tests, please don’t hesitate to let me know!