WHIP-ing WebRTC to Janus!

September 30, 2020 Lorenzo Miniero

Broadcasting is a huge industry, and historically there have been different protocols used for the purpose. A common choice, and basically a de-facto standard today, is RTMP (Real Time Messaging Protocol), which is used in almost all broadcasting applications as a way to inject media that can then be distributed, e.g., via HLS Live Streaming.

While this is a simple and effective setup, it has a few drawbacks, especially when latency is an issue or a challenge. In fact, even with the tightest of settings, you can rarely get below a few seconds of delay from producer to consumer. For this and other reasons (you can learn more in this interesting blog post by Dr.Alex, which summarizes the current state of the broadcasting industry), more and more efforts have recently been devoted to using WebRTC as a broadcasting technology instead, in order to take advantage of the truly real-time media delivery it implements, and at the same time benefit from all the additional functionality WebRTC supports today (e.g., in terms of bandwidth adaptation, encryption, and so on).

Using WebRTC for broadcasting introduces a new challenge, though. As anticipated, most broadcasting tools (e.g., OBS) normally support RTMP, and in some cases RTSP, as a way to inject media to a media server, but not much else: no WebRTC for sure. This means that, in case WebRTC distribution is required, the media server would likely have to take care of a transcoding process for media, or at the very least a protocol translation process from RTMP to ICE/DTLS/SRTP. While possible, this is clearly suboptimal, and not an ideal approach: a much more efficient way of handling things would be to use WebRTC for the injection process as well. Considering how SFUs typically work, this way media servers would never have to touch the media at all from producer to consumers, unless required (e.g., for re-broadcasting to other technologies, mixing or recording). Large scale broadcasting would be possible using different approaches, e.g., using our SOLEIL tree-based distribution, which was the subject of my Ph.D thesis several years ago, and was briefly discussed in a CommCon presentation a couple of years ago as well.

This is exactly what the WebRTC-HTTP ingestion protocol (WHIP), a recent IETF contribution by CoSMo Software masterminds Sergio Garcia Murillo and Alex Gouaillard, aims at addressing: that is, providing a simple, and most importantly media server-agnostic, way of injecting WebRTC streams that could be integrated in existing broadcasting tools. You can learn more about the effort in this blog post by Sergio and this other blog post by Dr.Alex, both very interesting and informative (and yes, guys, I agree we could all have been more creative in picking an image for the blob post, but it HAD to be that one! ).

WebRTC-HTTP ingestion protocol (WHIP)

As anticipated, a first specification of WHIP was recently submitted as an individual draft at the IETF by CoSMo, in order to foster discussion about this quite needed requirement, and possibly come up with an actual open standard all companies in the industry can refer to.

The way it was conceived is quite simple, and starts from a few assumptions:

Almost all broadcasting ingestion protocols only require a URI (and possibly some credential token) to publish, nothing else;
There is no standard signalling protocol in the WebRTC specification, but due to the dynamic nature of the protocol (ICE and DTLS, mostly), all signalling protocols are usually a bit more complex than a single URI;
For broadcasting purposes, you really only need to send an offer, and expect an answer back, which means signalling culd be reduced to a request/response mechanism;
Assuming a public media server, and possibly the use of ice-lite as well, trickling of candidates can be avoided thanks to the automated discovery of peer-reflexive (prflx) candidates media servers can use for connectivity.

All these assumptions are quite reasonable, and mean that the whole WebRTC negotiation process in WHIP can be reduced to an HTTP POST to send the SDP offer, and a 200/202 answer from the media server to return the SDP answer instead.

That’s all! Everything else can be done pretty much in an automated way using the existing WebRTC tools today, meaning that the HTTP/202 exchange is the only thing that the media producer and the media server need to exchange at a signalling level: once that’s taken care of, the media producer can start the connectivity checks to the media server, which will eventually lead to the DTLS handshake and a delivery of the media via SRTP. This is exemplified in the diagram below, which comes from the draft itself:

This is quite a simple and effective way of exchanging information, that would be simple to integrate (WebRTC stack aside) in existing broadcasting tools. Configuring a broadcast would be easy too, as all you’d need to set would be the HTTP URL of the WHIP backend. For authorization purposes, the draft currently assumes Bearer Tokens should be used, which means the authorization process can fit nicely in the simple HTTP/202 request/response exchange, without adding any additional overhead to the signalling. An interesting observation is that there’s no requirement that the WHIP endpoint and the media server be the same component, or actually be colocated at all: this provides the right flexibility to scale the two components searately and independently.

I was obviously curious to start experimenting with this, especially considering CoSMo Software already integrated the producer functionality in a fork of OBS Studio, a very popular broadcasting tool. This led me to write a thin layer to take care of the WHIP support, in order to allow WHIP media producers to send media to Janus, as I’ll explain in the next section.

WHIP-ing Janus

In order to interact with Janus, a user or application needs to implement the Janus API. While Janus supports different protocols (e.g., HTTP, WebSockets, MQTT and others), and more can be added any time thanks to its modular nature, they all act as “transports” for the Janus API. This means that adding support for WHIP wasn’t something I could simply add to Janus itself as a new transport plugin: in fact, the Janus core would have expected Janus API messages anyway. Besides, while the VideoRoom plugin makes the most sense for simple ingestion purposes, the plugin to interact with may actually be different for some use cases, even a custom and/or proprietary plugin implemented by a third-party.

For this reason, for my proof-of-concept I decided to actually implement WHIP as a thin layer in front of Janus instead. This allowed me to completely decouple the WHIP semantics from the existing Janus API, and actually keep Janus completely unaware of the fact a different ingestion mechanism is used. This is a common approach when designing and developing Janus-based application that need to expose a custom API towards end users, and there actually are several Janus API stacks in different languages that can be used for the purpose, some of which are listed in our documentation.

This thin layer implements a very simple bahaviour:

On one end, it implements an HTTP web server, to be able to receive SDP offers from WebRTC producers via WHIP;
On the other end, it implements an internal Janus API stack, to interact with Janus on behalf of WebRTC producers (that is, creating a session, attaching to the VideoRoom, and publishing in the VideoRoom relaying the SDP offer/answer exchange).

The diagram below shows this simple mechanism from a visual perspective:

This is actually quite simple to implement. In fact, while the Janus API is not that complex to implement in the first place, it’s even simpler to implement when limited to the requirements imposed by WHIP. All that is needed, in fact, is the ability to create sessions, attach to the VideoRoom plugins, and publish in a room, which actually only involves a single request and a couple of events. Anything else (e.g., managin trickle candidates, VideoRoom events related to room management, etc.) can safely be ignored, since it’s not relevant to the application itself).

To keep things simple, my POC currently has some properties hardcoded, e.g., the room to publish to, but it wouldn’t be hard to extend the code to also automatically create a new room for each new producer. At the same time, the POC doesn’t perform any RTP forwarding functionality either (which, as explained in this FOSDEM presentation, would be at the foundation of any Janus-based large scale broadcast), but that would be trivial to add as well. For the sake of simplicity, I just validated that this thin WHIP layer would do the job by joining the same room, via a browser, a producer would ingest to. This will be presented in the next section.

Before doing that, though, it’s worth spending a few words on the lack of trickling in WHIP, and its potential impact on unprepared media servers. As anticipated, to keep the signalling process tight and constrained to a single request/response, candidates are never trickled, as that would require additional out-of-band message exchanges; at the same time, though, in the vast majority of cases the SDP presented by producers will never contain any candidate at all, as waiting for some to be collected might delay the delivery of SDPs, which most client stacks prefer to avoid not to add any delay. This means that, in case of WHIP WebRTC producers, it’s very likely the media server would never receive any ICE candidate at all.

In the past, with Janus this was a problem. In fact, Janus uses the well known and reliable libnice library for ICE, which works great but has a known constraint: while it will always respond to incoming connectivity checks, it won’t start sending any of its own until at least a remote candidate has been explicitly set via its programmatic APIs. This is currently captured in an issue on their repository, and can be a problem if the peer never presents any candidate, since it means the ICE process may never complete, especially if full-ice is used. A few years ago we did come up with a workaround for that, though, which does indeed solve this problem. Specifically, since even when no remote candidates are known libnice will accept and reply to incoming connectivity checks anyway, this means we can be notified about unknown (at runtime) peer-reflexive candidates: as such, the trick here is to re-inject the just notified prflx candidate back to libnice using the dedicated API. This will “wake up” the stack (as it will be passed a new remote candidate it is actually already aware of), which will then start perform connectivity checks to complete the ICE process on both ends. This simple trick is depicted in the diagram below:

As such, if you’re planning to add WHIP support to your WebRTC server infrastructure, whether it’s based on libnice or not make sure you’re prepared to handle sessions where producers may (and often will) not explicitly advertise any candidate at all. In that case, you will indeed have to rely on prflx candidates for the job, so it might be a good idea to look into those if you’ve never dealt with them before.

Enter OBS-WebRTC

OBS Studio is a very well known, and widely used, open source broadcasting tool, with support for many features. Out of the box, though, it doesn’t support WebRTC, which is something the amazing folks at CoSMo Software added themselves some time ago already, in a fork called OBS Studio WebRTC. Thanks to their efforts, Janus is indeed a first class citizen in that integration, as it is listed as one of the streaming targets you can configure there: just insert the WebSockets API address of Janus and the VideoRoom ID to join, and the plugin in OBS takes care of the whole Janus API exchanges to allow OBS to publish there. Works great!

In order to facilitate the deployment of WHIP by providing an easy to use compliant WHIP producer, CoSMo Software recently expanded the WebRTC support in their OBS fork to support the WHIP specification too. More precisely, they added a new “Common WebRTC Streaming Platform” streaming target, that only requires a couple of parameters:

The address of the WHIP endpoint;
The Bearer token to use.

As we discussed in the previous sections, that’s really all you need in order to ingest media via WHIP, which is one of its strongest points. The screenshot below shows how one of my colleagues filled the UI fields with the data required to interact with my WebRTC thin layer: we were in the same LAN, so that’s why he used a private address there.

Once filled the data there, he started to prepare a simple scene in OBS. As you can see in the screenshot below, he captured his screen and webcam at the same time, and organized them as a picture-in-picture layout. Nothing fancy (you can do much more exciting things with OBS!), but enough for our demo purposes. When ready, he started streaming, which had CoSMo’s OBS start the WHIP negotiation that would eventually lead in a PeerConnection being created with Janus.

To validate everything was working correctly, I simply joined the same VideoRoom session OBS was streaming to from my browser. In fact, while in principle that’s not what WHIP was conceived to do, we’re using the VideoRoom for ingestion here, which means we can see those streams as “participants” in a room anyway. As you can see from the screenshot below, this worked beautifully! Despite OBS not having presented any candidate, ICE succeeded without issues (thanks to the trick explained before), and so did DTLS, which allowed OBS to effectively setup a Peerconnection with Janus to start streaming. While the remote video is rendered as a small element in our demo, you can see from the screenshot that it was actually a 720p video stream at ~2500kbps, which was the bitrate configured in the OBS settings.

The following screenshot shows the received WebRTC video with a bit more details, as it’s displayed full screen. It’s also helpful as it shows the wireshark capture Alessandro was performing when starting the WHIP session: you can see the HTTP request OBS sent to my custom WHIP endpoint, and the 202 it received back, both exchanging SDPs with an application/sdp content type.

Speaking of the HTTP messages, let’s have a quick look at the ones OBS and my WHIP endpoint exchanged, starting from the POST. Notice that, for the sake of simplicity, I’m displaying the payload right after the header, but the payload was actually only sent after OBS received a 100 Continue from the server.

POST /api/whip HTTP/1.1
Host: 192.168.1.218:7080
User-Agent: restclient-cpp/OBS
Accept: application/sdp
Authorization: Bearer 
Content-Type: application/sdp
Content-Length: 2360
Expect: 100-continue

v=0
o=- 6154193995027460356 2 IN IP4 127.0.0.1
s=-
t=0 0
a=group:BUNDLE audio video
a=msid-semantic: WMS obs
m=audio 9 UDP/TLS/RTP/SAVPF 111
c=IN IP4 0.0.0.0
a=ice-ufrag:7mhG
a=ice-pwd:L87j5omkuhYKCuHoMWTTBDU2
a=ice-options:trickle
a=fingerprint:sha-256 6C:B9:D7:08:D7:6C:CB:CF:F2:EF:AD:14:85:BA:A5:59:97:E0:C3:2B:67:7A:3B:E8:3A:92:38:3D:F4:36:70:41
a=setup:actpass
a=mid:audio
a=extmap:1 urn:ietf:params:rtp-hdrext:ssrc-audio-level
a=extmap:2 http://www.webrtc.org/experiments/rtp-hdrext/abs-send-time
a=extmap:3 http://www.ietf.org/id/draft-holmer-rmcat-transport-wide-cc-extensions-01
a=sendrecv
a=rtcp-mux
a=rtpmap:111 opus/48000/2
a=rtcp-fb:111 transport-cc
a=fmtp:111 minptime=10;useinbandfec=1;stereo=1;sprop-stereo=1;maxplaybackrate=48000;sprop-maxcapturerate=48000;maxaveragebitrate=131072;x-google-min-bitrate=128;x-google-max-bitrate=128
a=ssrc:1377417080 cname:c8/Gok4Fl7lFwciG
a=ssrc:1377417080 msid:obs audio
a=ssrc:1377417080 mslabel:obs
a=ssrc:1377417080 label:audio
m=video 9 UDP/TLS/RTP/SAVPF 127 120
c=IN IP4 0.0.0.0
b=AS:2500
a=ice-ufrag:7mhG
a=ice-pwd:L87j5omkuhYKCuHoMWTTBDU2
a=ice-options:trickle
a=fingerprint:sha-256 6C:B9:D7:08:D7:6C:CB:CF:F2:EF:AD:14:85:BA:A5:59:97:E0:C3:2B:67:7A:3B:E8:3A:92:38:3D:F4:36:70:41
a=setup:actpass
a=mid:video
a=extmap:14 urn:ietf:params:rtp-hdrext:toffset
a=extmap:2 http://www.webrtc.org/experiments/rtp-hdrext/abs-send-time
a=extmap:13 urn:3gpp:video-orientation
a=extmap:3 http://www.ietf.org/id/draft-holmer-rmcat-transport-wide-cc-extensions-01
a=extmap:5 http://www.webrtc.org/experiments/rtp-hdrext/playout-delay
a=extmap:6 http://www.webrtc.org/experiments/rtp-hdrext/video-content-type
a=extmap:7 http://www.webrtc.org/experiments/rtp-hdrext/video-timing
a=sendrecv
a=rtcp-mux
a=rtcp-rsize
a=rtpmap:127 VP8/90000
a= fmtp:127 x-google-min-bitrate=2500;x-google-max-bitrate=2500
a=rtcp-fb:127 goog-remb
a=rtcp-fb:127 transport-cc
a=rtcp-fb:127 ccm fir
a=rtcp-fb:127 nack
a=rtcp-fb:127 nack pli
a=rtpmap:120 rtx/90000
a=fmtp:120 apt=127
a=ssrc-group:FID 55344841 1248059976
a=ssrc:55344841 cname:c8/Gok4Fl7lFwciG
a=ssrc:55344841 msid:obs video
a=ssrc:55344841 mslabel:obs
a=ssrc:55344841 label:video
a=ssrc:1248059976 cname:c8/Gok4Fl7lFwciG
a=ssrc:1248059976 msid:obs video
a=ssrc:1248059976 mslabel:obs
a=ssrc:1248059976 label:video

As you can see, the WHIP POST is quite simple: the payload, as anticipated, is of type application/sdp, and the payload is indeed pretty much the same SDP we’d expect from a JSEP object in a browser; VP8 was offered as that’s how Alessandro configured the streaming target in OBS, but other codecs are supported too; you can also see how, just as anticipated, the offer doesn’t contain any candidate at all. One thing worth mentioning is that OBS was sending an Authorization: Bearer header, but with no token in there: this is probably a very simple thing to fix in CoSMo’s prorotype, and it didn’t impact my test since I wasn’t doing any validation on tokens anyway.

The response from the WHIP endpoint, provided in a 202, was just as straightforward:

HTTP/1.1 202 Accepted
X-Powered-By: Express
Content-Type: application/sdp; charset=utf-8
Content-Length: 2095
ETag: W/"82f-hP/FKs7oeD0+8y5WGeXpysvSGAk"
Date: Wed, 23 Sep 2020 09:42:15 GMT
Connection: keep-alive

v=0
o=- 6154193995027460356 2 IN IP4 192.168.1.218
s=VideoRoom 1234
t=0 0
a=group:BUNDLE audio video
a=msid-semantic: WMS janus
m=audio 9 UDP/TLS/RTP/SAVPF 111
c=IN IP4 192.168.1.218
a=recvonly
a=mid:audio
a=rtcp-mux
a=ice-ufrag:53/p
a=ice-pwd:g9ljo8RUJ7joUwWm0VCnKy
a=ice-options:trickle
a=fingerprint:sha-256 E0:12:FF:B8:54:82:74:58:7D:88:B9:7F:92:91:32:BC:98:69:45:27:35:34:A8:23:6B:B8:91:49:E8:94:04:91
a=setup:active
a=rtpmap:111 opus/48000/2
a=extmap:1 urn:ietf:params:rtp-hdrext:ssrc-audio-level
a=extmap:3 http://www.ietf.org/id/draft-holmer-rmcat-transport-wide-cc-extensions-01
a=msid:janus janusa0
a=ssrc:1474604096 cname:janus
a=ssrc:1474604096 msid:janus janusa0
a=ssrc:1474604096 mslabel:janus
a=ssrc:1474604096 label:janusa0
a=candidate:1 1 udp 2015363583 192.168.1.218 49207 typ host
a=end-of-candidates
m=video 9 UDP/TLS/RTP/SAVPF 127 120
c=IN IP4 192.168.1.218
a=recvonly
a=mid:video
a=rtcp-mux
a=ice-ufrag:53/p
a=ice-pwd:g9ljo8RUJ7joUwWm0VCnKy
a=ice-options:trickle
a=fingerprint:sha-256 E0:12:FF:B8:54:82:74:58:7D:88:B9:7F:92:91:32:BC:98:69:45:27:35:34:A8:23:6B:B8:91:49:E8:94:04:91
a=setup:active
a=rtpmap:127 VP8/90000
a=rtcp-fb:127 ccm fir
a=rtcp-fb:127 nack
a=rtcp-fb:127 nack pli
a=rtcp-fb:127 goog-remb
a=rtcp-fb:127 transport-cc
a=extmap:13 urn:3gpp:video-orientation
a=extmap:3 http://www.ietf.org/id/draft-holmer-rmcat-transport-wide-cc-extensions-01
a=extmap:5 http://www.webrtc.org/experiments/rtp-hdrext/playout-delay
a=fmtp:127 x-google-min-bitrate=2500;x-google-max-bitrate=2500
a=rtpmap:120 rtx/90000
a=fmtp:120 apt=127
a=msid:janus janusv0
a=ssrc:1835684612 cname:janus
a=ssrc:1835684612 msid:janus janusv0
a=ssrc:1835684612 mslabel:janus
a=ssrc:1835684612 label:janusv0
a=ssrc:843037510 cname:janus
a=ssrc:843037510 msid:janus janusv0
a=ssrc:843037510 mslabel:janus
a=ssrc:843037510 label:janusv0
a=candidate:1 1 udp 2015363583 192.168.1.218 49207 typ host
a=end-of-candidates

Just as in the request, the content is of type application/sdp, and the response actually contains the SDP answer from the media server, in this case Janus. A quick glance at the SDP immediately confirms this does indeed come from Janus, as there are several well-known keywords in there (e.g., the VideoRoom 1234 in the session attribute). Unlike the request, the response does contain candidates; this is required, because while we can live with the producer not advertising candidates, the media server has to, otherwise neither endpoint would know where to send connectivity checks to.

What’s next?

The WHIP specification is surprisingly simple, and just as effective. Writing the thin layer for the WHIP endpoint (which I wrote using nodejs) to act as a frontend to Janus was very straightforward (it doubt it took me more than 15-20 minutes), and as you can see from the screenshots above, it worked nicely for my tests. Of course, implementing something more robust and flexible might take more than that, but in principle WHIP is already in a state were it can be used and integrated in more complex setups as well.

While for the server side things may be very easy to handle (it was for Janus, at least), the matter may be a bit different for the client side instead. OBS-WebRTC was probably not a trivial task to accomplish for CoSMo, and there are already discussions and efforts on how to add WHIP support to other widely deployed and open source media applications, like ffmpeg and GStreamer, in order to make it even easier for media producers to involve WebRTC in their toolkits. It’s definitely exciting to think about how many opportunities might open here, in terms of media tooling and WebRTC support.

There may be some other things and areas to investigate, though. The first draft of WHIP was only published a few weeks ago, and there already are several suggestions on potential fixes and enhancements: you can see some (and participate to the discussion) by visiting the Github repo Sergio created for the draft. One enhancement I personally believe WHIP would benefit from, for instance, is an explicit URL to tear down a stream created via WHIP: at the moment, in fact, there is no such API (the only request/response exchange that exists is the one for the offer/answer dance), which means that a media server can only detect a user stopped (willingly or not) their stream by looking at the media traffic (e.g., ICE consent freshness and/or DTLS alerts), which may not always be reliable and could result in orphaned sessions. On the other end, having a way for the producer to explicitly signal their intention to stop streaming may make the process easier for media servers.

That’s all, folks!

I hope you enjoyed this overview on this new exciting protocol! I know I’ll be keeping track of it while it evolves, to make sure Janus will be able to support it for anyone interested in using it.

Lorenzo Miniero

I'm getting older but, unlike whisky, I'm not getting any better