WISH, WHIP and Janus: Part II

October 14, 2021 Lorenzo Miniero

Sometime around last year, I first introduced WHIP and how it could be used with Janus in a blog post right here. At the time, we had quite some fun experimenting with this new signalling approach our friends at CoSMo (and our good old friend Dr. Alex in particular, may he rest in peace ❤️) had come up with, using a custom fork of OBS Studio called OBS-WebRTC.

In the meanwhile, WHIP has been adopted by the IETF in a new Working Group called WISH (WebRTC Ingest Signalling over HTTPS), which led us and other developers to start working on different prototypes. As such, it made sense to summarize where we are now, what’s available (especially with respect to Janus), and what’s next.

Wait a moment, is it WISH or WHIP?

The original proposal was contributed as an individual draft more than a year ago, and was called WHIP, as in “WebRTC-HTTP ingestion protocol”. As we’ve just seen, a new WG has just been created in the IETF to work on this in a more structured way from a standardization perspective: unfortunately, it looks like the name “WHIP” had already been used in the past for a different effort, which led to the “WISH” name instead. That said, the original draft has been adopted in the WG with no title changes, which means the name of the protocol is indeed still “WHIP”. That’s good news, as I really liked using that Indiana Jones picture in the original blog post! (but the picture I used for this new post isn’t that bad either 😉 )

What is WHIP for?

In a nutshell, WHIP tries to provide a standard way to perform WebRTC ingest, e.g., for broadcasting purposes. In traditional broadcasting, RTMP is typically very often used for the job, whereas distribution is then performed by a CDN using other other technologies, e.g., HLS/DASH.

The main problem with the traditional broadcasting technologies is that, while they’re quite efficient and up to the task, they can’t do much when latency is particularly important. Even when pushed to their limits, you can never get a stream to be broadcasted with a latency below a few seconds, which doesn’t really work when latency needs to be much lower than that (sub-second at the very least, possibly even below that).

That’s where a technology like WebRTC can help. While originally conceived for conversational audio/video/data, and so bidirectional media, it has from the very beginning been used quite extensively also for monodirectional media distribution instead. Its conversational nature means it was designed from the very beginning to be as low-latency as possible, which would make it a very good fit as a broadcasting technology as well when latency is a concern.

This application of WebRTC was actually at the basis of my Ph.D Thesis, 5-6 years ago, and one of the main reasons why I started working on Janus in the first place, at the time. More specifically, the thesis introduced a WebRTC-based broadcasting architecture we designed called SOLEIL (Streaming Of Large scale Events over Internet cLouds), that could be used for large scale broadcasting using WebRTC instead of traditional technologies. While the content of that thesis may now possibly be a bit dated, the main concepts we worked on there still apply, and we use them regularly in many of our own applications and consulting services.

Unfortunately, the industry seemed to fight the idea of using WebRTC for the job for quite some time, sometimes with good reason, but quite often not. Dr. Alex did quite a good job debunking most of the points that were made at the time in a blog post and several presentations, helped by the several enhancements WebRTC as a technology had received over time. This eventually started to shift the perception WebRTC had in the broadcasting industry, and eventually led to the first attempts to implement this in a production environment, like Millicast.

That said, one of the main arguments against WebRTC for broadcasting often came down to its complexity, and a relative lack of tooling. While with RTMP you typically can just open any media production tool (e.g., OBS Studio), insert an RTMP url, and you’re done, with WebRTC it’s not that easy, since:

there is no standard signalling protocol for WebRTC, which means that even if you wanted, you wouldn’t know how to add WebRTC support to a generic tool;
WebRTC as a protocol suite is quite complex to implement itself (something we tend to forget when we just use JavaScript APIs in a browser to simply use it).

The first point is indeed the main objective of WHIP: providing a simple-to-implement and standard signalling protocol, based on HTTP, to negotiate a sendonly WebRTC PeerConnection, and thus allow for a much easier integration of WebRTC as a viable alternative for broadcasting ingest.

How does WHIP work?

In a nutshell, WHIP is a very straightforward protocol. To make everything as simple as possible, it constrains itself to a specific scenario (media ingestion) and uses HTTP requests to exchange all the information needed to establish a WebRTC PeerConnection for the purpose. More specifically:

you use an HTTP POST request to send your SDP offer, and get an SDP answer from the server in the HTTP response;
you can optionally trickle candidates via HTTP PATCH requests (which allow for ICE restarts too, when needed);
you tear down the session via an HTTP DELETE request.

That’s it! Everything else is your usual WebRTC “dance”, which means the usual ICE/DTLS setup followed by SRTP packets sent to the media server (and SRTCP sent back and forth for feedback and control). The following diagram describes the process in a visual way:

You start by sending an HTTP POST with your offer (and possibly a Bearer token, which is what can be used by WHIP for authorization and authentication purposes) to an HTTP URI that identifies a “WebRTC endpoint”, which will result in an SDP answer being sent back. The WHIP specification explains that a separate URL is returned as well in the Location header, to point to a so called “WHIP resource”: this address identifies the new ingest session, and is what you need to refer to from that point forward, whether it is to send trickle candidates (as shown in the diagram above), trigger an ICE restart or tear down the session in the first place (as displayed in the next couple of diagrams instead).

Notice how the first diagram presents “WHIP Endpoint”, “WHIP Resource” and “Media Server” as separate components: while logically they are, nothing prevents implementations to conflate them in the same application instead. What’s important is that “endpoint” and “resource” should be separate URLs (whether they live in the same web server or not), while the main purpose of the “media server” is obviously terminating the WebRTC PeerConnection itself.

This is an example (captured from an open source implementation of WHIP I’ll introduced later in this post) of how a POST to start a WHIP session could look like:

POST /whip/endpoint/test HTTP/1.1
Host: localhost:7080
Content-Type: application/sdp
Authorization: Bearer verysecret
Accept-Encoding: gzip, deflate
Connection: Keep-Alive
Content-Length: 1162

v=0
[..]

In this case, we’re trying to setup a new WHIP session by sending an offer to the /whip/endpoint/test WHIP endpoint, and providing a Bearer token (verysecret) as part of the authorization process; the SDP offer itself is omitted for the sake of brevity. The response from the WHIP endpoint will look like this:

HTTP/1.1 201 Created
X-Powered-By: Express
Access-Control-Allow-Origin: *
Access-Control-Expose-Headers: Location
Location: /whip/resource/test
Content-Type: application/sdp
Date: Wed, 13 Oct 2021 13:40:07 GMT
Connection: keep-alive
Keep-Alive: timeout=5
Transfer-Encoding: chunked

569
v=0
[..]

which will include the SDP answer from the server, meaning now the WHIP client has access to both local and remote SDPs. Notice how the response includes a Location header pointing to /whip/resource/test, which indicates where the WHIP resource for this session is. This means we can now also trickle candidates to help setup the PeerConnection, e.g.:

PATCH /whip/resource/test HTTP/1.1
Host: localhost:7080
Content-Type: application/trickle-ice-sdpfrag
Authorization: Bearer verysecret
Accept-Encoding: gzip, deflate
Connection: Keep-Alive
Content-Length: 335

a=ice-ufrag:ikGWqbVurOLryICjxi0l/wwEBrHcR8xe
a=ice-pwd:zD2krlPRTVfm/sKWhRFSxDMjYokC3DK5
m=audio 9 RTP/AVP 0
a=mid:video0
a=candidate:1 1 UDP 2015363327 192.168.1.232 44008 typ host
a=candidate:2 1 TCP 1015021823 192.168.1.232 9 typ host tcptype active
a=candidate:3 1 TCP 1010827519 192.168.1.232 56489 typ host tcptype passive

Notice the format used to exchange trickle candidates dynamically, which are basically SDP fragments (of type application/trickle-ice-sdpfrag) just including the current ICE credentials, a fake m-line, and one or more of the candidates we want to trickle. This format is documented in RFC 8840, and while it was originally conceived to allow SIP endpoints to take advantage of trickle too, it’s currently what WHIP specifies should be used for the job too.

The WHIP resource is also who we contact when we want to tear down the session, e.g.:

DELETE /whip/resource/test HTTP/1.1
Host: localhost:7080
Authorization: Bearer verysecret
Accept-Encoding: gzip, deflate
Connection: Keep-Alive

Of course, an explicit teardown is not the only way by which a WHIP session can be terminated. There are many cases, for instance, where a client may not be able to inform the server via signalling, meaning an HTTP DELETE may never take place. In that case, the WHIP specification explains how servers should monitor the ICE and DTLS state as well, e.g., to destroy a session when a DTLS alert is received, or ICE detects a failure.

An open source WHIP server (based on Janus)

Of course, the first thing I thought of when hearing about WHIP for the first time was “that’s cool! how do I make this work with Janus?”. That was indeed the main objective of the blog post I wrote a year ago, when I prototyped a WHIP server that implemented the syntax it specified at the time.

I ended up following pretty much the same approach this time around as well, by implementing a thin WHIP/REST API layer in front of a Janus instance. In fact, while the WebRTC part remains the same, Janus uses its own API (the Janus API) to exchange messages with clients/applications and negotiate PeerConnections. This means that, in order to be able to use WHIP to start sending media to Janus, you need to either implement WHIP itself in the control plane of Janus, or put something in front of it to take care of the translation. The latter seemed much easier to implement and made much more sense as well, which is why that’s exactly what I did, as an Express-based node.js REST server implementing the WHIP API. This server, which in an impetus of creativity I called “Simple WHIP Server”, is completely open source and can be found on GitHub.

Starting from the WHIP sequence diagrams we’ve seen before, the process of translating those exchanges to Janus API interactions can be found in the next few diagrams instead. The first diagram shows how we can deal with the negotiation process:

The obvious plugin of choice to do WebRTC ingestion is the VideoRoom, which implements an SFU, which means that any attempt to establish a new WebRTC PeerConnection via WHIP should involve the VideoRoom somehow. As such, whenever we receive a new POST request with a new SDP offer, this is translated to the creation of a new connection to the VideoRoom (handle) on behalf of the WHIP client, which we then use to create a fake participant in a room there. Sending an SDP offer to Janus that way will eventually get us an SDP answer back, that we can then put in the response to the POST we received.

Dealing with trickle candidates and ICE restarts works in a similar way, where we use the handle we created previously to update the information we now have on this connection:

For trickle it’s simply a matter of translating the RFC8840 format to whatever Janus expects for trickling candidates, while for ICE restarts it requires some more work. In fact, Janus works on full SDPs, which means that an ICE restart is performed by detecting new ICE credentials in a new SDP offer. WHIP simply exchanges the new ICE credentials, instead, which means it’s up to the WHIP Server to “craft” a new SDP out of the one it received before, by inserting the new credentials and sending them to Janus: this will cause Janus to detect an ICE restart is in progress, which will result in an SDP answer with new ICE credentials to be sent back. The WHIP server can then extract those new ICE credentials from the complete SDP, and just send those back instead in response to the HTTP PATCH request.

Finally, tearing down a session is relatively straightforward as well:

Handling an HTTP DELETE means we just need to tell Janus to get rid of the PeerConnection, and the easiest way to do that is simply detaching the handle we created, which will remove the fake participant from the VideoRoom, and destroy the PeerConnection as part of the process. We’ve explained before how media servers need to be ready to detect PeerConnections being closed without signalling telling us about it (e.g., WHIP Clients crashing), which is why the second diagram for instance shows how Janus detecting a DTLS alert from the peer is notified to the application layer via an event, which eventually leads the WHIP Server to get rid of the WHIP resource automatically just as if it just received an explicit DELETE.

All those considerations were brought in the open source WHIP Server implementation I introduced, where the JavaScript code dealing with incoming HTTP requests (to take care of WHIP) acts as a trigger for interactions with Janus. To make testing easier, I created a very basic UI to create new endpoints, where I could easily create named endpoints and specify:

the VideoRoom room we should be sending media to (basically where we’ll create the “fake” participant);
optionally, the token to require for authorization.

Now that we had a WHIP server, we only needed a client to test it against!

An open source WHIP client (based on GStreamer)

When we worked on WHIP last year, we could take advantage of OBS-WebRTC for the purpose, which the CoSMo team had expanded to support the flavour of WHIP available at the time. This made for easy testing from a native tool. Unfortunately, this wasn’t something we could use this time around as well, as that old version of OBS-WebRTC used a deprecated version of WHIP (which is not 100% compliant with the new one), and more recent versions don’t have any support for WHIP yet instead.

As such, I decided to create a new WHIP client myself for testing. The “easy” way forward would have been just using the browser for the purpose, as it supports both HTTP and WebRTC, that is the building blocks of WHIP itself: that said, I really wanted to experiment with a native implementation instead, with the main objective being an attempt to foster a better and easier integration with native application in the media production space.

I eventually decided to write a command line application based on GStreamer’s webrtcbin. The mean reason was I had actually already used that module in a WebRTC application before (JamRTC, my attempt to write a jam session application based on WebRTC and Janus), and so I already had some familiarity with it. Besides, GStreamer has a very powerful and modular architecture, which makes it very easy to use different codecs, or capture media in a ton of heterogenous and different ways. As an HTTP stack for the WHIP exchanges, I simply relied on libsoup, which just as GStreamer is based on GLib and so felt like the obvious choice.

Just as with the WHIP server, I called this project “Simple WHIP Client” and released the code as open source on GitHub. The end result was a command line application, with the following potential arguments:

Usage:
  whip-client [OPTION?] -- Simple WHIP client

Help Options:
  -h, --help            Show help options

Application Options:
  -u, --url             Address of the WHIP endpoint (required)
  -t, --token           Authentication Bearer token to use (optional)
  -A, --audio           GStreamer pipeline to use for audio (optional, required if audio-only)
  -V, --video           GStreamer pipeline to use for video (optional, required if video-only)
  -S, --stun-server     STUN server to use, if any (hostname:port)
  -T, --turn-server     TURN server to use, if any (username:password@host:port)
  -l, --log-level       Logging level (0=disable logging, 7=maximum log level; default: 4)

As such, it’s easy to just tell it where to connect (WHIP endpoint) and what to stream (audio/video pipelines), and it will just to its job. The customizable audio and video pipelines are particularly interesting, as they’re where the flexibility of GStreamer as a framework really shines: in fact, they allow you to have complete control on what to capture (e.g., a local device like a microphone or webcam, a file, a network resource, or whatever GStreamer supports), what codecs to use (e.g., Opus/VP8 or other WebRTC compliant codecs) with the related encoding properties, up to the packetization process. After that, the WebRTC stack in the WHIP client takes care of the rest automatically.

The following is a simple example of how you can capture some test audio/video patterns, and encode them as Opus/VP8 streams:

./whip-client -u http://localhost:7080/whip/endpoint/ciao \
	-t verysecret \
	-A "audiotestsrc is-live=true wave=red-noise ! audioconvert ! audioresample ! queue ! opusenc ! rtpopuspay pt=100 ssrc=1 ! queue ! application/x-rtp,media=audio,encoding-name=OPUS,payload=100" \
	-V "videotestsrc is-live=true pattern=ball ! videoconvert ! queue ! vp8enc deadline=1 ! rtpvp8pay pt=96 ssrc=2 ! queue ! application/x-rtp,media=video,encoding-name=VP8,payload=96"

In this example, we’re trying to contact the WHIP endpoint http://localhost:7080/whip/endpoint/ciao, using token verysecret for authorization, and we’re capturing those test patterns to encode. The end result will look like the screenshot below, where the WHIP client will configure the complete GStreamer pipeline to setup, and use the WHIP protocol to exchange SDP offer and answer and trickle its candidates: eventually, ICE and DTLS will be established, and media will start streaming.

An easy way to test this with our WHIP server would be to join the same VideoRoom room as passive participants. In fact, since the WHIP server is basically configured to create a fake participant in a specific room and have media sent there, joining the same room and subscribing to the publisher is the simplest way to ensure media is being ingested properly, as depicted in the screenshot below.

Of course, a more interesting use case for this would be to start rebroadcasting the ingested media in a WebRTC distribution network (e.g., using the SOLEIL architecture briefly introduced before), so that the stream originated by the WHIP client can be distributed to a wider audience than a single Janus instance can accomodate. That said, WHIP stops at the ingest, and as such this test is more than enough to guarantee it’s actually doing its job: what you do with the media after that is entirely up to you.

It’s also quite interesting to experiment more with the capture process in the first place: the picture below, for instance, shows how we can use NDI (which we talked about more than once, in this blog) to produce a stream using an external tool (e.g., OBS), possibly on a separate machine, and then use the WHIP client (again, taking advance of GStreamer’s modularity) to consume that remote NDI feed as the stream to be sent via WebRTC to a WHIP server instead.

That’s actually a demo I plan to try and demonstrate at the upcoming edition of ClueCon, during the Dangerous Demo session, so if this is something you’re interested in checking out, I’ll see you there!

Other implementations and interoperability tests

Of course I wasn’t the only one experimenting with WHIP as it came out. There were many other developers working on alternative WHIP servers and/or clients, which helped performing some initial interoperability tests already. More precisely:

Juliusz Chroboczek added WHIP support to his open source SFU Galene, thus providing a WHIP server implementation;
Sergio Garcia Murillo, as main author of the WHIP specification, provided a web-based WHIP client, while at the same time implementing a WHIP server layer in Millicast;
finally, Gustavo Garcia implemented a simple Go-based WHIP client, currently hardcoded to capture the screen of the machine it is executed on.

This provided an interesting opportunity to perform multiple interoperability tests, as we had access to three separate WHIP client applications (mine, Sergio’s and Gustavo’s), and three separate WHIP server implementations (mine, Galene and Millicast). As you can see in the gallery below, all interoperability tests were successful, which was surprisingly nice to find out.

More precisely, the first row shows my WHIP client, the second row shows Sergio’s, and the last row shows Gustavo’s client in action; the first column presents my Janus based WHIP server, the second is Galene, and the last is Millicast, thus displaying the interoperabilty results in a visual matrix.

The fact that everything worked was proof that, as simple as it is, even in these early stages WHIP provides an effective way of allowing different WebRTC applications to interact with each other for the purpose of media ingestion. That said, we did identify some things that might need to be addressed in new versions of the specification, such as:

the need to clarify that web-based WHIP clients will be subject to CORS (something native clients don’t really need to worry about), which became apparent while testing Sergio’s client;
RFC8840’s format for candidates, which we’ve glanced before, may be a bit too convoluted, and some simplifications might help;
it’s not always clear what to return in case of WHIP request errors;
it’s also not exactly clear what to return to candidates being trickled via HTTP PATCH (i.e., 200 vs 204);
there may be race conditions between PATCH requests when doing an ICE restart, due to the fact HTTP requests may arrive out of order and so a previous ICE credentials combinations pre-restart may be incorrectly detected as a new restart attempt.

At the time of writing, Sergio already started working on a new version of the draft that includes this feedback we collected, so most of the issues we identified should be fixed soon in the specification.

What’s next?

At this point, with so many implementations already available, the next immediate step is definitely finding an opportunity to test even more. Luckily, this opportunity might present itself quite soon, as in just a couple of weeks a new edition of the IETF Hackathon will take place, right before the upcoming IETF 112 meeting. Since that’s an occasion we’ve often taken advantage of for other WebRTC-related testing, hopefully we’ll have the chance to do some more interoperability tests with other implementations that have not been publicly disclosed yet. Considering how helpful the first round was in terms of helping shaping the specification (in the true spirit of IETF’s “rough consensus, running code” motto), this might be an excellent chance to iron out some other sharp edges in the WHIP protocol we haven’t addressed yet. Since it can be attended free of charge, if you’re planning to write a WHIP implementation (or have one ready) and are interested in participating in the tests, please don’t hesitate to join us!

I’m personally also interested in experimenting more with the WHIP client I worked on. I already anticipated the NDI-based test I plan to perform shortly, and the idea is to find how easy it might be to integrate the WHIP client in the workflow media producers typically follow without touching the tools they like to use every day. Of course, as a counterpart I also plan to work more on automating the internal distribution of an ingested WHIP stream, so that it can be seamlessly rebroadcasted via SOLEIL or our own Virtual Event Platform.

Needless to say, though, the main objective of WHIP was and remains making it easier to work with WebRTC in the broadcasting industry in the first place, and as a consequence facilitating the integration of WebRTC as a technology in commonly used media production tools like OBS and others. Hopefully the progress we’re making with the specification and related prototypes will convince tools implementors to add WebRTC as a publishing option via WHIP as well, in order to see RTMP used less and less for the purpose.

That’s all, folks!

I hope you enjoyed this read! As usual, I planned a much shorter blog post, and ended up with a few Divine Comedy canti instead… Hopefully this was informative and helpful nevertheless, and will encourage you to experiment more with this new exciting technology yourself. In case you do, and the tools we made available help you for the task, please don’t hesitate to share it with the world!

Lorenzo Miniero

I'm getting older but, unlike whisky, I'm not getting any better