A couple of years ago I wrote a blog post explaining how I used some existing features in the AudioBridge plugin to allow SIP endpoints to dial-in and join a conversation that would normally be limited to WebRTC participants alone. That worked nicely and, at the time, it also was an interesting opportunity to play a bit with Drachtio as the framework to mediate between the Janus API and SIP. That said, it also showed some limitations: in particular, it only allowed SIP endpoints to dial-in, but there was no way to have the conference bridge “dial-out” instead.
As such, I decided to explore this possibility instead; not only for the “dial-out” part, but also to experiment a bit with cascaded mixing in the AudioBridge. How did that go? Buckle up and find out!
Where were we?
As explained in a few other blog posts, and as you probably know already if you’re familiar with Janus, the AudioBridge basically implements an audio-only MCU, meaning it allows you to create audio conference rooms that multiple participants can join in order to have a conversation. Audio from all participants is mixed, and each participant gets a mix of all the other participants except themselves (N-1 mixing). It supports Opus and G.711 as codecs to use, and implements features like PLC (packet-loss concealment) to improve the audio quality where needed. We recently enhanced it with a libspeex-powered jitter buffer, and optional denoising via RNNoise. This real-time mix can then be distributed to a wider audience via RTP forwarders, if needed, as explained on my blog post on social audio applications (or as we do in our Virtual Event Platform at all IETF and RIPE meetings).
By default, the only way to join an AudioBridge room is via WebRTC, meaning you use the Janus API to establish an audio-only WebRTC PeerConnection with the plugin, in order to exchange audio frames in real-time. That said, the AudioBridge also supports a feature called “plain RTP participation”, which as the name suggests is a way to allow you to use plain RTP, outside of the context of a WebRTC PeerConnection or a signalling session, to join a room and exchange RTP packets with the plugin. You still use the Janus API to orchestrate the setup of such a plain RTP session, but you don’t need the overhead of WebRTC, and this opens the door to interesting interoperability opportunities with applications that may not support WebRTC at all. This is indeed exactly what I leveraged to implement the “dial-in” feature I mentioned in the introduction, and that I documented in a dedicated blog post extensively. The diagram before is a recap of that SIP-to-AudioBridge demo, via Drachtio.
Now, as anticipated this works nicely, but had a hardcoded limitation. In fact, the way the AudioBridge API worked, you could only set up plain RTP participation in a way that we could define “user-offer mode”. This means that the plugin was always expecting the end user to provide their own RTP connectivity details first (user IP address and port), and only after that the plugin would provide their own. This is equivalent to, e.g., expecting your peer to provide an SDP offer before you can provide an SDP answer in a WebRTC session.
In SIP terms, that’s the same as expecting a SIP INVITE with an SDP offer, and then sending a SIP 200 OK with your own SDP answer back, which is exactly what I implemented via Drachtio in the demo above. I orchestrated the various APIs, so that:
- I’d use Drachtio to wait for a SIP INVITE from a SIP endpoint;
- when an INVITE arrived, I’d extract the connectivity information from the associated SDP offer, and I’d pass that to the AudioBridge plugin using the plain RTP participation API;
- the plugin would answer back with its own RTP details, that I’d use to craft an SDP answer to send back in a 200 OK back to the SIP endpoint;
- media would flow.
What if I wanted to do the other way around, though? What if I somehow wanted to have a way for the AudioBridge plugin to “dial-out” and invite a SIP endpoint in, rather than wait for them to call? Due to the constraints imposed by the plain RTP participation API, that wasn’t possible, which was a bummer.
This also had another side effect, that was a partial inability to do cascaded mixing. In fact, while in this instance I used the plain RTP participation to implement SIP bridging, there’s much more we could do using this feature. One thing that comes to mind is simply bridging two or more different AudioBridge instances to each other, in what is usually referred to as “cascaded mixing”. This is a feature that has often been used in MCU deployments, mostly due to its benefits to the use of resources, especially in the case where you have to mix a lot of participants in the same session. If you have a room with 50 participants, decoding/encoding/mixing all those streams can take a toll on resources: but if you split that room in two, where each of the mixers is only in charge of 25 participants, and then they exchange the relative mixes with each other, the use of resources can be greatly limited and contained, as each instance only has to take care of mixing 26 streams (their own participant, plus the other remote instance), which is way less than 50.
This is a simplification, of course, and there’s more to say about that in general, but that was a motivation enough for me to start experimenting more with the feature, in order to figure out how to make that possible. For the sake of completeness, it’s worth pointing out that this is actually already possible in the AudioBridge as it is, if you use WebRTC as a way to bridge two rooms to each other: but as I mentioned, I wanted to figure out a way to do that using plain RTP participation instead, which would allow me to avoid the overhead of WebRTC in the first place, for what would arguably be a server side orchestration of resources that might be done in a lighter way instead.
Enhancing plain RTP participation
Plain RTP Participants were introduced in great detail in the previous blog post, so I won’t spend too many words on that, but in a nutshell, as we explained in a previous paragraph they’re mostly a way to orchestrate the exchange of out-of-context RTP connectivity information for the purpose of exchanging RTP packets in a bidirectional way between two remote endpoints, with no need of SIP or other standard signalling protocols (even though, as we’ve seen, they can be used to actually use them in enable using those protocols where they couldn’t before).
A simple example might be this join
request, where we’re telling the AudioBridge plugin where they can reach us:
{
request: "join",
room: 1234,
display: "Plain RTP participant",
rtp: {
ip: "192.168.1.10",
port: 2468,
audiolevel_ext: 1,
payload_type: 111,
fec: true
}
}
whereas the plugin would answer back with their own connectivity information:
{
"audiobridge" : "joined",
"room" : 1234,
"id" : 9375621113,
"display" : "Plain RTP participant",
"participants" : [
// Array of existing participants in the room
],
"rtp": {
"ip": "192.168.1.232",
"port": 10000,
"payload_type": 111
}
}
That’s it in a nutshell: the end result of that exchange would be the user sending RTP packets (with payload type 111
) from 192.168.1.10:2468
to 192.168.1.232:10000
, and the plugin doing the same the other way around. This also explains how this makes a mapping to SIP easy, since the end result of a SIP call would basically be the same.
Now, what we want is a way to have the plugin send us their connectivity info first, so that we have access to information we can use before providing our own side of things. This would allow us to implement a scenario like the following:
- we decide we want to invite a remote SIP endpoint to a room;
- we use the plain RTP participation API in the AudioBridge to ask for their RTP connectivity info;
- we use that information to craft an SDP offer, and send an SDP INVITE to the endpoint;
- the endpoint accepts the INVITE, and sends back a 200 OK with an SDP answer;
- we extract their RTP connectivity info from the SDP answer, and pass it back to the AudioBridge plugin;
- media flows!
In order to be able to do that, I had to revisit a bit the way the plain RTP participation API worked. This is something a Janus user initially tried to contribute a few months ago, but I felt their approach was a bit too convoluted. As such, I decided to have a look at the code myself, and came up with a simpler approach in a new pull request instead.
The approach I followed was to basically try and piggyback the already existing “have the plugin generate an offer” feature we had for WebRTC usage. More precisely, normally the AudioBridge plugin works the same way as plain RTP participation does, meaning it expects joining participants to provide a WebRTC SDP offer, to then craft an SDP answer to send back. If you provide a generate_offer: true
property when joining, though, and don’t provide any SDP offer of your own, then the plugin is instructed to take the initiative and prepare an SDP offer of their own first: you then close the deal in a subsequent configure
request, where you provide the SDP answer needed to finalize the PeerConnection establishment. The difference in the two approaches is summarized in the diagrams below.


As you can see, the second diagram is where we ask the plugin to send an offer first, to which we can then respond later on. Considering that the join
request is what we use to join as plain RTP participants too, I decided to try and play with it a bit, in order to follow the same pattern.
All that’s needed to tell the plugin we’re interested in plain RTP participation is providing an rtp
object, as shown in the snippets at the beginning of this section. As such, in order to enable a “plugin offers” mode for plain RTP participants, I modified the code so that, in case generate_offer
was set to true
, the ip
and port
properties in the rtp
object would be ignored, and we’d only start creating the local socket and network resources, in order to send information back. At the same time, I extended the configure
request so that it would support passing an rtp
object as well, for the sole purpose of providing the “answer” from the user back to the plugin and finalize the RTP connection establishment. The approach is summarized in the diagram below.
This is indeed what the above-mentioned pull request in Janus currently implements, so let’s give it a try!
Inviting a SIP endpoint to an AudioBridge room
To test this, I decided to try and use the same approach as I did last time, so basically create a small Node.js application based on Drachtio and Janode, our Janus API SDK, to orchestrate SIP dialogs and Janus API to bridge a SIP call to an AudioBridge room via plain RTP participation.
The first step to make that happen, besides removing some layers of dust from the demo I wrote at the time, was updating the AudioBridge code in Janode in order to support the rtp object in configure requests too, as that was indeed a new feature. My colleague Alessandro Toppi, the Janode author and maintainer, took care of that in a dedicated experimental branch, that will be merged as soon as associated feature is merged in Janus too.
After that, I needed to have a look at the Drachtio SRF documentation to see how I could implement a User Agent Client (UAC), as in that case I’d be originating a call myself, rather than react to incoming INVITEs as I did last time as a User Agent Server (UAS). Luckily, the documentation and the examples are very easy to follow, so I soon found out how to do that. The plan was to implement the following workflow:
- something triggers our intention to invite a SIP endpoint to an AudioBridge room;
- we attach a handle to the AudioBridge plugin via Janode, and join the room enabling plugin-generated offers and plain RTP participation, waiting for RTP connectivity details;
- the plugin sends us their RTP info, which we use to craft an SDP offer;
- we use the Drachtio SRF to create a UAC and send an INVITE to the SIP endpoint with the crafted SDP offer;
- the SIP endpoint replies back with an SDP answer;
- we parse the SDP answer to extract the provided RTP connectivity information, which we send back to the AudioBridge via a
configure
request; - at this point, the SIP endpoint should be able to interact via audio with the room.
The whole workflow is summarized in the diagram below for a more visual and easy understanding. For the sake of simplicity we’re assuming the application server (App) contacts Bob (the SIP endpoint) directly, while obviously most of the times we’d go through a proxy or a PBX.
For what concerns the code, most of it is very similar to the one I wrote in the previous blog post: in fact, turning an AudioBridge rtp
object to an SDP and viceversa works exactly the same way, and the only thing that was really different was basically the switch from using an UAS to using a UAC. In Drachtio, we can basically do this that way:
[..]
await srf.createUAC(to, {
localSdp: mySdp,
callingNumber: 'ab2sip',
headers: {
'Call-ID': callId
}
})
.then(async (uac) => {
[..]
This snippet should be self-explainatory, but we’re basically creating a new UAC that will call the provided SIP uri (the to
variable), using ab2sip
as our display name and including the SDP offer we crafted from the info we got from the AudioBridge. As you can see, I’m also manually passing a call-ID to use, and mostly for a lazy reason: I needed a way to map the outgoing call to the associated AudioBridge participant handle, and knowing the call-ID in advance (rather than figuring it out later) made it easier in the demo.
As to what to use for the trigger to send the INVITE, I decided to keep it simple as well, and basically used Express to tie it to a REST endpoint:
app.get('/', async function (hreq, hres) {
// FIXME
let room = 1234;
let to = 'sip:bob@192.168.1.74';
let callId = 'call-' + new Date().getTime();
console.log('Starting call ' + callId);
[..]
In a nutshell, sending a GET request to the root of the application server web backend triggers the code that gets the whole ball running. Awful if we need to create a real application out of this, but really simple and effective for a demo!
So, assuming I’m using the Janus SIP demo as my SIP endpoint, and I’m already in the AudioBridge room as a regular WebRTC participant in another tab, let’s see what happens when I invoke curl http://localhost:3000
in a shell…
So far, so good! The application server is indeed doing what it should: my GET did trigger the code to implement the workflow, and got Drachtio to send an INVITE to my SIP endpoint. At this point, when I accept, this should result in my SIP user being added to the room:
Eureka, that worked too! And while you’ll have to take my word for it (or re-create the same demo locally, it’s not that hard), I did have bidirectional audio between the SIP user and the AudioBridge participant, which was exactly what I wanted. Mission accomplished!
Hey, I was promised cascaded mixing!
Yeah, yeah, I was getting to that.
I explained how refactoring plain RTP participation to allow for a reversed role in their establishment would open the door to cascaded mixing as well, as it would allow, e.g., AudioBridge room1 somewhere to start exchanging audio packets with AudioBridge room2 somewhere else. Now that I got SIP dial-out working, I could start experimenting with cascading as well.
The idea for a workflow was basically as the SIP example, but without SIP, that is only involing the Janus and AudioBridge APIs. As such, something like the following:
- rooms
1234
and4321
need to be bridged/cascaded; - we create a AudioBridge handle on the Janus instance hosting room1, and join as plain RTP participants but with
generate-offer
enabled (we want to be “invited”); - using the
rtp
object we get back, we create a different AudioBridge handle on the Janus instance hosting room2, and there join as plain RTP participants providing the RTP info we obtained from room1; - room2 sends us an
rtp
object back, that we pass to room1 via a configure on the associated handle; - profit!
Again, just to make the process more straightforward to understand, here is how it works visually in a nutshell.
To test whether or not this would work, I updated the application server I had used for the Drachtio-based SIP integration, and used a different trigger to bridge two rooms using the Janode SDK. The main bulk of the code, skipping some debugging lines that are not really relevant, looks like this:
[..]
let details = {
room: room1,
display: 'Room ' + room2,
generate_offer: true,
rtp_participant: {}
};
let data = await handle1.join(details);
details = {
room: room2,
display: 'Room ' + room1,
rtp_participant: data.rtp_participant ? data.rtp_participant : data.rtp
};
data = await handle2.join(details);
details = {
rtp_participant: data.rtp_participant ? data.rtp_participant : data.rtp
};
data = await handle1.configure(details);
[..]
As you can see, nothing fancy: assuming we created two different handles (handle1
and handle2
) to the different AudioBridge instances hosting the two rooms we want to cascade on their respective Janus servers, we’re basically doing what we did before already, that is asking the first instance to give us their connectivity info, passing it to the second instance, and then passing the connectivity info we get from the second instance back to the first.
Now, I didn’t want to test this just with “regular” getUserMedia
audio, as it would be hard to figure out if I was indeed getting the right audio from both instances with just me contributing with my mic. That’s always a problem when you test stuff on your own: there really is such a thing like the “Loneliness of the WebRTC developer“, you know! As such, I decided to leverage a little known feature of the AudioBridge plugin, that is its ability to play pre-recorded Opus files within existing rooms (e.g., for background music or announcements). The idea basically being that, if I played different audio files in the two different rooms, a working cascaded setup would allow me to hear both audio files no matter which room I joined. Genius, won’t you agree?
I soon found out that was a little problematic, though, as the Janode SDK didn’t have support for using that feature yet. As such, I brushed up my poor JavaScript skills and, with Alessandro’s blessings, I added support for it myself, in a dedicated pull request. That now allowed me to also programmatically start playing audio files in both AudioBridge rooms from the same application server I used to “cascade” them.
Long story short, I triggered the cascading, I triggered the audio files playback in both rooms, I joined one of the rooms using the Janus AudioBridge demo page and then… nothing! No audio at all. Why was that?!
A quick look at Wireshark confirmed my suspicions. Despite no errors using the plain RTP participation, there was actually no trace of any RTP packet being exchanged on the ports that the two AudioBridge instances had advertised to each other via my application server. After a little bit of digging, I remembered one additional constraint I had added to the plugin back when I had first added support for plain RTP participants, specifically the fact that even after “negotiating” plain RTP connectivity, the plugin would only start sending RTP packets from the mixer after receiving the first packet from the remote user. That was indeed the issue: considering that, in this case, both endpoints were actually AudioBridge instances, both of them were waiting for the first RTP packet from the other before starting to send one of their own, thus leading to an impasse.
Of course, in retrospect, that constraint made little sense, so I changed that as part of the RTP connectivity refactoring I had already performed in the new PR. Specifically, I changed it so that the channel would be marked as active as soon as the plugin had access to connectivity information from the peer: right away, in case RTP connectivity was provided in a join
request, or later on, should that info arrive via a configure
request instead. And lo and behold…
Yay! That worked as expected! Joining room 1234
on the first instance as a “regular” WebRTC participant using the AudioBridge demo, I saw a participant acting as the avatar for the other room (room 4321
) and, most importantly, I could hear both audio files being played at the same times, which confirmed cascading was actually working as expected.
Great! What now?
Well, as usual this was mostly a proof-of-concept demo: I wanted to provide ways for an AudioBridge to “dial out”, e.g., to a SIP endpoint, and for different AudioBridge instances to cascade to share the mixing load, and these tests proved that it is now indeed possible, with the changes we implemented. Whether or not people will actually start using this functionality for exactly those reasons, or for different usages of plain RTP participation, it’s up to them! I’m definitely looking forward to someone playing with it.
There definitely are some considerations to make on cascaded mixing, though. Now that we showed that it’s indeed possible to connect different AudioBridge rooms (possibly from completely different Janus instances) to each other, some may be tempted to connect them with little consideration of media topologies, which would be unwise. In fact, it’s important to point out that cascaded mixing works in a simple way: I mix all media I get here, you mix all media you get there, and we exchange the mixes, so that people here will be able to hear to what people over there are saying and viceversa. But, exactly because of how simple this is, it’s of paramount importance to avoid potential mixing loops. Creating a full mesh of, e.g., four different AudioBridge instances, where each instance is cascaded to the other three, sounds like a great idea, but it really isn’t, because you’ll soon figure out that the media you mixed yourself will get back to you from one of the other instances, as they’ll get it via the mix of one of the other instances they’re cascaded to. As such, it’s important to only cascade in a hierarchical way, e.g., using a tree, in order to avoid any changes media loops like the one we described may occur.
There definitely are other considerations that could be made on cascaded mixing: ages ago we spent a long time on it, even proposing a sadly unsuccessful Distributed Conferencing (DCON) BoF at the IETF, based on our efforts at the time cascading Asterisk MeetMe instances empowered by the BFCP moderation protocol. That dates back to almost 15 years ago, but you can be assured that the same considerations will apply even now!
That’s all, folks!
I hope that, after all the QUIC-related material I shared in the last few months on this blog, you appreciated this return to WebRTC, and in particular on the nuts and bolts of SIP, audio and cascaded mixing. I definitely see all this as an interesting opportunity, and I hope you’ll share that sentiment: and that, why not, you’ll be encouraged to set up a little testbed of your own to give it a try!
As usual, should you need any help building something on this on or other features we provide as part of our offerings, don’t hesitate to get in touch!