Real-time text, SIP and WebRTC

December 18, 2019 Lorenzo Miniero

Real-time text (RTT) has been around for quite some time. Many people actually confuse RTT with Instant Messaging (IM), while the two are quite different, not only in terms of user experience but also from a technological perspective. In fact, where IM envisages people exchanging more or less complete messages among each other only after a trigger of some sort (e.g., when the enter key or a send button is pressed), RTT has text exchanged while it’s being typed. This means that any party in the conversation can, at any time, see the text while the person is typing it, without waiting for the whole message to be delivered.

There are several important use cases and scenarios for such a technology, that go way beyond the purpose of simple entertainment and communication. RTT, in fact, is very useful as a means of communication for some people with disabilities (e.g., for live captioning), and will be at the foundation of the upcoming Next-Generation emergency services (e.g., NG-112 and NG-911), due to its real-time nature.

As we’ll see in a minute, there is a well known and deployed specification to use RTT in SIP applications. This made me want to investigate how hard it would be to integrate such a functionality in WebRTC as well, obviously with the help of Janus and the SIP plugin, which led to (spoiler alert!) this branch.

T.140

Several implementations exist for real-time text as a technology. From a standards perspective, a relevant specification was provided in ITU’s T.140 (Protocol for multimedia application text conversation).

Without delving too much into the details, the protocol is based on the concept of so called T140blocks, which contain a set of characters or special codes that one party is delivering to one or more others. In a nutshell, any time a participant types some text, one or more of the typed characters can be bundled together in a T.140 block, and sent. Characters are supposed to be UTF-8 encoded, and while most of them correspond to actual text being typed in, some are actually meant to be special codes for special actions: this includes, e.g., a Byte Order Mark (BOM), a code for backspaces, one to act as a line separator and more.

The backspace one is quite important, as it puts a boundary on how the conversation can actually be encoded: specifically, once you’ve typed something, the only way to correct that is using the backspace key. This means there’s no way, for instance, to set the terminal cursor on a specific point of the already introduced text to fix a single character: or, to be more precise, this can be done from a user experience perspective, but this would then need to be translated in the protocol as a series of backspace codes to erase all characters up to the one to replace, and a reintroduction of those same characters after the correction. While this might seem an unnecessary complication, it actually helps keeping the protocol simple and tight.

Buffering of characters to put in a T140block may be involved in order to avoid the excessive overhead that might come out of sending one character at a time: how much to buffer is of course a matter of trade-off, as small buffering may result in increased overhead but quicker delivery of the text, while on the other end buffering too much will indeed reduce the overhead but also introduce an annoying latency in the conversation.

SIP and RTT

In order to be able to use such a protocol in SIP-based applications, an RTP payload was then designed to allow text to be carried over RTP (RFC4103), together with information on how to negotiate it within the context of SDP. The choice of RTP instead of, let’s say, a regular IM protocol based on TCP was due to the strict real-time nature of the protocol itself: in fact, while RTP is usually only considered as an option for live audio and video streams, it’s a quite flexible specification instead, that was initially designed for live streams of different kinds. As such, it proved quite suited for the live and conversational nature of T.140 as well.

A T.140 stream is negotiated using “text” as the media to negotiate in an SDP m-line. The RTP payload format to negotiate T.140 itself, instead, is “text/t140”, with a clock frequency of 1000 Hz. The following is an example of how such an SDP section might look like:

m=text 11000 RTP/AVP 98
a=rtpmap:98 t140/1000

As it can be evinced from the snippet, it’s pretty much a “regular” SDP. Apart from that, everything works as “regular” RTP as well: each RTP packet must have a sequentially increasing sequence number, and timestamps must reflect the moment the first block of text in the packet was introduced. The payload is then composed of T.140 encoded data, so basically a T140block.

0                   1                   2                   3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|V=2|P|X| CC=0  |M|   T140 PT   |       sequence number         |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                      timestamp (1000Hz)                       |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|           synchronization source (SSRC) identifier            |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                      T.140 encoded data                       |
+                                               +---------------+
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

One problem when it comes to using RTP as a protocol, though, is packet loss. In fact, RTP is most of the times used on top of UDP in order to guarantee a speedy delivery of packets, which means packet loss and out of order delivery of packets may occur in a conversation. While this is only relatively a problem for audio and video (with its own solutions), it’s more problematic with real-time text, expecially since each T140block will contain portions of the text to render: missing blocks will mean missing text, which is a problem of its own, made worse by the fact that backspaces applied to the wrong portion of text may actually result in completely broken conversations.

While RTP itself provides some means to detect packet loss (e.g., sequence numbers) in order to mitigate the problem, one solution that was found was the introduction of redundancy, with the help of RED (Redundant Audio Data). While RED (RFC2198/RFC4102) was initially conceived for adding redundancy to audio packets, it’s actually flexible enough to be used in other contexts as well, which made it a perfect choice for real-time text as well. Without delving into the details of how it works specifically, suffice it to say that it allows to enrich each RTT RTP packet with info on previous T140blocks, so that missing text due to packet loss can still be recovered with the help of redundancy, e.g.:

 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|V=2|P|X| CC=0  |M|  "RED" PT   |   sequence number of primary  |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|               timestamp of primary encoding "P"               |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|           synchronization source (SSRC) identifier            |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|1|   T140 PT   |  timestamp offset of "R"  | "R" block length  |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|0|   T140 PT   | "R" T.140 encoded redundant data              |
+-+-+-+-+-+-+-+-+                               +---------------+
+                                               |               |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+     +-+-+-+-+-+
|                "P" T.140 encoded primary data       |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Enter WebRTC

I’ve been fascinated for a while by RTT and, in particular, its integration in SIP deployments (some form of live captioning is already available in the VideoRoom and Streaming plugins, using data channels along audio and video streams), and so I wanted to investigate how hard it might be to include WebRTC in the picture, using Janus. As you know, Janus does have a SIP plugin that acts as a gateway between the WebRTC and SIP world, and so RTT would simply be an additional block to introduce in that already existing implementation. As such, I started working exactly on that in this branch.

The motivations for such an effort were multiple. First of all, there was obviously the curiosity about an objectively interesting technology. But a way more important aspect related to RTT is that, while the protocol itself is not incredibly complicated, the implementation of proper user interfaces on top of that can be. In fact, there are different ways the text can be captured (we’ve discussed one possible complication before, when introducing the backspace control code), and even more ways to render the real-time text coming from remote participants; besides, things can be way more complex if the real-time session goes beyond a one-to-one conversation. This has been a considerable drawback on a more widespread adoption of RTT in SIP deployments in the past: while some implementations exist, most are commercial only, and there are but a handful of open source ones around. Those you can find are either quite obsolete, or tightly coupled to specific platforms.

Most of these drawbacks disappear when working with web applications, though: in fact, it’s much easier to write an HTML/JavaScript frontend application that takes into account the RTT UX requirements. This makes it clear that the ability to establish a WebRTC communication to join a SIP/RTT session would be incredibly useful to make such an important technology more widespread and adopted.

It’s worth mentioning that the IETF has indeed started working on a way to get RTT working with WebRTC, in a currently active MMUSIC draft. The draft specifies how T.140 blocks can be exchanged over data channels, and how the SDP negotiation should change for that to happen accordingly. I should clarify that, while I did take the draft into account, as it will be explained later the current implementation doesn’t implement all that’s described there. The plan is to better align to the specification in the future, though.

SIP + RTT + Janus = ♥

When I started studying the technology, and how it could be mapped with what WebRTC provides today, I realized that, as a starting point, I had to take care of the following steps:

Negotiate “text” m-lines and the “text/t140” format on the SDP side;
Negotiate datachannels on the WebRTC side;
Translate one m-line format to the other, when bridging SDPs between the WebRTC and SIP peers;
Decapsulate T140blocks from RTP packets sent by the SIP participant, and relay them (with or without translation to a different format) via data channels towards the WebRTC peer;
Craft RTP packets to send to the SIP participant for every data sent via data channels by the WebRTC peer (possibly with translation to T140blocks);
Test a live conversation via Janus between a browser and an RTT-compliant SIP endpoint.

This is exactly what I ended up implementing in this pull request, which you can refer to if you’re interested in giving it a try or improving its current state.

In order to keep things simple, I decided to avoid any translation on the data itself, and keep T.140 blocks on the data channels side as well. There were many reasons for that: first of all, it’s what the above-mentioned MMUSIC draft specifies; besides, T.140 blocks are quite simple to handle in JavaScript as well, especially when exchanged as an ArrayBuffer rather than a plain string. This made the process more straightforward, as it allowed us to simply relay the exchanged data as it was, and only worry about RTP packetization/depacketization with respect to media.

As anticipated in a previous section, though, plain T.140 is rarely used in current deployments, due to the considerations we made on packet loss, and RED is favoured as a way to add redundancy. The draft clarifies how that is actually not needed for data channels, as using an ordered and reliable profile in SCTP ensures this packet loss cannot occur, as long as the SCTP association holds: this means that the Janus SIP plugin is supposed to still only relay T.140 blocks on the WebRTC side, whether RED is negotiated on the SIP side or not. This is exactly what we ended up doing, even though in an incomplete way for now: specifically, for outgoing data (messages sent via data channels by the WebRTC endpoint) we keep track of the last two T.140 blocks we sent, so that we can craft the redundant packet containing the new data to send and the redundant generations; for incoming data, instead (RED RTP packets sent by the SIP endpoint), we parse the RED header and go through the different payloads, but only relay the latest payload via data channels. This means that we’re not doing everything we should, at the moment: at the very least, we should keep track of the last received packets as well, and implement some form of buffering to accomodate for occasional packet loss. That said, the protocol implementation is there, so in the future it will just be a matter of building on top of that.

SDP management was a bit more complex. In fact, while as anticipated SIP endpoints use “m=text” m-lines to negotiate RTT sessions, WebRTC browsers don’t support that media type. They do support data channels, though, via “m=application” m-lines, which meant a translation had to happen for any SDP offer/answer exchange in the SIP plugin. This is the main point where the current implementation differs from the MMUSIC draft, though. In fact, the draft currently mandates that additional attributes should be exchanged to use RTT over data channels, namely “dcmap” and “dcsa”, like in the following SDP snippet:

m=application 911 UDP/DTLS/SCTP webrtc-datachannel
c=IN IP6 2001:db8::3
a=max-message-size:1000
a=sctp-port 5000
a=dcmap:2 label="ACME customer service";subprotocol="t140"
a=dcsa:2 fmtp:- cps=20
a=dcsa:2 hlang-send:es eo
a=dcsa:2 hlang-recv:es eo

Unfortunately, browsers currently don’t provide support for any of those attributes, and as such negotiating them anyway may result, at the moment, in undefined behaviour, or even broken sessions. The “dcmap” attribute would be particularly helpful as it allows the advertisement of different data channel labels to associate to different real-time text participants: since it cannot be taken advantage of, though, even though we do support using different labels for different data exchanges over the same PeerConnection in Janus, we decided to keep things simple for now, and negotiate data channels the “regular” way. While this currently limits the Janus integration to 1-1 sessions, we believe it’s still a good starting point. As a result, any attempt to negotiate a “text” session with the SIP plugin, e.g.:

v=0
o=Lorenzo_Miniero 1 1 IN IP4 192.168.1.108
s=Omnitor_SDP_v1.1
c=IN IP4 192.168.1.108
t=0 0
m=text 1024 RTP/AVP 99 98
a=rtpmap:99 red/1000
a=fmtp:99 98/98/98
a=rtpmap:98 t140/1000

is translated to such a WebRTC negotiation, and viceversa:

v=0
o=Lorenzo_Miniero 1 1 IN IP4 192.168.1.108
s=Omnitor_SDP_v1.1 t=0 0
a=group:BUNDLE data
a=msid-semantic: WMS janus
m=application 9 UDP/DTLS/SCTP webrtc-datachannel
c=IN IP4 192.168.1.108
a=sendrecv
a=sctp-port:5000
a=mid:data
[.. ICE/DTLS details follow..]

Implementing the UI/UX part of the proof-of-concept wasn’t that straightforward either. In fact, we had to extend the existing SIP plugin demo with basic real-time text support, which included:

Negotiating data channels when offered (or offering data channels when needed for real-time text);
Adding new UI elements to host the chat part (chat box and input area);
Properly render the remote real-time text (i.e., add the remote characters to the chatbox, and intercept codes like backspace and line separator);
Properly handle user input, meaning intercepting keypresses to detect when to send custom codes (e.g., backspace or line separator) and when to send regular text instead, besides preventing cursor repositioning (to force users to use backspaces to correct text).

To keep things simple, we implemented a common window as the main UI (local and remote text in the same window). This resulted in a very basic and dumb UI implementation (which would probably make the skin of RTT experts crawl), that still managed to act as a simple proof-of-concept for testing nevertheless.

Testing proved to be a bit of a headache, though. In fact, as anticipated in the previous sections, there aren’t many implementations freely available: most are commercial only, or more or less tied to specific platforms; in the open source space, Asterisk does support real-time text via SIP, but only in passthrough mode, meaning it couldn’t be used as an end user to talk to.

Eventually the choice fell on an old, Java open source implementation called TIPcon1. While quite old and not recently updated, it proved to be the only solution easily available to test against. Getting it to work proved to be a bit of a challenge, since even though it’s Java based, it apparently was conceived to only work on Windows; as such, it wouldn’t work on my Fedora machine, neither when I tried to recompile it nor when I tried to launch it via Wine. This forced me to make launch a VM with a Windows OS on to host it, which if you know me well enough is something I try to do as rarely as possible 🙂

That said, this eventually allowed me to test the Janus integration against an actual RTT implementation. If you have a look at the screenshots below, you’ll see that, despite the limitations explained before, it seemed to be working just as expected, which was exciting!

… and same chat from the (incredibly ugly) Janus SIP demo in a browser!

Basically both the SIP and the WebRTC user are able to see what the other is typing in real-time, which is exactly what the purpose of RTT is in the first place: “completed” messages are prefixed by the time the line separator was sent, while text being typed in right now is identified by a “typing” label. It’s pretty obvious how we tried to indeed mimic the TIPcon1 UI for the demo: again, this is just because it was the only application we could put our hands on, and so the only “reference” we had for something that made sense.

This image, instead, shows the RTP packets containing real-time text being exchanged between the two parties in the conversation:

You can see how the Janus SIP plugin (192.168.1.108 in the picture) is correctly crafting the RTP packets out of the data it gets from the WebRTC users via data channels, setting the proper timestamp and sequence number accordingly (besides a random SSRC generated at the session start), and setting the Marker Bit after what is considered an idle period (1s at the time of writing). Notice that, in this session, plain T.140 is being used on the wire, rather than redundancy via RED: this is because this test was made before support for RED was added.

While it’s not immediately apparent from the screenshot, it also hides one more simplification we made in the demo, though, that we’ll need to sort out soon. More specifically, at the moment we’re sending a T140block on data channels any time a key is pressed: this means we’re not doing any buffering at all in the application (as TIPcon1 partially does instead), and a new RTP packet is sent by Janus for each character. This is of course quite suboptimal for the considerations made in a previous section on the tradeoff between latency and overhead: as a next fix, we plan to implement some form of buffering, in order not to send something right away but only, e.g., each 100ms. Whether that will happen in the browser (e.g., with a setInterval) or in the plugin (e.g., plugin not sending incoming data channels to RTP right way, but buffering them instead) is something we’ll have to evaluate.

What’s next?

I was really excited to work on the effort I described in this blog post. At the same time, though, I realize it’s far from enough, and that more needs to happen before it can actually and really be useful:

First of all, at the time of writing the implementation you can find available in the PR does support redundancy via RED, but isn’t using the redundant information as it should: coupled with the fact that no buffering is done when relaying from SIP to WebRTC to compensate for potential out-of-order packets, and no buffering is currently done in the WebRTC capture/encoding part either, this means the communication may not be very robust in problematic networks, at least for packets coming from the SIP endpoint.
Besides, I only tested this with the open source TIPcon1 client: considering it’s a quite old and possibly deprecated implementation, it’s hardly a guarantee that this will work as expected with more advanced clients out there. Unfortunately, though, that’s the only client I could get access to: hopefully the availability of this effort as an open source implementation will encourage interoperability testing by third parties, and hopefully foster fixes and enhancements to be contributed back.
As anticipated, this doesn’t 100% adhere to the specification for RTT over WebRTC yet: that said, there’s not much we can do in that regard, as that will only change once browsers do implement support for the missing SDP attributes. I hope this will still be considered a helpful example to help the specification go forward, since it does send T.140 blocks over data channels already anyway.
Finally, the ugly user interface I wrote to test this all would really need to be improved: while it’s “functional” and works nicely as a simple proof-of-concept, I’m by no means an UI/UX expert, especially in such a delicate context. Hopefully more implementations built on top of Janus will come in the future to fix that shortcoming.

Hope you enjoyed reading all this, and I’m looking forward to your thoughts!

Lorenzo Miniero

I'm getting older but, unlike whisky, I'm not getting any better