Working with real-time media has always been cool (it definitely never really felt like work to me!), but that has started to become even funnier in the past few years. The explosion of AI/ML really opened the doors to interesting new opportunities in the real-time multimedia space, especially when it comes to integrations in WebRTC-based workflows.

As a research-oriented company, we were of course quite intrigued by the possibilities, and that eventually led us to start working on a new open source project called Juturna, which is what I’ll introduce in this blog post, and what my colleague Antonio Bevilacqua recently talked about at RTC.ON. We’ve been using it extensively ourselves since we started, of course, but what’s important is: will this be useful to you too? IMHO the answer is probably a resounding yes, but if you’re curious about what Juturna is, how it works and what it can do, just grab a cup of tea and keep reading!

The origin story

Each project usually has a cool origin story, something that sparked the flames and turned an idea into something much more concrete, and obviously Juturna makes no exception.

Our first steps in the AI/ML+RTC world started from where probably everyone else in the same space did: we wanted live transcriptions for our calls, meetings, or whatever else involved audio. In our case specifically, we needed to add them to our virtual event platform, the one powering remote participation support at IETF meetings. Within the context of hybrid meetings like at the IETF, we wanted to have a way to provide participants, whether at the venue or remote, a live transcription of whatever was being said during a session, in order to facilitate the process of following sessions and understanding discussions. This proved particularly useful at the meeting in Yokohama, which is where we first deployed the solution, as attendees with a more limited grasp of the English language had a helpful “crutch” to follow sessions effectively.

When we started tinkering with transcriptions, to keep things easy we decided to rely on an external service for the purpose, Deepgram to be precise. We used it for a few meetings and it worked fine, but we always felt we lacked the proper control and understanding of the process to really get the best out of it. This led us to the decision that we’d try to come up with something of our own instead. At the time, we had just hired Antonio, a formed student of ours that had gone to do great things abroad, obtaining a Ph.D in Ireland working on AI/ML, and so he made for the perfect candidate to investigate this new activity.

Long story short, this led to Whispy, a project Antonio spoke about at JanusCon too. Whispy was our effort to adapt STT Whisper models to real-time environments, an endeavour that, as everyone who has ever tried to tackle it knows, doesn’t come without challenges. It took us a lot of work and lessons learned, but that eventually proved to be successful, and Whispy ended up replacing Deepgram as our backend for live transcriptions at IETF meetings.

At this point, though, we knew we didn’t want to limit our efforts to transcriptions alone. There’s a whole plethora of cool things you can do with AI/ML and real-time media, and we wanted to explore as many as possible, possibly in a flexible and configurable way. Taking the Janus development process as a lesson, we decided to try and come up with a more modular approach instead, where we could dynamically create real-time media pipelines that we could process in different ways, possibly using extensible nodes that multiple developers could contribute and/or use.

That ended up being the seed for what turned out to be Juturna itself, as a project stewarded by Antonio and another brilliant colleague of mine, Paolo Saviano, who had already experimented with something like this a few years ago and talked about it at the first edition of JanusCon.

And, if you’re curious about the name, we have a fun story about that too! If you’re here, you probably know about Janus, our open source WebRTC Server. As you may or may not know, we chose that name for the server as, in the Roman mythology, Janus was the (start quote) “God of all beginnings, gates, transitions, time, duality, doorways, passages, and endings“. As the God who saw both the past and the future at the same time, it felt like the perfect name for our server, which at the time had been conceived as a “gateway” between the past (legacy protocols and technologies) and the future (WebRTC and beyond). So, who’s Juturna in all this, you’re asking? Well, long story short, in the same Roman mythology, Juturna was the “goddess of fountains, wells and springs“, and, surprise, the wife of Janus himself! Considering we planned to use Juturna with Janus a lot (even though, as we’ll see, you can easily integrate Juturna in other contexts), it made a lot of sense to name the project like that. Besides, the heavy reliance Juturna places on pipelines and flows, mapped pretty well to her role in the mythology as well.

What’s Juturna, then?

In a nutshell, Juturna is a lightweight, plug-in-oriented streaming data-pipeline engine written in Python.
It was designed to make it trivial to assemble audio/video/sensor processing workflows that move data from live sources (RTP, files, cameras, microphones) through arbitrary transformations and finally to sinks (HTTP webhooks, FFmpeg-based streamers, files, or any other piece of code that consumes data locally or remotely).

As such, there were three key aspects that we focused on when designing the framework:

  • Real-Time: it was important for us that it would work in real-time and on real-time data, using streams that could come from many sources. RTP is a first-class citizen, obviously, since as a standard intermediate format it makes it easy to use, e.g., Janus as a source of media to process, but it’s not the only option.
  • Multimedia: we obviously needed to be able to handle, process and transform media of different types and formats: audio and video, of course, but not only that, the idea being that generic data could be processed as well, assuming a shared and more or less “standardized” format was used to address it.
  • Pipelines: in order to handle the transformation process of one or more streams of data concurrently, it was important to use some form of pipelining of the data, by envisaging asynchronous and parallel units that could run multiple workflows.

Introduced like that, it might look like yet-another-pipeline-tool. While we clearly did learn a lot from existing frameworks based on the same principles, we tried to design it differently. For the modular nature of the framework, for instance, we decided to use dynamic and hot-plugglable nodes: this means that not only you can construct a pipeline out of nodes coming from different sources and/or developers, but that new nodes can also be discovered (and used) at runtime, making for a much more flexible approach. In order for heterogeneous nodes to be able to work and interact with each other, we obviously needed to figure out a way to properly “type” payloads: to do so, we came up with ways to address, in an extensible way, different kind of payloads through a flow, like audio frames, images, generic objects, etc., all trying to use a “zero-copy” approach as much as possible. And, to make this modular approach as easy as possible to use and extend, we designed the framework from a developer perspective, and with a developer focus: this means it’s supposed to be very easy, using a CLI command, to e.g., create a new node with very few lines of Python.

From a deployment perspective, we also tried to look at it from different angles. While it’s definitely possible to use Juturna in a more scalable and distributed context, we designed it with some local-first policies to make it as easy as possible to deploy and launch it. As such, everything runs in a single Python process, where no brokers, databases or extra services are needed for the process itself. Of course, different nodes may have different requirements, when it comes to resources they may need to operate, but the key point is that the framework itself should be as lightweight and easy-to-run as possible. That’s why one of our design goals was indeed ensuring that the deployment of a Juturna instance would be possible as a single static binary/library or a tiny Docker layer, without additional and possibly complex requirements.

As we’ll see in the next few sections, where we’ll have a look at what pipelines are and how they work, we tried to make everything as easy as possible to get in and understand for new users of the framework. By leveraging a full Python authoring surface, for instance, we designed Juturna so that ML engineers who already know, e.g., NumPy and PyTorch can jump in immediately and with little effort. And while Juturna comes with a few nodes of its own out of the box, one of its strength points is actually in its extensibility, and in the ability for developers to add nodes of their own to integrate in new or existing pipelines: this is where the hot-pluggable nature of nodes comes into play, as adding, e.g., a new algorithm is supposed to be as simple as git clone into a plugins folder, without requiring any recompilation or container restart.

And, last but not least, the real-time nature of Juturna pipelines also meant it was important to pay close attention to glass-to-glass latency, especially on commodity hardware, in order to make interactive applications based on the framework feel as responsive as possible.

In a nutshell, those were our main design goals when we first started designing Juturna and working on it. In order to better understand how that actually works in practice and what you can do with it, it may be helpful to have a look at the overall architecture, and pipelines in particular.

What are pipelines, and what do they look like?

To better understand how pipelines work in Juturna, and their role in the workflow, it may be useful to start from a more practical example. Let’s use the following diagram as a reference.

In this particular example, we have a pipeline working on two separate media streams: a live audio stream, and a live video stream, both coming in via RTP. The source of these RTP streams could be anything: they could be coming in via Janus, e.g., via RTP forwarders, or from other common RTP sources like FFmpeg or GStreamer; or they may be coming in from any other implementation that can originate and/or relay RTP streams. The source doesn’t really matter: what’s actually important is that Juturna is receiving two streams via RTP, and that those streams are now feeding the pipeline we configured.

The audio stream goes through a series of nodes that form a combined transcriptor and translator workflow. Specifically, RTP packets are first processed by an audio_rtp node, whose purpose is receiving packets and processing them, most importantly decoding them to a raw format that can be processed along the chain: these raw samples are then buffered according to some rules, so that they can be processed accordingly. The next step is a voice_detector node, whose role is to check whether there’s any audio to process at all: in fact, in order to avoid wasting cycles needlessly, having a VAD-like node is of paramount importante, e.g., to skip processing frames that we know to just be silence. Samples that “survive” the VAD cut are then passed to the next node in the chain, a transcriptor node: this is where the “meat” of the transcription service is, since it’s a node leveraging a Whisper model. At this point, the result of this transcription is forked and passed to two different nodes along the chain: (i) an http_notifier node, where we just push the results via HTTP to an external backend (e.g., for displaying the live transcription somewhere), and (ii) a translator node, where we can take the audio we transcribed and translate it to a different language (from English to Italian, in this specific case). The result of this translation is then pushed to a different http_notifier sink node.

The video stream, instead, is processed differently. We still have a video_rtp node to process incoming RTP packets and turn them to raw video frames we can work on, but the nodes that follow are different, since what we want to do with this video stream is something else. More precisely, we first pass the decoded video frames to a motion_detector node, to check whether we can detect any movement in the video stream we’re receiving: you could see this as the video equivalent of the audio VAD we saw before, if you will. Once we know there’s video to process, we pass the frame to a pose_estimator node, which will return both the metadata associated to the processing (e.g., coordinates associated to the detection) and a new frame enriched with that metadata as an overlay. Instead of pushing these results to some nodes as they are, though, we now feed them to a ffmpeg_videostream node, which will encode a new video stream using the enriched frame: the end result can then be pushed, again via RTP, to an external backend, e.g., a Janus Streaming plugin mountpoint where interested subscribers can consume the processed video.

As you can see, even from a high level perspective and overview of this sample pipeline, it’s clear how flexible and configurable Juturna can be, especially in terms of the nodes to involve for each stream, in series or in parallel. As in many other pipeline-oriented frameworks, there’s source nodes (nodes that act as the starting point for a specific chain), processing nodes (nodes that receive data from upstream, process it, and send something else downstream), and sink nodes (nodes that present and/or encode results to some external process or backend). With that in mind, it becomes easy to figure out how to integrate Juturna in a workflow that involves a heterogenous mix of third-party applications: as long as a shared format is used for sending and receiving data, Juturna can process whatever you need it to.

That is how we integrated Juturna in our IETF transcriptions service, for instance. We mentioned how, at the time, we had developed Whispy exactly for that purpose: once we started working on Juturna, we refactored Whispy as a series of independent nodes that could form a chain, including, e.g., the VAD and transcriptor nodes we saw in the diagram above. For the sake of simplicity, that diagram didn’t include all the nodes we typically use for transcriptions ourselves (e.g., the hallucination filter that could be added after the transcriptor node), but in general, as you can imagine it’s simple to just chain multiple nodes to better shape the pipeline you’re interested in, which is exactly what we did.

In the next few sections we’ll go through some practical examples and demos, but before we do that, let’s have a deeper look at what nodes actually are, and how payloads are formatted and processed.

Nodes and payloads

No matter how nodes are actually implemented, when used within the context of a Juturna pipeline they are all basically JSON objects: these objects contain all the info needed by Juturna to figure out how to load, configure and involve a node in the process.

The following snippet, for instance, could be the JSON object associated to the audio_rtp node we saw in the previous diagram:

{
  "name": "0_src_audio",
  "type": "source",
  "mark": "audio_rtp",
  "configuration": {
    "rec_host": "192.168.1.10",
    "rec_port": 8888,
    "audio_rate": 16000,
    "buffer_size": 3
  }
}

while this could be the configuration of the VAD node instead:

{
  "name": "1_voice_detector",
  "type": "proc",
  "mark": "vad_silero",
  "configuration": {
    "rate": 16000,
    "threshold": 0.75,
    "speech_pad_ms": 400,
    "min_silence_duration_ms": 500
  }
}

While the nodes have very different implementations and roles, within a pipeline, the way you configure them is pretty much the same: the type attribute, for instance, specifies whether this is a source, processing or a sink node, while the mark attribute provides a reference to the node implementation itself. The configuration object is quite important, of course, as this is where you configure all the dynamic parameters needed to tweak the behaviour of the node you’re loading: for an RTP source node this may be info on which address/port to bind to or which codec to use, for instance, while other processing nodes will expose properties to dynamically configure whatever algorithm they implement. As such, the configuration object will always be node-specific.

As you can see in the snippets above, all nodes are named as well, which is quite important when it comes to deciding how each of those nodes should be used within the context of a specific pipeline, as is what is connected to what, and how. Just as nodes, pipelines are basically JSON files as well, and a snippet partially matching the diagram we saw before can be seen here:

[
  { "from": "0_src_audio", "to": "1_voice_detector" },
  { "from": "0_src_video", "to": "1_motion_detector" },
  { "from": "1_voice_detector", "to": "2_transcriptor" },
  ... ,
  { "from": "2_transcriptor", "to": "4_dst_trx" },
  { "from": "2_pose_estimator", "to": "4_dst_pose" }
]

In a nutshell, a pipeline is a JSON array that lists one or more objects, where each objects represents a specific connection between one node an another. In the snippet above, this means we have an RTP audio source feeding a VAD, for instance, the VAD then feeding a transcriptor, and so on and so forth. The fact we can simply configure which node feeds which node using the from and to attributes makes it very easy to script a complex pipeline made of nodes in sequence and/or in parallel.

And when it comes to nodes, as we mentioned initially, the framework was conceived to be modular, which means these nodes can come from different places. Just as Janus with its plugins, Juturna is just the framework and the “core”: all nodes are external, and optional, components. The Juturna repo does come with some nodes out of the box as part of its core library (specifically those that allow you to set up a live transcription pipeline, at the time of writing), but you’re of course not limited to those, and as a matter of fact you can definitely use nodes written by someone else, or even write your own. This means that nodes can be built-in, community, private, or local: then, whether the execution of each node is fully local or uses external/remote services and resources, everything exists within the same Python process, which is Juturna itself, allowing for a streamlined processing of data through a series of potentially heterogeneous nodes.

Of course, this only works if there’s ways to somehow have access to nodes in a dynamic way. Without going too much in detail on this (you can watch Antonio’s presentation or read the documentation for more details), you can have Juturna load a custom module in two separate ways:

  1. you simply import and copy the node in the related folder, so that Juturna can find them locally, or
  2. you reference them externally via a configuration file, and Juturna will pull them on the fly.

At the time of writing, only GitHub is supported as a mechanism for pulling nodes externally (which we did as Juturna is an open source project available on GitHub itself), but in the future we’ll obviously add more ways to make it possible. Should the project gather enough interest, it could be interesting to design interfaces for specific node repos as well, e.g., a bit like repos for Linux distros work.

Long story short, when Juturna starts, different kind of nodes can be loaded: some may be part of the core library, some may be plugins available on the official repo, while others may be custom nodes coming from somewhere else. The whole process is summarized in the diagram below.

Once nodes are available, they can be used as part of a pipeline, as we’ve seen in the initial example. We’ve explained how both nodes and pipelines are addressed via JSON files/objects, which means a JSON file is all Juturna needs to instantiate available nodes to form a specific pipeline, like this:

Once a pipeline has been instantiated, its lifecycle can then be managed by going through some key functions, e.g., to go through a warmup phase, and then to dynamically start and stop it:

Once a pipeline is started, media will start going through the different nodes, which will trigger whatever behaviour they were implemented to take care of. As anticipated, while nodes themselves run locally as part of the Juturna instance, that doesn’t mean that the processing of data will be entirely local as well: some processing nodes may refer to local ML models, for instance (e.g., a transcriptor node leveraging a local Whisper model), but there could be node that act as local interfaces to remote services, where the bulk of the work could actually take place.

Of course, this modular approach where different nodes from different sources seamlessly work together only works when we can make some assumptions on how data is formatted. If we need to chain two nodes, for instance, we need to make sure that, whatever format the upstream node is using to describe the data payload, the downstream node supports and understands, otherwise nothing will work. While ideally there could be a globally uniform way of representing data, it may be unrealistic to mandate this on every node, especially when we have nodes working on internal transformations that don’t really need to be consumable anywhere else (which may be common for nodes specifically mean for specific pipelines, or custom workflows). As such, while the core library comes with a few payloads for common formats (e.g., audio, video, images, etc.), Juturna is also conceived to allow developers to still customise their own plugins with new and custom payloads, should that be needed.

An RTP packet goes into a foobar…

I’m sure all this is quite interesting, but why not have a look at some demos of what we’ve been using Juturna so far, for a more practical understanding of its potential?

As we anticipated, we’ve been using Juturna for quite sime time already to perform live transcriptions of IETF meetings, which we provide remote participation support for. These live transcriptions are not only available to remote attendees, via a side panel in the UI, but also to local attendees, in a side panel on one of the available screens in the room.

Functionally, the pipeline we use looks very much like the example we’ve seen before. A Janus AudioBridge mixer feeds the Juturna pipeline with live data, using RTP forwarders: audio frames are then processed in a chain made by a Silero VAD node, a Whisper-based transcriptor node, and an additional (currently not available on the public repo, but it may in the future) node responsible for aggregating and correcting results, by filtering allucinations as well. The processed data (timestamped transcription data) is then pushed in real-time via the HTTP notifier node to a component that’s responsible for presenting the results to end users: the live page this component generates is integrated, in both the second screen and the users’ UI, as an iframe. Considering that IETF meetings typically consist of 8 parallel tracks of about 150 sessions that run over the course of a week, we typically deploy a couple of Juturna instances to serve them all, as that’s more than enough.

Antonio showcased a variant of this pipeline during his RTC.ON presentation as well, by having Juturna run on his laptop and processing audio coming from his microphone: he demonstrated how his audio was being transcribed live, and then translated to Italian in real-time as well as part of the process, using a translator node.

There’s more interesting things you can do than transcriptions, though, and even using video rather than audio. For instance, as part of a PNRR project our company was involved in, Juturna pipelines were used to analyze live video streams in order to identify potential security hazards, like people not wearing a helmet in areas where they should. A short snippet from a demo made for the project (using stock footage as a simulated live video source) can be seen below.

In this specific instance, a YOLO-based estimator node was used to process the live video data, followed by a custom node used for annotation purposes. The processed frames were then re-encoded as an RTP stream and fed back to Janus, so that they could be watched in real-time using WebRTC: the animation above is indeed a partial capture of the web page presenting the real-time processed data.

Both examples are straightforward enough to understand, but should already give you an idea of the flexibility Juturna has as a framework. As long as nodes exist to provide a specific functionality needed for specific processing chains, the sky’s the limit! And with the extensibility we conceived as far as nodes are concerned, it should be quite easy to develop, expand or add new nodes any time a new requirement for a Juturna pipeline pops up, as most of the times it might simply be a matter of providing the right interfaces to wrap existing models or solutions.

Where are we now, and what’s next?

Needless to say, as a very new project that we only recently released to the public, Juturna does work, but not without a few rough edges. Besides, there are a few assumptions that the codebase currently makes that may be worth revisiting as the project evolves.

First of all, though, Juturna does work, at least for the scenarios we’ve been using it so far. We’ve explained, for instance, how we refactored Whispy, our original homemade Whisper-based transcription service, as a set of nodes on Juturna itself, and that’s been used in production to power the transcription service for the past few IETF meetings already, and doing a great job at that. We’ve also experimented with different Juturna pipelines within the context of a PNRR project, where we involved not only audio, but video as well, as shown in the previous section.

On a more general note, though, we do know there are a few areas for improvement:

  • The way pipelines are conceived, for instance, there could currently be some form of implicit back-pressure/bottleneck if a consumer is slow, which could in turn result in either growing buffers impact on RAM, or dropped packets.
  • The threading model could be improved too, as at the moment the thread count is basically the number of active nodes. This means that, while there typically is no overhead for I/O-heavy sources (e.g., networking for RTP), CPU-bound code might cap. This seems to be a problem mostly with versions of Python lower than 3.13, but we’ll definitely need to investigate whether using more recent versions of Python helps in that regard.
  • As mentioned, GitHub is currently the only “embedded” way to retrieve external nodes: while this works in principle and it’s quite simple to use, it also means you need to pay attention to GitHub API quotas, should you start deploying many instances.

Again, these are just a few random considerations on stuff that we’re aware of, but as you can imagine we’ve also been actively working on this, so we do plan to tackle them in upcoming versions of Juturna, besides all the additional functionality we want to add.

That said, what we’d really love to know is if it would work for other people as well. As the authors of the framework, we did drive the design towards choices that would benefit our own requirements, but it’s of course important to understand if those design choices did preserve the flexibility we wanted initially for the project in a way that allows other people to use it in scenarios different than our own. Janus is a very flexible component, but it only turned out to be successful because other people saw the same flexiibility in it as we did: will it be the same for Juturna?

As such, we’re definitely interested in feedback from developers that may be intrigued by the possibilities Juturna promises, and possibly build a community around it. What do you think of it? Do you feel you could you use it as it is? Was it that easy for you to write new nodes? What’s stopping you from using a specific pipeline you’ve been thinking about? The more we know about all this, the better we can work on Juturna to ensure a bright future for the project and its community.

That’s all folks!

We hope you enjoyed this introduction to Juturna, our new framework for real-time AI pipelines, and that we intrigued you enough to give it a try.

If you want to learn more, please make sure you watch Antonio’s presentation on the topic, and read the documentation which comes with extensive details and a few examples. As firm believers in open source, we’re very interested in building a community around this, so should you need help with it, or should you be interested in contributing to the project in any form, please don’t hesitate to reach out!

I'm getting older but, unlike whisky, I'm not getting any better