The importance of accessibility is being increasingly recognized in the digital world, including for Mux and our customers. End-users expect captions to be available on most video on the internet — but having to go to a third party to produce captions files and build a workflow to support uploading those files to Mux is a lot of extra friction for you, our customers. We want to make video easy, and part of that is enabling new functionality whenever we can.
Over three years ago, Mux added initial support for captions. We expanded this to live captions two years ago, then added automatic generation of live captions last year. Now, it’s time for us to fill in the biggest remaining missing piece. We’re excited to announce automatic captioning for on-demand assets, available now as a public beta.
Our new automatic captioning uses OpenAI’s Whisper model, a state-of-the-art implementation of speech recognition. We run this entirely on our infrastructure, so your video isn’t sent to any third-party vendors.
As we’ve experimented with this model, we’ve been really impressed with how well it works on a wide variety of content. However, there are cases where the output won’t be perfect, and a different workflow might be better for some of our customers. You should evaluate whether the quality of generated captions is sufficient for your use cases.
And one of the best parts: this feature is included in Mux’s video encoding, for no additional cost. You can auto-caption every minute of on-demand video you upload to Mux, for exactly the price you’re paying today.
We want it to be as easy as possible for you to create great, accessible customer experiences. We’ve also been thinking about other features that speech recognition can enable. Many of our customers have asked for simple transcripts they can feed into other workflows — sentiment analysis, translation, content summarization, moderation tools, and more.
We wanted to make this tool available in a way that requires no complex audio processing for you to implement on your own. Alongside streamable captions, a transcript will now be available for you, to help you to build even greater product experiences using video.
When you create a new asset, you can specify generated_subtitles as a list on the first input, like this:
Your asset will initially ingest as normal. When the asset is ready (i.e., you get the video.asset.ready event sent to your webhook endpoint), you’ll see that there’s an additional text track that will look like this:
Then, some time after that (depending on the duration of your video file, this could take a while, but for a short video it will typically only take a few seconds), you’ll receive a video.asset.track.ready event webhook.
Now, if you start playback, you will have English subtitles available, just like on this tutorial video from Dave.
For more details on how to enable this feature, as well as details on how to retrieve a transcript of your content from our new transcription endpoint, check out the latest guide in our documentation.
As we gain experience with the strengths and weaknesses of the Whisper system underlying this feature, we intend to improve the quality of the generated captions over time. We hope that we can make it work well enough for you, even for some cases where it’s not good enough right now.
Automatic captioning for on-demand content is a feature we’re thrilled to bring to you, but there are some additional details coming soon (and we’d love your feedback if these are essential for you):
- Custom vocabulary (as in our live auto-captioning support)
- Additional languages and language autodetection (today, we only support English)
- Profanity filtering
- Adding auto-generated captions to your existing on-demand Mux assets
We’re excited to see what you can build with this! We’re always open to feedback, so let us know what you think at firstname.lastname@example.org.