You just finished building a slick video upload flow. Files come in, get transcoded, and play back beautifully. Then someone asks: "Do these videos have captions?" And suddenly that clean architecture has a conspicuous gap in it.
Captions and transcripts aren't a nice-to-have anymore. Accessibility regulations, search engine indexing, AI-powered content discovery, and plain user preference have made them table stakes for any serious video product. The problem is that most developers treat captioning as an afterthought — something to bolt on later with a manual upload form or a third-party editor. That approach breaks down the moment you're handling more than a handful of videos.
This guide covers the full picture: how auto-generated captions work under the hood, how to enable them programmatically for both VOD assets and live streams, how to attach multi-language subtitle tracks, and how to build searchable transcript experiences on top of it all. By the end, you'll have a concrete, API-driven workflow that scales from one video to one million.
Why Captions and Transcripts Matter More Than You Think
The accessibility case is obvious but worth stating clearly. ADA compliance and WCAG 2.1 guidelines require captions for pre-recorded synchronized media, and courts have consistently held that inaccessible video content constitutes a barrier to equal access. For developers, this is a non-negotiable requirement, not a feature request.
The less obvious case is SEO and AI discoverability. Search engines can't watch video — they read text. A video without a transcript is essentially invisible to Google, and increasingly to the AI models that surface content in response to natural language queries. Adding a time-synced transcript to your video pages gives crawlers something to index and dramatically expands the surface area of discoverable content.
There's also the engagement angle. Captions consistently improve completion rates, especially on mobile where audio is often off by default. And for enterprise use cases — recorded webinars, training videos, knowledge bases — the ability to search inside video content is often the entire value proposition.
Manual captioning costs anywhere from $1 to $3 per minute of video and introduces human latency into your pipeline. At 100 hours of uploaded content per week, that's 6,000 minutes and potentially $18,000 in captioning costs before you've done anything interesting with the data. API-driven auto-captioning eliminates that ceiling.
How Automatic Speech Recognition Pipelines Work
Before diving into the API calls, it helps to understand what's happening inside an auto-captioning pipeline. The process has three stages.
Audio extraction is first. The video file is demuxed and the audio track is isolated, typically as a PCM or compressed audio stream. For multi-track content, you'll need to select which audio track to transcribe.
ASR (Automatic Speech Recognition) is the core step. A neural model processes the audio and produces a sequence of word tokens with associated confidence scores and timestamps. Modern ASR models like Whisper operate at the segment level but can produce word-level timestamps with the right configuration.
Timed text formatting takes the raw ASR output and produces a structured caption format. The two formats you'll encounter most often are SRT (SubRip Text) and WebVTT (Web Video Text Tracks). WebVTT is the standard for the web — it's what HLS players consume natively, and it supports features like positioning, styling, and cue metadata that SRT lacks. Understanding the relationship between WebVTT, HLS, and caption flags is worth reading if you want to go deep on the spec.
Broadcast delivery uses a different standard: CEA-608 and CEA-708 are embedded closed caption formats carried in the video bitstream itself. These matter for live streams going to broadcast or set-top boxes, but for web delivery, WebVTT is what you want.
Quality factors to be aware of include speaker diarization (identifying who is speaking when), punctuation restoration (ASR models output words, not sentences), and confidence scores (useful for flagging low-quality segments for review). Background noise, heavy accents, and domain-specific vocabulary all degrade accuracy.
Using Mux to Auto-Generate Captions for VOD Assets
Mux handles the entire ASR pipeline as part of asset creation. You don't need to provision any speech-to-text infrastructure, manage WebVTT formatting, or figure out how to attach tracks to an HLS manifest. You pass a flag at upload time and captions appear.
To add auto-generated captions to a video asset, include a generated_subtitles array in your asset creation request:
import Mux from "@mux/mux-node";
const mux = new Mux({
tokenId: process.env.MUX_TOKEN_ID,
tokenSecret: process.env.MUX_TOKEN_SECRET,
});
const asset = await mux.video.assets.create({
input: [
{
url: "https://storage.example.com/uploads/webinar-recording.mp4",
},
],
playback_policy: ["public"],
generated_subtitles: [
{
language_code: "en",
name: "English (auto-generated)",
},
],
});
console.log(asset.id);Once the asset reaches ready status, the generated subtitle track is available in the asset's tracks array. The track will have a type of text, a text_type of subtitles, and a status that moves from preparing to ready as the ASR job completes.
Retrieving Transcripts Programmatically
The caption track is more than just a display feature — the underlying WebVTT file is your transcript. You can retrieve it directly by fetching the track URL from the asset response:
const asset = await mux.video.assets.retrieve(assetId);
const subtitleTrack = asset.tracks.find(
(track) => track.type === "text" && track.text_source === "generated_live"
|| track.text_source === "auto-generated"
);
if (subtitleTrack && subtitleTrack.status === "ready") {
// The text track URL is accessible via the Mux API
const trackData = await mux.video.assets.retrieveTextTrack(
assetId,
subtitleTrack.id
);
console.log(trackData);
}The WebVTT content gives you timestamped cue blocks you can parse, index, or feed into downstream systems. Each cue has a start time, end time, and the transcribed text — everything you need to build a searchable transcript index.
Listening for Asset Ready Events
In a real pipeline, you don't want to poll for asset status. Use Mux webhooks to trigger downstream work when captions are ready. The relevant event is video.asset.track.ready, which fires when a text track finishes processing:
// Express webhook handler
app.post("/webhooks/mux", express.raw({ type: "application/json" }), (req, res) => {
const event = JSON.parse(req.body);
if (event.type === "video.asset.track.ready") {
const { asset_id, id: track_id } = event.data;
// Trigger your indexing or translation workflow
queueTranscriptProcessing({ assetId: asset_id, trackId: track_id });
}
res.sendStatus(200);
});This event-driven pattern is the foundation of any robust captioning pipeline. Your upload endpoint creates the asset, Mux handles transcoding and ASR in parallel, and the webhook fires when work is done — no polling, no timeouts, no manual intervention.
Auto-Generated Captions for Live Streams
Live captioning is a harder problem than VOD. You're operating on an unbuffered audio stream with no ability to look ahead, which means the ASR model has to make decisions in real time with less context than it would have over a complete recording. Accuracy is slightly lower than VOD, and there's an inherent latency trade-off: the longer you buffer audio before sending it to the ASR model, the more accurate your captions, but the more delay your viewers experience.
Mux's approach to auto-generated live captions keeps this latency to a few seconds while maintaining accuracy that's practical for most use cases. Configuration happens at live stream creation time:
const liveStream = await mux.video.liveStreams.create({
playback_policy: ["public"],
new_asset_settings: {
playback_policy: ["public"],
},
generated_subtitles: [
{
language_code: "en",
name: "English",
passthrough: "live-session-2024",
},
],
});
console.log(liveStream.stream_key);
console.log(liveStream.playback_ids[0].id);The passthrough field is useful for correlating caption events with your own session metadata. When the live stream ends and converts to a VOD asset, the captions carry over — so your post-event recording already has captions attached without any additional work.
For use cases that need captions in multiple languages during a live broadcast — multilingual webinars, international events — Mux supports auto-generated live captions in several languages. You specify the source language and Mux handles the rest.
Adding Multi-Language Subtitle Tracks
Auto-generated captions give you one language track. For global products, you need more. The practical workflow is: generate the source transcript, translate it, then attach the translated track as an additional subtitle.
Translating Captions with Mux Robots
Mux Robots includes a translate-captions workflow that translates an existing caption track into another language with a single API call. It preserves the timing structure automatically — no VTT parsing or manual text extraction required. When upload_to_mux is true (the default), the translated track is attached to the asset automatically and appears in the player's caption selector.
import Mux from '@mux/mux-node';
const mux = new Mux();
const job = await mux.robotsPreview.jobs.translateCaptions.create({
parameters: {
asset_id: assetId,
track_id: captionTrackId, // your source caption track
to_language_code: 'es', // BCP 47 language code
upload_to_mux: true, // auto-attach translated track to the asset
},
});To translate into multiple languages at once, kick off a job for each target language. They run in parallel:
const languages = ['es', 'ja', 'pt', 'fr', 'de'];
const jobs = await Promise.all(
languages.map((lang) =>
mux.robotsPreview.jobs.translateCaptions.create({
parameters: {
asset_id: assetId,
track_id: captionTrackId,
to_language_code: lang,
upload_to_mux: true,
},
})
)
);Each job fires a robots.job.translate_captions.completed webhook when finished. The webhook payload includes a temporary_vtt_url if you need to download the translated VTT for other purposes (like feeding into a dubbing service). But for simply adding multi-language captions to your video, the auto-attach handles everything — no need to manually host or attach the translated file.
If you need more control over translation quality — for example, using a specific translation provider or post-editing translations before attaching — you can set upload_to_mux: false, download the translated VTT, edit it, and attach it manually using the track creation API.
Building Searchable Transcript Experiences
Captions displayed in a player are valuable. Captions indexed and made searchable are a product feature that can differentiate your entire platform. The pattern is straightforward: parse the WebVTT output, extract the cue text and timestamps, and index it with a search engine.
Parsing and Indexing Transcript Data
A WebVTT cue looks like this:
00:01:23.500 --> 00:01:27.000
Welcome to today's session on building scalable video infrastructure.Parse each cue into a document with the asset_id, start_time (in seconds), and text, then push it to your search index:
import { parseSync } from "subtitle";
function vttToSearchDocuments(vttContent, assetId) {
const cues = parseSync(vttContent);
return cues
.filter((node) => node.type === "cue")
.map((cue) => ({
id: `${assetId}-${cue.data.start}`,
asset_id: assetId,
start_time: cue.data.start / 1000, // convert ms to seconds
end_time: cue.data.end / 1000,
text: cue.data.text,
}));
}
// Index into Algolia, Elasticsearch, or a simple Postgres full-text index
const documents = vttToSearchDocuments(vttContent, assetId);
await searchIndex.saveObjects(documents);A React Component for Time-Synced Search Results
With transcript data indexed, you can build a search experience where clicking a result jumps the video to that exact moment. Here's a minimal React component using Mux Player:
import MuxPlayer from "@mux/mux-player-react";
import { useRef, useState } from "react";
export function SearchableVideoPlayer({ playbackId, searchResults }) {
const playerRef = useRef(null);
const [query, setQuery] = useState("");
function seekToResult(startTime) {
if (playerRef.current) {
playerRef.current.currentTime = startTime;
playerRef.current.play();
}
}
return (
<div className="video-search-container">
<MuxPlayer
ref={playerRef}
playbackId={playbackId}
streamType="on-demand"
/>
<input
type="text"
value={query}
onChange={(e) => setQuery(e.target.value)}
placeholder="Search inside this video..."
/>
<ul className="search-results">
{searchResults.map((result) => (
<li key={result.id} onClick={() => seekToResult(result.start_time)}>
<span className="timestamp">
{formatTime(result.start_time)}
</span>
<span className="excerpt">{result.text}</span>
</li>
))}
</ul>
</div>
);
}For a more complete implementation including AI-powered transcript interaction, Mux has published a detailed walkthrough on building an AI-powered interactive video transcript that's worth reading alongside this guide.
Handling Scale: Bulk Uploads and Failure Recovery
A single-video workflow is straightforward. A bulk import of 500 recorded webinars needs more thought.
Queue everything. Don't fire 500 simultaneous asset creation requests. Use a job queue (BullMQ, SQS, or similar) to control concurrency and handle retries. Mux's API has rate limits, and more importantly, your downstream systems — translation APIs, search indexers — will have their own limits.
Handle track failures gracefully. Caption generation can fail independently of video transcoding. A video.asset.track.errored webhook event signals a caption failure on an otherwise-healthy asset. Your pipeline should handle this case explicitly: log it, alert on it, and optionally re-queue the caption job.
Monitor quality at scale. Auto-generated captions are good but not perfect. For content where accuracy is critical — legal recordings, medical training videos — consider sampling a percentage of transcripts for human review. Mux's Mux Robots hosted AI workflows provide additional AI-powered processing options that can be layered into your pipeline.
The complete architecture looks something like this: client uploads a file → your server creates a Mux asset with generated_subtitles → Mux transcodes and runs ASR in parallel → video.asset.track.ready webhook fires → your handler fetches the WebVTT, queues translation jobs → translated tracks are attached back to the asset → transcript documents are indexed for search. Every step is asynchronous, every failure has a recovery path, and no human touches the process.
Conclusion
Auto-generated captions, transcripts, and multi-language subtitles are solved infrastructure problems. The ASR technology is mature, the APIs are reliable, and the integration patterns are well-understood. The only reason to skip them is not having a clear path from "video uploaded" to "captions attached" — and this guide has laid that path out.
Mux handles the hardest parts: audio extraction, ASR processing, WebVTT generation, track attachment, and HLS manifest integration. Your job is to wire the events together, decide which languages to support, and build the user experience on top. The webhook-driven, event-based architecture described here scales from a handful of videos to a high-volume platform without changing shape.
If you're starting from scratch, the Mux auto-generated captions documentation is the right place to begin. And if you want to explore what else becomes possible when your video infrastructure is AI-aware — chapters, summaries, semantic search — the AI video workflows guide shows how these pieces connect into something genuinely powerful.
Captions used to be a line item on a post-production budget. Now they're an API call.