Video Dubbing and Multi-Language Audio Tracks: A Developer's Guide to the API Workflow

AI dubbing tools have quietly become good enough to matter. ElevenLabs, Rask AI, HeyGen — these services can take a piece of English video and return convincing dubbed audio in Spanish, Japanese, or Portuguese in minutes. For developers building global video products, that's a genuinely new capability. But here's what most tutorials gloss over: generating the dubbed audio is only one piece of the puzzle.

The full localization pipeline requires translating your captions, generating dubbed audio with proper timing, attaching those audio files to your existing video assets, and letting viewers switch between languages during playback. The last three steps are infrastructure work. They involve APIs, webhooks, encoding requirements, and player configuration — and they're exactly the kind of thing that gets underestimated until you're in the middle of building it.

This guide covers the complete workflow from a single-language video to a fully localized, multi-track asset. Mux handles more of this natively than most developers realize: Mux Robots translates your captions with a single API call, the asset API attaches dubbed audio tracks without re-encoding video, and Mux Player surfaces language switching automatically. The external AI dubbing service handles the voice synthesis. Everything else lives in Mux.

Here's the architecture at a high level: Mux asset → Robots translate-captions → AI dubbing service → audio track API → Mux Player. Let's walk through each step.

Step 1: Translating Captions with Mux Robots

Before you can dub a video, you need accurately translated text with timing information preserved. You could pipe your caption file through a translation API yourself, handle edge cases in the VTT formatting, and write the translated file back to the asset. Or you can make one API call.

Mux Robots is an AI-powered workflow engine that runs directly on your Mux assets. The translate-captions workflow takes an existing caption track and translates it into a target language, optionally attaching the result to the asset automatically.

The required parameters are straightforward: asset_id, track_id (the ID of the existing English caption track), and to_language_code using BCP 47 language codes like es, ja, or pt.

javascript


import Mux from '@mux/mux-node';

const mux = new Mux();

const job = await mux.robotsPreview.jobs.translateCaptions.create({
  parameters: {
    asset_id: 'YOUR_ASSET_ID',
    track_id: 'YOUR_TRACK_ID',
    to_language_code: 'es',
    upload_to_mux: true,
  },
});

console.log('Job ID:', job.id);
console.log('Status:', job.status); // 'pending'

When upload_to_mux is set to true (the default), the translated VTT file is automatically attached to the asset as a new text track. The Spanish captions will immediately appear in Mux Player's caption selector — no additional API call required.

To translate into multiple languages, you kick off a job for each target language:

javascript


import Mux from '@mux/mux-node';

const mux = new Mux();
const targetLanguages = ['es', 'ja', 'pt', 'fr', 'de'];

const jobs = await Promise.all(
  targetLanguages.map((lang) =>
    mux.robotsPreview.jobs.translateCaptions.create({
      parameters: {
        asset_id: assetId,
        track_id: captionTrackId,
        to_language_code: lang,
        upload_to_mux: true,
      },
    })
  )
);

Each job fires a robots.job.translate_captions.completed webhook when it finishes. The webhook payload includes a temporary_vtt_url — a short-lived URL pointing to the translated VTT file. This URL is what you'll feed into your AI dubbing service in the next step. The translation is already timed and structured; the dubbing service just needs to synthesize the voice.

Step 2: Generating Dubbed Audio with an External AI Service

This is the one step Mux doesn't handle natively, and for good reason — voice synthesis is its own specialized problem. Services like ElevenLabs, Rask AI, and HeyGen accept either a video file or a transcript with timing data and return audio files ready to be attached to video.

The Robots-generated translated VTT file is particularly useful here. Rather than sending raw text, you're giving the dubbing service pre-timed, translated segments — which means the synthesized speech can be aligned to the original video's pacing. This matters a lot for lip-sync quality.

A typical integration looks like this:

javascript


// Triggered by robots.job.translate_captions.completed webhook
async function handleTranslationComplete(webhookPayload) {
  const { asset_id, to_language_code, temporary_vtt_url } = webhookPayload.data;

  // Fetch the translated VTT content
  const vttContent = await fetch(temporary_vtt_url).then((r) => r.text());

  // Send to your preferred AI dubbing service
  const dubbingJob = await elevenlabs.dubbing.create({
    source_url: getMuxPlaybackUrl(asset_id), // original video
    transcript: vttContent,
    target_lang: to_language_code,
    mode: 'automatic',
  });

  // Store job reference for polling or webhook handling
  await db.dubbingJobs.create({
    mux_asset_id: asset_id,
    language_code: to_language_code,
    external_job_id: dubbingJob.dubbing_id,
  });
}

Quality considerations worth knowing upfront: voice cloning produces significantly better results than generic voices, but requires reference audio and additional setup time. For most production workflows, start with the dubbing service's highest-quality generic voices per language, then add voice cloning for languages where you see the most viewer engagement. Lip-sync quality varies considerably across services — test with a short clip before committing to a full content library.

Using Mux's Audio Track API to Attach Dubbed Audio

Once your AI dubbing service returns an audio file, you need to attach it to the Mux asset. This is where multi-track audio support becomes essential. Mux lets you add alternate audio tracks to an existing asset without re-encoding the video — the original video stream stays untouched.

The API call to add an audio track takes the asset ID, a URL pointing to your audio file (you'll need to host the dubbed audio somewhere accessible), a language_code in BCP 47 format, and a human-readable name for the track selector UI.

javascript


import Mux from '@mux/mux-node';

const mux = new Mux();

const track = await mux.video.assets.createTrack('YOUR_ASSET_ID', {
  url: 'https://storage.example.com/dubbed-audio/asset-123-es.mp3',
  type: 'audio',
  language_code: 'es',
  name: 'Spanish',
  closed_captions: false,
});

console.log('Track ID:', track.id);

For a full multi-language batch workflow, loop over your completed dubbing jobs:

javascript


import Mux from '@mux/mux-node';

const mux = new Mux();

async function attachDubbedTracks(assetId, completedDubbingJobs) {
  const results = [];

  for (const job of completedDubbingJobs) {
    const track = await mux.video.assets.createTrack(assetId, {
      url: job.audioFileUrl,
      type: 'audio',
      language_code: job.languageCode,
      name: getLanguageName(job.languageCode),
      closed_captions: false,
    });

    results.push({
      language: job.languageCode,
      track_id: track.id,
      status: track.status,
    });
  }

  return results;
}

function getLanguageName(code) {
  const names = {
    es: 'Spanish', ja: 'Japanese', pt: 'Portuguese',
    fr: 'French', de: 'German',
  };
  return names[code] ?? code.toUpperCase();
}

A few encoding requirements to keep in mind: Verify your dubbing service's output format before attaching — mismatched encoding settings between tracks are the most common source of sync problems. Always keep the original language track as the default — removing or overwriting the original creates a confusing fallback experience for viewers on unsupported devices.

For more detail on how multiplexing separate audio streams works at the streaming protocol level, it's worth understanding how HLS handles alternate audio renditions — this explains why adding tracks to Mux doesn't require transcoding the video.

Orchestrating the Complete Pipeline with Webhooks

The steps above don't run sequentially in user time — they're async jobs that complete at different rates. A well-architected pipeline uses Mux webhooks to drive each stage forward automatically.

Here's the full event-driven flow:

asset.ready → trigger Robots translate-captions jobs → robots.job.translate_captions.completed → send translated VTT to AI dubbing service → dubbing service webhook → attach dubbed audio to Mux asset → done

A serverless function can orchestrate this entire chain:

javascript


import Mux from '@mux/mux-node';

const mux = new Mux();

// Webhook handler — runs on Mux webhook events
export async function POST(request) {
  const event = await request.json();

  switch (event.type) {
    case 'video.asset.ready': {
      const { id: assetId } = event.data;
      const tracks = await mux.video.assets.listTracks(assetId);
      const captionTrack = tracks.find((t) => t.type === 'text');

      if (!captionTrack) break;

      // Kick off translation jobs for all target languages
      const languages = ['es', 'ja', 'pt', 'fr'];
      await Promise.all(
        languages.map((lang) =>
          mux.robotsPreview.jobs.translateCaptions.create({
            parameters: {
              asset_id: assetId,
              track_id: captionTrack.id,
              to_language_code: lang,
              upload_to_mux: true,
            },
          })
        )
      );
      break;
    }

    case 'robots.job.translate_captions.completed': {
      const { asset_id, to_language_code, temporary_vtt_url } = event.data;

      // Fetch translated VTT and send to dubbing service
      const vttContent = await fetch(temporary_vtt_url).then((r) => r.text());
      await submitToDubbingService(asset_id, to_language_code, vttContent);
      break;
    }

    // Your dubbing service calls this endpoint when audio is ready
    case 'dubbing.completed': {
      const { mux_asset_id, language_code, audio_url } = event.data;

      await mux.video.assets.createTrack(mux_asset_id, {
        url: audio_url,
        type: 'audio',
        language_code: language_code,
        name: getLanguageName(language_code),
        closed_captions: false,
      });
      break;
    }
  }

  return new Response('ok', { status: 200 });
}

For error handling, implement exponential backoff on any step that calls an external API. The most fragile point is the temporary VTT URL — it expires, so download and store the VTT content immediately when you receive the webhook, rather than passing the URL directly to the dubbing service. If a dubbing job fails, you can re-trigger it without re-running the translation step.

Mux Player: Multi-Track Playback Without Extra Configuration

Once your audio tracks and translated caption tracks are attached to the asset, Mux Player surfaces them automatically. Viewers get a language selector in the audio menu and a separate captions menu, both populated from the tracks on the asset. No custom UI code required for the baseline experience.

For a React application, the setup is minimal:

jsx


import MuxPlayer from '@mux/mux-player-react';

export function LocalizedVideoPlayer({ playbackId, defaultLanguage = 'en' }) {
  return (
    <MuxPlayer
      playbackId={playbackId}
      metadata={{
        video_title: 'My Localized Video',
      }}
      defaultAudioLanguage={defaultLanguage}
      defaultCaptionsLanguage={defaultLanguage}
    />
  );
}

The defaultAudioLanguage and defaultCaptionsLanguage props let you pre-select a language based on the viewer's locale — a small touch that meaningfully improves the experience for non-English audiences.

For viewers on devices where a dubbed audio track isn't available (older HLS implementations with limited alternate audio support), the player falls back to the original audio track gracefully. Captions continue to work regardless, which is why translating captions first — even before committing to full audio dubbing — is the right sequencing for most teams.

Practical Considerations Before You Ship

Audio sync is the most common production issue. Dubbed audio is typically slightly longer or shorter than the original due to differences in speech rate across languages. Most AI dubbing services offer time-stretching options; use them. Test sync at the beginning, middle, and end of the video — sync drift often compounds over time.

Cost optimization matters at scale. Caption translation via Mux Robots is inexpensive and fast, so translate captions for all your content. Save full audio dubbing for high-performing assets — you can always add dubbed audio tracks later without touching the video or captions. This staged approach also lets you gauge demand by language before investing in voice synthesis.

Naming conventions affect the player UI directly. Use proper BCP 47 language codes (es-MX for Mexican Spanish vs es-ES for Spain Spanish) when your audience is regional enough to care. For most products, the base codes work fine.

Testing across devices is non-negotiable. Track switching behavior differs between Safari on iOS, Chrome on Android, and desktop browsers. HLS alternate audio support on mobile Safari has historically had quirks — test your most important language pairs on real devices before launch.

For live streams, the real-time dubbing problem is fundamentally different (latency constraints make synchronous AI dubbing impractical). The practical approach is to treat live content as VOD post-stream: once the recording is ready as a Mux asset, the same pipeline above applies.

Putting It All Together

The developer tooling around AI video localization has finally caught up to the quality of the AI models themselves. What used to require significant custom infrastructure — caption translation, audio track management, multi-language playback — now fits into a webhook-driven pipeline that you can build in an afternoon.

Mux's role in this stack is concrete: Robots handles caption translation with a single API call and automatic asset attachment, the audio track guide covers everything you need to attach dubbed audio without touching the video, and Mux Player handles language switching UI automatically. The external AI dubbing service is a well-defined integration point — it receives timed translated text and returns audio. Everything else is Mux.

If you're starting from scratch, the right sequence is: get auto-generated captions working first (the auto-generated captions guide covers this), then add Robots caption translation, then evaluate which content warrants full audio dubbing. You'll ship value at each stage, and the architecture composes cleanly as you add languages.

The full documentation for each piece of this pipeline is available in the Mux Robots translate-captions docs, the alternate audio tracks guide, and the Mux Player documentation. Start with caption translation — it's one API call, and it's a meaningful improvement for your international viewers today.

Back to Articles

Table of Contents

Video Dubbing and Multi-Language Audio Tracks: A Developer's Guide to the API Workflow

LinkStep 1: Translating Captions with Mux Robots

LinkStep 2: Generating Dubbed Audio with an External AI Service

LinkUsing Mux's Audio Track API to Attach Dubbed Audio

LinkOrchestrating the Complete Pipeline with Webhooks

LinkMux Player: Multi-Track Playback Without Extra Configuration

LinkPractical Considerations Before You Ship

LinkPutting It All Together

Step 1: Translating Captions with Mux Robots

Step 2: Generating Dubbed Audio with an External AI Service

Using Mux's Audio Track API to Attach Dubbed Audio

Orchestrating the Complete Pipeline with Webhooks

Mux Player: Multi-Track Playback Without Extra Configuration

Practical Considerations Before You Ship

Putting It All Together