Skip to content

Auto-Generate Video Metadata with AI: Summaries, Chapters, Tags, and Key Moments Using Mux Robots

Every video you publish without a proper title, description, tags, and chapters is a video that's harder to find, harder to navigate, and harder to recommend. For a single video, filling that metadata in manually is a five-minute task. For a catalog of thousands — or a platform where users upload content every hour — it becomes an impossible bottleneck.

The result is always the same: inconsistent tags that break your filters, missing chapters on long recordings, generic descriptions that don't help search, and a backlog of untagged content that keeps growing faster than your team can work through it.

Mux Robots is a set of hosted AI workflows that solve this at the API level. Instead of wiring together your own LLM pipeline, you call an endpoint, point it at an asset, and get structured metadata back — titles, descriptions, keyword tags, timestamped chapters, highlight moments, and answers to arbitrary questions about your video content. This post walks through each of the four metadata workflows, shows you real code for each one, and explains how to chain them into a fully automated pipeline that runs on every upload.

LinkWhy Video Metadata Is Worth Getting Right

Metadata is the invisible layer that makes your video catalog actually work. Your search index runs on it. Your recommendation engine trains on it. Your browse filters depend on it. When someone asks an AI assistant to find relevant content from your library, it's reading metadata.

The gap between "good enough" metadata and genuinely useful metadata is larger than most teams expect. Good metadata means:

  • Accurate, descriptive titles that reflect what's actually in the video, not just the filename or the session topic from a calendar invite
  • Keyword-rich descriptions that give search engines and embedding models enough signal to surface the video in context
  • Normalized tags that map to a controlled taxonomy — so "machine learning," "ML," and "AI/ML" don't create three separate buckets in your filter UI
  • Chapter markers that let viewers jump to the section they need, which directly improves watch time and reduces drop-off on long-form content
  • Identified highlights that can feed preview clips, social media cuts, or "best of" features

None of this requires a human to do it manually. Each of these is something an AI model can generate from the video's transcript — and that's exactly what Mux Robots is built for.

LinkThe Four Mux Robots Metadata Workflows

Mux Robots provides four AI workflows that cover the full metadata spectrum. Each one is a POST request that takes an asset ID and returns structured output. Before using any of them, make sure your asset has captions — the workflows analyze the transcript, so auto-generated captions are a prerequisite for best results.

LinkSummarize: Titles, Descriptions, and Tags

The summarize workflow generates a title, a 2–4 sentence description, and an array of keyword tags from your video's content.

bash
POST /robots/v0/jobs/summarize
json
{ "asset_id": "your-asset-id", "tone": "professional", "title_length": "medium", "description_length": "long", "tag_count": 10 }

The tone parameter accepts neutral, playful, or professional — useful for matching the output to your platform's voice. tag_count defaults to 10, but you can adjust it based on your taxonomy needs.

For platforms serving multiple regions, output_language_code lets you generate metadata directly in the target language rather than translating after the fact:

json
{ "asset_id": "your-asset-id", "tone": "neutral", "output_language_code": "fr" }

The most powerful configuration option is prompt_overrides, which lets you inject custom instructions into the generation process. This is how you enforce a controlled vocabulary — instead of getting whatever tags the model feels like generating, you can constrain the output to your actual taxonomy:

json
{ "asset_id": "your-asset-id", "tone": "professional", "tag_count": 5, "prompt_overrides": { "keywords": "Only use tags from this approved list: javascript, react, node.js, typescript, web-performance, accessibility, css, testing, devops, security. Do not invent new tags.", "quality_guidelines": "Descriptions should be concise and written for a developer audience. Avoid marketing language." } }

This is the answer to the tag normalization problem. Rather than post-processing AI-generated tags with a fuzzy-match algorithm, you tell the model exactly what vocabulary it's allowed to use at generation time.

Here's a minimal Node.js example that summarizes an asset and logs the metadata:

javascript
import Mux from '@mux/mux-node'; const mux = new Mux(); const job = await mux.robotsPreview.jobs.summarize.create({ parameters: { asset_id: assetId, tone: 'professional', tag_count: 8, }, }); // Poll or use webhooks — once completed: const completed = await mux.robotsPreview.jobs.summarize.retrieve(job.id); console.log(completed.outputs.title); // "Introduction to React Server Components" console.log(completed.outputs.description); // "In this session, we cover..." console.log(completed.outputs.tags); // ["react", "server-components", "next.js", ...]

You can see a full working example of this pattern in the summarizing and tagging videos with AI example on Mux Docs.

LinkGenerate Chapters: Timestamped Navigation

The generate-chapters workflow analyzes your video's captions to identify topic transitions and returns an array of chapters with start times and titles. The first chapter always starts at timestamp 0.

bash
POST /robots/v0/jobs/generate-chapters
json
{ "asset_id": "your-asset-id" }

A typical response looks like this:

json
{ "data": { "chapters": [ { "start_time": 0, "title": "Introduction and agenda" }, { "start_time": 142, "title": "Setting up the development environment" }, { "start_time": 387, "title": "Building the API layer" }, { "start_time": 721, "title": "Testing and deployment" }, { "start_time": 934, "title": "Q&A" } ] } }

You can control chapter density and title style using prompt_overrides. For long-form webinars, you might want fewer, broader chapters. For course content, you might want more granular breakpoints:

json
{ "asset_id": "your-asset-id", "prompt_overrides": { "chapter_guidelines": "Create chapters every 3-5 minutes minimum. Avoid creating chapters for transitions shorter than 2 minutes. Focus on major topic shifts, not minor asides.", "title_guidelines": "Chapter titles should be action-oriented and specific. Use sentence case. Maximum 6 words." } }

Once you have chapters, you can pass them directly to Mux Player's advanced chapter support to render a navigable timeline. The AI-generated chapters guide walks through exactly how to connect the Robots output to the player. For a full code example, see generating video chapters using AI in the Mux docs.

LinkFind Key Moments: Highlights and Clips

The find-key-moments workflow identifies the most compelling segments in a video — useful for generating preview clips, social media cuts, highlight reels, or surfacing the best 60 seconds of a one-hour recording.

bash
POST /robots/v0/jobs/find-key-moments
json
{ "asset_id": "your-asset-id", "max_moments": 5, "target_duration_ms": { "min": 15000, "max": 60000 } }

The output per moment is richer than a simple timestamp range. Each moment includes an overall_score between 0 and 1, a title, separate audible_narrative and visual_narrative descriptions, and an array of notable_concepts:

json
{ "data": { "moments": [ { "start_ms": 312000, "end_ms": 354000, "overall_score": 0.91, "title": "The core argument for server-side rendering", "audible_narrative": "Speaker delivers a clear, quotable explanation of why SSR matters for initial page load.", "visual_narrative": "Diagram comparing client vs. server rendering timelines is on screen.", "notable_concepts": ["server-side rendering", "web performance", "time to first byte"] } ] } }

The scoring model weighs audio clarity, hook strength, emotional intensity, novelty, and soundbite quality. Higher scores indicate moments that are self-contained and likely to work well as standalone clips — a useful signal when you're automating clip generation.

You can combine key moments output with Mux's instant clips feature to build a fully automated highlight pipeline: find the moments, create clips from the timestamp ranges, and serve them as preview content without any manual editing.

LinkAsk Questions: Entity Extraction and Classification

ask-questions is the most flexible workflow. You define a set of questions about your video's content, specify the allowed answer options for each one, and get back structured answers with confidence scores and reasoning.

bash
POST /robots/v0/jobs/ask-questions
json
{ "asset_id": "your-asset-id", "questions": [ { "question": "What is the primary topic of this video?", "answer_options": ["JavaScript", "Python", "DevOps", "Security", "Design", "Product Management", "Other"] }, { "question": "What is the target audience level?", "answer_options": ["beginner", "intermediate", "advanced"] }, { "question": "Does this video mention any specific cloud providers?", "answer_options": ["AWS", "Google Cloud", "Azure", "Multiple providers", "None"] } ] }

Each answer includes the selected option, a confidence score from 0 to 1, a reasoning field explaining why the model chose that answer, and a skipped flag for cases where the question doesn't apply:

json
{ "data": { "answers": [ { "answer": "JavaScript", "confidence": 0.94, "reasoning": "The video is entirely focused on React and Node.js development, both JavaScript-based.", "skipped": false }, { "answer": "intermediate", "confidence": 0.82, "reasoning": "The speaker assumes familiarity with React basics but explains more advanced patterns.", "skipped": false } ] } }

The confidence thresholds are worth internalizing: scores above 0.9 indicate clear evidence in the transcript, 0.7–0.9 is strong but not definitive, 0.5–0.7 is moderate, and below 0.5 is weak enough that you probably shouldn't use the answer automatically without a review step.

For entity extraction — pulling out mentioned companies, products, or people — you can frame questions to match your CMS's content model:

javascript
import Mux from '@mux/mux-node'; const mux = new Mux(); const questions = companies.map((company) => ({ question: `Does this video mention ${company}?`, answer_options: ['yes', 'no'], })); const job = await mux.robotsPreview.jobs.askQuestions.create({ parameters: { asset_id: assetId, questions }, }); // Once completed (via webhook or polling): const completed = await mux.robotsPreview.jobs.askQuestions.retrieve(job.id); const mentioned = completed.outputs.answers .filter((a) => a.answer === 'yes' && a.confidence > 0.8) .map((a, i) => companies[i]);

LinkBuilding an Automated Metadata Pipeline

Running each workflow in isolation is useful. Chaining them together into a pipeline that triggers on every upload is where the real leverage comes from.

The architecture is straightforward: listen for the video.asset.ready webhook, fire all four Robots jobs in parallel, listen for each job's completion webhook, and push the results to your CMS or search index.

Here's a serverless function that handles the video.asset.ready event and kicks off all metadata workflows simultaneously:

javascript
import Mux from '@mux/mux-node'; const mux = new Mux(); export async function POST(request) { const event = await request.json(); if (event.type !== 'video.asset.ready') { return new Response('ok'); } const assetId = event.data.id; await Promise.all([ mux.robotsPreview.jobs.summarize.create({ parameters: { asset_id: assetId, tone: 'professional', tag_count: 8 }, }), mux.robotsPreview.jobs.generateChapters.create({ parameters: { asset_id: assetId }, }), mux.robotsPreview.jobs.findKeyMoments.create({ parameters: { asset_id: assetId, max_moments: 5 }, }), mux.robotsPreview.jobs.askQuestions.create({ parameters: { asset_id: assetId, questions: [ { question: 'What content category best describes this video?', answer_options: ['tutorial', 'webinar', 'product-demo', 'interview', 'lecture', 'other'], }, ], }, }), ]); return new Response('jobs started'); }

Each job fires a webhook when it completes. Your completion handler maps the output to your content model and writes it to your CMS:

javascript
if (event.type === 'robots.summarize.completed') { const { title, description, tags } = event.data; await cms.updateAsset(assetId, { title, description, tags }); } if (event.type === 'robots.generate_chapters.completed') { const { chapters } = event.data; await cms.updateAsset(assetId, { chapters }); } if (event.type === 'robots.find_key_moments.completed') { const topMoment = event.data.moments.sort((a, b) => b.overall_score - a.overall_score)[0]; await cms.updateAsset(assetId, { highlight_start_ms: topMoment.start_ms, highlight_end_ms: topMoment.end_ms }); }

The result is a pipeline where every video that lands in your Mux account automatically gets a title, description, tags, chapters, and a flagged highlight — with zero manual intervention.

LinkUse Cases Across Different Platforms

Course platforms benefit most from chapters and key moments. Chapters give students a navigable table of contents inside the player. Key moments can surface the single most important concept from each lesson for preview cards. ask-questions can classify videos by topic and difficulty level for your curriculum graph.

UGC platforms need summarization at upload time to auto-populate metadata before a video goes live. Combined with ask-questions for content classification, you get a first-pass review layer that can flag videos for human moderation or route them to the right content collections automatically.

Media libraries and archives — including internal enterprise knowledge bases — are where the full pipeline pays off most. A catalog of untagged recordings is a search dead end. Running the complete pipeline across an existing backlog transforms it into a searchable, navigable, recommendable asset. The Mux Robots VHS archive post is a fun illustration of exactly this problem at the personal scale.

Enterprise video platforms — meeting recordings, training sessions, company all-hands — are long, dense, and rarely watched in full. Chapters make them skimmable. Summarization makes them searchable. Key moments surface the parts worth rewatching.

LinkConsistent Metadata at Any Scale

The manual metadata bottleneck is a scaling problem with a technical solution. Every workflow in Mux Robots is designed to generate the kind of structured, consistent output that makes catalogs actually work — not one-off summaries, but normalized, configurable metadata that fits your content model and your taxonomy.

The four workflows complement each other directly: summarize handles discovery metadata, generate-chapters handles in-player navigation, find-key-moments handles clip and preview generation, and ask-questions handles classification and entity extraction. Run them together on every upload and you get a catalog where every video is searchable, navigable, and correctly categorized from the moment it goes live.

The full documentation for each workflow lives in the Mux Robots guide — including the complete request schemas, webhook event shapes, and prompt_overrides reference for each job type. If you want to see the end-to-end pattern before writing your own integration, the AI workflows examples index has working code for the most common use cases.

Arrow RightBack to Articles

No credit card required to start using Mux.