These new-fangled AI agent systems can now browse the web, write code, file bug reports, and execute multi-step workflows with limited issues. When they finish a task, though, what do we actually get? A text summary, maybe a screenshot, perhaps a list of files changed.
That’s kinda like hiring a contractor to renovate your kitchen and receiving only a written description of what they did. “Alrighty, we removed the old cabinets, installed new ones, and the sink is now functional.” Cool, but like, can I see the kitchen? I demand receipts. Show me how it works.
agent-video solves this problem by borrowing heavily from how films get made.
Please wait, loading…
When an AI agent navigates a website, it operates differently than a human. A human user experiences time linearly, clicking a button, waiting for a page to load, reading the content, deciding what to do next, and clicking again. The thinking happens between the actions, creating a smooth and continuous flow.
An AI agent experiences time in bursts. It takes a snapshot of the page, sends that snapshot to a language model, waits several seconds for a response, receives instructions, executes them instantly, takes another snapshot, and repeats.
Most of the agent's work happens during those long pauses while the model thinks. If you recorded the raw screen activity, you'd get a video that's 90% still frames with occasional rapid-fire clicking, which makes for unwatchable footage.
A film school graduate’s solution: rehearsal
The architecture of agent-video borrows a concept I learned back in film school. In Hollywood, complex scenes often require multiple phases. There's pre-production where the script gets written and timing gets planned, the actual shoot where cameras roll, and post-production where everything gets edited together.
agent-video can apply this same principle to screen recording. The MCP server in this project orchestrates three distinct phases that mirror this filmmaking workflow.
Pre-production happens first. The server opens a browser and visits each page, taking accessibility snapshots of what it sees. It sends each snapshot to Claude with a persona prompt and asks for structured narration that includes specific scroll targets. The AI writes commentary based on the actual page content and indicates which elements should be visible during each segment.
Then the narration goes to ElevenLabs for voice synthesis, which returns character-level timing data alongside the audio. By the end of this phase, the script is written, all the audio is ready, and the server knows exactly when to scroll to each element.
The shoot comes next. The server launches a fresh browser with video recording enabled, navigates to each URL in sequence, and performs content-aware scrolling that's synchronized to the narration. As Claude mentions a specific section of the page, the browser smoothly scrolls to bring that element into view. The timing comes from ElevenLabs' character-level alignment data, which maps each word in the narration to a precise timestamp.
Post-production ties everything together. The server extracts only the relevant segments from the raw recording, concatenates them into a seamless video, and overlays the pre-generated audio at precise intervals. The result is a polished video where narration and visuals sync at the content level.
They did their research
The research pass is where agent-video differs from simpler screen recording tools. Before any video gets captured, the server actually looks at each page and decides what to say about it, while also planning exactly where to scroll during each part of the narration.
// Open browser for research
agentBrowser(`open "${pages[0].url}" --headed`);
for (const page of pages) {
// Navigate to page
agentBrowser(`open "${page.url}"`);
// Take an accessibility snapshot with element refs like @e1, @e5, @e12
const snapshot = agentBrowser(`snapshot`);
// Ask Claude to write structured narration with scroll targets
const narrationData = await generateNarration(persona, page.url, snapshot);
}
// Close browser after research
agentBrowser(`close`);The narration generation calls Claude's API with the page snapshot and persona description, using structured outputs to guarantee valid JSON.
const response = await fetch("https://api.anthropic.com/v1/messages", {
method: "POST",
headers: {
"x-api-key": apiKey,
"content-type": "application/json",
"anthropic-version": "2023-06-01",
"anthropic-beta": "structured-outputs-2025-11-13",
},
body: JSON.stringify({
model: "claude-sonnet-4-5-20250929",
max_tokens: 800,
messages: [{
role: "user",
content: `You are narrating a screen recording. Your persona: ${persona}
You are viewing: ${pageUrl}
Here is the accessibility snapshot (elements have refs like @e1, @e5):
${snapshot}
Generate narration that flows naturally through the page from top to bottom.
Create 3-5 segments. Pick refs that correspond to what you're talking about
in that segment. Use 'top' for the first segment, then refs for subsequent
sections as you scroll down.`
}],
output_format: {
type: "json_schema",
schema: {
type: "object",
properties: {
segments: {
type: "array",
items: {
type: "object",
properties: {
text: { type: "string" },
scrollTo: { type: "string" }
},
required: ["text", "scrollTo"],
additionalProperties: false
}
}
},
required: ["segments"],
additionalProperties: false
}
}
})
});Claude returns something like this for a typical marketing site:
{
"segments": [
{
"text": "Oh wonderful, another startup landing page with a giant hero section.",
"scrollTo": "top"
},
{
"text": "Let's see what features they're bragging about down here.",
"scrollTo": "@e15"
},
{
"text": "And of course there's a pricing section, because nothing says confidence like burying your prices at the bottom.",
"scrollTo": "@e42"
}
]
}This means the commentary is always relevant and the scrolling is always purposeful. If the homepage has a giant hero image with "We're hiring!" plastered across it, the narration can mention that while the browser shows the hero. If the pricing page shows a free tier, the persona can react to it as the browser scrolls that section into view. The narration emerges from what's actually on the page, and the visuals follow along.
Just like… a real… human.
The accessibility snapshot
The accessibility snapshot deserves its own explanation because it's the bridge between what Claude sees and what the browser can scroll to.
When the browser takes a snapshot, it captures the page's accessibility tree rather than a visual screenshot. This tree represents the semantic structure of the page with each interactive or meaningful element assigned a unique ref like @e1, @e5, or @e42. The snapshot might look something like this for a simple marketing page.
- heading "Welcome to Acme" @e1
- paragraph "We help companies do things better." @e2
- link "Get Started" @e3
- heading "Features" @e15
- list @e16
- listitem "Fast deployment" @e17
- listitem "Easy integration" @e18
- listitem "24/7 support" @e19
- heading "Pricing" @e42
- table @e43
- row "Free tier - $0/month" @e44
- row "Pro tier - $99/month" @e45Claude reads this structure and can reference specific elements by their ref when writing narration. When Claude says "let's look at the pricing" and specifies @e42 as the scroll target, the browser knows exactly which DOM element to bring into view.
In the performance pass, the refs are likely to exist just like they did in the snapshot Claude just analyzed. The challenge comes from the fact that pages can change between the research pass and the performance pass, which occasionally causes scroll targets to become invalid. Dynamic content, lazy loading, or authentication state changes can all cause the refs from research to point to different elements during recording. In that case, you can opt for a persistent CSS or HTML selector instead of a ref so you're certain you've got a handle to the element you're looking for.
You can be anyone you want
The persona can be anything you describe, with no predefined list to choose from.
{
"persona": "a jaded Silicon Valley investor who's seen a thousand pitch decks",
"pages": [
{ "url": "https://startup.com" },
{ "url": "https://startup.com/pricing" }
]
}You could also try a "Gordon Ramsay reviewing websites instead of restaurants" persona, or "a confused grandparent trying to understand what this company does."
The persona string gets passed directly to Claude, which interprets it and writes narration in that voice. You can get creative with descriptions like "a noir detective narrating in the style of a 1940s radio drama" or "an overenthusiastic intern on their first day" or "a bored teenager who'd rather be anywhere else."
My favorite in testing was the baby who just learned to talk, but I could see how I would think that’s funny and nobody else would.
Audio generation and timing alignment
The generated narration flows through ElevenLabs for text-to-speech, but we use a special endpoint that returns more than audio.
const response = await fetch(
`https://api.elevenlabs.io/v1/text-to-speech/${voiceId}/with-timestamps`,
{
method: 'POST',
headers: {
'xi-api-key': apiKey,
'Content-Type': 'application/json'
},
body: JSON.stringify({
text: fullNarration,
model_id: 'eleven_multilingual_v2'
})
}
);
const { audio_base64, alignment } = await response.json();The with-timestamps endpoint returns character-level timing data alongside the audio. The alignment object contains arrays that map every character in the input text to precise timestamps in the generated audio.
alignment: {
characters: ['O', 'h', ' ', 'w', 'o', 'n', 'd', 'e', 'r', 'f', 'u', 'l', ...],
character_start_times_seconds: [0.0, 0.05, 0.12, 0.15, 0.22, ...],
character_end_times_seconds: [0.05, 0.12, 0.15, 0.22, 0.28, ...]
}This character-level timing is the key to content-aware scrolling. Since we know that each segment starts at a specific character offset in the full narration, we can look up exactly when that segment will be spoken and trigger the scroll at that moment.
function calculateSegmentTimings(segments, charStartTimes) {
const segmentTimings = [];
let charOffset = 0;
for (const segment of segments) {
const startTime = charStartTimes[charOffset];
segmentTimings.push({
text: segment.text,
scrollTo: segment.scrollTo,
startTimeMs: Math.round(startTime * 1000)
});
// Move past this segment plus the space between segments
charOffset += segment.text.length + 1;
}
return segmentTimings;
}The result is a precise schedule that tells the browser exactly when to scroll to each element during recording.
The performance pass
With all audio generated, timing calculated, and scroll targets identified, the server opens a fresh browser and records the actual video.
// Open browser for recording
agentBrowser(`set viewport 1280 720`);
agentBrowser(`open "${pages[0].url}" --headed`);
// Start recording
agentBrowser(`record start "${videoPath}"`);
const recordingStartMs = Date.now();
const marks = [];
for (let i = 0; i < pages.length; i++) {
const { url } = pages[i];
const clip = clips[i];
// Navigate
agentBrowser(`open "${url}"`);
// Mark timestamp for extraction later
marks.push({
clipNum: i + 1,
offsetMs: Date.now() - recordingStartMs,
durationMs: clip.durationMs
});
// Enable smooth scrolling
agentBrowser(`eval "document.documentElement.style.scrollBehavior = 'smooth'"`);
// Content-aware scrolling based on segment timings
const segmentStartTime = Date.now();
for (const segment of clip.segmentTimings) {
// Wait until it's time to scroll to this segment
const elapsedMs = Date.now() - segmentStartTime;
const waitMs = segment.startTimeMs - elapsedMs;
if (waitMs > 0) {
await sleep(waitMs);
}
// Scroll to the target element
if (segment.scrollTo === 'top') {
agentBrowser(`eval "window.scrollTo({ top: 0, behavior: 'smooth' })"`);
} else if (segment.scrollTo.startsWith('@')) {
agentBrowser(`scrollintoview ${segment.scrollTo}`);
}
}
// Wait for remaining audio duration
const totalElapsed = Date.now() - segmentStartTime;
const remainingMs = clip.durationMs - totalElapsed;
if (remainingMs > 0) {
await sleep(remainingMs);
}
}
// Stop recording
agentBrowser(`record stop`);
agentBrowser(`close`);The scrolling happens in sync with the narration because both are driven by the same timing data from ElevenLabs. When Claude's narration mentions "let's see what features they're bragging about," the browser is already scrolling to the features section. When the persona comments on pricing, the pricing table slides into view.
The marks array tracks when each page segment starts relative to the recording beginning, which becomes the extraction map for post-production.
Fix it in post
With the raw video captured and the marks logged, post-production assembles the final output through a series of ffmpeg operations.
The first step extracts precise time windows from the source video based on the marks recorded during the performance pass.
ffmpeg -ss 0.000 -t 3.5 -i recording.webm segment_1.mp4
ffmpeg -ss 3.500 -t 2.9 -i recording.webm segment_2.mp4
ffmpeg -ss 6.400 -t 4.2 -i recording.webm segment_3.mp4These segments then get joined using ffmpeg's concat demuxer, which works at the container level and avoids re-encoding to keep things fast.
ffmpeg -f concat -safe 0 -i concat_list.txt -c copy concat.mp4Audio mixing positions each narration clip at the correct offset using the adelay filter, which accepts milliseconds for sub-second precision.
let cumulativeOffset = 0;
for (const clip of clips) {
audioFilter += `[${clip.num}]adelay=${cumulativeOffset}|${cumulativeOffset}[a${clip.num}];`;
cumulativeOffset += clip.durationMs;
}The final ffmpeg command merges the positioned audio tracks onto the concatenated video.
ffmpeg -i concat.mp4 -i clip_1.mp3 -i clip_2.mp3 -i clip_3.mp3 \
-filter_complex "[1]adelay=0[a1];[2]adelay=3500[a2];[3]adelay=6400[a3];[a1][a2][a3]amix=inputs=3" \
-c:v copy output.mp4The finished MP4 then gets uploaded to Mux for encoding, adaptive streaming, and delivery.
const upload = await fetch("https://api.mux.com/video/v1/uploads", {
method: "POST",
headers: {
"Content-Type": "application/json",
Authorization: `Basic ${credentials}`,
},
body: JSON.stringify({
new_asset_settings: {
playback_policy: ["public"],
video_quality: "basic"
}
})
});
await fetch(uploadUrl, {
method: "PUT",
body: videoBuffer,
headers: { "Content-Type": "video/mp4" }
});Mux returns a playback URL that works on any device with adaptive bitrate streaming handled automatically.
But Dave, this seems… complicated
Fair. But it’s also what makes it effective. The two-pass approach might seem unnecessarily complicated, but it solves real problems with AI agent recordings.
The first benefit is that narration matches content precisely. Recording while an AI writes narration in real-time would produce commentary based on whatever the AI remembered or assumed about the page. Visiting pages and taking snapshots first means the narration responds to what's actually there, and the structured segment format means the AI explicitly indicates which elements it's talking about.
The second benefit is the elimination of timing drift. Real-time speech synthesis introduces unpredictable latency, and the agent would have moved on by the time ElevenLabs returned an audio clip. Pre-generating all audio and extracting character-level timing means the server knows exactly when each word will be spoken before recording starts.
The third benefit is content-aware scrolling. Simple screen recordings scroll at a fixed pace that has no relationship to what's being said. The segment-based approach means the browser scrolls to features when the narration mentions features, scrolls to pricing when the narration discusses pricing, and stays put when the narration dwells on something important.
The fourth benefit is the removal of dead time. A raw recording of an AI agent might run 15 minutes, but only 2 minutes shows anything interesting while the rest captures API calls and model inference. Recording only the performance pass and extracting only the scroll segments produces a final video of continuous, engaging content.
Run it yourself
The system requires four external services to function. An agent browser SDK handles browser automation and screen recording, Claude generates the contextual narration with scroll targets, ElevenLabs converts text to speech with character-level timing, and Mux hosts and delivers the final video.
ANTHROPIC_API_KEY="your-api-key"
ELEVENLABS_API_KEY="your-api-key"
MUX_TOKEN_ID="your-token-id"
MUX_TOKEN_SECRET="your-token-secret"The MCP server exposes a single tool called create_narrated_recording that handles the entire workflow when given a persona and an array of URLs.
{
"persona": "a skeptical product reviewer who's hard to impress",
"pages": [
{
"url": "https://example.com"
},
{
"url": "https://example.com/features"
},
{
"url": "https://example.com/pricing"
}
]
}The tool visits each page, writes structured narration based on what it sees, generates audio with timing data, records the video with content-aware scrolling, and returns a Mux playback URL. The entire workflow runs autonomously from a single tool call.
What you can build with this
Product demos become possible without human involvement when you point the tool at your application, give it a persona, and let it generate a walkthrough video with commentary. The skeptical reviewer persona is particularly effective for internal reviews because a critical eye often catches UX issues that a friendly walkthrough would miss.
Competitor analysis gets a new format when you record your competitor's website with a "confused first-time visitor" persona to expose friction points, or use an "enthusiastic salesperson" persona on your own site to hear how the pitch sounds when spoken aloud.
Bug reports gain visual context when an agent records itself triggering the bug while narrating what's happening, giving developers immediate understanding that would take paragraphs of text to convey.
Async handoffs between agents and humans become richer when an agent records a video walkthrough that a human reviewer can watch at 2x speed during their morning coffee.
The technology here builds on mature foundations. Screen recording, language models, text-to-speech, and video hosting all existed before this project. The interesting part is how they're composed into a two-pass workflow that separates research from performance, a freeform persona system that shapes narration style, and a timing engine that uses character-level alignment to sync scrolling with speech.
AI agents are getting better at doing things, and now they can show you what they did while telling you about it in character.



