Published on January 28, 2026

Agentic screen recording in the browser

Dave Kiss
By Dave Kiss13 min readEngineering

These new-fangled AI agent systems can now browse the web, write code, file bug reports, and execute multi-step workflows with limited issues. When they finish a task, though, what do we actually get? A text summary, maybe a screenshot, perhaps a list of files changed.

That’s kinda like hiring a contractor to renovate your kitchen and receiving only a written description of what they did. “Alrighty, we removed the old cabinets, installed new ones, and the sink is now functional.” Cool, but like, can I see the kitchen? I demand receipts. Show me how it works.

agent-video solves this problem by borrowing heavily from how films get made.

LinkPlease wait, loading…

When an AI agent navigates a website, it operates differently than a human. A human user experiences time linearly, clicking a button, waiting for a page to load, reading the content, deciding what to do next, and clicking again. The thinking happens between the actions, creating a smooth and continuous flow.

An AI agent experiences time in bursts. It takes a snapshot of the page, sends that snapshot to a language model, waits several seconds for a response, receives instructions, executes them instantly, takes another snapshot, and repeats.

Most of the agent's work happens during those long pauses while the model thinks. If you recorded the raw screen activity, you'd get a video that's 90% still frames with occasional rapid-fire clicking, which makes for unwatchable footage.

Boooooooorrrrrrriiiinnnnnggggg

LinkA film school graduate’s solution: rehearsal

The architecture of agent-video borrows a concept I learned back in film school. In Hollywood, complex scenes often require multiple phases. There's pre-production where the script gets written and timing gets planned, the actual shoot where cameras roll, and post-production where everything gets edited together.

agent-video can apply this same principle to screen recording. The MCP server in this project orchestrates three distinct phases that mirror this filmmaking workflow.

Pre-production happens first. The server opens a browser and visits each page, taking accessibility snapshots of what it sees. It sends each snapshot to Claude with a persona prompt and asks for structured narration that includes specific scroll targets. The AI writes commentary based on the actual page content and indicates which elements should be visible during each segment.

Then the narration goes to ElevenLabs for voice synthesis, which returns character-level timing data alongside the audio. By the end of this phase, the script is written, all the audio is ready, and the server knows exactly when to scroll to each element.

The shoot comes next. The server launches a fresh browser with video recording enabled, navigates to each URL in sequence, and performs content-aware scrolling that's synchronized to the narration. As Claude mentions a specific section of the page, the browser smoothly scrolls to bring that element into view. The timing comes from ElevenLabs' character-level alignment data, which maps each word in the narration to a precise timestamp.

Post-production ties everything together. The server extracts only the relevant segments from the raw recording, concatenates them into a seamless video, and overlays the pre-generated audio at precise intervals. The result is a polished video where narration and visuals sync at the content level.

LinkThey did their research

The research pass is where agent-video differs from simpler screen recording tools. Before any video gets captured, the server actually looks at each page and decides what to say about it, while also planning exactly where to scroll during each part of the narration.

javascript
// Open browser for research agentBrowser(`open "${pages[0].url}" --headed`); for (const page of pages) { // Navigate to page agentBrowser(`open "${page.url}"`); // Take an accessibility snapshot with element refs like @e1, @e5, @e12 const snapshot = agentBrowser(`snapshot`); // Ask Claude to write structured narration with scroll targets const narrationData = await generateNarration(persona, page.url, snapshot); } // Close browser after research agentBrowser(`close`);

The narration generation calls Claude's API with the page snapshot and persona description, using structured outputs to guarantee valid JSON.

javascript
const response = await fetch("https://api.anthropic.com/v1/messages", { method: "POST", headers: { "x-api-key": apiKey, "content-type": "application/json", "anthropic-version": "2023-06-01", "anthropic-beta": "structured-outputs-2025-11-13", }, body: JSON.stringify({ model: "claude-sonnet-4-5-20250929", max_tokens: 800, messages: [{ role: "user", content: `You are narrating a screen recording. Your persona: ${persona} You are viewing: ${pageUrl} Here is the accessibility snapshot (elements have refs like @e1, @e5): ${snapshot} Generate narration that flows naturally through the page from top to bottom. Create 3-5 segments. Pick refs that correspond to what you're talking about in that segment. Use 'top' for the first segment, then refs for subsequent sections as you scroll down.` }], output_format: { type: "json_schema", schema: { type: "object", properties: { segments: { type: "array", items: { type: "object", properties: { text: { type: "string" }, scrollTo: { type: "string" } }, required: ["text", "scrollTo"], additionalProperties: false } } }, required: ["segments"], additionalProperties: false } } }) });

Claude returns something like this for a typical marketing site:

json
{ "segments": [ { "text": "Oh wonderful, another startup landing page with a giant hero section.", "scrollTo": "top" }, { "text": "Let's see what features they're bragging about down here.", "scrollTo": "@e15" }, { "text": "And of course there's a pricing section, because nothing says confidence like burying your prices at the bottom.", "scrollTo": "@e42" } ] }

This means the commentary is always relevant and the scrolling is always purposeful. If the homepage has a giant hero image with "We're hiring!" plastered across it, the narration can mention that while the browser shows the hero. If the pricing page shows a free tier, the persona can react to it as the browser scrolls that section into view. The narration emerges from what's actually on the page, and the visuals follow along.

Just like… a real… human.

LinkThe accessibility snapshot

The accessibility snapshot deserves its own explanation because it's the bridge between what Claude sees and what the browser can scroll to.

When the browser takes a snapshot, it captures the page's accessibility tree rather than a visual screenshot. This tree represents the semantic structure of the page with each interactive or meaningful element assigned a unique ref like @e1, @e5, or @e42. The snapshot might look something like this for a simple marketing page.

- heading "Welcome to Acme" @e1 - paragraph "We help companies do things better." @e2 - link "Get Started" @e3 - heading "Features" @e15 - list @e16 - listitem "Fast deployment" @e17 - listitem "Easy integration" @e18 - listitem "24/7 support" @e19 - heading "Pricing" @e42 - table @e43 - row "Free tier - $0/month" @e44 - row "Pro tier - $99/month" @e45

Claude reads this structure and can reference specific elements by their ref when writing narration. When Claude says "let's look at the pricing" and specifies @e42 as the scroll target, the browser knows exactly which DOM element to bring into view.

In the performance pass, the refs are likely to exist just like they did in the snapshot Claude just analyzed. The challenge comes from the fact that pages can change between the research pass and the performance pass, which occasionally causes scroll targets to become invalid. Dynamic content, lazy loading, or authentication state changes can all cause the refs from research to point to different elements during recording. In that case, you can opt for a persistent CSS or HTML selector instead of a ref so you're certain you've got a handle to the element you're looking for.

LinkYou can be anyone you want

The persona can be anything you describe, with no predefined list to choose from.

json
{ "persona": "a jaded Silicon Valley investor who's seen a thousand pitch decks", "pages": [ { "url": "https://startup.com" }, { "url": "https://startup.com/pricing" } ] }

You could also try a "Gordon Ramsay reviewing websites instead of restaurants" persona, or "a confused grandparent trying to understand what this company does."

The persona string gets passed directly to Claude, which interprets it and writes narration in that voice. You can get creative with descriptions like "a noir detective narrating in the style of a 1940s radio drama" or "an overenthusiastic intern on their first day" or "a bored teenager who'd rather be anywhere else."

My favorite in testing was the baby who just learned to talk, but I could see how I would think that’s funny and nobody else would.

LinkAudio generation and timing alignment

The generated narration flows through ElevenLabs for text-to-speech, but we use a special endpoint that returns more than audio.

javascript
const response = await fetch( `https://api.elevenlabs.io/v1/text-to-speech/${voiceId}/with-timestamps`, { method: 'POST', headers: { 'xi-api-key': apiKey, 'Content-Type': 'application/json' }, body: JSON.stringify({ text: fullNarration, model_id: 'eleven_multilingual_v2' }) } ); const { audio_base64, alignment } = await response.json();

The with-timestamps endpoint returns character-level timing data alongside the audio. The alignment object contains arrays that map every character in the input text to precise timestamps in the generated audio.

javascript
alignment: { characters: ['O', 'h', ' ', 'w', 'o', 'n', 'd', 'e', 'r', 'f', 'u', 'l', ...], character_start_times_seconds: [0.0, 0.05, 0.12, 0.15, 0.22, ...], character_end_times_seconds: [0.05, 0.12, 0.15, 0.22, 0.28, ...] }

This character-level timing is the key to content-aware scrolling. Since we know that each segment starts at a specific character offset in the full narration, we can look up exactly when that segment will be spoken and trigger the scroll at that moment.

javascript
function calculateSegmentTimings(segments, charStartTimes) { const segmentTimings = []; let charOffset = 0; for (const segment of segments) { const startTime = charStartTimes[charOffset]; segmentTimings.push({ text: segment.text, scrollTo: segment.scrollTo, startTimeMs: Math.round(startTime * 1000) }); // Move past this segment plus the space between segments charOffset += segment.text.length + 1; } return segmentTimings; }

The result is a precise schedule that tells the browser exactly when to scroll to each element during recording.

LinkThe performance pass

With all audio generated, timing calculated, and scroll targets identified, the server opens a fresh browser and records the actual video.

javascript
// Open browser for recording agentBrowser(`set viewport 1280 720`); agentBrowser(`open "${pages[0].url}" --headed`); // Start recording agentBrowser(`record start "${videoPath}"`); const recordingStartMs = Date.now(); const marks = []; for (let i = 0; i < pages.length; i++) { const { url } = pages[i]; const clip = clips[i]; // Navigate agentBrowser(`open "${url}"`); // Mark timestamp for extraction later marks.push({ clipNum: i + 1, offsetMs: Date.now() - recordingStartMs, durationMs: clip.durationMs }); // Enable smooth scrolling agentBrowser(`eval "document.documentElement.style.scrollBehavior = 'smooth'"`); // Content-aware scrolling based on segment timings const segmentStartTime = Date.now(); for (const segment of clip.segmentTimings) { // Wait until it's time to scroll to this segment const elapsedMs = Date.now() - segmentStartTime; const waitMs = segment.startTimeMs - elapsedMs; if (waitMs > 0) { await sleep(waitMs); } // Scroll to the target element if (segment.scrollTo === 'top') { agentBrowser(`eval "window.scrollTo({ top: 0, behavior: 'smooth' })"`); } else if (segment.scrollTo.startsWith('@')) { agentBrowser(`scrollintoview ${segment.scrollTo}`); } } // Wait for remaining audio duration const totalElapsed = Date.now() - segmentStartTime; const remainingMs = clip.durationMs - totalElapsed; if (remainingMs > 0) { await sleep(remainingMs); } } // Stop recording agentBrowser(`record stop`); agentBrowser(`close`);

The scrolling happens in sync with the narration because both are driven by the same timing data from ElevenLabs. When Claude's narration mentions "let's see what features they're bragging about," the browser is already scrolling to the features section. When the persona comments on pricing, the pricing table slides into view.

The marks array tracks when each page segment starts relative to the recording beginning, which becomes the extraction map for post-production.

LinkFix it in post

With the raw video captured and the marks logged, post-production assembles the final output through a series of ffmpeg operations.

The first step extracts precise time windows from the source video based on the marks recorded during the performance pass.

bash
ffmpeg -ss 0.000 -t 3.5 -i recording.webm segment_1.mp4 ffmpeg -ss 3.500 -t 2.9 -i recording.webm segment_2.mp4 ffmpeg -ss 6.400 -t 4.2 -i recording.webm segment_3.mp4

These segments then get joined using ffmpeg's concat demuxer, which works at the container level and avoids re-encoding to keep things fast.

bash
ffmpeg -f concat -safe 0 -i concat_list.txt -c copy concat.mp4

Audio mixing positions each narration clip at the correct offset using the adelay filter, which accepts milliseconds for sub-second precision.

javascript
let cumulativeOffset = 0; for (const clip of clips) { audioFilter += `[${clip.num}]adelay=${cumulativeOffset}|${cumulativeOffset}[a${clip.num}];`; cumulativeOffset += clip.durationMs; }

The final ffmpeg command merges the positioned audio tracks onto the concatenated video.

bash
ffmpeg -i concat.mp4 -i clip_1.mp3 -i clip_2.mp3 -i clip_3.mp3 \ -filter_complex "[1]adelay=0[a1];[2]adelay=3500[a2];[3]adelay=6400[a3];[a1][a2][a3]amix=inputs=3" \ -c:v copy output.mp4

The finished MP4 then gets uploaded to Mux for encoding, adaptive streaming, and delivery.

javascript
const upload = await fetch("https://api.mux.com/video/v1/uploads", { method: "POST", headers: { "Content-Type": "application/json", Authorization: `Basic ${credentials}`, }, body: JSON.stringify({ new_asset_settings: { playback_policy: ["public"], video_quality: "basic" } }) }); await fetch(uploadUrl, { method: "PUT", body: videoBuffer, headers: { "Content-Type": "video/mp4" } });

Mux returns a playback URL that works on any device with adaptive bitrate streaming handled automatically.

LinkBut Dave, this seems… complicated

Fair. But it’s also what makes it effective. The two-pass approach might seem unnecessarily complicated, but it solves real problems with AI agent recordings.

The first benefit is that narration matches content precisely. Recording while an AI writes narration in real-time would produce commentary based on whatever the AI remembered or assumed about the page. Visiting pages and taking snapshots first means the narration responds to what's actually there, and the structured segment format means the AI explicitly indicates which elements it's talking about.

The second benefit is the elimination of timing drift. Real-time speech synthesis introduces unpredictable latency, and the agent would have moved on by the time ElevenLabs returned an audio clip. Pre-generating all audio and extracting character-level timing means the server knows exactly when each word will be spoken before recording starts.

The third benefit is content-aware scrolling. Simple screen recordings scroll at a fixed pace that has no relationship to what's being said. The segment-based approach means the browser scrolls to features when the narration mentions features, scrolls to pricing when the narration discusses pricing, and stays put when the narration dwells on something important.

The fourth benefit is the removal of dead time. A raw recording of an AI agent might run 15 minutes, but only 2 minutes shows anything interesting while the rest captures API calls and model inference. Recording only the performance pass and extracting only the scroll segments produces a final video of continuous, engaging content.

LinkRun it yourself

The system requires four external services to function. An agent browser SDK handles browser automation and screen recording, Claude generates the contextual narration with scroll targets, ElevenLabs converts text to speech with character-level timing, and Mux hosts and delivers the final video.

.env
ANTHROPIC_API_KEY="your-api-key" ELEVENLABS_API_KEY="your-api-key" MUX_TOKEN_ID="your-token-id" MUX_TOKEN_SECRET="your-token-secret"

The MCP server exposes a single tool called create_narrated_recording that handles the entire workflow when given a persona and an array of URLs.

json
{ "persona": "a skeptical product reviewer who's hard to impress", "pages": [ { "url": "https://example.com" }, { "url": "https://example.com/features" }, { "url": "https://example.com/pricing" } ] }

The tool visits each page, writes structured narration based on what it sees, generates audio with timing data, records the video with content-aware scrolling, and returns a Mux playback URL. The entire workflow runs autonomously from a single tool call.

LinkWhat you can build with this

Product demos become possible without human involvement when you point the tool at your application, give it a persona, and let it generate a walkthrough video with commentary. The skeptical reviewer persona is particularly effective for internal reviews because a critical eye often catches UX issues that a friendly walkthrough would miss.

Competitor analysis gets a new format when you record your competitor's website with a "confused first-time visitor" persona to expose friction points, or use an "enthusiastic salesperson" persona on your own site to hear how the pitch sounds when spoken aloud.

Bug reports gain visual context when an agent records itself triggering the bug while narrating what's happening, giving developers immediate understanding that would take paragraphs of text to convey.

Async handoffs between agents and humans become richer when an agent records a video walkthrough that a human reviewer can watch at 2x speed during their morning coffee.

The technology here builds on mature foundations. Screen recording, language models, text-to-speech, and video hosting all existed before this project. The interesting part is how they're composed into a two-pass workflow that separates research from performance, a freeform persona system that shapes narration style, and a timing engine that uses character-level alignment to sync scrolling with speech.

AI agents are getting better at doing things, and now they can show you what they did while telling you about it in character.

Written By

Dave Kiss

Dave Kiss – Senior Community Engineering Lead

Was: solo-developreneur. Now: developer community person. Happy to ride a bike, hike a hike, high-five a hand, and listen to spa music.

Leave your wallet where it is

No credit card required to get started.