Every video that plays smoothly in a browser, on a phone, or on a smart TV has survived a gauntlet of encoding decisions that most developers never see. Choose the wrong codec and you're paying 40% more in CDN costs than you need to. Design a bad bitrate ladder and your ABR player will stall at a critical moment. Wire up a naive encoding pipeline and you'll be polling job status in a while loop at 3am when a batch upload fails halfway through.
This guide is for developers building video platforms, media engineers moving off manual FFmpeg pipelines, and teams evaluating cloud encoding APIs. By the time you're done reading, you'll know how to pick the right codec for your use case, design a bitrate ladder that balances quality against bandwidth cost, and understand what a production-grade cloud encoding pipeline actually looks like under the hood.
Let's start with the part that trips up almost every developer who's new to video.
Encoding, Transcoding, and Packaging: Not the Same Thing
These three words get used interchangeably in documentation and job descriptions. They mean different things, and conflating them causes real bugs.
Encoding is the process of compressing raw video frames into a deliverable format using a codec. You're taking uncompressed pixel data — which for a single frame of 1080p video is roughly 6MB — and applying a compression algorithm that gets that down to something a network can actually carry.
Transcoding is encoding that starts from an already-encoded source. You're decoding a file and re-encoding it, usually to change the codec, resolution, or bitrate. Most "encoding pipelines" in production are technically transcoding pipelines because source files arrive pre-encoded.
Packaging is the step developers most often forget. A transcoded file is not streamable. Packaging takes your encoded output and segments it into small chunks (typically 2–6 seconds), then generates manifest files that tell a player which segments exist and at which quality levels. HLS produces .m3u8 manifests; DASH produces .mpd files. Skip this step and your beautiful H.265 encode sits on disk doing nothing.
Containers vs. Codecs
One more thing to get straight before we go deeper: containers and codecs are different. MP4, MOV, and MKV are containers — wrappers that hold encoded video, audio, and metadata. H.264, H.265, and AV1 are codecs — the algorithms that actually compress the video data inside the container. An MP4 can contain H.264 video. It can also contain H.265 video. The container doesn't tell you the codec. This is why FFmpeg flags like -c:v libx264 specify the codec separately from the output container.
Key Parameters You Need to Know
Before choosing a codec, you need to understand the parameters you'll be tuning regardless of which codec you pick:
- Bitrate: How many bits per second the encoded stream uses. CBR (constant bitrate) keeps this fixed — good for live streaming where buffer predictability matters. VBR (variable bitrate) allocates more bits to complex scenes and fewer to simple ones — better for VOD quality per bit.
- Keyframe interval (GOP size): How often the encoder writes a full intra-frame (I-frame). For ABR streaming, this must align with your segment duration. A 2-second segment with a 4-second keyframe interval means players can't seek cleanly.
- Profile and level: H.264's Baseline/Main/High profiles control which compression features are enabled. Baseline is the most compatible; High gives better compression but not every device supports it.
- Encoder preset: FFmpeg's ultrafast through veryslow presets trade encoding CPU time for compression efficiency. For live streaming, veryfast is typical. For VOD archiving, slow or slower gets you significantly better compression.
Codec Trade-offs: H.264, H.265, and AV1
Picking a codec is genuinely one of the highest-leverage decisions in your video stack. It affects encoding cost, CDN bandwidth cost, and the devices that can play your video.
H.264 (AVC): The Safe Default
H.264 remains the most universally supported video codec on the planet. Every modern browser, phone, smart TV, and game console can decode it. Encoding complexity is low, tooling is mature, and the bitrate ranges are well understood: 1–3 Mbps for 720p, 3–6 Mbps for 1080p at reasonable quality.
H.264 is the right call when broad device reach is non-negotiable — corporate video platforms, UGC sites with unknown device distributions, or any live streaming scenario where low-latency encoding matters more than compression efficiency.
Here's a basic H.264 transcode with FFmpeg:
ffmpeg -i input.mp4 \
-c:v libx264 \
-preset slow \
-crf 22 \
-profile:v high \
-level 4.0 \
-g 60 \
-keyint_min 60 \
-sc_threshold 0 \
-c:a aac \
-b:a 128k \
output_1080p.mp4The -g 60 and -keyint_min 60 flags set a fixed GOP of 60 frames (2 seconds at 30fps). The -sc_threshold 0 disables scene-change-triggered keyframes, which is critical for segment-aligned ABR packaging.
H.265 (HEVC): 40–50% Better Compression, Real Licensing Pain
H.265 delivers roughly 40–50% better compression than H.264 at equivalent perceptual quality. For 4K VOD, that compression advantage translates directly into CDN cost reduction. A 4K stream that costs $0.08/GB to deliver in H.264 might cost $0.045/GB in H.265 at comparable quality.
The catch is licensing complexity. HEVC is patent-encumbered, and the licensing pool fragmented badly enough that some browser vendors (notably Firefox and Chrome on certain platforms) still don't support it without hardware decoder fallback. For 4K VOD delivered to TVs and iOS devices, H.265 is often the right choice. For browser-first platforms, it's messier.
AV1: Royalty-Free, Future-Forward, CPU-Hungry
AV1 is royalty-free and delivers approximately 30% better compression than HEVC — meaning roughly 50% better than H.264 at equivalent quality. For a platform delivering millions of hours of video, that's a meaningful infrastructure cost reduction.
The trade-off is encoding time and cost. Encoding AV1 with software (libaom-av1) is 10–50x slower than H.264. Hardware AV1 encoders (available in newer Nvidia, Intel, and AMD GPUs) have closed that gap significantly, but you're still paying more per encoding job.
Device support has matured considerably in 2025. Chrome, Firefox, Edge, Android, and newer smart TVs all support AV1 hardware decode. Safari and older iOS devices remain the main gap. For high-volume VOD archives — the kind where content sits for years and gets served billions of times — encoding once in AV1 pays for itself quickly. For short-shelf-life UGC content, the encoding cost rarely justifies it.
Choosing Your Codec: A Practical Framework
Rather than prescribing a single answer, think through these questions:
- Is device compatibility the top constraint? Use H.264.
- Is this 4K VOD on TV/iOS with CDN cost pressure? Use H.265.
- Is this a large-scale VOD archive with good encoding budget and modern device targets? Use AV1, potentially alongside H.264 as fallback.
- Is this live sports or low-latency live? Use H.264. AV1 and H.265 software encoding can't meet real-time deadlines reliably.
- Is this a major platform serving diverse devices? Consider multi-codec encoding — H.264 as universal fallback, AV1 for capable clients.
Netflix encodes the same title in multiple codecs and serves the appropriate one based on device capability detected at playback time. You can do this too if you have the encoding budget.
Designing a Bitrate Ladder
The encoding ladder is the set of resolution/bitrate rendition pairs that your ABR player switches between based on network conditions. Getting this right is how you simultaneously improve quality for users on fast connections and prevent rebuffering for users on slow ones.
A reasonable H.264 ladder for a general-purpose VOD platform looks something like this:
360p / 800 kbps video + 96 kbps audio
480p / 1400 kbps video + 128 kbps audio
720p / 2800 kbps video + 128 kbps audio
1080p / 5000 kbps video + 192 kbps audioThe exact numbers matter less than the rationale behind them. Each rung should be meaningfully distinguishable from the one below it — if 720p and 1080p look nearly identical on a phone screen, you're wasting bandwidth on the 1080p rendition without user benefit. Conversely, the gap between rungs can't be so large that the player has to drop multiple quality levels when the network degrades.
The CDN Cost Equation
Here's concrete math on why your top rendition bitrate matters. Suppose you're delivering 1 million hours of video per month, and 30% of those hours are served at your 1080p rendition. At 5 Mbps average bitrate, that's:
1,000,000 hours × 0.30 × 3,600 seconds × 5 Mbps / 8 bits
= 675,000 GB of bandwidth per monthAt $0.08/GB CDN pricing, that's $54,000/month from just your top rendition. Drop that rendition bitrate from 5 Mbps to 4 Mbps without perceptible quality loss — something per-title encoding makes achievable — and you save $10,800/month. This is why encoding decisions are infrastructure cost decisions.
Segment Duration Trade-offs
Segment duration is the other variable developers underestimate. Shorter segments (2 seconds) let the ABR algorithm react faster to network changes but generate more HTTP requests, increasing origin load and player overhead. Longer segments (6 seconds) are more efficient to serve but mean the player is slower to respond to network drops.
For general VOD, 6-second segments hit a reasonable sweet spot. For low-latency live streaming, 2-second segments (or lower with LL-HLS) are necessary. Choose your segment duration before you build your packaging pipeline — it's painful to change later because it affects manifest structure, CDN cache key design, and player buffer assumptions.
Per-Title Encoding: Why Static Ladders Leave Quality on the Table
A static bitrate ladder applies the same renditions to every video you encode. That 720p/2800kbps rendition might look excellent for a fast-motion live sports clip, but for a slow-panning nature documentary, it's massively over-provisioned — you could deliver the same quality at 1200kbps. Worse, a complex animation might look mediocre at 2800kbps because you've under-provisioned it.
Per-title encoding (also called per-scene or content-aware encoding) runs a complexity analysis pass before encoding to compute the optimal bitrate for each resolution based on the specific video content. The result: simpler content gets smaller files, complex content gets the bits it needs, and the average bitrate across your library drops 30–50%.
The Convex Hull Approach
Netflix's original per-title encoding paper introduced the convex hull concept: for a given video, encode it at many resolution/bitrate combinations, measure the quality of each point (using a metric like VMAF), and find the Pareto-optimal frontier — the set of renditions where you can't improve quality without increasing bitrate, or reduce bitrate without losing quality.
The practical implementation involves encoding test segments at multiple bitrate points per resolution and plotting VMAF scores against bitrate. The "knee" of each curve tells you the optimal bitrate for that resolution and content type.
Mux has been doing this for years and has written extensively about the engineering behind it. The instant per-title encoding post covers how to do this without the latency penalty of a full-length probe encode.
How Mux Handles Cloud Encoding
Building and maintaining an encoding fleet is a significant operational investment. You need GPU instances for fast encoding, queue infrastructure for job management, retry logic for transient failures, packaging tooling for HLS/DASH output, and origin storage that a CDN can serve from. Then you need to keep all of it running at 99.9%+ availability while the video industry releases new codec standards and the device landscape shifts.
Mux's cloud encoding pipeline handles this as a managed service. The architecture follows a pattern any developer can reason about: ingest → encode → package → store → serve.
A typical Mux integration for VOD looks like this:
const Mux = require('@mux/mux-node');
const { video } = new Mux();
// Create an asset from a publicly accessible URL
const asset = await video.assets.create({
inputs: [{ url: 'https://your-storage.com/source-video.mp4' }],
playback_policy: ['public'],
});
console.log('Asset ID:', asset.id);
console.log('Status:', asset.status); // 'preparing'Rather than polling the status field in a loop, use webhooks to react to state changes:
// In your webhook handler (Express example)
app.post('/webhooks/mux', express.raw({ type: 'application/json' }), (req, res) => {
const event = JSON.parse(req.body);
if (event.type === 'video.asset.ready') {
const assetId = event.data.id;
const playbackId = event.data.playback_ids[0].id;
// Trigger downstream steps: attach captions, update CMS, generate thumbnails
updateCMS({ assetId, playbackId });
generateThumbnail(assetId);
}
if (event.type === 'video.asset.errored') {
// Handle encoding failure — alert, retry, or escalate
notifyOnCall(event.data);
}
res.sendStatus(200);
});This webhook-driven pattern eliminates polling loops and naturally composes with downstream jobs. Thumbnail generation, caption attachment, search index updates, and CMS metadata writes all trigger off the same video.asset.ready event.
Mux also publishes work on how their encoding approach has evolved over time — from per-title encoding at scale to audience adaptive encoding, which goes further by optimizing renditions based on the actual device capabilities of a title's real viewers.
Quality Metrics: How to Know Your Encode Is Actually Good
"Looks good to me" doesn't scale to an automated encoding pipeline. You need objective metrics you can run programmatically and use as quality gates.
PSNR (Peak Signal-to-Noise Ratio) is the oldest metric, computed as a mathematical ratio between the original and encoded frames. It's fast to compute and useful for regression testing — if PSNR drops suddenly between encoder versions, something changed. But PSNR correlates poorly with human perception. An encode can have high PSNR and look terrible.
SSIM (Structural Similarity Index) measures structural information, luminance, and contrast — it models how human visual perception weighs different types of distortion. Better than PSNR for perceptual quality assessment and well supported in FFmpeg.
VMAF (Video Multimethod Assessment Fusion), developed by Netflix, is the current standard for production encoding pipelines. VMAF is trained on human opinion scores and combines multiple quality metrics into a single score (0–100) that correlates well with how humans actually rate video quality. Running VMAF with FFmpeg:
ffmpeg -i source.mp4 -i encoded.mp4 \
-lavfi "[0:v][1:v]libvmaf=log_path=vmaf_output.json:log_fmt=json" \
-f null -A VMAF score above 93 is generally considered transparent quality — viewers can't distinguish it from the source. Scores below 75 are perceptually poor. You can gate your encoding pipeline to fail an asset if VMAF drops below a defined threshold, giving you automated quality assurance at scale.
Live Encoding: Different Rules Apply
Live encoding operates under constraints that make most of the above advice inapplicable. You have no look-ahead — the encoder processes frames as they arrive. Real-time deadlines mean you can't afford slow encoder presets. AV1 software encoding is completely off the table for true live.
The critical discipline for live encoding is fixed GOP size. Every ABR player expects segment boundaries to align with keyframes. If your keyframe interval isn't exactly equal to your segment duration (in frames), players will either stall or produce corrupted output at segment boundaries. With 30fps video and 2-second segments, that means a keyframe every 60 frames, locked:
ffmpeg -i rtmp://ingest/stream \
-c:v libx264 \
-preset veryfast \
-tune zerolatency \
-g 60 \
-keyint_min 60 \
-sc_threshold 0 \
-b:v 3000k \
-c:a aac \
-b:a 128k \
-f hls \
-hls_time 2 \
-hls_flags delete_segments \
output.m3u8B-frames add latency because the encoder needs future frames to encode them. For low-latency live, disable B-frames (-bf 0) and accept the slight compression penalty in exchange for lower glass-to-glass latency.
When a live stream ends, the cloud encoding pipeline should automatically package the recording into a seekable VOD asset — this is sometimes called live-to-VOD. Mux handles this automatically; the live recording becomes a standard asset you can query and play back with the same playback infrastructure.
Conclusion: Four Decisions, One Pipeline
Most video quality and cost problems trace back to four decisions made early in the stack: codec choice, bitrate ladder design, static vs. per-title encoding, and pipeline architecture. Get these right and everything downstream — CDN costs, startup time, rebuffer rate, quality of experience — improves.
The build-vs-buy question on encoding infrastructure has a clear inflection point. At low volume, a manual FFmpeg pipeline on a few instances is totally reasonable. As volume grows, the operational burden of maintaining encoding infrastructure, handling failures gracefully, staying current with codec tooling, and running quality validation at scale starts to dominate engineering time. That's the point where a managed cloud encoding API earns its cost back quickly.
If you're evaluating where you sit on that spectrum, Mux's cloud video encoding handles the encoding infrastructure, per-title optimization, HLS packaging, and CDN delivery in a single API call — so your engineers can focus on the product layer rather than the pipeline layer.
For deeper reading, the adaptive bitrate streaming guide covers ABR algorithm behavior in detail, and what is perceptual quality goes deeper on VMAF and how to think about quality measurement in production systems.