How Mux detects shot boundaries

Shots is a new Mux Video primitive that generates a manifest of shot boundaries and representative shot images for any assets. It's the foundation for a brand new Mux Robots workflow, Find Scenes. Read Victor's post about how the Mux Robots team used Shots to detect scenes.

In this post, we'll go into more detail on the algorithm behind Shots.

The goal

The algorithm must detect the boundaries of each shot. It should be a simple, one-pass algorithm that uses a limited amount of resources. We can’t always avoid invalid results, but we tolerate false positives more than false negatives.

Basic metrics

In order to detect shots within a video, we need to analyze pixel changes between each frame. FFmpeg exposes three basic metrics to help us calculate the differences between them. Let’s start with the definitions of basic metrics that FFmpeg exposes natively:

Sum of absolute differences

\begin{align*} & \small \text{L(x,y,t) - luminance of pixel (x,y) at time t}\newline &\large SAD=\sum_{x,y}|L(x,y,t)-L(x,y,t-1)| \end{align*}

High difference in luminance means there was a visible change in two consecutive frames. That is a good indicator for possible shot change.

Mean Average Frame Difference:

MAFD=\frac{SAD}{width \times height}

Shot change score (MAFD with values normalized to 0-100):

\large score=\frac{MAFD}{2^{bit\_depth}-1}

First approach: Set a threshold for score

Setting a threshold for score is the most straightforward approach and actually the one that FFmpeg uses. The problems begin once we try to find the threshold value. There is no value that would work for every case. It should be higher for dynamic shots without cuts. On the other hand we need a lower value when a new shot starts, but it is quite similar to the previous one.

Choosing different thresholds for different videos would require much more complex analysis, so that is not an option for a single-pass algorithm.

Assuming we prefer false positives, the threshold would be low. Low threshold means more frames flagged as shot changes. The result is almost no false negatives, but there are way too many false positives.

Second approach: Search for peaks

Another approach would be to search for short peaks in shot change score. We set a threshold for a ratio of current score to moving average for three previous frames. This eliminated the threshold sensitivity problem from the first approach. Still, there was one problematic case. When a scene is static for an extended period, the moving average of scores approaches zero. Any subsequent small change produces a disproportionately large ratio, causing false positives.

Solution

Both previous approaches created way too many false positives, but those false positives were not overlapping. Each algorithm detected them in different places.

The best solution was to combine both approaches. The final algorithm assumes that shot change happens when BOTH algorithms detect a shot change. The precise definition of shot change is:

\begin{aligned}&\large score(t)>3\space\space\space\space AND \space\space\space\space \frac{score(t)}{avg\_score}>10 \ \newline \newline &\small avg\_score=\frac{\sum_{i=1..3}{score(t-i)}}{3}\newline\end{aligned}

This approach gives very good results and is resource-efficient. It works in one pass, using only the last three values.

Example

Red wine being poured into a wine glass in slow motion, with visible bubbles and foam forming at the surface. A second empty wine glass is partially visible in the background. The scene is shot on a dark, moody background with warm, dramatic lighting.

There are three shots in this video. That means exactly two shot changes should be detected.

First approach gives one more false positive in the middle.

A shot change score graph showing a blue score line and a red threshold line. Two spikes in the blue line cross the threshold and are labeled "New Shot" on the left and right. A smaller spike in the middle is labeled "False positive quick zoom" in red, indicating a spike that crossed the threshold due to a quick zoom rather than an actual shot change. Below the graph, two video frames of a dimly lit bar or restaurant interior are shown side by side with an arrow between them, illustrating the subtle camera movement that triggered the false positive.

Second approach gives two false positives, but in different places

A line chart titled "Score / Moving Average" with a blue score/average line and a red threshold line at 10. Two tall spikes are labeled "New shot" and two smaller threshold-crossing spikes are labeled "False positive" in red with arrows. Below the chart, two video stills are shown: a dark restaurant interior captioned "Very slow zoom followed by more details appearing on the sides," and a dimly lit dining space captioned "Static view with changes in lights and shadows."

After combining both approaches we get exactly two shot boundaries.

A combined line chart showing three series: blue (score), red (score/average), and yellow (threshold). Two tall spikes are labeled "Shot change detected by both algorithms." Three smaller spikes that cross the threshold are labeled in red as false positives — one from the first algorithm and two from the second algorithm.

Known failure cases

This algorithm works well. Still, there are some videos where it fails. There is one known case that creates false negatives - smooth transitions.

A GIF of a peaceful, sunlit animated scene featuring a mossy hill with a small cave entrance, surrounded by trees with dappled light filtering through the leaves.

A line chart titled "Shot Change Score" with a blue score line and a red threshold line at 3. The blue line remains well below the threshold throughout, with the highest spike barely reaching 1.6, indicating no shot changes detected.

The score changes are gradual and never spike above threshold. There is no simple algorithmic solution for this case. It would require at least a basic understanding of the content of each shot to distinguish a smooth transition from a true shot change.

The most problematic case with the algorithm returning a false positive is a video with flashing lights. The best example may be a video made with a phone camera at a concert. There is one long shot, but this algorithm will detect a shot change during each flash.

A slow-motion GIF of a crowd at a concert or club event, shot from behind. Hands are raised against a backdrop of dramatic blue stage lighting with light beams cutting through haze above the crowd.

A line chart titled "Shot Change Score" with a blue score line and a red threshold line at approximately 3. The blue line is highly active throughout, with numerous spikes frequently crossing the threshold, making it difficult to distinguish true shot changes from noise.

There are some more interesting cases like a slideshow covering only part of the screen or a sharp change in zoom. These are edge cases that fall outside the algorithm's scope by design.

Learn more about how to use Shots in our guide.

How Mux detects shot boundaries

The goal

Basic metrics

Sum of absolute differences

Mean Average Frame Difference:

Shot change score (MAFD with values normalized to 0-100):

First approach: Set a threshold for score

Second approach: Search for peaks

Solution

Example

Known failure cases

Written By

Grzegorz Gronkowski – Senior Software Engineer

Leave your wallet
where it is

Read more like this

React Native needs a new video player

Fine-tuning a multimodal model for video intelligence

The death of the to-do app

Check out our newsletter

How Mux detects shot boundaries

Written By

Leave your wallet where it is

Read more like this

React Native needs a new video player

Fine-tuning a multimodal model for video intelligence

The death of the to-do app

Check out our newsletter

Leave your wallet
where it is