One of our goals with Mux Data is to provide meaningful data about video performance at a quick glance, while also providing the ability to drill deep when needed.
We created the Viewer Experience Score when we launched Mux Data in 2016. The Viewer Experience Score is a single number that describes how satisfied users are with video streaming performance. This overall score is composed of four underlying scores, focusing on playback failures, rebuffering, startup time, and video quality, each of which describe how satisfied a user is with the underlying four elements of video performance.
Think of the inverted pyramid used in journalism. The headline of an article summarizes the entire story in just a few words. The first paragraph summarizes what's most important. By the middle of the article, you're in the finer details, and supporting/background material waits until the end.
A Viewer Experience Score is like this. The headline is a single number that describes how happy your users are with system performance (e.g. 75). Dig deeper, and you find out why (e.g. 3.5% of users are seeing HTML5 playback failures, and these failures are concentrated on MP4 playback). At the bottom is every single view that was measured (a user in Toronto tried to watch video 19284, and saw an error after 30 seconds).
This system helps our customers focus on the actual experience of watching video (Quality of Experience), and not just the underlying service-level performance (Quality of Service). It's one thing to learn that you have rebuffer frequency of 0.34 events per minute. That's useful data, but it doesn't really tell you how happy your users are. It's better to have a layer on top that says how happy users are with the amount of rebuffering they're seeing.
Our original methodology was inspired by the Apdex score, created by NewRelic (and an open alliance) to measure application performance. Apdex divides performance into three buckets: satisfactory, tolerating, and frustrated. Satisfactory requests are given a score of 1.0; tolerating an 0.5; and frustrated an 0.0. So if an application has 60% satisfactory performance (60% * 1.0), 30% tolerating performance (30% * 0.5), and 10% frustrated performance (10% * 0.0), the total score would be 0.75.
We originally used a similar methodology for our Viewer Experience Score. Based on research, we learned that 2 seconds of startup time was generally considered satisfactory for web video, and 8 seconds was generally frustrating. So a view with 1.75 seconds of startup time got a 100, a view with 5 seconds got a 50, and a view with 10 seconds got a 0.
This methodology generally worked well, but based on customer feedback and user research, it has some problems, and we think we can do better.
- It's too binary. There are degrees of bad performance that are lost with the hard thresholds of 100, 50, and 0.
- It stops at "frustrating". The experience of 20 seconds of startup time is worse than the experience of 9 seconds, but our methodology flattened these into the same score.
- Some of our users find it confusing for an unexpected reason. Many of our users learned an A-F grading system on an 100 point scale in school, and associate a 90% with an "A" and anything below 60% with an "F." This introduced confusion and frustration, since our 0-100 scores were not designed with this in mind. A score of 70 was actually above average, but our customers didn't see it that way.
- Finally, the experience of these metrics is not linear. 8 seconds of startup time is not exactly 4 times worse than 2 seconds of startup time. In our research, we actually found that the importance of each metric had a slightly different shape, which can be plotted on a graph sometimes called a utility curve (or indifference curve).
We're excited to announce that we have released a new version of our Viewer Experience Score.
The new methodology keeps many of the features of our earlier methodology. For example, it still focuses on viewer satisfaction and frustration, and still relies on the four dimensions of performance that make up overall experience.
We've made three three significant changes, however.
Change from binary scores to a function. 2 seconds of startup time isn't the same as 8 seconds, but our old methodology would have assigned exactly 50 points to each. In the new methodology, 2 seconds of startup time will be an 80, while 8 seconds will be 50.
Capture tradeoffs. We did pairwise research into how users experience tradeoffs in video performance - for example, would you rather have a little rebuffering and a lot of startup time, or a little startup time and a lot of rebuffering - and used that to plot abstract utility curves for each dimension. Look for another blog post on that later.
Extend the scale. Rather than assigning 0 points to the point at which a user first becomes "frustrated," we're assigning this point 50 points. This allows us to continue to score performance beyond the first point at which a viewer is frustrated. [tbd: Some scores will still reach 0 at a certain point (e.g. after x rebuffering, the score is 0), while others will get close to zero but never reach it (e.g. 30 seconds of startup time gets a score of 6)
Abstractly, the new formula looks like this.
- If viewers are satisfied with performance, the score is 100. (This is unchanged.)
- The score at which viewers become frustrated with performance is now 50. (This is changed from 0.)
- Each score is now computed with a simple formula that describes the relative importance of performance along a utility curve. (This is new.)
Playback Success Score is fairly simple. A failure that ends playback is an 0, while a video that plays through without failure is 100. This is unchanged.
What's new is that we now give a score of 50 if a viewer exits before a video starts (EBVS).
Startup Time Score describes how happy or unhappy viewers are with startup time. Longer startup times mean lower scores, while shorter startup times mean higher scores. Once startup time reaches a certain point (around 8 seconds), we begin to decrease the rate of score decay since additional seconds of startup becomes less impactful for long startup times.
Startup Time score decreases quickly after 500 milliseconds.
- 400 ms: 95
- 2 seconds: 80
- 8 seconds: 50
- 20 seconds: 29
Smoothness Score measures the amount of rebuffering a viewer sees when watching video. A higher Smoothness Score means the user sees less (or no) rebuffering, while a lower score means a user sees more rebuffering.
- No rebuffering: 100
- 5 minute video with a single 5s rebuffer: average of 80 and 90 = 85
- 20 minute video with four 15s rebuffers: average of 44 and 60 = 54
Video Quality Score measures the visual quality a user sees by comparing the resolution of a video stream to the resolution of the player in which it is played. If a video stream is significantly upscaled, quality generally suffers, and viewers have an unacceptable experience.
Note that video quality is notoriously difficult to quantify, especially in a reference-free way (without comparing a video to a pristine master). Bitrate doesn't work, since the same bitrate may look excellent on one video and terrible on another.
Several factors contribute to actual video quality: bitrate, codec, content type, and the quality of the original source. However, if content is encoded well and at the right bitrates, upscaling tracks reasonably well to video quality.
- No upscaling: 100
- 50% upscaling throughout: 85
- 200% upscaling for the first 30 seconds, and no upscaling for the next 20 minutes: 92
Overall Score is the combination of four underlying component scores, which track the four major categories of video performance - playback success, startup time, smoothness/rebuffering, and video quality.
The Tradeoff functions are based on research into the relative tradeoffs between increasing one metric at the expense of another. For example, you can increase Quality at the expense of Startup Time and vice versa. However, doing so would be a bad idea because Startup Time is more valuable than Quality. Generally, we found that Playback Success is the most important, followed by Smoothness then Startup Time, and finally Quality.
In a future post, data scientist Ben Dodson will talk more about the methodology we use to update our scores, including a more detailed description of the tradeoff functions. Look for that soon.
In the meantime, get in touch with any questions.