December 5, 2022 (6 months ago)
I’ve seen a lot of captions over the years, and there’s one thing I can tell you with certainty: The way that custom controls and the HTML video element interact right now is inadequate, and I think there’s some work we can do to fix it.
The fact that I’m now an editor of the WebVTT specification and a Chair of the Timed Text Working Group is no accident: I’ve been a heavy captions user since I was a little kid.
Growing up in Israel, I watched most of my favorite shows on an old CRT in the living room. A lot of programming had open captions, or captions that are always shown on screen, due to accessibility requirements. In addition, a lot of imported non–Hebrew language content was subtitled rather than dubbed, unless it was children’s content, because it’s generally a lot more cost-effective to subtitle films and shows than to dub them.1
When I moved to the US in 2000, I continued watching subtitled and captioned content to help me learn English and also to be able to see more non-Hebrew, non-English movies and TV shows — films like “Amelie” and anime like “Cowboy Bebop.”
If you’re a big anime fan like me, unless you know or took the time to learn Japanese (generally considered to be in the hardest class of languages to learn, although this didn’t prevent me from trying), you wouldn’t have been able to see a lot of the shows if not for subtitles. There have also been great strides in bringing more dubbing into anime in recent years, because having both means greater access. Do you have a favorite movie or show that you wouldn’t have been able to watch if not for subtitles? Let me know!
Nowadays, I have captions turned on basically all the time for online streaming services, because it’s hard to hear dialog (although my new soundbar with a center channel definitely helps with that). It also helps increase comprehension, focus, and retention, regardless of dialog audio levels.
(We at Mux think that captions are super important, so we worked to bring you auto-generated captions for your live streams with free minutes! Check out the blog post.)
As I was graduating from college, the FCC passed a new law that required captions on content that was originally shown on TV and later made available online. I ended up joining a company at just the right time to work on implementing improved captions support. I found this work compelling, so I’ve continued with it, notably adding the improved captions support in Video.js. After my first Demuxed talk on captions, I expressed interest in helping out with WebVTT. Then, in 2018, as the previous editor of WebVTT was moving on to greater things, I joined the Timed Text Working Group to help move WebVTT forward (it’s been a long journey, and there’s still a lot of work there).
When I was improving captioning support in Video.js, in 2014, text track (I’ll call this “native captions” below) was still a pretty new feature, which meant lots of browser bugs and cross-browser compatibility issues.
When building Mux Player, one of our goals was to avoid deviating from the browser’s native video implementation as much as possible. With that goal in mind, I've had the chance to approach things from a new direction.
Native captions support has come a long way since 2014, so we wanted to continue using native text tracks if possible.
Unlike native captions, built-in player controls are very limiting. A lot of web players implement their own controls, not only for functionality but also for the ability to customize the visual design. For example, some native controls have complicated interactions with assistive technologies. Implementing these features manually (using the correct primitives) can produce more accessible controls.
If you create custom controls and then enable native captions, unless you're Crunchyroll, your controls probably take up space at the bottom of the display area. This is the same area where the captions appear, which means that when the controls are displayed, they may overlay the captions. (Please don't hide the captions altogether! Unfortunately, I've seen this on some players that shall remain nameless.)
This is a bad user experience, particularly for users who rely on captions, because they will miss content when the controls are visible. This isn't a problem with native controls and native captions because the captions will move out of the way when the controls are shown. I wanted to try and address this behavior with a cross-browser solution.
There are a couple ways to change where the captions show up. You can 1) change the cue settings or 2) apply some CSS.
WebVTT has support for positioning each cue via cue settings right within the WebVTT file. This might be useful, but we don't want to require Mux users to modify their files to use our player, so this wasn't going to work for us.
Is it possible to apply CSS to move the cues out of the way? As with a lot of things in computing, the answer is "it depends." WebVTT has a CSS extension for the ::cue pseudo-element, which allows you to apply styles to each cue. However, it limits which things are allowed to be applied, so something like the following is going to get ignored by the browser, unfortunately.
Are there other ways to move captions using CSS? Yes, if you're in Safari or Chrome (and likely other Chromium-based browsers, although I haven't tested extensively). This is because WebKit-based browsers (remember that Blink was originally forked from WebKit?) render the cues into the video element's shadow DOM and expose pseudo-elements that we can use to target them. Specifically, ::-webkit-media-text-track-display.
Now, the trick is to make the captions move as expected. Using something like bottom: 3em could work, but it can produce some unexpected results. For example, I’ve seen captions that jumped up twice each time a new cue was shown on screen. Ultimately, the best course of action is using transform to translate the pseudo-element up, like so:
Unfortunately, Firefox doesn't expose any pseudo-elements or other ways to target the display area via CSS.
Remember when you asked about whether you can programmatically modify the cue settings? Well, the answer is that you can!
All the settings can be modified on the VTTCue object, and the changes should be reflected immediately.
The cue setting line controls the position of the cue from the top edge of the video or from either the left or right edge, in the case of vertical cues.
This setting can be a positive value that represents the line number from the top edge of the video (or a negative value for lines from the bottom edge). Or it can be a percentage of the video height or width, depending on the cue’s orientation.
So how does this technique work? Well, when the controls are showing, we modify the cue’s line property to subtract the height of the control bar. Then, when the controls hide again, we restore the line property back to what it was before.
But what about cues that aren’t at the bottom of the screen? Chances are they’re OK staying where they are. This means we’d want to check if a cue’s line property is either auto or at the bottom of the video — say, if line is between -1 and -5 (ignoring larger positive values here for simplicity).
Mux Player’s design also calls for different buttons and controls to be present depending on the display size and whether the stream is live or not, so we want to account for that.
We did run into a Chrome bug when implementing this solution. Chrome wouldn’t accept the first value that we set line to. However, we noticed that this can be worked around by first setting the line to a different value before setting it to the value we wanted it to be. We’ve filed a bug with Chromium.
There’s still more to do to get this to cover a variety of edge cases — like what if the activeCues change while the controls are still showing? Or what if the selected track changes? The final implementation that we have in Mux Player ends up being a bit complex in order to cover a variety of scenarios, but it does work across browsers, including Firefox, Chrome, and Safari.
After implementing this technique, I spent some time writing a blog post on this topic. However, as I was completing my first draft of that blog post, we noticed a bug in Safari.
At first, we weren't sure what exactly was going on. Someone on the team reported seeing duplicate captions in some cases, but it wasn't easily reproducible. We noticed it again that day, so I set aside time to figure it out.
I don't remember exactly what led me to figuring out how to reproduce it, but the bug happens when we use my new technique to move the captions up and then seek back to replay a portion of video with the captions. It seems like Safari has a check to see if the cue embedded inside an HLS stream exists and, if not, to re-add it. If we modify the cue in any way, like changing the line property, the modified cue becomes invalid, and Safari adds a new cue. Since the modified cue isn’t removed, we end up with duplicated cues. If you seek back multiple times, you'll end up getting multiple duplicates, meaning all the previous cues would've been shifted.
I've opened an issue against WebKit for this, and I hope it gets fixed soon. In the meantime, since it happens only with native HLS playback, we can use MSE-based playback on desktop and iPad and turn the new captions movement behavior off on iPhones.
Custom controls are very common for web-based players, from Mux Player to Video.js and beyond. Pretty much every player has, at some point, struggled with captions and controls. In Video.js, the solution we decided on was to not rely on native captions by default except for Safari and then target Safari specifically with CSS.
Looking at the specification for WebVTT, it specifically accounts for the control bar showing up and asks user agents (browsers) to move the cues out of the way when this happens (see section 7.1 step 4 of the Processing Model algorithm). Given that this mechanism exists in the spec, we need a way to let the browser know about our custom controls so that it can apply the same mechanism for our controls as for native controls.
To that end, I opened an issue on the WebVTT specification — where I’ve already shared one potential solution — and started a discussion in the Timed Text Working Group and the Media Working Groups to think about how this should be addressed.
This is where you come in. Your input and thoughts on the discussion can help drive the future of native video captioning. How can we improve captions on the web when using custom controls with native captions? While you're at it, do you see any limitations with captions that can be improved?
Make your voice heard. Give me your take by tweeting @MuxHQ or emailing me at email@example.com - and thanks for reading!
1. The difference between captions and subtitles is that subtitles only have the spoken words, whereas captions also include other auditory cues. Outside the US, there may not be a distinction between the two, or captions may be called Subtitles for The Deaf and Hard of Hearing (SDH).↩
No credit card to start. $20 in free credits when you're ready.
Vercel's Edge Config can come in handy in many different ways. See how we used it to cut down on the amount of spam we were dealing with from our forms.
By Justin Sanford
With lazy-loading and a blurhash placeholder, we make the loading experience of Mux Player feel great in our Next.js app
By Darius Cepulis
While hunting for a pesky live streaming bug, we discovered that virtual load balancers don’t always simulate their physical counterparts the way you might expect.
By Dmitry Ilyevsky