Every February, millions of fans across America (and to some extent the world), tune into one of the biggest sporting events of the year.
For the past three years, we’ve been watching too - monitoring livestreams with Mux Data to ensure an average 3.9 million viewers experienced an optimized stream of the Big Game.
In that time, we’ve had some fumbles and recoveries, and while you may not be planning a streaming event with millions of concurrent viewers, the takeaways from our lessons learned are applicable to live streams of all sizes.
Heading into our first live stream was exciting and a little daunting. We were a two and a half year old company. We’d managed live events before, but concurrent viewership for those was maybe half of what we were expecting for this game. On top of all of that, we were heading in with a brand new feature - our real-time dashboard - which, at the time, was a custom feature we had built for the customer broadcasting the event.
Based on some feedback from our customer, we predicted the viewership load, and even prepared for 2-3X of the expected volume. We made an educated guess, and it served us well. However, come game day, we saw a usage pattern in our application that we hadn’t expected. Since we had planned for so much extra load, we were able to survive.
But we still had a blind spot.
While we had planned for significant viewership traffic of the stream, we hadn’t accounted for a surge in users viewing the quality of experience metrics within the dashboard itself.
During the stream, we quickly recognized problematic characteristics: alerts, data lagging behind, and database compute usage well above benchmarks. These signs typically signify data loss, or that the data is there but it’s not able to be visualized.
Checking to see what all those alerts are about
In our case, it was the latter. While we had provisioned and tested for ingesting massive amounts of data, we had focused our preparations on not losing that data. We didn’t anticipate that there would be hundreds of users (from our customer and our team) in the dashboard at once, each trying to pull a unique set of data. Our testing had accounted for tens of users, but not for this type of scale.
Mux Data utilizes Clickhouse to store and deliver all of our performance metrics to our customers. Clickhouse is a highly efficient database that allows massive throughput ingesting and retrieving data, and we had thoroughly tested the scalability of each. The part we didn’t account for was what happens when you are trying to pull large amounts of data from the same database while it’s ingesting hundreds of thousands of events per second. Luckily for us, Clickhouse is vertically scalable, and we had practiced rescaling our systems on the fly with no data loss or database downtime. On game day, after a few subsequent rescales, we were in the clear for the rest of the event.
With learnings from year one under our belts, we developed tools to simulate retrieval traffic from every different aspect for the subsequent events. Now in our testing, we simulate traffic for the full product usage - both ingestion and retrieval - for the main event, both well above any expected levels.
Running that first event was an all hands on deck effort, which was great because the entire company from Jon, our CEO, down was in the trenches, slacking every play-by-play in our internal war room.
We also had an internal media room in our San Francisco office, where our engineers were monitoring the event in real time. Just outside of the media room was a viewing party where we brought our families to celebrate Mux’s role off the field. But when we ran into the dashboard issue, we realized that having our friends and family in the same office with our folks manning the event was actually more stressful than celebratory.
In high stakes situations, having a team of specialists at the ready is your biggest advantage and strongest defense against unexpected issues. Without quick-thinking engineers who recognized the problem and pivoted quickly, we might not have had future opportunities to apply our lessons learned.
One of those lessons is that you don’t need multiple quarterbacks, and in fact you don’t want multiple quarterbacks. When something goes wrong, you want clearly defined roles and responsibilities, so there’s no overlap and the team operates like a well-oiled machine. As our experience in monitoring expanded, our hands-on team consolidated to a few core members who pulled in others as needed and kept the rest of the company up to speed on how things were going.
When we can get back to in person events, we’ll apply another lesson and stick to hosting post-game parties at the office.
Performance is critical for all live streams, and a lot goes into that on our end, including tuning Kafka, Clickhouse, and our data collection layer.
Our game-day preparations included a lot of resource tuning and benchmarking on the data collection layer to ensure the capacity limits of the collectors were well understood. Mux SDKs send beacons to our beacon collectors, written in Golang, where they are parsed and written to Kafka topics for asynchronous processing. It’s crucial that our collectors be able to capture and record every beacon to guarantee low latency and high data-accuracy.
Collectors are autoscaled in Kubernetes in response to CPU load. We relied on load-testing results to know the number of beacons a single collector could handle, how many views those beacons correspond to, and therefore how many collectors we’d need to have online to support game-day traffic. We also experimented with different AWS EC2 instance types to find the best balance between performance, cost, and responsiveness to spikes in load. This was accomplished by replaying beacons from earlier livestream events that were anonymized and scaled, allowing us to simulate anticipated traffic levels 2x, 3x, or even more.
One big lesson from this experience is that it’s critical to test with real traffic, quirky as it may be. The cardinality of real data is hard to achieve with simulated traffic. Our data processing and database systems are particularly sensitive to high-cardinality data. Using that quirky anonymized beacon data from earlier live-stream events was crucial to stressing these systems and knowing their bounds.
Kafka plays a major role in many real time data processing and analytics systems, and Mux Data is no exception. We provisioned our Kafka clusters to tolerate up to two node failures (e.g. replication factor of 3), and tested this repeatedly during load-testing to ensure failures could be tolerated. We closely monitored system performance during those simulated failures, increasing CPU and memory resources on the Kafka brokers as needed to ensure a seamless recovery. Some of the broker-level settings that we experimented with included num.io.threads, num.network.threads, and num.replica.fetchers. One great thing about our game-day Kafka setup is that it’s deployed and managed through Kubernetes in the same manner as our other Kafka clusters - it’s just a lot bigger, with many more brokers.
Mux Data uses the Clickhouse database for historical and real time data. For this event, our main focus was on the performance of the real time Clickhouse database, since the volume of inserts and queries is orders of magnitude higher than for the historical database. As mentioned earlier, it was critical that we load-test with high-cardinality data representative of what we expected on game-day. We had to ensure that both data ingestion (inserting into Clickhouse) and querying performed well.
One lesson we’ve learned is that Clickhouse loves having a small number of inserts of large batches of data, and can struggle with larger numbers of inserts containing small batches of data. Small batches lead to more frequent merges on Clickhouse, which are computationally expensive, and Clickhouse purposely limits the number of concurrent merges. For this reason, we kept the number of concurrent inserts to a minimum. It’s a bit counterintuitive, but we improved insert performance to Clickhouse by reducing the number of Kafka topic partitions for our real time data, thereby reducing the number of concurrent inserts.
We also discovered the importance of tuning the ‘max_threads’ setting on Clickhouse. The ‘max_threads’ setting affects how many threads can be used to execute a query. This is super important given that queries run across Clickhouse nodes in a distributed manner. We experimented with this setting after vertically scaling our Clickhouse nodes and saw significant performance improvements.
Learnings from every round of preparations and testing not only change our approach to the game each year we monitor, but they help us improve our Data product as well.
In addition to the lessons shared above, we also learned that for larger events, it makes sense to isolate the infrastructure from other customers, allowing dedicated space for large events to run without interference, and also ensuring large events don’t become noisy neighbors for other users.
Lastly, accurate estimates are crucial for a successful large event. You will always want to prepare for 2-10X of the expected volume, depending on how certain you are with your estimates. In subsequent years, we’ve been able to glean data about viewership profiles, and work with the broadcasters to better estimate more clearly the traffic patterns. This encompasses not only people watching the event, but also how our customers intend to use our products to ensure a smooth event. Both help us to better plan for what to expect come game day, but we still default to over-anticipating the load.
After our second year monitoring the Big Game, we hosted a webinar with Fox and AWS. Check it out if you want to learn more about video observability and scaling the world’s largest sporting events.