If you are lucky enough to have ever run software at scale, then having things fail is probably something you're familiar with. If not, you probably at least remember the first time your external hard drive or flash drive with that super important file failed to be read by your computer. Regardless of your background with hardware or software, you probably have experienced some kind of failure. Did you know exactly when and how that failure happened? Were you prepared for it?
Maybe this all sounds crazy to you and you are thinking "I run my software in the magical cloud with durable hard drives and live migrating virtual machines." Depending on your situation, you may still be vulnerable to physical failures such as a hard drive head crash or a network fiber cut, but the majority of software outages are actually caused while engineers are taking action on the running systems, frequently during updating or repairing systems.
Before we get too deep into the technical discussion about failure detection and architecture, let's stop for a second and remember that not everyone is YouTube (where I used to work) or Google where minutes of downtimes seriously impacts revenue. If your website is down for 5 minutes, what is the cost of that outage? If the cost is near $0, you probably should spend nearly $0 on your monitoring maybe by using one of the many free uptime check services. When the cost becomes significant you will not only want to consider more complex monitoring systems in order to assess your SLOs and ensure you meet your SLAs, but you should also consider how to design your system to handle transient failures of dependent services.
The first full data collection outage I have seen since I joined Mux in August of 2016 occurred on April 19th 2017 around 4:30 PM PST during an upgrade of our API server which was ironically adding additional monitoring capabilities.
A component of that API server which our internal services rely on was misconfigured during the upgrade. This misconfiguration resulted in our data collection agents rejecting all requests once their in-memory caches expired. The cache we had designed to mitigate such outages actually resulted in a slow decay of traffic, as designed.
However, the alerts were configured to only notify in a 0 data collection scenario because the higher values for our "too little data collection" threshold were misbehaving a couple days earlier and had been subsequently disabled. The relationship between these components in our staging environment is not the same as in production, which led to a false positive during our release process. We resolved the issue around 7:30 PM PST on the same day, and are working to improve both the monitoring and reliability of our data collection.
I admit I can be a bit arrogant when it comes to the awesomeness of the systems I have built, but there is nothing more humbling than having it fail, silently. On multiple occasions, I have been looking at the data from a system I built and suddenly realized it was failing in a way I did not anticipate. I am a human. I make mistakes. Today, eyeballs looking at graphs or logs are frequently better at catching anomalies than algorithms. For this reason, I encourage everyone running a system they feel is very important to keep a literal eyeball on their key metrics. YouTube had TVs showing nearby engineers critical metrics and on more than one occasion a real live human caught a regression in a well known graph before any system did. On the day of our outage, our Raspberry Pi that we were using to show off various metrics via Grafana Playlists was not running and our eyeballs missed a regression.
Active monitoring like that requires human eyeballs looking at graphs 24/7, and that doesn't just suck, it is completely impractical for so many reasons. Reactive monitoring is what most people are relying on these days. This is some software that continually inspects the state of the system, evaluates some (normally human defined) rules which determine if the state of the system is irregular, and then alerts a human if required. The uptime services I mentioned earlier are the simplest example of a reactive monitoring system. More complex systems might look at metrics such as high disk usage and alert a human who will delete unused data. Setting alerting thresholds isn't always obvious or easy. Remembering to set both "too high" and "too low" levels for various metrics is important, but so often it is easy to forget one of these as we did that day. It doesn't matter what monitoring infrastructure you run if you don't have good alerts defined.
Alerting is hard to do right. The only alert that is worse than an alert that doesn't fire is the alert that makes you ask, "is this really a problem?" One technique you can use when defining alerts is to have different tiers. Separate the "this is urgent" alerts which should actually set off a pager from the "you should look at this soon" alerts which can email the whole team. Promote and demote alerts appropriately between the different tiers you define. If an alert that pages someone is being too noisy but it is super critical to operations, consider demoting that alert to just an email instead of deleting it entirely. It actually isn't fair to say we forgot an alert, we had removed an alert that was being too noisy when we probably should have just demoted it.
After you see an issue in the system, or an alert tells you there is a problem with it, what do you do? Fix it, obviously. Good job. Can I go back to sleep now? Maybe, but did you understand the problem? Do you know how to prevent that problem from happening again? I know some people that dread writing postmortems. The postmortem process is not a punishment, but rather a mode of learning that you should be excited to participate in. I believe that your company is destined for failure the second this tool has become a burden or just another part of "the process" to resolve an outage. Take time to think about what happened and what else could happen given all of that new information you just acquired.
Here's a few things I think about when assessing an outage:
- Do we have all the information we need to understand all of the details of what happened and why?
- Could we have detected this outage in another way? (Was the alert we received as specific as possible?)
- Was the action that needed to be taken obvious to all members of the oncall team?
- Can we automate the action that was taken?
- What was the root cause of this outage?
- Can/should we architect our systems to be able to fail more gracefully in this (and similar) circumstance(s)?
When you are operating at a certain scale, failure is bound to happen. Through the lens of the postmortem, you become a better architect against possible failures. There is not a one-size-fits-all approach to failure mitigation. Each of the decisions you make, both small and large, should go through a simple cost-benefit analysis in order to assess whether or not you should actually implement them. Here's a few examples from my experiences:
- Retry, Backoff and Rate Limit - Sometimes just simply adding a retry will go a long way, but that is dangerous as it may add additional load when the system is under pressure. In that scenario, it is important to have a backoff strategy that is appropriate for your workload such as linear, exponential or other special snowflake rate limiting designs.
- Use a Cache - If the server doesn't respond, can you reuse the last response to fulfill the current request? If this is feasible, is the cost of the additional resources worth it?
- Add Redundancy - If a single server fails, does the whole system stop? Can you introduce redundancy to eliminate this kind of failure and again, is that cost worth it?
- Build a Buffer - Can handling of the data wait? If it can, consider adding a component such as Amazon Kinesis or Apache Kafka to enqueue the data for later processing.
- Reconsider Dependencies - Does the system actually need the successful operation of this other component? Can the dependency on the other component be removed entirely or can it be replaced with something that has a more appropriate set of guarantees?
- Introduce Isolation - Were a large set of well behaved users affected by a single abuser? Can you isolate the resources that known well behaved users rely on from those which are more likely to encounter abusers? Can you do a better job isolating internal and external traffic?
- Improve Test and Release Practices - Sometimes, the hardest part of architecting a system is actually not architecting the system itself, but rather the continual updating of that system. How much value can you get from improved testing and release practices? The variance in both the cost and benefit here is probably the highest.