12:20am. I was jerked awake by the ringing of my phone. There were problems in production.
I logged on, desperately trying to clear the fog of sleep from my brain. There were already a handful of alerts for a variety of services. I ran some quick tests, and quickly figured out that this was an outage.
After a bit more investigation, I was able to narrow down the cause to a specific service. After a short while, the platform was fixed. I made some notes while everything was fresh in my head, and did my best to get back to sleep.
In the morning, several engineers from different teams worked together to investigate the problem further. It all boiled down to an interaction with a server that had gone down. In theory, Knewton’s architecture should have been resilient to any single server failing. As it turns out, theory and practice weren’t the same.
The Cost of Outages
Failure is part of engineering. But system failures like the one described above are very costly.
First and foremost there is the cost to users. For students using Knewton-powered products, this cost can be significant. When Facebook is down, you can’t look at your friend’s selfies. But when Knewton is down, you can’t do your homework, or finish your take-home test, or study for the upcoming exam. We take these potential consequences seriously.
Second, there is the cost to employees, and to the company they work for. Waking people up in the middle of the night, or otherwise interrupting their personal lives, is a real cost. It creates unhappy employees. If it happens too frequently, then engineers may start ignoring the problems, or leave the company — or both.
Third, system failures can lead to compounding problems. When an alarm wakes me up in the middle of the night, my brain is only functioning at a fraction of its normal capacity. It is much easier for me to make a mistake at 3 a.m. than at 3 p.m. If I make the wrong judgement call while dealing with a production issue, I could actually make things worse — turning a minor issue into a major one.
Fourth, failures hurts a company’s reputation. No one wants to use flaky products. If Gmail was down for an hour every week, would you continue to use it? What if it dropped 5% of all incoming and outgoing emails? Users don’t want the best product in the world 90% of the time. They’d rather use a product that is 90% as good and works 100% of the time.
How Do You Avoid Failure?
The first way to avoid costly outages is to learn from our mistakes. At Knewton, we hold a post-mortem with clearly owned action items every time we have a production outage.
Second, we learn from other companies. Even though our product is unique, designing for failure isn’t a problem unique to Knewton. There is a lot of work being done at other companies to proactively test for reliability. While it is always tempting to reinvent the wheel, it’s always better to start by learning from others.
Here are two techniques that Knewton uses to improve reliability that we heard about elsewhere:
1. Introducing Failure
First, we are introducing failures into our platform, using Netflix’s wonderful Chaos Monkey tool. (Netflix, if you’re reading, THANK YOU for open sourcing this wonderful tool!) For each of our stacks, Chaos Monkey terminates one instance at random, once per workday, during business hours. This allows us to discover the weaknesses of our platform on our own terms, before they impact our users.
2. Mitigating Risk in Production
One of the challenges of effective testing is that you can’t truly duplicate your production environment. Production stresses services in unique ways that are difficult or impossible to reproduce in test environments. Moving a new version of a service into production is always the real, final test. If you fail this test, then you impact your users.
So really the question isn’t if you test in production, it’s how you test in production. Knewton has been using two related techniques to minimize the risk of deploying new services to production.
A shadow deployment sits in parallel to the real service, processing duplicate copies of production messages, but without performing writes. We can monitor the performance of the shadow stack, but it will not directly affect our users. The shadow deployment allows us to test the performance of a new version with real production load, without the risk of impacting our users with bugs or regressions. The downsides, of course, are that a shadow service places additional load on dependent services (possibly indirectly affecting users), and requires work to create a custom read-only shadow configuration. Additionally, the difference in configuration means that we are not truly testing the same thing. In fact, if the service is write-limited, the shadow may give us greatly exaggerated performance measurements.
A canary deployment is a small subset (usually just one) of the boxes for a stack that run the new version of a service. The remaining boxes still run the current version. If something goes wrong with the new version, the damage is limited to only a fraction of our users. Unlike a shadow, this does not require any special configuration.
What constitutes effective failure testing changes constantly, as new services are introduced and our platform evolves. Creating comprehensive failure testing plans for each new service or major service change that we deploy is key. What are the different ways that this service could fail? What happens if this service fails? How will that affect the connected services? What happens if one of the connected services fails or becomes temporarily unavailable?
Additionally, we are in the process of auditing our outage history and categorizing it into different failure modes. This audit will allow us to design new tools to introduce common failure modes and make sure that we are now resilient to them across our whole platform.