Testing for Failure

12:20am. I was jerked awake by the ringing of my phone. There were problems in production.

I logged on, desperately trying to clear the fog of sleep from my brain. There were already a handful of alerts for a variety of services. I ran some quick tests, and quickly figured out that this was an outage.

After a bit more investigation, I was able to narrow down the cause to a specific service. After a short while, the platform was fixed.  I made some notes while everything was fresh in my head, and did my best to get back to sleep.

In the morning, several engineers from different teams worked together to investigate the problem further. It all boiled down to an interaction with a server that had gone down. In theory, Knewton’s architecture should have been resilient to any single server failing. As it turns out, theory and practice weren’t the same.

The Cost of Outages

Failure is part of engineering. But system failures like the one described above are very costly.

First and foremost there is the cost to users. For students using Knewton-powered products, this cost can be significant. When Facebook is down, you can’t look at your friend’s selfies. But when Knewton is down, you can’t do your homework, or finish your take-home test, or study for the upcoming exam. We take these potential consequences seriously.

Second, there is the cost to employees, and to the company they work for. Waking people up in the middle of the night, or otherwise interrupting their personal lives, is a real cost. It creates unhappy employees. If it happens too frequently, then engineers may start ignoring the problems, or leave the company — or both.

Third, system failures can lead to compounding problems. When an alarm wakes me up in the middle of the night, my brain is only functioning at a fraction of its normal capacity. It is much easier for me to make a mistake at 3 a.m. than at 3 p.m. If I make the wrong judgement call while dealing with a production issue, I could actually make things worse — turning a minor issue into a major one.

Fourth, failures hurts a company’s reputation. No one wants to use flaky products. If Gmail was down for an hour every week, would you continue to use it? What if it dropped 5% of all incoming and outgoing emails? Users don’t want the best product in the world 90% of the time. They’d rather use a product that is 90% as good and works 100% of the time.

How Do You Avoid Failure?

The first way to avoid costly outages is to learn from our mistakes. At Knewton, we hold a  post-mortem with clearly owned action items every time we have a production outage.

Second, we learn from other companies. Even though our product is unique, designing for failure isn’t a problem unique to Knewton. There is a lot of work being done at other companies to proactively test for reliability. While it is always tempting to reinvent the wheel, it’s always better to start by learning from others.

Here are two techniques that Knewton uses to improve reliability that we heard about elsewhere:

1. Introducing Failure

First, we are introducing failures into our platform, using Netflix’s wonderful Chaos Monkey tool. (Netflix, if you’re reading, THANK YOU for open sourcing this wonderful tool!) For each of our stacks, Chaos Monkey terminates one instance at random, once per workday, during business hours. This allows us to discover the weaknesses of our platform on our own terms, before they impact our users.

2. Mitigating Risk in Production

One of the challenges of effective testing is that you can’t truly duplicate your production environment. Production stresses services in unique ways that are difficult or impossible to reproduce in test environments. Moving a new version of a service into production is always the real, final test. If you fail this test, then you impact your users.

So really the question isn’t if you test in production, it’s how you test in production. Knewton has been using two related techniques to minimize the risk of deploying new services to production.

Shadow Deployment

A shadow deployment sits in parallel to the real service, processing duplicate copies of production messages, but without performing writes. We can monitor the performance of the shadow stack, but it will not directly affect our users. The shadow deployment allows us to test the performance of a new version with real production load, without the risk of impacting our users with bugs or regressions. The downsides, of course, are that a shadow service places additional load on dependent services (possibly indirectly affecting users), and requires work to create a custom read-only shadow configuration. Additionally, the difference in configuration means that we are not truly testing the same thing. In fact, if the service is write-limited, the shadow may give us greatly exaggerated performance measurements.

Canary Deployment

A canary deployment is a small subset (usually just one) of the boxes for a stack that run the new version of a service. The remaining boxes still run the current version. If something goes wrong with the new version, the damage is limited to only a fraction of our users. Unlike a shadow, this does not require any special configuration.

What’s Next?

What constitutes effective failure testing changes constantly, as new services are introduced and our platform evolves. Creating comprehensive failure testing plans for each new service or major service change that we deploy is key. What are the different ways that this service could fail? What happens if this service fails? How will that affect the connected services? What happens if one of the connected services fails or becomes temporarily unavailable?

Additionally, we are in the process of auditing our outage history and categorizing it into different failure modes. This audit will allow us to design new tools to introduce common failure modes and make sure that we are now resilient to them across our whole platform.

Make Your Test Suite Accessible

What does it mean for something as complex and dynamic as a platform to be “well-tested”? The answer goes beyond simple functional coverage.

Testing has been my specialty through much of my 14 years of experience in software. If there is one thing I’ve learned about testing, it is that tests can, and should, do more than just test. Tests can be used to communicate and collaborate. Tests can also be used to discover what your product is, as well as what it should be. At their best, tests can be the frame of reference that anchors a team and solidifies team goals into verifiable milestones.

Testing the platform

The Knewton platform is composed of many component services. Each of those services is developed by a dedicated team, and each service is tested on its own with the standard unit, integration, and performance tests. This article is not really about those tests but about how we test the platform as a whole.

The Knewton platform uses data to continuously personalize the delivery of online learning content for individual students. The platform determines student proficiencies at extremely detailed levels, provides activity recommendations, and generates analytics. To do all this, our platform must be fast, scalable, and reliable. Our team must be skilled at grappling with intricate technical problems, while maintaining high-level perspective and focus on the greater system. Testing is part of how we maintain this dual perspective.


Accessibility is the most important criteria we build into our tests to help us achieve the above goals.

In the context of a full-stack test suite, accessibility to me means at least the following:

– Anyone can run the tests
– Anyone can read the test report and analyze test failures
– Anyone can read, change, extend, or otherwise interact with the test definitions

Making tests accessible and promoting those accessible tests can be a tough cultural challenge as well as a tough technical challenge. But the cost of failing at this is high. The more isolated your test suite (and the engineers who create and execute it) are, the less value you will derive from it. Your tests will not reflect involvement from the greater organization, and more importantly, the information your tests generate will not be as widely disseminated throughout the organization as they could be.

So how is a test suite made “accessible”?

Anyone can run the tests

The best thing you can do with a test suite is get it running in your continuous integration server. At Knewton we use Jenkins as our CI server. Anyone in our organization can use Jenkins to invoke the tests against any testing environment, at any time, without any special setup on their computer whatsoever.

Additionally, the test code is in our Git repository, and everyone is encouraged to check it out and invoke the tests in very flexible ways. Developers have the option of running a single test, a set of related tests, tests that correlate with a given JIRA ticket, or other options. Developers can run the tests against a local development environment, or a deployed environment. A test suite that can be run in flexible ways is an important part of accessibility.

Anyone can read the test report

Our test suite produces several kinds of test reports. The report I enjoy showing off the most is the HTML report, which lists every test that runs and details every test that fails (this capability is built into the wonderful RSpec testing framework we use). This HTML report is archived in Jenkins with every test run, so anyone can read it for any test run right within their browser. And because the report uses plain English, it is comprehensible by anyone who is familiar with our platform’s features, developers or not.

Here is what a small portion of our HTML test report looks like, showing both passing and failing tests:


What may or may not be obvious here is that tests are really about information. When I test a piece of software, my product is actionable information. When I make an automated test suite, my product is an information generator. Building a generator of information is one of the more valuable and interesting bits of work a QA engineer can do; here at Knewton, we encourage this mentality.

Anyone can change the tests

First and foremost, my job at Knewton is to enable tests to take place easily. Secondly, my job is to assist and initiate the creation of actual tests. Here at Knewton, it’s great for me to see the testing framework I created be picked up by developers, changed, extended and generally used. While we do formal code reviews on the tests, we try to make that process very efficient in order to ensure that there are very low barriers for anyone who creates a platform test.

What does accessibility get you?

Here are just a few of the ways that an accessible test suite brings value to an organization:

-Raising awareness of the behaviors of the system and the interactions between various components in the system throughout the entire organization.

-Eliminating bottlenecks when releasing: got the code deployed and need to run the tests? Just go press the button.

-Enabling continuous deployment: when your tests are in your continuous integration system, it becomes easy to chain together build, deploy, and test plans into a continuous deployment scheme (we are still working on this one).

-Encouraging better tests: when non-testers are encouraged to get involved in testing, unexpected questions get asked.

More to come

Testing is a massively important part of the puzzle for Knewton as we scale our technology and our organization. We are learning more every day about how to make the best, most accessible and valuable tests we can. In a future post, I intend to share some of the technical details and tools we have been using to make our tests. In the meantime, I welcome your feedback on the ideas presented here and around testing in general.

Simplifying Cross-Browser Javascript Unit Testing

By Daniel Straus and Eric Garside

Javascript unit testing is complicated and expensive; unlike server-side logic, code that seems to work fine may execute differently on different browser/OS combinations, which means that in order to get adequate test coverage, all supported platforms need to be tested thoroughly. In order to allow our codebase to be testable as it grows in size we came up with a way to make our testing process easy and scalable.

To write our tests we used ScrewUnit, a behavior-driven testing framework which provides an extensible cross-browser utility for developing tests. Each test suite is just a web page; running it is only as difficult as pointing a browser at it. Let take a look at an example:

This test passes in a Webkit browser, but not in Internet Explorer (IE can’t process the date format). As you can imagine, it would be quite time consuming to have to run tests like this on multiple browsers manually on every check-in.

To make the use of ScrewUnit manageable and efficient we turned to Selenium-WebDriver (a browser automation tool). We can write Selenium routines to open a project’s ScrewUnit test and analyze its results. With a single line of code, we can check if ScrewUnit detects failures:

This is improved over the previous method. But to make this process optimal we wanted to run these routines in parallel. To accomplish this (run across multiple browser/OS combinations) we used Sauce Labs. Sauce Labs allows us kick off tests from a single machine quickly and doesn’t require a large testing hardware infrastructure on our end (everything runs in the Sauce Labs’ cloud). Below is an example of how this all comes together:

By combining ScrewUnit, Selenium and Sauce Labs, we removed the complexity of running Javascript unit tests across multiple browsers. We’ve put together a demo of our tests as they run on our own project.