Creating and maintaining the infrastructure to support a 40+ person developer team and more than a million users on the world’s most powerful and rigorously adaptive learning platform is no simple task. Conventional wisdom would suggest that a ten-person team with a wide swath of specialists would be the ideal arrangement. But in this regard, as with a number of other tech team practices, Knewton is anything but conventional.
Simply put, Knewton does not have an Ops team. Instead, the knowledge and tools required for infrastructure and systems tasks are distributed throughout the team. This structure confers a number of benefits. Developers are better able to write configurable, maintainable, deployable software because they have a strong understanding of our systems infrastructure. And systems-focused engineers are better able to optimize and maintain our infrastructure because they are also contributing members of the service developer teams.
In practice, this means that all software developers at Knewton are expected to both understand and utilize infrastructure technologies. All developers may find themselves on production support and tasked with updating environment configurations or deploying new code. Likewise, systems engineers all have a solid software engineering background and are expected to write production code.
Expectations and cross-pollination are only part of the process, however. Here’s some insight into the tools we use to create an environment where development and infrastructure work hand-in-hand.
AWS for everyone!
Every Knewton developer has access to our AWS console as well as the AWS command line tools pre-installed on their machines. Eventually every developer will go through the process of deploying code to production, at which point he or she will learn the basics of the AWS toolkit as well as our deployment and configuration management systems. As a result, no developer need waste time emailing the Systems group for basic information about EC2 instances or other AWS resources.
Knewton makes use of a number of AWS technologies, from EC2, to RDS, to ElastiCache and ElasticMapReduce. Many of these products have their own CLI tools, though they are all simple enough to install and configure that at Knewton we’ve made them part of box setup for new developers. While not every engineer is a wizard with the console or CLI commands, there is enough documentation to ensure any developer working alone on a late night will not be blocked by an AWS issue (unless the block is on AWS’ end…).
We have a few in-house tools that make use of AWS’ APIs, among them k.aws and the Knewton Crab Stacker. To focus this article a bit, I’ll dive specifically into the latter, as it addresses an important problem: deploying and updating software stacks.
CloudFormation and KCS
Knewton Crab Stacker, or KCS, makes use of the AWS CloudFormation toolset. CloudFormation makes it possible to define and deploy combinations of AWS resources, from EC2 instances to load balancers. The “define” part comes in the form of JSON templates, while deployment can be done either using the AWS console or a CLI tool.
Now, CloudFormation on its own is great for just “putting a box out there.” The templates let you define a wide variety of resources and set parameters and mappings for a base level service setup and tuning.
What CloudFormation doesn’t do well is box configuration. The JSON templates don’t allow you to do much in the way of defining functions, setting conditions, or even basic math. If you try to force configuration management onto it, you end up with a lot of bash scripts and config files floating around, or worse, hardcoded into templates.
Even relegating KCS to deploying boxes can be tricky. The launch command for a given stack can include a dozen or more command line arguments, such as EC2 instance parameters (size, type, etc.) and command flags.
The simplest case launch will make use of all defaults in the template and look something like this on the command line:
cfn-create-stack $StackName --template-file $Template
But if you need to use other CloudFormation functionality and override a number of parameters at launch, you’ll end up with something like this:
cfn-create-stack $StackName --template-file $Template --parameters AWSAccount=Production;InstanceType=m1.large;ClusterSize=4;ConfigTarballVersion=2.1.5;AppVersion=1.2.3;SSHKey=ProductionKey --capabilities CAPABILITY_IAM --disable-rollback
Yikes! That’s a lot to type from memory for each deploy. You’re especially going to want that last option to disable rollback as it keeps the instances from failed launches around for you to debug — essential for when you inevitably mistype a version number.
If stack launches are fairly consistent, you can mitigate the annoyance of launch commands with BASH scripts, but these will be a pain to maintain. But what if you have a number of frequently changing parameters or decisions that need to be made at launch? What if you need to work with multiple AWS accounts or validate components of your launch config so as to avoid a painful debug cycle? (Is that tarball you need in s3? Does this environment have all of my new stack’s dependencies?) Complex stacks can take ten to twenty minutes to launch. You don’t want to have to keep re-launching just because you fat-fingered the instance type.
The problem with the command above is that every parameter represents a potential point of failure. CloudFormation is only able to ensure that your template is logically consistent and the JSON is valid. It can’t know whether or not AppVersion 1.2.3 is a thing, or whether a four node cluster matches what is in the current environment, or numerous other details that can spoil an update before it begins.
This is where KCS steps in. Knewton Crab Stacker was developed by a team of Knewton engineers (including yours truly). KCS is a Python command line tool designed to make CloudFormation deployment much simpler.
The first nice thing KCS does is add the abstraction “environment” to our AWS accounts. It does this by simply taking the stackname parameter and appending $EnvironmentName + “-” to the front of it. From CloudFormation’s perspective, the stackname is “Dev-UserService,” but KCS understands the stack as “The UserService stack in the Dev environment.”
Making use of the namespace this way greatly simplifies the task of isolating test environments from one another. It adds one more piece to launch commands, which in the simplest case look like this:
kcs stacks create $Service $Template $Environment
The difference between this and the simple CloudFormation command above is what goes on behind the scenes.
Before initiating the create, KCS checks a number of things. First, KCS makes sure that the environment has any stacks that the new service needs. If a dependency is missing, you can still force a create. Secondly, KCS ensures that any s3 resources referenced in the template or launch command actually exist. In other words, if your launch command specifies “ServiceTarballVersion=0.3.444”, KCS makes sure that said tarball is actually there.
Updates are far more common than creates, and it is here where KCS really does a lot for you. Here’s a simple update command:
kcs stacks update $Service $Template $Environment
Like the create, KCS does a ton of validation on the template and environment. With the update however, KCS also runs a diff on the existing stack. Before the update actually runs, you will be shown a list of every parameter the update adds, removes, or changes. From there, you can either proceed with or cancel the update.
Before I do an update of a stack, I can also use “describe” to see what’s in the environment currently. The full command is “kcs stacks describe”, but I can shorten it using “s” and “d”, and aim it at our Dev environment like so:
kcs s d User Dev
Dev - $93.6 monthly
Stack Status $ monthly Creation Last Update
User update complete $93.6 4 months ago 3 months ago
SystemConfig Version: 1.1.55 User App Version: 1.1.224
i-01234567 ec2-01-23-45-678.compute-1.amazonaws.com (m1.medium)
This gives me a lot of cool info including the version of the App, some parameter information, as well as the instance ID, type, and hostname. If I want an exhaustive list of parameters I can do this:
kcs s d User Dev --detailed
Dev - $93.6 monthly
Stack Status $ monthly Creation Last Update
User update complete $93.6 4 months ago 3 months ago
Cluster Size: 4
SystemConfig Version: 1.1.55
Environment Class: Dev
User Version: 1.1.224
Instance Type: m1.medium
DB Password: ******
Billing Environment: Staging
UserDB Address: amazonrds.us-east-1.rds.amazonaws.com
Key Name: STAGING-007
i-1234567 ec2-01-23-45-678.compute-1.amazonaws.com (m1.medium)
These commands make it easy to run updates without knowing much about the stack, but there is an even easier method for the truly lazy:
kcs stacks interactive-update $Service $Environment
This command uses the existing stack’s template and then lets you pass in values for each parameter while showing you what is in the current environment. It guarantees that you only change exactly what you want to change.
When the update actually runs, KCS adds a few layers of insurance that CloudFormation does not. For one, it spins up brand new instances, runs their launch config, and then waits for success signals before tearing down the old stack. This allows you to set up whatever level of functionality and performance testing you want as a condition of a successful update. If part of the launch config fails or does something unexpected, KCS rolls everything back.
All of this just scratches the surface at what KCS can do. I could write a few dozen pages about KCS’ other abilities, like grabbing service logs, executing remote commands, hotswapping jars on Java stacks, and even snapshotting entire environments and then recreating new environments from snapshots (need to copy a 100 instance staging environment? No problem).
The main thing that KCS does is to kill the need for a Release Engineer or “Deployment Guy.” Nobody is happier about this than I am, as I was the “Deployment Guy” for months. Instead, we have a situation now where systems engineers can focus on improving infrastructure and devs can get new code out easily.
The lion’s share of the credit for KCS has to go to Sarah Haskins and Trevor Smith, the two developers who did the bulk of the coding. It has made life easier for all developers here at Knewton, and we hope to open source it in the future.
Configuration management and future challenges
As nice as KCS is for our deployment workflow, it is only able to solve one part of our infrastructure needs. Like any moderately large tech team, there are natural conflicts of interest that arise between those focused on system stability and maintenance, and those trying to push out a slick new feature. We’re not immune from knowledge bottlenecks and technical debt, but as the team grows and practices are refined, the future looks brighter and brighter for our tech team.
At the very least, thanks to KCS, we have a pretty good handle on deploying services. Swell. But how do we configure boxes once CloudFormation puts them out there? How do we ensure that services are able to talk to one another and that stacks are resilient to a plethora of errors?
Those fantastic questions, I’m afraid, will have to be the subject of another “N Choose K” post.
What's this? You're reading N choose K, the Knewton tech blog. We're crafting the Knewton Adaptive Learning Platform that uses data from millions of students to continuously personalize the presentation of educational content according to learners' needs. Sound interesting? We're hiring.