How to Replace a Microservice: Swapping Out the Engine in Mid-Flight

Photo by Michael VH via Flickr (CC BY 2.0)

Photo by Michael VH via Flickr (CC BY 2.0)

Introduction

So you’ve built an important new service. It satisfies current use cases and is also forward looking. Pretty soon it’s mission-critical, so you optimize where you can and keep it humming along.

Then the product changes, and your service needs to do something different. You don’t have time to rewrite from scratch (and you’re never supposed to do that anyway), so you refactor what you can, then start bolting stuff on.

A couple of years later, the product changes again. This time, refactoring isn’t going to cut it: the entire design needs to change. So you take a deep breath and make your case for the big rewrite. The team is justifiably skeptical, but you’re persuasive, and everyone signs off.

Great! The service is going to be so much better this time. With two years of domain knowledge, you know where all the landmines are, you know the stack and the usage patterns like the back of your hand.

Still, it’s slow-going. A bunch of other services now depend on the existing interface in production. Downtime is unacceptable, and a flip-the-switch cutover is just asking for trouble.

Now what?

The Knewton Graph Store

Knewton’s mission is to personalize education for everyone. To do that, Knewton organizes educational content in graph form that shows how knowledge builds on itself. Graphing is a cornerstone of what gives our platform its power.

Knewton’s graph store initially used an event-based model. In order to modify a graph, internal client code would do the following operations:

  • Read graph data using a library built on Titan (a property graph library with transactions and pluggable datastores) and backed by a Cassandra cluster.
  • Make the desired modifications and commit the Titan transaction.
  • Intercept the Titan transaction, package the modifications up into a delta format and write the delta to a queue.
  • Consumers of the queue would pick up the delta messages and write them back into another Cassandra database in different forms.

By the middle of 2015, Knewton was using graphs in ways that required much lower latencies, which meant that frequent round trips to a database wouldn’t work any more. Graphs had grown significantly bigger, which had caused problems with storage and database schemas. Beyond that, many smaller problems also made the service difficult to work with. So we undertook a project to rewrite the graph store.

Design and Development

The design and development of the new service were relatively painless. It’s tempting to claim that this was purely due to how fantastic Knewton’s engineers are (although we are and you should come work with us), but it’s worth noting how important hindsight was to the process. Having seen how the graph store was used in the real world helped us know what to build, what to fix, and what to avoid. One of Knewton’s core values is to “ship and learn”, and after several years of seeing the graph store evolve, we knew what we needed and built the new service accordingly.

Migration

Migration was by far the most complicated phase of the project. It was composed of two parts: migrating the existing data to the new service, then migrating clients to the new service.

Migrating the data

Best practices for doing data migration usually specify a five-step process:

  1. Set up the new datastore (or new table or schema).
  2. Modify existing business logic to write to both the old and new datastores while still reading from the old one.
  3. Backfill historical data into the new datastore.
  4. Start reading from the new datastore.
  5. Stop writing to the old datastore.

Due to the event-sourced nature of writes, we were able to accomplish Step 2 by adding a new consumer to the events topic. Knewton uses Kafka as our main messaging platform, and one of its most useful features is the ability to have multiple consumers subscribe to the same message stream at the same time. This allowed us to mirror incoming data to both the new and old datastores with a minimum of trouble.

Difficulties arose in Step 3 due to issues with old data. Midway through the migration, we discovered that data stored several years earlier didn’t match current schemas. The validation logic had evolved quite a bit, but the historical data had never been cleansed.

This absence of strict validation mechanisms is a downside of many NoSQL solutions: since the database isn’t enforcing any schemas or constraints, the burden falls entirely on application logic, which typically changes much more quickly than schemas do.

Migrating the data ended up requiring several weeks of time-consuming manual investigation and data introspection.

Migrating the clients

Though ultimately worth it, migrating the clients off of the old service took a lot of time and effort because of all the necessary cross team coordination. When trying to switch clients onto a new interface for your services, it’s customary to leave the old interfaces available for a period of time to allow other teams some breathing room to do their migration. While we were able to accommodate our clients using this strategy, unforeseen interdependencies between clients caused us to have to spend several more weeks than planned coordinating the final rollout.

Results

In the end, we were able to serve all of our clients’ needs while simplifying the system and cutting hosting expenses by more than 70%. Knewton completed the migration with no data loss, no downtime, and no outages. Knewton is now able to handle significantly larger graphs and build new features that enable our partners to do things that weren’t possible before.

Lessons Learned

Migrating takes longer than you think

Migration is often the phase of the project with most uncertainty, so it’s difficult to forecast accurately. It’s easy to assume that the difficulty of migrating to a new service will purely proportional to the amount of data being moved, but there are several other factors to keep in mind:

Datastore differences

Moving between different datastores makes the migration much more complicated. You have to account for the particular mechanisms, guarantees, and performance characteristics of your new datastore. Changing datastores also typically means that you have to rewrite tooling and documentation as well.

Data hygiene

Data will get dirty. There are always going to be a few objects or rows that don’t exactly conform to the rules you expect them to, especially in NoSQL databases where hygiene mechanisms such as constraints and foreign keys often don’t exist. Unless you’ve taken great pains in the past to make sure your data is exactly what you think it is, migration time is when you’ll find out which of your assumptions aren’t true.

Clients

You have much less control over the clients of your service than you do over your own code. Clients have their own services and their own schedules, and you’re the one throwing in a monkey wrench by changing one of their dependencies. There are several ways to accommodate clients, but each approach has costs:

  • Be flexible about cutover deadlines
    • Clients have their own schedules to manage,so it’s a good idea to give them some breathing room to do the migration. A hard cutover date is a recipe for problems.
    • Tradeoff: total migration time (and thus the total time you have to maintain the old service for backwards compatibility) can stretch out longer.
  • Get feedback early and continuously
    • A good way to avoid any unforeseen issues in client migration is to continually solicit feedback on both technical considerations (like interface design and documentation) and project management considerations (like timing and scheduling).
    • Tradeoff: communication requires a lot of energy and effort from all involved parties, with a payoff that is not immediately visible.
  • Minimize the need for coordination
    • Coordination across multiple teams is difficult, and even more so with scale. In general, it’s a good practice to avoid it because it reduces the amount of time any one team spends waiting for other teams to complete precursor tasks.
    • Trade-off: Enabling all clients to complete the migration independently requires more effort and planning from the service owner, especially in response to unforeseen events.

Rewrite only when you need to rearchitect

It’s sometimes difficult to judge when to rewrite a service from the ground up, but a good rule of thumb is to rewrite only if the system’s architecture itself is blocking your progress. It can be easy to convince yourself that the code may be confusing, or that the data model doesn’t make sense, or that there’s too much technical debt, but all those issues are solvable “in place.”

If you can refactor the code, refactor the code. If you can change your database schema incrementally, change your schema. When you can’t move forward any more without changing the architecture, it’s time to think about a rewrite.

Don’t look too far down the road

One of the easiest traps to fall into in software engineering is to build a service (or feature or library or tool) because “it might be needed down the road.” After all, software engineers are supposed to be forward looking and anticipate issues before they occur. But the further you look, the less clearly you can see. It’s important to resist temptation and build only those things that you know you need now (or, at most, in the immediate future).