Distributed Tracing: Motivation

Knewton’s technology solutions comprise more than 50 different services. Such a complex system can be difficult to debug, which is why we set out to integrate distributed tracing into our architecture. A distributed tracing system tracks requests by assigning a unique identifier for the request that is maintained through all service layers, from initial reception through response to clients. Without a distributed tracing system, understanding issues occurring deep in the stack becomes extremely difficult.

Let’s walk through a typical debugging scenario without distributed tracing. Here is a simplified view of the architecture involved in receiving a student event and generating a subsequent recommendation:

distributed-tracing-simplified-flow-knewton

  1. Student completes some work. The results are posted to the HTTP server.
  2. The HTTP server reposts the event to the distributed queue for durability.
  3. The student event processor converts the information from client language to internal Knewton language.
  4. The student event processor posts the translated event to a distributed queue.
  5. The recommendation service reads the event, generates a recommendation, and returns it to the client.

Let’s say the client reports that an event was sent for User A, but no recommendation was received for what User A should work on next. The problem could exist at any of the five layers described above. The identifiers used within each service may be different, so to find where the problem occurred, we’d have to query the logs of each service in serial. Debugging becomes even more complicated in a non-linear dependency chain.

In the world of distributed tracing, debugging would be reduced to two steps: find the incoming event, noting the trace ID assigned to it, and searching across all services for logs associated with that trace ID.

The distributed tracing system provides latency data for each step in the transaction. Before distributed tracing we were unable to calculate end-to-end time for a single transaction, but we could visualize the connections between our services. The Grafana graph below shows 95% latency for various steps in recommendation processing.

grafana-graph-recommended-processing-latency-knewton

To get the most out of a distributed tracing system, we identified key requirements:

Integration:

  • Adding tracing support to a service should require minimal or no code changes.
  • Adding tracing support to a service should not increase latency, nor should it affect service uptime or reliability.
  • The solution must be able to support our interservice communication protocols: Thrift, HTTP and Kafka.
  • The solution must provide a way for an external system to input the trace ID.  Current full-platform tests tell us only that an issue occurred, but have no indication as to where.  Smoke test alerts could include the trace ID, which would make debugging much quicker.
  • The trace ID and information should be accessible to a service for any purpose that the service finds useful, such as logging intermediary actions or including in error logs.

Data:

  • A solution that traces all events.  Some tracing solutions trace only a portion of traffic, and you never know when a particular event will require investigation.
  • A solution that will display an event from end to end, through each service that interacts with it.
  • Tracing information must include unique identification, system, action, timing and sequence.
  • System time across services cannot be guaranteed, so the solution must implement tracking of a logical order so that events can be displayed in the order in which they occurred.
  • Tracing data will not contain any Personally Identifiable Information or information proprietary to customers or Knewton.

Availability:

  • The solution must function in all environments: development, quality assurance, production.  Retention policy may be different for different environments.
  • Tracing reports must be available for immediate debugging and investigation.
  • Tracing data will be available for real-time debugging for one week.  All tracing data will be retained offline for historical analysis.

We analyzed the following distributed tracing solutions for their ability to satisfy our requirements:

Most of these products did not support all of the protocols we require. Kafka was most often missing, with Thrift a close second. And it would not be possible to enhance for the proprietary products to fit our protocol needs. Brave was a particularly compelling solution, but ultimately we decided it was too invasive.

In the end we decided to use Zipkin without Finagle. Finagle is a great product, but did not support Thrift 7 and reverting to an older version of Thrift would have been a large effort in the wrong direction. In the end, we upgraded to Thrift 9, but this was wire compatible between server and client so it was much easier to roll out than a switch to Scrooge.

Our next blog post will explain how we were able to integrate distributed tracing compatible with Zipkin and Finagle into our code while meeting all of the above requirements.

What's this? You're reading N choose K, the Knewton tech blog. We're crafting the Knewton Adaptive Learning Platform that uses data from millions of students to continuously personalize the presentation of educational content according to learners' needs. Sound interesting? We're hiring.