Distributed Tracing: Motivation

Knewton’s technology solutions comprise more than 50 different services. Such a complex system can be difficult to debug, which is why we set out to integrate distributed tracing into our architecture. A distributed tracing system tracks requests by assigning a unique identifier for the request that is maintained through all service layers, from initial reception through response to clients. Without a distributed tracing system, understanding issues occurring deep in the stack becomes extremely difficult.

Let’s walk through a typical debugging scenario without distributed tracing. Here is a simplified view of the architecture involved in receiving a student event and generating a subsequent recommendation:


  1. Student completes some work. The results are posted to the HTTP server.
  2. The HTTP server reposts the event to the distributed queue for durability.
  3. The student event processor converts the information from client language to internal Knewton language.
  4. The student event processor posts the translated event to a distributed queue.
  5. The recommendation service reads the event, generates a recommendation, and returns it to the client.

Let’s say the client reports that an event was sent for User A, but no recommendation was received for what User A should work on next. The problem could exist at any of the five layers described above. The identifiers used within each service may be different, so to find where the problem occurred, we’d have to query the logs of each service in serial. Debugging becomes even more complicated in a non-linear dependency chain.

In the world of distributed tracing, debugging would be reduced to two steps: find the incoming event, noting the trace ID assigned to it, and searching across all services for logs associated with that trace ID.

The distributed tracing system provides latency data for each step in the transaction. Before distributed tracing we were unable to calculate end-to-end time for a single transaction, but we could visualize the connections between our services. The Grafana graph below shows 95% latency for various steps in recommendation processing.


To get the most out of a distributed tracing system, we identified key requirements:


  • Adding tracing support to a service should require minimal or no code changes.
  • Adding tracing support to a service should not increase latency, nor should it affect service uptime or reliability.
  • The solution must be able to support our interservice communication protocols: Thrift, HTTP and Kafka.
  • The solution must provide a way for an external system to input the trace ID.  Current full-platform tests tell us only that an issue occurred, but have no indication as to where.  Smoke test alerts could include the trace ID, which would make debugging much quicker.
  • The trace ID and information should be accessible to a service for any purpose that the service finds useful, such as logging intermediary actions or including in error logs.


  • A solution that traces all events.  Some tracing solutions trace only a portion of traffic, and you never know when a particular event will require investigation.
  • A solution that will display an event from end to end, through each service that interacts with it.
  • Tracing information must include unique identification, system, action, timing and sequence.
  • System time across services cannot be guaranteed, so the solution must implement tracking of a logical order so that events can be displayed in the order in which they occurred.
  • Tracing data will not contain any Personally Identifiable Information or information proprietary to customers or Knewton.


  • The solution must function in all environments: development, quality assurance, production.  Retention policy may be different for different environments.
  • Tracing reports must be available for immediate debugging and investigation.
  • Tracing data will be available for real-time debugging for one week.  All tracing data will be retained offline for historical analysis.

We analyzed the following distributed tracing solutions for their ability to satisfy our requirements:

Most of these products did not support all of the protocols we require. Kafka was most often missing, with Thrift a close second. And it would not be possible to enhance for the proprietary products to fit our protocol needs. Brave was a particularly compelling solution, but ultimately we decided it was too invasive.

In the end we decided to use Zipkin without Finagle. Finagle is a great product, but did not support Thrift 7 and reverting to an older version of Thrift would have been a large effort in the wrong direction. In the end, we upgraded to Thrift 9, but this was wire compatible between server and client so it was much easier to roll out than a switch to Scrooge.

Our next blog post will explain how we were able to integrate distributed tracing compatible with Zipkin and Finagle into our code while meeting all of the above requirements.

On Comparable Noise and Satisfiable Sound

The Motivation

Working for a startup can be intense. Many in this great city succumb to the pressures of work life and relinquish their purely intellectual pursuits. Here at Knewton, however, our love affairs with art, music, literature, formal systems — love affairs that, for the most part, began within very walls of the institutions that we hope to disrupt — these passions inspire our work and our camaraderie with each other. And practically speaking, interview questions at New York’s hottest startups can lean heavily on what was learned in the undergraduate CS lecture hall. The need to keep these skills sharp is clearly still relevant.

The aim here is to kick off a series of blog posts that pair substantial results in CS or formal systems with significant works of classical music. The series is limited to classical works explicitly — there is plenty of headphones-on time at work to listen to The Postal Service.

The Works

Listen to this: Gondwana by Tristan Murail

Read this: On Computable Numbers by Alan Turing

Alan Turing’s On Computable Numbers, with an Application to the Entscheidungsproblem is a seminal paper. Anyone taking a CS undergraduate core will encounter the notion of computability and the halting problem. Further from the ivory tower, it is not difficult to find people who are near-giddy about Turing Completeness:

Breaking news: HTML + CSS3 is Turing Complete,

People looking for help developing a Turing machine in Microsoft Excel,

A Turing machine built using Legos,

Brainfuck, a language described on its website as “an eight instruction Turing-Complete Programming Language”

Turing’s name has become quite literally a “formal” stamp of approval in the computing world. Turing’s seminal paper defined a set of abstractions that formed the basis for what we now call “Turing Completeness,” a requirement of all first-class computing languages. What exactly did Turing say in that paper? And what does it have to do with French spectral music?

Turing’s Paper

Beyond a few feeble mechanical attempts, computers did not exist in 1936, and Mr. Turing didn’t appear that excited about making one — he was interested in making Godel’s long, German work on incompleteness comprehensible. That is, Turing wanted to intuitively construct a class of decision problems that could be proven undecidable. Turing could not tackle this without defining a notion of “computable.” His definition leaned heavily on a few notions that we would recognize today as common, but required a little more formal attention-to-detail back then.

Turing describes memory:

We may compare a man in the process of computing a real number to a machine which is only capable of a finite number of conditions q1, q2, …. qI; which will be called “m-configurations”

and the idea of output:

… the subsequence of the symbols printed by it which are of the first kind will be called the sequence computed by the machine. The real number whose expression as a binary decimal is obtained by prefacing this sequence by a decimal point is called the number computed by the machine.

Academic language aside, the machine Turing is defining is simple. Turing calls this machine an “‘automatic-machine’ (or alpha-machine).” The machine has the following properties:

a tape divided into squares

symbols that are on the squares

the scanned symbol (the rth symbol). Modern texts will likely refer to this as the symbol the ‘head’ is pointing to.

configuration: the current m-configuration and the current scanned symbol (qn, G(r))

But this paper is not without its difficulties, especially if it is approached today.

One initial difficulty: having completed standard undergraduate classes in formal computation theory, Turing’s initial choice of terminology — “Circle-free” and “Circular” — was extremely confusing to my modern eyes. Most undergraduates approach Turing’s ideas in their coursework from the “halting problem” angle. What is initially confusing here is the following:


How can this be? Turing quickly skirts over the definition of a circle-free machine early on in the paper when he describes it as the following:

If a computing machine never writes down more than a finite number of symbols of the first kind, it will be called circular. Otherwise it is said to be circle-free. A machine will be circular if it reaches a configuration from which there is no possible move, or if it goes on moving, and possibly printing symbols of the second kind, but cannot print any more symbols of the first kind. (p. 233)

Finite. This should be the same as a machine that “halts,” right?

No. Turing’s machines are built to write down numbers. These numbers are (somewhat sneakily, in my opinion) implied to be repeating binary decimals on page 232:

The real number whose expression as a binary decimal is obtained by prefacing this sequence by a decimal point is called the number computed by the machine.

So we’re essentially talking about endlessly repeating floats — the kind of numbers that you get when you type in [1] [/] [3] on a calculator, except in binary. Numbers with bars over the top of the last digit: 0.33333333… for example (0.010101010… in binary). In Turing’s paper, the machine that prints 0.010101010… is “circle-free.” This means it encounters no terminal difficulty in computation — it can keep printing out the endless decimal representation without pausing indefinitely to calculate. By this definition, “finite” numbers such as 4, 5, and 6 would be circle-free numbers, because the machine could be constructed to just print out a sequence of zeroes after the decimal point: 6.000000000….

“Halting” really has no place in his discussion: if we took a machine that “did not halt” by the “halting problem definition”, it would print out, say, 6.000 and then wait for an indefinite amount of time before printing out the next zero.

Understanding the important concept of a circle-free machine is critical, even if your only goal in reading the Turing is to see Godel’s diagonalization “done simply” (see the argument on page 247 of the article, where he argues that the machine H is circular by suggesting an infinite loop).

Why the Murail?

The Murail is a significant representation of the spectral music genre which has significant parallels with the work of Turing.

The Murail, like the Turing, uses an intuitive language, completely divorced from the method-as-music ideal of serialism (about serial music). Gondwana is based entirely off of the first sounds you hear in the piece — it is the subsequent deconstruction that leads the listener down a path that is nearly always defined momentarily, in that each subsequent moment of sound in the piece is very often based off of the prior moment. There are few moments of discontinuity.

The work starts off very simply. The piece begins with seven to eight moments of repeated sound (see 0:12, 0:30, 0:46, 1:01, 1:15, 1:28, 1:38, 1:48, and then MAYBE [1:55, 2:01, 2:07, 2:13-2:40] in this recording), as easily understood derivations from a simple sonic idea.

Indeed, these events are like Turing’s “appeal to intuition” by Murail. There is no recognition of pattern possible without repetition. With the seven chime-strike chords, Murail is handing us a ball of yarn — we will need it to find our way home from the complicated, thorny walk in the woods that is the rest of the piece. There will be no resounding return, no recapitulation to save the day here.

Beyond the opening, the Murail begins to dig deep into the essence of sound. At roughly eight minutes into the work, a hissing begins to become a main sonic feature. The connection to the beginning is simple — there is a brief, percussive sound at the beginning of the piece with a few strings playing, perhaps a maracca — there are lots of high, hiss-like frequencies that comprise the sound of a maracca. But it brings to mind an important question — a question that is at the heart of spectral music: what kind of music is happening in the parts of sound waves that we DON’T hear? What about all the extremely high frequencies that dogs can hear? Above those, even? They exist. They have to. A sound wave can be encoded by a number. Large numbers exist. What if we slowed the noise, say, of water molecules at room temperature bouncing around inside a mason jar? Anyone that has been distracted by the sound of a tea kettle on the verge of boiling already knows that interesting sonic things can happen with water molecules on the surface of a boiling hot-plate. But what about beyond the ear? What types of large-number-based frequencies exist at this sonic extremes of our world?

Computable vs. Uncomputable Sounds?

There is a strong argument that much of “Western Harmony” can be boiled down into small, rational-number relationships between different frequencies; numerical relationships, in fact, that are easily computable by a Turing machine (with a reasonable “skeleton table,” even!). A major triad, for example, is little more than the three ratios:

1, 5/4, and 3/2 (pitches ‘do’, ‘mi’ and ‘so’, if you’ve ever watched “The Sound of Music”)

By the standards of Murail (and Turing, for that matter), these are small numbers, and the human ear is quite adept at performing the computation of recognizing them and placing them in reasonable cultural and musical contexts.

The human ear, from a perspective of what it can understand or “compute” in this way, is quite good. We like minor chords, diminished chords — heck, we’ll even listen to Coltrane. Serial music is tougher. In general, music that doesn’t “sound good” to the western ear because of dissonance, etc. likely involves sound frequency relationships that make use of larger numbers. A minor chord, for example, requires use of the 6th and 8th partials. Its ratios, depending on how you define the third of the chord, require much larger numbers than that of a major chord. The numbers involved with frequencies heard in the Murail then, in all their roaring, bending, and hissing glory, beg a final question:

What is the limit? What is uncomputable for the human ear?


Parallels with music aside, readers not familiar with Turing’s life and the circumstances of his death should read this: turing wikipedia article. Oppression and intolerance are real, and Turing suffered greatly at the hands of the British government.

from: www.nateburke.com

Code Swarm Post Git: Data Visualization

The above visualization shows the history of commits for the Knewton Adaptive Learning Platform. A commit occurs when a developer alters the code and transfers the changes into our central git repository. Each file in a commit is represented by a colored-coded blob which indicates the type of file committed. Each time a developer commits a file, the algorithm brings the file closer to the user’s blob; this proximity reflects the associativity between code files and users as well as the relationships among users themselves. The histogram at the bottom displays the frequency of commits over time.

This visualization can also be referred to as a “code swarm”, and is an example of “organic information visualization” (a phrase coined by Ben Fry), that is, any data visualization that allows the elements to interact in a dynamic and unpredictable fashion. The benefit of such a visualization is that it brings the relationships between different project components into relief. The viewer can also observe the evolution of a project over time.