Analyzing Java “Garbage First Garbage Collection” (G1GC) Logs

Garbage Collection can take a big toll on any Java application, so it’s important to understand its behavior and impact. After a JVM upgrade of Knewton’s Cassandra database, we needed a tool to compare the performance and impact of different garbage collection strategies, but  we couldn’t find an existing tool that would parse gc logs and analyze them.

This post explains the process followed and discusses some results that we believe may be useful while evaluating Java garbage collection strategies.

Evolution of Garbage Collection on Java

One of the biggest challenges for any programming language is the way that it will allocate memory while executing a task. Garbage collection is the action to regain sectors of memory that are no longer in use. These mechanisms are usually evaluated in three dimensions that are known as the three legs of the performance stool: latency, throughput, and footprint. In real life, there is always a tradeoff, and if all the stars align, you can optimize for two of the three dimensions.

carlos-blog-1

Source: Explorations of the three legged performance stool

Over the course of its history, Java has handled garbage collection in different ways. The early JDK (before 1.4) had a single thread to sweep memory to find out which areas were in use and which ones could be reclaimed.

Java 4 introduced more complex algorithms based on tracing collectors The idea was to find root objects and identify all instances that were still in use. If an object didn’t have a trace, it was considered not in use, and that area of memory would be reclaimed. The trouble with this approach is that applications had cease all activities to allow the analysis. Examples of these “stop the world” algorithms include Mark and Sweep, Heap Compaction, and Copying Collector.

Garbage collection strategies before Java 7 relied on the definition of fixed regions of memory named generations:

  • Objects created recently were gathered in the Eden Generation,
  • When the garbage collector acted, objects that had traces to be in use were “promoted” to the Survivor Generation when an object were not referenced by other objects, they are removed.
  • Objects that survived multiple passes eventually reach the Permanent Generation.

One disadvantage of these garbage collection algorithms is that they require copying and deleting objects during promotions, temporarily using more memory and in some cases causing an overflow that could crash an application. As the size of those memory buckets is inflexible, this approach could limit applications when they needed to scale. Examples of these strategies are Concurrent Mark-Sweep (CMS) and Parallel Collector.

Java 7 update 4 came with a new algorithm: Garbage First (G1), which divides memory into heap regions of the same size. Unlike CMS, while sweeping memory, G1 uses a pause predictive model to identify areas likely to contain objects not in use and targets them as first to be evacuated, hence the name. Only the evacuation process, when objects from the heap that are still in use are compacted and copied to a different region to avoid fragmentation has a stop-the-world behavior. In our tests, the stop-the-world phase has been imperceptible, however.

G1 doesn’t sort objects by generation, thus there are no limits on generation size. The result is a strategy that frees up large portions of memory in less time and that keeps adjusting based on of previous garbage collections.

Java 8 supports G1, CMS, and Parallel Collector, and G1 has been proposed as the default strategy for Java 9.

Why Knewton Created a Parser/Analyzer

While upgrading the Cassandra database, the database  team searched for a tool that could evaluate objectively the performance of G1 vs. CMS; Although G1 was released in April 2012, utilities available for analyzing the garbage collection logs, like Netflix’s gcviz, didn’t support G1 logs; We also tried an approach proposed by Jeff Taylor of Oracle, but their code was not able to interpret all the information from our servers. The scripts are an update of Taylor’s approach: With AWK they collect the logs and format them as a table so they can be interpreted and analyzed with R.

The JVM options that we have been using are:

-XX:+UseThreadPriorities
-XX:ThreadPriorityPolicy=42
-Xms1995M -Xmx1995M
-Xss256k -XX:StringTableSize=1000003
-XX:SurvivorRatio=8
-XX:MaxTenuringThreshold=1
-XX:+UseTLAB
-XX:+ResizeTLAB
-XX:CompileCommandFile=/hotspot_compiler
-XX:MaxGCPauseMillis=500
-XX:InitiatingHeapOccupancyPercent=25
-XX:+UseG1GC
-XX:G1RSetUpdatingPauseTimePercent=5
-XX:+ParallelRefProcEnabled
-XX:ParallelGCThreads=2
-XX:ConcGCThreads=2
-XX:+PrintGCDetails
-XX:+PrintGCDateStamps
-XX:+PrintHeapAtGC
-XX:+PrintTenuringDistribution
-XX:+PrintGCApplicationStoppedTime
-XX:+PrintPromotionFailure
-XX:PrintFLSStatistics=1
-Xloggc:/<valid path>/gc.log
-XX:+UseGCLogFileRotation
-XX:NumberOfGCLogFiles=10
-XX:GCLogFileSize=10M

The resulting logs contain entries formatted as JSON:

2016-05-12T20:42:14.917+0000: 2.438: Total time for which application threads were stopped: 0.0204068 seconds, Stopping threads took: 0.0000649 seconds
{Heap before GC invocations=2 (full 0):
garbage-first heap  total 2043904K, used 120064K [0x0000000083400000, 0x0000000083503e60, 0x0000000100000000) region size 1024K, 111 young (113664K), 7 survivors (7168K)
Metaspace     used 18551K, capacity 18694K, committed 18944K, reserved 1067008K 
class space    used 2265K, capacity 2299K, committed 2304K, reserved 1048576K
2016-05-12T20:42:15.303+0000: 2.824: [GC pause (G1 Evacuation Pause) (young) 
Desired survivor size 7340032 bytes, new threshold 1 (max 1) - age   1:    6402368 bytes,    6402368 total, 0.0183564 secs]
  [Parallel Time: 17.2 ms, GC Workers: 2]
     [GC Worker Start (ms): Min: 2824.3, Avg: 2824.3, Max: 2824.3, Diff: 0.0]
     [Ext Root Scanning (ms): Min: 2.2, Avg: 2.2, Max: 2.2, Diff: 0.0, Sum: 4.5]
     [Update RS (ms): Min: 0.4, Avg: 0.4, Max: 0.4, Diff: 0.1, Sum: 0.8]
        [Processed Buffers: Min: 2, Avg: 3.5, Max: 5, Diff: 3, Sum: 7]
     [Scan RS (ms): Min: 0.0, Avg: 0.1, Max: 0.1, Diff: 0.1, Sum: 0.2]
     [Code Root Scanning (ms): Min: 0.0, Avg: 0.1, Max: 0.1, Diff: 0.1, Sum: 0.2]
     [Object Copy (ms): Min: 9.2, Avg: 11.7, Max: 14.2, Diff: 5.0, Sum: 23.4]
     [Termination (ms): Min: 0.0, Avg: 2.6, Max: 5.2, Diff: 5.2, Sum: 5.2]
        [Termination Attempts: Min: 1, Avg: 1.0, Max: 1, Diff: 0, Sum: 2]
     [GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
     [GC Worker Total (ms): Min: 17.1, Avg: 17.1, Max: 17.1, Diff: 0.0, Sum: 34.3]
     [GC Worker End (ms): Min: 2841.5, Avg: 2841.5, Max: 2841.5, Diff: 0.0]
  [Code Root Fixup: 0.0 ms]
  [Code Root Purge: 0.0 ms]
  [Clear CT: 0.1 ms]
  [Other: 1.1 ms]
     [Choose CSet: 0.0 ms]
     [Ref Proc: 0.6 ms]
     [Ref Enq: 0.0 ms]
     [Redirty Cards: 0.0 ms]
     [Humongous Register: 0.1 ms]
     [Humongous Reclaim: 0.0 ms]
     [Free CSet: 0.1 ms]
  [Eden: 104.0M(104.0M)->0.0B(1189.0M) Survivors: 7168.0K->8192.0K Heap: 117.3M(1996.0M)->15.0M(1996.0M)]
Heap after GC invocations=3 (full 0):
garbage-first heap   total 2043904K, used 15360K [0x0000000083400000, 0x0000000083503e60, 0x0000000100000000) region size 1024K, 8 young (8192K), 8 survivors (8192K)
Metaspace       used 18551K, capacity 18694K, committed 18944K, reserved 1067008K
class space     used 2265K, capacity 2299K, committed 2304K, reserved 1048576K

The Cassandra upgrade was a consequence of a fix for a Thrift Buffer anomaly; this is another example of the continuous efforts of Knewton’s database team to tune and optimize our database servers.

Results

Knewton has conducted extensive analysis of the performance and tuning of our Cassandra nodes, the following are some of the information collected from a cluster with 24 nodes worth of information that are presented as the hypothetical Knerd Cassandra cluster, also described in Dr. Josh Wickman’s post. Each node has 2 CPU’s and 8 GB of RAM.

The following examples show ways to analyze the the gc.logs. They consider only the logs obtained from Production Cassandra nodes during a 24 hours window; during that period it was also done a full repair on the database, and an automated backup. Note that the results shown were from the test once after the change was done; for exploratory phases we used cassandra-stress tool for functional testing, and an end-to-end load test using Locust.

The following histogram compares the duration of garbage collection execution and groups them in three categories:

  1. Real Time: the total time from start to finish for the garbage collection
  2. User Time: the time used for processes that were not related to the kernel (operating system) and pauses.
  3. Sys Time: The time spent by kernel operations,  most of it corresponding to time that the CPU is locked.lso known as stop-the-world scenario.

carlos-blog-2

Time spent on garbage collection in a “stop-the-world” phase (Sys Time) is a fraction of the time when user processes are blocking operations. The garbage collection adapts to minimize impact on the the application.

We can also compare the sizes for each generation before and after the garbage collection and confirm the impact that a Cassandra nodetool repair action can have on the garbage collection.

First, a brief explanation of how repairs work in Cassandra: Cassandra is a masterless database, which means that each server that is part of the same cluster will share data (sometimes several copies of the same record). Each node is responsible for knowing where to find a record for a fraction of the data; each record will have a unique token identification; and each node will be responsible for keeping track of all the copies of specific tokens.

When a repair is requested, each node will be responsible for validating the consistency of all copies of the data. (If there are differences the latest version will prevail.) Depending on the amount of data and the replication factor, this can be a very expensive operation on CPU and memory.

The graph below shows the total memory used before and after a garbage collection

carlos-blog-3

At (1), the command nodetool repair executed a sequential repair in the entire cluster, which used more memory after the garbage collection since the anti-entropy mechanism of the repair put more objects in use.

carlos-blog-4

Garbage collection has a non deterministic behaviour: it does not start at the exact time that the memory threshold is surpassed (2), as sometimes it can take fractions of the second afterward, also, there can be a delay while G1 determines the areas that will release the most memory; that delay will depend on the heap size allocated and how much is in use, this can be an area of concern for environments with limited resources (i.e.: development virtual environments, docker containers).

carlos-blog-5

After the repair, the pattern changed, with garbage collections that effectively decreased the memory used below 0.25 GB (3); one explanation is that the garbage collector learns from its previous runs when the repair was being executed.

carlos-blog-6

Finally, there is an additional spike (4) caused by our automated backup mechanism.

carlos-blog-7

From the total heap size, we can also analyze and compare components; in the example we’ll show only the parts of the New Generation: Eden and Survivor spaces, as they usually are the earliest indicators of issues with memory consumption.

carlos-blog-8carlos-blog-9
Garbage collection is triggered by the total heap size heap, so garbage collection will happen even when the Eden and Survivors Generations are relatively small.

After garbage collection, the Eden space should be at or close to zero, while the Survivor space gets bigger.. The Eden space is also larger than the Survivor space; in this example, by 20:1. This ratio will be different depending on the application.

If needed, we can get more granular information about the different operations (like scanning memory sections or copying objects) that take place during a garbage collection; in the same way as before, we can also identify the change of the behavior that occurs during a repair.
carlos-blog-10carlos-blog-11

As explained before, a Cassandra repair can consume a lot of resources; but this is necessary to ensure that the data is consistent (especially when there are constant updates). One finding of the composition is that the change on the behavior pattern: the Objects Copy operation (in green) stayed longer hours after the repair was completed. which could be also an effect of the adaptive nature of the G1 algorithm.

Conclusions

The algorithms and strategies for garbage collection will keep evolving and changing, as we have not yet found one that maximizes throughput with minimum latency and footprint. In the meantime, Java applications will continue to rely upon G1GC for versions 7, 8, and, most likely, 9. By sharing some experiences with logs from Generation First Garbage Collection (G1GC) and open source data modeling tools, we hope to help other database teams understand their systems better.

References

For Garbage Collection

IBM – Java theory and practice: A brief history of garbage collection

infoq – G1: One Garbage Collector To Rule Them All

Oracle – Getting Started with the G1 Garbage Collector

Oracle – Java SE 6 Garbage Collection Tuning

jaxenter – Java 9’s new garbage collector: What’s changing? What’s staying?

Oracle – Understanding G1 GC Logs

For visualization tools of Garbage Collection logs

GCViewer

Netflix’s gcviz

For data modeling with R

r-project.org

Code

Code is available at this gist

Simplifying Cassandra Heap Size Allocation

As discussed previously, Knewton has a large Cassandra deployment to meet its data store needs. Despite best efforts to standardize configurations across the deployment, the systems are in a near-constant flux. In particular, upgrading the version of Cassandra may be quick for a single cluster but doing so for the entire deployment is usually a lengthy process, due to the overhead involved in bringing legacy stacks up to date.

At the time of this writing, Knewton is in the final stages of a system-wide Cassandra upgrade from Version 1.2 to 2.1. Due to a breaking change in the underlying SSTable format introduced in Version 2.1, this transition must be performed in stages, first from 1.2 to 2.0, then 2.0 to 2.1. While the SSTable format change does not have a direct bearing on the topic of interest here, another change introduced in Version 2.0 is of material importance.

This post describes some nuances of Cassandra’s JVM (maximum) heap memory allocation that we discovered along the way, and how we handled aspects that posed a threat to our systems.

A problem case

Our journey of discovery began with a struggle to address crashes on a particular cluster due to running out of heap memory. For ease of reference, let’s call this cluster Knerd. In our AWS infrastructure, the Knerd cluster was running on m4.large nodes having 8 GB of memory, and by default Cassandra was allocating 4 GB of heap to each node. After exhausting other options, we decided to migrate the cluster to m4.xlarge nodes with 16 GB of memory, believing that these larger nodes would be allocated 8 GB of heap.

Imagine our surprise to see the new nodes come up with 4 GB of heap, same as before.

Default heap allocation in Cassandra 1.2

This phenomenon forced us to take a harder look at the code behind the heap allocation, contained in the cassandra-env.sh config file. The relevant code block, as it ships with Cassandra 1.2, is contained in the function calculate_heap_sizes:

   # set max heap size based on the following
   # max(min(1/2 ram, 4096MB), min(1/4 ram, 8GB))
   # calculate 1/2 ram and cap to 4096MB
   # calculate 1/4 ram and cap to 8192MB
   # pick the max
   half_system_memory_in_mb=`expr $system_memory_in_mb / 2`
   quarter_system_memory_in_mb=`expr $half_system_memory_in_mb / 2`
   if [ "$half_system_memory_in_mb" -gt "4096" ]
   then
       half_system_memory_in_mb="4096"
   fi
   if [ "$quarter_system_memory_in_mb" -gt "8192" ]
   then
       quarter_system_memory_in_mb="8192"
   fi
   if [ "$half_system_memory_in_mb" -gt "$quarter_system_memory_in_mb" ]
   then
       max_heap_size_in_mb="$half_system_memory_in_mb"
   else
       max_heap_size_in_mb="$quarter_system_memory_in_mb"
   fi
   MAX_HEAP_SIZE="${max_heap_size_in_mb}M"

In short, this code block imposes a cap on each of two parameters, then uses the larger parameter as the maximum heap size.

The graph below shows the heap size in GB as a function of the system memory (also in GB), up to the absolute cap of an 8 GB heap. (We’ll focus on integer values for ease of discussion.)

cassandra-2-1-default-heap-allocation

The heap size scales linearly as half the system memory up to 4 GB, then plateaus for systems with total memory between 8 GB and 16 GB. This is precisely the range of values for which we observed a steady maximum heap size of 4 GB on the Knerd cluster.

This plateau is a direct consequence of the half-memory parameter in the shell code being capped at a smaller value than the quarter-memory parameter. While scrutiny of the shell code above will show exactly the behavior displayed in the graph, a new user is unlikely to see this coming, especially in the typical NoSQL climate with short time scales and developer-managed data stores. Indeed, since Cassandra’s heap scales with system memory, it is reasonable to assume that it will scale smoothly for all values of the system memory below the cap. The plateau is at odds with this assumption, and therefore constitutes behavior that might be deemed unpredictable or counterintuitive, and can disrupt provisioning and capacity planning. In other words, this internal Cassandra feature violates the Principle of Least Surprise.

Beyond the plateau, the curve begins to rise linearly again, this time scaling as one-quarter of system memory. Then, at a system memory of 32 GB, the heap size reaches the absolute cap of 8 GB.

Modified heap allocation in Cassandra 1.2

If you’re using Cassandra 1.2, we strongly recommend doing something about this plateau in order for your nodes to scale predictably. Here are some options.

The 8 GB cap is generally desirable, because if the heap is too large then stop-the-world garbage collection cycles will take longer and can cause downstream failures. On the other hand, the system memory value upon reaching the cap is not terribly important. It could be 32 GB, as in the default scheme, or it could be 24 GB with little noticeable difference. Thus the plateau can be eliminated in one of two ways:

  1. Keep the remaining features, shifting the transition points to lower system memory values.cassandra-2-1-modified-heap-no-plateau
  2. Take the average over the values less than system memory of 16 GB, which yields a simple one-quarter slope up to the cap.

cassandra-2-1-modified-heap-constant-slope

Averaging is simpler, but it may not be suitable for small values of system memory.  Either approach can be achieved with a simple set of conditionals specifying the scaling for each segment; for example, the first could be coded like

   if [ "$system_memory_in_mb" -lt "8192" ]
   then
       max_heap_size_in_mb="$half_system_memory_in_mb"
   elif [ "$system_memory_in_mb" -lt "24576" ]
   then
       max_heap_size_in_mb="$quarter_system_memory_in_mb"
   else
       max_heap_size_in_mb="8192"
   fi

Either approach eliminates the undesirable plateau. An upgrade to Cassandra 2.0 or above also effectively removes its impact.

Heap allocation in Cassandra 2.0+

Cassandra 2.0 introduced a change in heap allocation, although this change was easy to miss since it did not get highlighted in the What’s new in Cassandra post for 2.0.

The change is a small one: the cap of the one-half-memory variable was changed from 4 GB to 1 GB, such that the plateau occurs at a heap size of 1 GB and extends from system memory values of 2 GB to 4 GB. While the plateau still exists, it is no longer an issue for systems having memory greater than 4 GB. Production systems will doubtless be running with more memory than this, so upgrading alone will solve the scaling predictability issue in nearly all cases.

However, systems previously operating with system memory at or below 16 GB will have a smaller heap after upgrading from 1.2 to 2.0+. This is because the heap now scales as one-quarter of the system memory for a broader range of memory values than it did on Cassandra 1.2. Although the picture hasn’t changed for high-memory systems, this shift to smaller heaps can cause further problems on systems with relatively small node memory, such as the Knerd cluster.

Moving memtables off heap in Cassandra 2.1

The change in the cap in Cassandra 2.0 may be related to the effort to move more Cassandra components off of the JVM heap. Version 2.0 offered off-heap partition summaries, and version 2.1 introduced the ability to move memtables off heap. These features make a smaller heap size desirable in order to make room for the memtable components in off-heap memory.

Your heap may behave better, at the tradeoff of possible out-of-memory events at the system level. Since each Java process allocates a separate heap, other Java processes running on the same machine (e.g. Datastax OpsCenter) will result in less memory available for Cassandra’s off-heap components. In the extreme case, the operating system may kill the Cassandra process or crash altogether, producing identical behavior to the heap-crash case from the perspective of the application (i.e. Cassandra is not available on that node).

Knewton’s approach

At first, we applied a Band-Aid fix in Cassandra 1.2 to always allocate a heap size equal to half the system memory. This worked reasonably well, until migrating to 2.1. The relevant change to configuration templates had only been made for Cassandra 1.2, and other versions fell back to the default allocation code (which is how we discovered the properties described in this post).

That’s when we decided it was time to do some testing. We had three options:

  1. Move only the memtable buffers off heap (offheap_buffers);
  2. Move both the memtable buffers and the objects off heap (offheap_objects);
  3. Leave the memtables on heap and raise the heap size.

We regarded raising the heap size as our fallback, since we knew how the system would behave if we did that. Due to considerations related to scaling (e.g. data size growth), we weren’t very confident that any testing would yield a clear choice of how large to make the heap. In the end, then, this amounted to simply raising the maximum heap size to half the system memory. While doing so would be nice and simple, we hesitated to make our Cassandra installation deviate this much from the base installation.

So we experimented with off-heap memtables. As noted earlier, moving components off of the heap appears to be related to why the maximum heap size calculation was changed in the first place. Moreover, moving both the memtable buffers and the objects off heap is slated to become the default behavior in a future release of Cassandra, which led us to believe that this option’s performance is good enough that it is recommended in most cases.

Both offheap_buffers and offheap_objects showed good results in tests, reducing the stress on the heap and system performance overall relative to heap_buffers (the default configuration). As expected, the GC time was lowest for offheap_objects, as was the end-to-end processing time. To our surprise, however, offheap_buffers was the star performer. This setting yielded the best numbers while maintaining the most consistent read and write rates. In some of these measures, such as percentage of heap used and GC pause time, offheap_buffers narrowly edged out offheap_objects. In others, notably write latency, bloom filter false-positive ratio, and average response time, offheap_buffers was the clear winner. This performance was enough to convince us to move forward with offheap_buffers.

After deploying this change, we mostly saw positive results. Unfortunately, we did still sometimes see some transient high-heap events, mostly on clusters with the large data sets. On the other hand, these events were not accompanied by crashes. This caused us to reevaluate our longstanding alert threshold for heap usage, whereupon we concluded that our threshold needed some tuning. This resolved the remainder of the issue.

In summary, investigation of some puzzling behavior revealed unexpected heap allocation details in Cassandra’s internal configuration. Studying the configuration code allowed us to evaluate our options to address the behavior we originally observed. Our goal was to yield more predictable heap allocation results when scaling a cluster without impeding Cassandra’s performance. We achieved this by moving the memtable buffers off heap, which is possible in Cassandra 2.1+.

Digging Deep Into Cassandra Thrift Buffer Behavior

Everyone who works in tech has had to debug a problem.  Hopefully it is as simple as looking into a log file, but many times it is not.  Sometimes the problem goes away and sometimes it only looks like it goes away.  Other times it might not look like a problem at all.  A lot of factors will go into deciding if you need to investigate, and how deep you need to go.  These investigations can take a lot of resources for an organization and there is always the chance of coming up empty handed which should never be seen as a failure.

This post will summarize an investigation into some Cassandra memory behavior that the database team at Knewton conducted recently.  It is a good illustration of the kind of multi-pronged approach needed to unravel strange behaviors low in the stack.  While we will be specifically talking about Cassandra running on AWS the takeaways from this article are applicable to systems running on different platforms as well.

Uncovering a Problem

Background

Knewton had a Cassandra cluster that was very overprovisioned. The instances were scaled up (each instance is made more powerful) rather than scaled out (more instances are added to the cluster).  Cassandra is a horizontally scalable datastore and it benefits from having more machines rather than better ones.

So we added more machines and reduced the power of each one. We now had 4GB of available memory for the Java heap instead of 8GB.  This configuration worked well in all of our tests and in other clusters but in this specific overprovisioned cluster we found we had scaled each node down too much, so we moved to machines that could accommodate 8GB of heap, m4.xlarge instances.

Anomalous Behavior on New Instances

After moving all nodes over to m4.xl instances we saw our heap situation stabilize. However we began to notice anomalous CPU and load averages across the cluster.  Some nodes would run higher than other nodes.  The metrics showed that, out of the four cores on a m4.xl instance, one was  completely saturated.

If you saw this load on its own you would not think that it is a problem. Total usage  of CPU on the box is at 25% and there are no catastrophically long GC pauses.  However, the cluster was not uniform, which called for further investigation.

In these CPU graphs, you can see nodes running on low CPU that encountered something that would rapidly promote them into the high CPU band, and they would almost never drop back down.

cassandra-thrift-buffer-high-cpu-band

 

We found that this graph was correlated with the graph of average garbage collection time.

cassandra-thrift-buffer-garbage-collection-high-cpu

When nodes promote themselves into the high CPU band, their GC times spike.

What is Cassandra holding in its heap that is causing GC issues?  With a lot of overhead in memory and CPU, crashes are not a risk, but performance is far from where we want to be.

Before tackling the garbage collection question, we have two other questions that we can answer already:

Why is this behavior showing up now?

We had an 8GB heap before and should have seen the same behavior.  The reason we only saw this CPU and GC behavior once on m4.xlarge instances is twofold:

  1. Something unique in this cluster caused a 4GB heap to be fatal but an 8GB heap to be adequate.  Expanding the heap did not get rid of the problem.
  2. The original cluster that had an 8GB heap was around for years and all nodes were promoted to the high CPU band.  The baseline operation of the cluster looked like the high CPU band.  It is only because of previous issues with provisioning this cluster that we were  watching closely when we moved to these m4.xlarge instances.  This highlights the importance of understanding your baseline resource utilization and not assuming that it means you are in a healthy state.

Why is this behavior a problem?

Even though the high usage band is not a problematic load, the behavior of the cluster was unexpected, and this promotion behavior is problematic.  The biggest reason that is problematic is that it meant we had a nondeterministic factor in our latencies.  We can expect different performance on a node that is in the high CPU band than one in the low CPU usage band.  However we cannot predict when a node will promote itself as we don’t know the cause for this behavior.  So we have an unknown factor in our system, which is dangerous for repeatability and reliability.

Investigation

Investigating further is resource intensive, and often your senior engineering staff has to undertake the investigation, as it requires some degree of independence and experience.  So make sure you decide that the  problem is  actually worth the time of your organization before sinking days or weeks of engineering time into it.

Logs

The first thing to do is to look through the relevant logs.  Looking through the Cassandra logs we found, unsurprisingly, a lot of garbage collections.  We’d had GC logging on and found several “large block” messages in the CMS garbage collection logs at about the time these promotions happened.  To get more visibility into what is actually in the heap, we turned on GC class histogram logs, but these said that almost all the memory was being held in byte arrays.

Not helpful.  Time to go even deeper and see what is in the heap.

Heap Dump

So we took a heap dump from a problematic node and on a node that was exhibiting good GC behavior as a control.  A heap dump is the size of the used Java heap. Dumping the heap is a “stop the world” operation for the process you are dumping, so when doing this in production be sure that the service can be unresponsive for a minute or more.  The file is binary and examining it is labor intensive, so it is best to move the heap dump to a computer that’s not being used in production to investigate.

We used the Eclipse Memory Analyzer Tool (MAT) to investigate the heap dumps and found that almost all of the memory was taken up by the write buffer of the TframedTransport objects.  There were several hundred of these objects and the write buffer size ranged from 1kB to 30MB, with many in the range of 20-30MB.  We saw similar objects in the heap dump of the control node, but not nearly as many.  The write buffer contains what is being written to the Thrift transport socket and does not correspond to Cassandra reads and writes.  In this case Cassandra is writing the output of an incoming read to this buffer to send to the client that has requested the data.

It became pretty clear that this was a Thrift protocol issue so we searched for related issues.

cassandra-thrift-buffer-related-issues

Literature Search

Any problem you find has been reported in some way or another by someone else, especially if you are using anything but the latest versions of open-source software,  It is very useful to search the web for related problems at every step of the investigation, but as you get more specific you might uncover things that previously you would not have encountered.  In this case the investigation led us to the Thrift protocol, which is not something we would have searched for earlier.

The Thrift library that our version of Cassandra used had some memory issues referenced in CASSANDRA-4896. These are referenced again in more detail and resolved in  CASSANDRA-8685.  So you don’t have to read through the tickets, basically the write buffer — the thing taking up all of our space and the likely cause of our issues — will grow to accommodate a larger payload but never shrinks back down. So once an operation is requested of Cassandra that has a large payload, such as a query returning a 20MB record, these buffers increase in size and never drop back down.

This behavior lines up with what we are seeing in the cluster, where a node could be operating smoothly until a large set of queries come in and expand the write buffers on the Thrift frame objects.  Once the frame size increases, it stays big for the duration of the connection with the client. (We’ll come back to this point shortly).

This Thrift behavior is noted in THRIFT-1457 and a fix is out for it. The Thrift library fix is also rolled into Cassandra 2.1.3 and later.  This is also only in Thrift so any clients using CQL won’t see this.

Verify the Hypothesis

The link between this memory behavior and client connections was a working hypothesis. We got a very strong signal supporting the theory one morning when we deployed a new version of services that communicated with our cluster.  This deployment meant closing all of the Cassandra connections on the client side during a low traffic period and then reopening them.  Cold caches cause a lot of queries to occur during a deploy, but these queries do not return large blocks of data, so the TframedTransport write buffers do not expand. Memory stays low until we start getting production traffic, which then expands the buffers as described above.

cassandra-thrift-buffer-cpu-usage

The CPU and GC graphs from the deploy at 10:45 show that we have almost no traffic beforehand.  We’re still holding onto these large memory objects and they aren’t cleaned up because the TframedTransport objects live as long as the corresponding client connection.

cassandra-thrift-buffer-garbage-collection-time

Once the client connections close, the objects are garbage collected and the new connections start with small TframedTransport write buffers. So we see low heap and GC times at about 10:45.

cassandra-thrift-buffer-heap-size-by-node

There is a dramatic heap usage drop when the deploy happens at 10:45.

Conclusions

Technical Wrapup

The behavior of these TframedTransport buffers makes the Thrift protocol in pre-2.1 Cassandra dangerous.  Each connection holds a TframedTransport object in the Cassandra heap that can grow depending on the query executed but cannot shrink.  Memory then begins to fill with these TframedTransport objects as they cannot be cleaned out as long as the connection to the client is open.  If the client connection is severed by a client restart or a deploy, these objects get garbage collected.  New connections begin with a small TframedTransport object size which behaves well with GC / Heap / CPU on the node until a large query comes in which then expands the write buffer on the TframedTransport object again,  promoting the node to a high-CPU-usage node and damaging its performance and response time.  Eventually all nodes in the cluster reach this high-CPU-usage band and the extra resources taken up by these large TframedTransport objects become a burden on the performance of our database.

Next Steps

Because these buffers never decrease in size, the memory usage of the Cassandra instance is bounded only by the number of connections to the cluster, which is determined by the number of active clients.  Our previous 4GB heap was unable to accommodate the number of clients in Knewton’s infrastructure so it would perpetually crash, however 8GB is sufficient.

Once we have identified the source of the problem, we have to decide what, if anything, to do about it. Some options would be:

  • Upgrading to Cassandra 2.1.3 or later.
  • Migrating to CQL from Thrift

We know that the behavior is not fatal to our processes, so depending on the opportunity cost of each of these processes, we can decide if we will live with this behavior or if we are going to fix it.

Cost Analysis

It is difficult to quantify how much an investigation like this costs in terms of time, opportunity cost, and money to an organization.  Before starting an investigation, it is useful to timebox the project and restrict yourself to a certain number of engineering hours to investigate.  In hindsight one might quantify the cost of this investigation as:

Monetary Cost = Engineering Time – Hardware Optimization

In this case we will need less hardware once this memory issue is resolved, perhaps three fewer m4.xlarge instances.  The yearlong cost of a reserved m4.xl instance is currently $1,200. We spent about three engineer days investigating.  Estimating engineering costs at $200 an hour (including overhead and opportunity cost to the organization) each engineering day is $1,600.  So over the year, we spent about $1,200 more on the investigation than we recovered from optimizations.

Final Remarks

Beyond monetary considerations, we have a better understanding of our system, which benefits us in several ways.  We are able to adjust our alerts to better monitor the system.  We can justify our instance sizes.  And finally we are able to support larger engineering objectives, advising the organization on the best course of action for future projects.  For example, we decided to move aggressively to upgrade this cluster to 2.1.

Deep investigations like this do not come with guarantees of strong takeaways or definitive answers.  They are expensive in terms of time and often require senior engineers to dig into deep system internals.  Yet at Knewton we have found that these investigations are valuable for the database team and the wider tech organization.

Introducing the Knewton Client Library

Why Build a Client Library?

As part of Knewton’s mission to personalize learning for everyone, Knewton strives to provide an easy-to-use API that our partners can leverage in order to experience the power of adaptive learning. Knewton’s API follows industry norms and standards for authentication and authorization protocols, error handling, rate limiting, and API structure.

To authenticate incoming API requests from client applications, Knewton uses two-legged OAuth 2.0. Currently, each partner has to implement OAuth 2.0 on the client side from scratch, which may increase the potential of authentication failures when users try to log in.

Partners also need to be well-versed in Knewton’s API standards. Using the right API endpoint along with valid credentials is critical to ensuring that the client application scales effectively upon launching. Beyond the initial integration, partners are generally cautious about upgrading their applications when Knewton releases updates or new endpoints.

To help simplify and standardize development of client applications, Knewton is starting to pilot the Knewton Client Library (KCL). The KCL boilerplate code will allow partners to build client applications faster and more efficiently. Implementing KCL in client applications will allow partners to devote more resources to the actual client application and enriching the overall experience of the students and teachers who use it. Using KCL will also offer partners a smoother upgrade path, since Knewton will absorb any backwards compatibility complexities into the library.

Components of the Knewton Client Library

The Knewton Client Library uses the Swagger framework. Utilizing the contract first method of SOA development, Swagger provides a set of tools for REST APIs. It has a standardized definition format which can be used to generate code and documentation.

A centralized API definition ties together 1) server side implementation, 2) client code generation, and 3) API documentation, ensuring that they stay in sync and reducing time spent writing boilerplate code.

A centralized API definition ties together 1) server side implementation, 2) client code generation, and 3) API documentation, ensuring that they stay in sync and reducing time spent writing boilerplate code.

As part of the integration process, our partners must create and register student accounts through the Knewton API. The following Swagger definition is a simplified example of a dummy endpoint to create a student account, POST /student.

# Path
/student:
  post:
    summary: Create a student
    operationId: createStudent  
    parameters:
      - name: StudentCreatePayload
        in: body
        required: true
        schema:
          properties:
            name:
              description: Student's first and last name
              type: string
            courses:
              description: Array of Course IDs
              type: array
              items:
                type: integer  
            date_created:
              description: Date student was added
              type: string
              format: date
    responses:
      200:
        description: On success, returns no content

This example describes a dummy POST REST endpoint which takes in a JSON object with a student’s name, courses, and creation date. A request to such an endpoint might look like:

curl -H "Content-Type: application/json" -X POST -d '{"name":"Knewton Student","courses":[19, 123],"date_created":"September 1, 2016",}' http://knewton-dummy-api.com/student

The Knewton Java Client Library

The swagger-codegen library can generate API clients in many popular programming languages from a definition file. Let’s take a look at how we use this to generate the Knewton Java Client Library.

We use the swagger-codegen library to generate a Java Retrofit API interface from our API definition. Retrofit uses annotations on interface methods and parameters to describe the API.  These are then parsed by an HTTP client to execute requests.

Here is the Retrofit interface, which was automatically generated from the student creation endpoint definition. The interface matches the Swagger definition above: it defines a POST request to the “/student” endpoint, which takes a StudentCreatePayload (containing name, courses, and date created) as a parameter that is automatically serialized to JSON and added to the request body.

public interface Api {

    @POST("/student")
    Void createStudent(
        @Body StudentCreatePayload studentCreatePayload
    );
}

The auto-generated code provides a baseline client implementation. To make KCL as simple to use as possible, KCL wraps the auto-generated client with higher-level functionality such as authentication and rate limiting. The wrapper also refactors methods, parameters, and return objects to provide a more natural interface for Java developers.

For example, while Java has the java.util.Date object, the autogenerated code uses a String. The wrapper allows users to pass java.util.Date objects and will restructure them to the strings that the Swagger-generated API class expects.

Continuing with the student creation example above, we can further simplify the partner-facing interface. The API endpoint requires that the caller provides the date that the student record is created. There is no need for KCL to require the current date, as this can be determined automatically. Therefore the only parameters in our wrapped createStudent method are the student name and course list.

public void createStudent(String name, List<Integer> courses) {
    Date dateCreated = new Date();
    StudentCreatePayload studentCreatePayload = new StudentCreatePayload();
    studentCreatePayload.setName(name);
    studentCreatePayload.setCourses(courses);
    studentCreatePayload.setDateCreated(dateCreated.toString());
    generatedApi.createStudent(studentCreatePayload);
}

Instead of writing all of the code in the create student method, a partner application can now simply call wrapper.createStudent(name, courses);

OAuth, Retrying, and Logging

KCL uses OkHttp as its underlying client library.  OkHttp integrates well with Retrofit and provides an easy way to extend its functionality via Retrofit’s interceptor interface. Each client can have multiple interceptors. Each interceptor can modify a request or response at any point. For example, an interceptor could monitor, amend, and retry requests based on information read from responses.

As the diagram below illustrates, OkHttp nests interceptors added to a client, so the first interceptor to process a request will be the last interceptor to process a response.

KCL uses interceptors for logging, OAuth and retrying on rate limiting.

KCL uses interceptors for logging, OAuth and retrying on rate limiting.

Logging Interceptor

The logging interceptor is the simplest interceptor because it does not modify or react to a request or response. This interceptor logs information from the request made to Knewton and the response received from Knewton.

The following is a simplified example of the KCL logging interceptor. This sample interceptor logs the request headers and request body, as well as the response code.

@Override
public Response intercept(Chain chain) {
    Request request = chain.request();

    // log request information
    logger.log("Request Headers: " + request.headers());
    logger.log("Request Body: " + request.body());

    // Continue with the interceptor chain
    Response response = chain.proceed(request);

    // log response information
    logger.log("Response Code: " + response.code());

    return response;

}

OAuth is an industry standard authentication protocol used for API authentication.

KCL includes an OAuth interceptor to handle authentication. If a user is already authenticated, there will be an authorization header on the request and the interceptor will no-op. If there is no authorization header, two possible cases would require re-authentication. Either the user does not have an access token in which case KCL requests one, or the user’s access token is not working. Either the access token has expired, in which case KCL uses the refresh token to request a new access token, or the refresh token has expired, in which case KCL must request a new authorization altogether.

Here is a sample OAuth interceptor:

knewton-client-library-oauth-flowchart

@Override
public Response intercept(Chain chain) {
    Request request = chain.request();

    // Already authorized (no-op)
    if (request.header("Authorization") != null) {
    return chain.proceed(request);
    }

    // Get a new auth and add it to the request
    Auth auth = authCache.getAuth();
    request.addHeader("Authorization", auth);

    Response response = chain.proceed(request);

    // Access token expired
    if (response.code() == 401) {
        auth = authCache.getNewAccessToken();
        request.addHeader("Authorization", auth);
        response = chain.proceed(request);
    }

    // Refresh token expired
    if (response.code() == 401) {
        auth = authCache.getNewRefreshToken();
        request.addHeader("Authorization", auth);
        response = chain.proceed(request);
    }

    return response;

}

Rate Limiting and Entity Limiting

Knewton enforces a set of rate limits to ensure that no single user can overwhelm the API with traffic and shut out other partners. After a specified number of requests per minute for a given endpoint, the Knewton API will respond with HTTP 429 errors, indicating that the request has been rate-limited and rejected. It can be retried in the next minute.

Some requests to Knewton’s API are not so time-sensitive that a delay of up to a minute would affect users. Therefore, KCL has incorporated automatic configurable retrying after a time increment specified by the partner.

This is an example of one retry.

private static final int HTTP_TOO_MANY_REQUESTS = 429;

@Override
public Response intercept(Chain chain) {
    Request request = chain.request();
    Response response = chain.proceed(request);

    if (response.code() == HTTP_TOO_MANY_REQUESTS) {
        Thread.sleep(RETRY_WAIT_TIME);
        response = chain.proceed(request);
    }

    return response;

}

If KCL continues to receive HTTP 429s, it will repeat this process until it times out. Since partners can configure the timeline for this process, they do not have to worry about handling these errors in their client applications.

Conclusion

With the Knewton Client Library, partners can more easily integrate with the Knewton adaptive learning platform, and consequently focus their efforts on delivering a great user experience to students and teachers accessing their applications. KCL also allows partners to take advantage of new features as Knewton releases them.

Thank you to Aditya Subramaniam and Paul Sastrasinh for their contributions to this blog post.

API Infrastructure at Knewton: What’s in an Edge Service?

The Knewton API gives students and teachers access to personalized learning recommendations and analytics in real time. In this post, we will pull back the covers of our API to explain how we handle user requests. You will first learn how to build an edge service with Netflix Zuul, the framework we chose for its simplicity and flexibility. Then, we’ll dive into the Knewton edge service to show you how it improves API simplicity, flexibility, and performance.

What’s in an Edge Service

An edge service is a component which is exposed to the public internet. It acts as a gateway to all other services, which we will refer to as platform services. For example, consider an Nginx reverse proxy in front of some web resource servers. Here, Nginx acts as an edge service by routing public HTTP requests to the appropriate platform service.

An example edge service: Nginx as a reverse proxy for two resource servers. Requests for images are routed to one server, while requests for text are routed to another.

An example edge service: Nginx as a reverse proxy for two resource servers. Requests for images are routed to one server, while requests for text are routed to another.

Beyond Routing: Edge Service Architecture

The reverse proxy pattern makes sense if every one of your resource servers is hardened to receive raw internet traffic. But in the typical service-oriented architecture, implementing a simple reverse proxy would duplicate a substantial amount of security and reliability code across platform services. For example, we do not want to reimplement user authentication in each platform service. Instead, we factor these responsibilities directly into the edge service. It validates and authenticates incoming requests; it enforces rate limits to protect the platform from undue load, and it routes requests to the appropriate upstream services. Factoring these high-level features into the edge service is a huge win for platform simplicity.

Building this service was a substantial engineering task. We wanted to leverage our experience developing and deploying JVM-based services by creating our edge service following the same pattern. This made Nginx and Apache’s Lua and PHP-based scripting capabilities unattractive. We would have had to rewrite our standard libraries and debug code in a new paradigm. Instead, we built our edge service on top of Netflix Zuul.

The edge service is Knewton’s interface to the public internet. It is registered directly with an AWS Elastic Load Balancer (ELB), and is responsible for sanitizing and routing requests for the Knewton Platform. To maintain high availability, our edge service runs as a cluster. On startup, edge service nodes register themselves with the load balancer, which then distributes requests across the cluster.

The edge service is Knewton’s interface to the public internet. It is registered directly with an AWS Elastic Load Balancer (ELB), and is responsible for sanitizing and routing requests for the Knewton Platform. To maintain high availability, our edge service runs as a cluster. On startup, edge service nodes register themselves with the load balancer, which then distributes requests across the cluster.

Netflix Zuul

Zuul is a framework created by Netflix to handle their own API and website traffic. The framework is structured around filters, a concept borrowed from the Java Servlet Filter pattern. A Zuul edge service is made up of a series of such filters, each performing some action on an HTTP request and / or response before passing control along to the next filter. For example, a filter might add authentication information to request headers, write an error response to an invalid request, or forward a request to an upstream service for further processing. In this section, we will walk through an example Zuul filter to show you the simplicity of the pattern.

Zuul Filters

We will consider three categories of Zuul filters: pre-filters, route-filters, and post-filters. (Zuul also supports error-filters and static-filters.) Pre-filters run before the edge service routes a request; route-filters forward requests to upstream services; and post-filters run after the proxied service returns its response.

The edge service consists of a series of Zuul filters which work together to write a response for a given request. The route-filters make requests to platform services to retrieve data and update state.

The edge service consists of a series of Zuul filters which work together to write a response for a given request. The route-filters make requests to platform services to retrieve data and update state.

Filters are defined by three pieces of logic:

  • Filter execution order
  • Conditional filter execution
  • Execution logic

Let’s dive into our example. We’ve written this filter in Java, but Zuul is compatible with all JVM languages.

Filter Execution Order

Zuul filters run in the same order for every request. This enables successive filters to make assumptions about the validations run and to access accumulated state. For example, we might store the request’s user ID in one filter and use it to apply rate limits in the next. Zuul filter ordering is defined with two methods: filterType, which specifies whether a filter is a pre-filter, route-filter or post-filter, and filterOrder, which orders filters of the same type.

// Defining the execution order for a Zuul filter in Java

@Override
public String filterType() {
   // run this filter before routing
   return "pre";
}

@Override
public int filterOrder() {
   // run this filter first among pre-filters
   return 0;
}

Conditional Filter Execution

Filters may be run conditionally for each request. This gives the designer significant flexibility. For example, a rate limit filter may not run for an authenticated user. This conditional logic is factored into the shouldfilter method.

// Configuring a Zuul filter to run on every request

@Override
public boolean shouldfilter() {
   // Always run this filter
   return true;
}
// Configuring a Zuul filter to run on requests from unauthenticated users only
@Override
public boolean shouldfilter() {
   RequestContext context = RequestContext.getCurrentContext();
   return !(boolean)context.getOrDefault("IS_AUTHENTICATED", false);
}

Execution Logic

The code to execute in a filter is defined in the runfilter method. Here, we make use of the static RequestContext object to store state for later filters, or to write the response.

// Defining Zuul filter logic to record and check rate limits.

@Override
public ZuulfilterResult runfilter() {
   RequestContext context = RequestContext.getCurrentContext();
   boolean isOverRate = rateLimiter.recordRate(context.getRequest());
   if (isOverRate) {
       context.set("IS_RATE_LIMITED", true);
   }
   return null;
}

All together, this gives us an example filter:

// A simple Zuul filter that runs for unauthenticated users and records
// whether a rate limit has been exceeded.
public class RateLimitfilter extends Zuulfilter {

   @Override
   public String filterType() {
       // run this filter before routing
       return "pre";
   }

   @Override
   public int filterOrder() {
       return 0;
   }

   @Override
   public ZuulfilterResult runfilter() {
       // records the request with the rate limiter and checks if the
       // current rate is above the configured rate limit
       RequestContext context = RequestContext.getCurrentContext();
       boolean isOverRate = rateLimiter.recordRate(context.getRequest());
       if (isOverRate) {
           context.set("IS_RATE_LIMITED", true);
       }
       return null;
   }

   @Override
   public boolean shouldfilter() {
       // should only run if the user is not authenticated
       RequestContext context = RequestContext.getCurrentContext();
       return !(boolean)context.getOrDefault("IS_AUTHENTICATED", false);
   }
}

In this way, we build up a modular set of request-processing functionality. This pattern makes it easy for multiple engineers to contribute new features. At Knewton, engineers outside of the API team have committed code for edge service features.

For more information about the design and lifecycle of a Zuul service, see this Netflix blog post.

Zuul and the Knewton Edge Service

Zuul is an opinionated but barebones framework. While adding functionality is simple, you will have to implement the filters yourself. Now we will explain the most important filters we wrote for the Knewton edge service. You will learn how to reject bad requests, reverse-proxy good ones, and reduce interservice traffic.

Pre-filters

Pre-filters in our edge service validate, rate-limit, and authenticate incoming requests. This enables us to reject bad requests as early in the pipeline as possible.

Rate Limiting

Our edge service is responsible for protecting the Platform from bursts of requests. When the rate of requests from a given IP or user exceeds a specified threshold, the edge service responds with 429: Too Many Requests. This threshold helps to prevent harm to the platform from Denial of Service attacks and other excessive request load.

Rate limits are enforced in a pre-filter so that these excess requests will not place any load on platform services. This pre-filter tracks the number of requests made during a one-minute window. If this number exceeds the rate limit, the filter skips further request processing and immediately writes a “429” response.

Logic in the rate limiting Zuul prefilter. For a simple rate limiting implementation with Redis, see the Redis documentation.

Logic in the rate limiting Zuul prefilter. For a simple rate limiting implementation with Redis, see the Redis documentation.

Rates are stored in memory on each edge service node to provide the lowest latency possible to each request. The rates recorded by each node are reconciled asynchronously using a shared Redis cache. This means that, no matter which node handles a given request, all nodes will eventually acknowledge it. In practice, this reconciliation happens quickly; convergence occurs within a constant factor of AWS’s region-internal network latency.

A request to an endpoint is rejected when more than the configured limit of requests have been made within a minute. edge service nodes coordinate request counts through a shared Redis cache.

A request to an endpoint is rejected when more than the configured limit of requests have been made within a minute. edge service nodes coordinate request counts through a shared Redis cache.

Surge Queuing

Load on the Knewton API is not constant. Even within the broad patterns of school days and lunch breaks, request rates vary. Rather than passing these bursts on to platform services, our edge service smooths traffic, so that the platform sees smaller peaks. To accomplish this, our rate limiter follows the same Leaky Bucket pattern used in Nginx. Requests exceeding the rate limit get added to a fixed-length queue. The queued requests are processed at the rate limit, ensuring that the platform does not see a surge. If the queue is full when the rate limit is exceeded, these excess requests are rejected with a 429 error. This approach has the added benefit of simplifying Knewton API clients, because requests can be made in reasonable bursts without being rejected. Read more about how we implement request queueing in our next blog post.

A request surge queue sits in front of the rate limiter to smooth out bursts of requests.

A request surge queue sits in front of the rate limiter to smooth out bursts of requests.

Authentication

At Knewton, we use OAuth to authenticate users. Each API request must contain a valid OAuth token. The edge service rejects requests which do not contain these tokens. This obviates the need for end-user authentication in upstream platform services and reduces interservice traffic by rejecting unauthenticated requests immediately at the edge.

Route-filters

The edge service is responsible for forwarding requests to the appropriate upstream microservice. Once a request has passed the gauntlet of Pre-filters, it is passed on to the Route-filters. Since Knewton services use Eureka for service discovery, we chose to use Netflix’s Ribbon HTTP client to make these upstream requests. Ribbon allows our edge service to automatically discover upstream services, load-balance traffic between them, and retry failed requests across instances of a service, all with minimal configuration.

// Creates a Ribbon client with Eureka discovery and round robin
// load balancing
AbstractLoadBalancerAwareClient setupRibbonClient(String upstreamName) {
    ServerList<DiscoveryEnabledServer> serverList =
            new DiscoveryEnabledNIWSServerList(upstreamName);
    IRule rule = new AvailabilityFilteringRule();
    ServerListFilter<DiscoveryEnabledServer> filter = new ZoneAffinityServerListFilter();
    ZoneAwareLoadBalancer<DiscoveryEnabledServer> loadBalancer =
            LoadBalancerBuilder.<DiscoveryEnabledServer>newBuilder()
                               .withDynamicServerList(serverList)
                               .withRule(rule)
                               .withServerListFilter(filter)
                               .buildDynamicServerListLoadBalancer();
    AbstractLoadBalancerAwareClient client =
            (AbstractLoadBalancerAwareClient) ClientFactory.getNamedClient(upstreamName);
    client.setLoadBalancer(loadBalancer);
    return client;
}

The Knewton edge service is not limited to endpoint-based routing. Our next blog post will cover specialized routing we have implemented to support shadow services and server-side API mocking.

Post-filters

After receiving the upstream responses, the post-filters record metrics and logs, and write a response back to the user.

Response Writing

The Post-filter stage is responsible for writing a response to the user. It is worth emphasizing that the edge service must always write back some response. If an unhandled exception or upstream timeout were to prevent this, the user would be left idling. This means catching exceptions and handling errors in all situations, so that the edge service always returns a timely answer, even if that answer is an error code (ie. 5xx response).

Metrics and Logging

Recording the right metrics and logs helps to ensure quality of service of the Knewton platform.  The edge service encapsulates the entire lifecycle of every API request. It records the entire duration of each request and handles every failure. This uniquely positions the edge service to report on end-user experience. Our post-filters publish metrics and logs to Graphite relays and Splunk Indexers. The resulting dashboards, alerts, and ad-hoc queries give us all the information we need to investigate problems, optimize performance, and guide future development work. Watch for our upcoming blogpost on API monitoring for more details.

Conclusion

The Knewton edge service validates and rate limits API requests, authenticates users, routes requests to upstream microservices, and records end-to-end metrics. Building this logic into the edge service simplifies our platform and provides performance and reliability guarantees to our users. We built our edge service on top of Zuul because its filter-based design is simple and extensible, its Java construction was compatible with our architecture, and because it has been proven in production at Netflix.

In this blog post, you learned how to build a Zuul edge service and how this design can improve your API. A follow-up post will describe custom logic we built to support surge queuing, shadow services, and API mocking in the Zuul framework.

Thanks to all the contributors on the Knewton API Platform team: Paul Sastrasinh, Rob Murcek, Daniel Singer, Bennett Wineholt, Robert Schultheis, Aditya Subramaniam, Stephanie Killian