Everyone who works in tech has had to debug a problem. Hopefully it is as simple as looking into a log file, but many times it is not. Sometimes the problem goes away and sometimes it only looks like it goes away. Other times it might not look like a problem at all. A lot of factors will go into deciding if you need to investigate, and how deep you need to go. These investigations can take a lot of resources for an organization and there is always the chance of coming up empty handed which should never be seen as a failure.
This post will summarize an investigation into some Cassandra memory behavior that the database team at Knewton conducted recently. It is a good illustration of the kind of multi-pronged approach needed to unravel strange behaviors low in the stack. While we will be specifically talking about Cassandra running on AWS the takeaways from this article are applicable to systems running on different platforms as well.
Uncovering a Problem
Knewton had a Cassandra cluster that was very overprovisioned. The instances were scaled up (each instance is made more powerful) rather than scaled out (more instances are added to the cluster). Cassandra is a horizontally scalable datastore and it benefits from having more machines rather than better ones.
So we added more machines and reduced the power of each one. We now had 4GB of available memory for the Java heap instead of 8GB. This configuration worked well in all of our tests and in other clusters but in this specific overprovisioned cluster we found we had scaled each node down too much, so we moved to machines that could accommodate 8GB of heap, m4.xlarge instances.
Anomalous Behavior on New Instances
After moving all nodes over to m4.xl instances we saw our heap situation stabilize. However we began to notice anomalous CPU and load averages across the cluster. Some nodes would run higher than other nodes. The metrics showed that, out of the four cores on a m4.xl instance, one was completely saturated.
If you saw this load on its own you would not think that it is a problem. Total usage of CPU on the box is at 25% and there are no catastrophically long GC pauses. However, the cluster was not uniform, which called for further investigation.
In these CPU graphs, you can see nodes running on low CPU that encountered something that would rapidly promote them into the high CPU band, and they would almost never drop back down.
We found that this graph was correlated with the graph of average garbage collection time.
When nodes promote themselves into the high CPU band, their GC times spike.
What is Cassandra holding in its heap that is causing GC issues? With a lot of overhead in memory and CPU, crashes are not a risk, but performance is far from where we want to be.
Before tackling the garbage collection question, we have two other questions that we can answer already:
Why is this behavior showing up now?
We had an 8GB heap before and should have seen the same behavior. The reason we only saw this CPU and GC behavior once on m4.xlarge instances is twofold:
- Something unique in this cluster caused a 4GB heap to be fatal but an 8GB heap to be adequate. Expanding the heap did not get rid of the problem.
- The original cluster that had an 8GB heap was around for years and all nodes were promoted to the high CPU band. The baseline operation of the cluster looked like the high CPU band. It is only because of previous issues with provisioning this cluster that we were watching closely when we moved to these m4.xlarge instances. This highlights the importance of understanding your baseline resource utilization and not assuming that it means you are in a healthy state.
Why is this behavior a problem?
Even though the high usage band is not a problematic load, the behavior of the cluster was unexpected, and this promotion behavior is problematic. The biggest reason that is problematic is that it meant we had a nondeterministic factor in our latencies. We can expect different performance on a node that is in the high CPU band than one in the low CPU usage band. However we cannot predict when a node will promote itself as we don’t know the cause for this behavior. So we have an unknown factor in our system, which is dangerous for repeatability and reliability.
Investigating further is resource intensive, and often your senior engineering staff has to undertake the investigation, as it requires some degree of independence and experience. So make sure you decide that the problem is actually worth the time of your organization before sinking days or weeks of engineering time into it.
The first thing to do is to look through the relevant logs. Looking through the Cassandra logs we found, unsurprisingly, a lot of garbage collections. We’d had GC logging on and found several “large block” messages in the CMS garbage collection logs at about the time these promotions happened. To get more visibility into what is actually in the heap, we turned on GC class histogram logs, but these said that almost all the memory was being held in byte arrays.
Not helpful. Time to go even deeper and see what is in the heap.
So we took a heap dump from a problematic node and on a node that was exhibiting good GC behavior as a control. A heap dump is the size of the used Java heap. Dumping the heap is a “stop the world” operation for the process you are dumping, so when doing this in production be sure that the service can be unresponsive for a minute or more. The file is binary and examining it is labor intensive, so it is best to move the heap dump to a computer that’s not being used in production to investigate.
We used the Eclipse Memory Analyzer Tool (MAT) to investigate the heap dumps and found that almost all of the memory was taken up by the write buffer of the
TframedTransport objects. There were several hundred of these objects and the write buffer size ranged from 1kB to 30MB, with many in the range of 20-30MB. We saw similar objects in the heap dump of the control node, but not nearly as many. The write buffer contains what is being written to the Thrift transport socket and does not correspond to Cassandra reads and writes. In this case Cassandra is writing the output of an incoming read to this buffer to send to the client that has requested the data.
It became pretty clear that this was a Thrift protocol issue so we searched for related issues.
Any problem you find has been reported in some way or another by someone else, especially if you are using anything but the latest versions of open-source software, It is very useful to search the web for related problems at every step of the investigation, but as you get more specific you might uncover things that previously you would not have encountered. In this case the investigation led us to the Thrift protocol, which is not something we would have searched for earlier.
The Thrift library that our version of Cassandra used had some memory issues referenced in CASSANDRA-4896. These are referenced again in more detail and resolved in CASSANDRA-8685. So you don’t have to read through the tickets, basically the write buffer — the thing taking up all of our space and the likely cause of our issues — will grow to accommodate a larger payload but never shrinks back down. So once an operation is requested of Cassandra that has a large payload, such as a query returning a 20MB record, these buffers increase in size and never drop back down.
This behavior lines up with what we are seeing in the cluster, where a node could be operating smoothly until a large set of queries come in and expand the write buffers on the Thrift frame objects. Once the frame size increases, it stays big for the duration of the connection with the client. (We’ll come back to this point shortly).
This Thrift behavior is noted in THRIFT-1457 and a fix is out for it. The Thrift library fix is also rolled into Cassandra 2.1.3 and later. This is also only in Thrift so any clients using CQL won’t see this.
Verify the Hypothesis
The link between this memory behavior and client connections was a working hypothesis. We got a very strong signal supporting the theory one morning when we deployed a new version of services that communicated with our cluster. This deployment meant closing all of the Cassandra connections on the client side during a low traffic period and then reopening them. Cold caches cause a lot of queries to occur during a deploy, but these queries do not return large blocks of data, so the
TframedTransport write buffers do not expand. Memory stays low until we start getting production traffic, which then expands the buffers as described above.
The CPU and GC graphs from the deploy at 10:45 show that we have almost no traffic beforehand. We’re still holding onto these large memory objects and they aren’t cleaned up because the
TframedTransport objects live as long as the corresponding client connection.
Once the client connections close, the objects are garbage collected and the new connections start with small
TframedTransport write buffers. So we see low heap and GC times at about 10:45.
There is a dramatic heap usage drop when the deploy happens at 10:45.
The behavior of these
TframedTransport buffers makes the Thrift protocol in pre-2.1 Cassandra dangerous. Each connection holds a
TframedTransport object in the Cassandra heap that can grow depending on the query executed but cannot shrink. Memory then begins to fill with these
TframedTransport objects as they cannot be cleaned out as long as the connection to the client is open. If the client connection is severed by a client restart or a deploy, these objects get garbage collected. New connections begin with a small
TframedTransport object size which behaves well with GC / Heap / CPU on the node until a large query comes in which then expands the write buffer on the
TframedTransport object again, promoting the node to a high-CPU-usage node and damaging its performance and response time. Eventually all nodes in the cluster reach this high-CPU-usage band and the extra resources taken up by these large
TframedTransport objects become a burden on the performance of our database.
Because these buffers never decrease in size, the memory usage of the Cassandra instance is bounded only by the number of connections to the cluster, which is determined by the number of active clients. Our previous 4GB heap was unable to accommodate the number of clients in Knewton’s infrastructure so it would perpetually crash, however 8GB is sufficient.
Once we have identified the source of the problem, we have to decide what, if anything, to do about it. Some options would be:
- Upgrading to Cassandra 2.1.3 or later.
- Migrating to CQL from Thrift
We know that the behavior is not fatal to our processes, so depending on the opportunity cost of each of these processes, we can decide if we will live with this behavior or if we are going to fix it.
It is difficult to quantify how much an investigation like this costs in terms of time, opportunity cost, and money to an organization. Before starting an investigation, it is useful to timebox the project and restrict yourself to a certain number of engineering hours to investigate. In hindsight one might quantify the cost of this investigation as:
Monetary Cost = Engineering Time – Hardware Optimization
In this case we will need less hardware once this memory issue is resolved, perhaps three fewer m4.xlarge instances. The yearlong cost of a reserved m4.xl instance is currently $1,200. We spent about three engineer days investigating. Estimating engineering costs at $200 an hour (including overhead and opportunity cost to the organization) each engineering day is $1,600. So over the year, we spent about $1,200 more on the investigation than we recovered from optimizations.
Beyond monetary considerations, we have a better understanding of our system, which benefits us in several ways. We are able to adjust our alerts to better monitor the system. We can justify our instance sizes. And finally we are able to support larger engineering objectives, advising the organization on the best course of action for future projects. For example, we decided to move aggressively to upgrade this cluster to 2.1.
Deep investigations like this do not come with guarantees of strong takeaways or definitive answers. They are expensive in terms of time and often require senior engineers to dig into deep system internals. Yet at Knewton we have found that these investigations are valuable for the database team and the wider tech organization.
What's this? You're reading N choose K, the Knewton tech blog. We're crafting the Knewton Adaptive Learning Platform that uses data from millions of students to continuously personalize the presentation of educational content according to learners' needs. Sound interesting? We're hiring.