Tag Archives: haproxy

Cassandra – Successes, Failures, Headaches, Lessons Learned


My long project of converting my monitoring application at work to use Apache Cassandra is finally in production, and for the most part, running smoothly. All data storage and retrieval is using Cassandra, although I’ve yet to take the training wheels off, so to speak. Everything is still logging into the old MySQL databases until I’m confident that everything is working properly. The Cassandra storage setup has been the primary storage method for a bit over a week now, and while everything is functioning properly on the surface, I’d still like to do some additional sanity checks to make sure the data going in is completely consistent in both storage methods. I’ve had enough weirdness during the process to warrant caution.

For the most part, the process of adapting my code base to use Cassandra was straight forward, but getting to know Cassandra has been more complicated. Being a fully redundant and replicated system leaves a lot more room for strangeness than your typical single-host software package, and the fact that it’s a Java app meant that I was stepping into largely unfamiliar territory, since I typically run far, far away from Java apps (except you, Minecraft… we love you!). What follows are descriptions of some of the bumps in my road to victory, so that others may benefit from the lessons learned from my ignorance.

Picking  the right interface library is key

Since the bulk of my monitoring application is written in PHP, and I didn’t really feel like doing a complete rewrite, I needed to find an interface library that allowed PHP to talk to a Cassandra cluster. The first library I decided to try was the Cassandra PHP Client Library (CPCL). It was pretty easy to learn and get working, and at first glance, seemed to do exactly what I wanted with a very small amount of pain. When I really got deep into the code conversion, I started having some issues. Things were behaving strangely, and I got fairly (ok, seriously) frustrated. Then I found out it was my fault, and that I was doing things wrong. There’s always a “man, do I feel stupid” moment when things happen like that, but I quickly got past it and started moving on.

Then I came across some issues that really weren’t my fault (I think). Once the basics of converting the code were done, I started adding some new pieces to take advantage of the fact that I’d be able to store many times more data than I was able to previously. One of my previous posts on the topic mentioned my problems with drawing graphs when a large number of data points were involved. My solution to that was to incorporate rrdtool-style averaging of data (henceforth known as “crunching”) of data, which takes the raw data and averages things based upon fixed time intervals, typically 5 minutes, 30 minutes, 2 hours, 1 day, etc. rrdtool only keeps the raw data for a short period of time, but my system keeps it for the lifespan that Cassandra is told to keep it (currently two years). Combining the averaging over the various time intervals and the raw data itself, I can quickly graph data series over many different time ranges while keeping the PHP memory footprint and data access time at sane levels.

My solution for crunching the data was fairly simple to implement, but works in a fundamentally different way from rrdtool. rrdtool does the averaging calculations as data is submitted, and it does it very quickly because it enforces a fixed number of data points in its database, with fixed intervals in between each collection. The scheduling of data collections in my system is completely arbitrary and can be done on any regular or irregular interval. It uses agents that report the monitored data to the storage cluster via HTTP POSTs, so doing the averaging in-line would be very costly in terms of web server processing and latency for the monitoring agent. A big goal is to keep the monitoring agent idle and asleep as much as possible, so waiting for long(ish) HTTP POSTs isn’t very optimal. Therefore, the data processing needed to be done out of band from the data submission.

That’s where I came across the first big gotcha with the CPCL package. After I had my crunching code written and made the necessary changes to my graphing code to use the averaged data, I started seeing some really weird graphs. Certain ranges of the data were completely wrong, in ways that made no sense at all. For example, on an otherwise normal graph of a machine’s load average, there would be a day’s worth of data showing the machine’s load average at over 1,000,000,000,000 (which probably isn’t even possible without a machine catching on fire). After a lot of debugging, I finally came to the conclusion that even though the query to Cassandra was right, the wrong data was being returned. That huge number of over 1,000,000,000,000 was actually a counter belonging to a network interface on that machine. I tried everything I could think of to find or fix the problem, but nothing I did eliminated the problem completely.

The other big problem manifested itself at the same time as I was seeing the first. For no reason I could discern, large-ish write operations to Cassandra would just hang indefinitely. Not so good for a long-running process that processes data in a serial fashion. I tried debugging this problem extensively as well, and came up with nothing but more questions. What had me completely confused was that I could unfreeze the write operation using a few non-standard methods – either by attaching a strace process to the cruncher process, or by stopping the cruncher process using a Ctrl-Z, then putting it back into the foreground. Both of these methods reliably unfroze the hung cruncher process, but it would typically freeze up sometime later.

These two issues had my wits completely frayed. In desperation, I put up a plea for help on the CPCL github bug tracker page. I couldn’t find a mailing list or fourm, so I had to post it on their bug tracker, even though that’s not the optimal place for such a request. I waited a few weeks, but I never received any kind of response.

Instead of kicking the dead horse further, I decided to try another library – phpcassa. The classes were laid out in a similar way to CPCL, so converting the CPCL code over to phpcassa wasn’t too hard. And guess what! Everything worked properly! No strange hangups, no incorrect data. I’m not sure there was any way I could have known to try phpcassa first, since both libraries looked feature complete and fairly well documented, but man, I wish I had tried phpcassa first. it would have saved me a lot of headache.

Think hard about your hardware configuration, then test the crap out of it

Once I had all of my code changes ready, I was pretty anxious to get them pushed out to my production setup. Before I could do so, I needed to set up a production Cassandra cluster. My development clusters were done in a virtualized manner with ten or so Cassandra VMs spread across a couple hardware nodes. This worked perfectly well in my development setup, but my production setup monitors a few orders of magnitude more hosts, so I knew there was going to be some trial and error involved with finding the right production setup.

After talking things over with my boss and another co-worker familiar with Cassandra, I came up with a template of what I was shooting for. I aimed for some fairly capable machines that had been retired from our normal product offerings. They were servers that were just occupying space that would otherwise go unused. I ended up setting up eight Harpertown Xeon (E5420) machines, each with 8GB of RAM, a 500GB disk for the OS and Cassandra commit logs, and three 3TB drives in a RAID0 for the Cassandra data. I figured that the RAID0 would make things nice and fast, and that the large amount of data each node could store would get me to my utilization goals.

I set up the nodes, and quickly ran into a few issues in my pre-deployment testing. The boxes kept losing disks at a rapid rate, even under fairly paltry load conditions. It turns out that the motherboard model that was present in those machines had a fairly high rate of goofy onboard SATA controllers. It was known to our server setup team and our systems restoration folks, but I was unaware of it. Lesson? Research your hardware. Make sure it isn’t crap.

I was able to bypass the issue in the affected servers by using add-on SATA controllers, so I proceeded undeterred. After much preparation, I was ready to put things live in production. I activated the Cassandra code in a write-only state, and everything worked pretty well. Anxious to see how it functioned under heavier load, I started doing some data imports from the old MySQL database into Cassandra, and things got a lot less awesome. Doing any more than one  or two concurrent data imports caused a lot of congestion in the Cassandra cluster, and caused the whole system to slow down and grind to a halt. I figured I could survive if I could only import data in serial fashion, so I let it run for a while. I really wanted to test the data crunching code, so after a while I stopped the imports and started up a couple cruncher processes. They dragged Cassandra down a bit, but things were keeping up. I decided to see how far I could push things, so I started up a few more cruncher processes, and they proceeded to drive Cassandra straight into the ground. The entire system ground to a halt. Crap. Frustrated and completely annoyed, I disabled the Cassandra code and went back to the drawing board.

After taking a few days away from the project to clear my head, I realized that I had forgotten a fairly important piece of the process of building my Cassandra cluster. I never really tested it under load. Sure, I pointed my dev environment at it to make sure it worked, but I never tried in any way to simulate the huge difference in load between what my dev environment generates and the total onslaught generated by the 20,000+ servers being monitored in the production environment. Lesson? Load testing is good. Beat the crap out of your setup, then when you think you’re satisfied it can handle the load, double it. Beat it up more.

I decided that my next round of servers would be newer and less flaky. My MySQL setup uses a good number of AMD X6 1055T servers as cheap and fairly capable storage nodes, so I went in that direction. I started with 8GB of RAM in each box, and did a lot of testing to figure out whether I should use single data disks, or multiple disks in a RAID configuration. My first test was with a single disk compared with a two-disk RAID0, and the RAID0 performed much better. The next iteration compared the same two-disk RAID0 configuration with a four-disk RAID10. I expected the RAID10 to perform better, but the numbers really weren’t that different, so I decided on the two-disk RAID0 as my setup to save on equipment costs.

One thing that I could see very clearly in the benchmarking is Cassandra’s huge optimization of write performance at the expense of read performance. I could easily throw a few hundred thousand write operations per second at the test cluster, but it struggled to get more than a few hundred reads per second.   I did a lot of reading to find out what I was doing wrong, and came up with a few pointers (some of which will be discussed later), but the big one was simply the scale of the cluster. If you want better read performance  out of Cassandra, throw more hardware at it.

So I did. I added more RAM to each box (for a total of 12GB per box currently), and added many additional nodes. My current cluster is comprised of 22 machines, each with 12GB of RAM and at least 3TB of storage each. Go big or go home, right? The big lesson learned here is that Cassandra does better with more hardware nodes of moderate capability, rather than fewer large nodes. Give the cluster as many I/O paths to your data as possible, or risk trapping it behind I/O bottlenecks.

Node death can be a pain even when you have many other active nodes

During the course of my first attempt at pushing things live in production, a very painful problem emerged as a result of the frequent drive issues the Harpertown boxes suffered. If I had my web nodes configured to connect to a node that was offline (as in, the machine is not responding on the network), the whole system would grind to a standstill as web nodes tried, in vain, to connect to the dead node. Even with eleven out of twelve nodes active, a large enough percentage of the web processes tried to connect to the dead node that the whole system would eventually jam up waiting for the connection to time out before trying the next node in the list. My long-running processes, like the cruncher script, weathered this gracefully, since connection latency was not a factor to them, and they would remember that a particular node was down. Short-running tasks, namely the HTTP interactions, suffered greatly when a node died, since the system essentially has to relearn the state of each Cassandra node each time the process starts. Since the HTTP stack is heavily reliant on moving the enormous number of short-lived HTTP connections through the queue in quick fashion. wasting a bunch of time figuring out that a node is down is very costly – the stalled HTTP processes quickly pile up and quickly prevent real work from occurring.

My solution to this was to add pair of load balancers into the mix. I have a good amount of experience using the linux kernel load balancing (managed with ipvsadm) and ldirectord as a monitoring agent, but the network setup required to properly implement it wasn’t really conducive for my needs. I wanted to be able to arbitrarily forward the Cassandra TCP connections to anywhere, not just a local subnet directly connected to the load balancers. So I looked around a bit and found haproxy. It’s a lightweight process that will do load balancing and service checking for either HTTP or bare TCP connections, which fit my needs to a T. I did some quick testing and found it to be exactly what I needed.

In my final production setup, I grabbed two of my previously discarded Harpertown boxes (making sure they weren’t the ones with SATA issues) and configured them in an active-active setup using haproxy and heartbeat. Each load balancer has a VIP that is active at any given time, but can also fail over to the other box, which avoids the downed node problem. I configured haproxy to run in a multi-process fashion, one process for each of the eight CPU cores in the Harpertown machines. The end result is a redundant service-checking load balancer setup that ignores downed Cassandra nodes that quite easily passed over 100MByte/sec of Cassandra traffic in my testing.

Make sure your Cassandra JVM is properly tuned

Before Cassandra, my typical interaction with Java applications was to see them, acknowledge their presence, then run in the opposite direction.  I had some pretty bad experiences trying to deal with Tomcat in my early days as a SysAdmin, and they really turned me off to Java as a whole. Once Cassandra came into the picture, I didn’t have much choice but to learn about it.

I read plenty of articles where people talked about tuning the Java heap, but really didn’t have a concept of what all the variables in the equation meant. My first attempts at changing the heap size came after a Cassandra process crashed, and wouldn’t start up again. I’m not sure what prompted me to think about changing the heap size as a solution, but it worked, temporarily at least. I increased the heap size to 6GB I think, out of 8GB of total system RAM. Cassandra started, worked fine for a short while, then promptly crashed and burned. Attempt number one – resounding failure!

I read up some more on what each heap-related configuration setting related to, and what it affected. Learning the relationship between the new and old generations of heap memory, and how they were used by the garbage collection processes was key. In my first attempt, not knowing what I was changing, I set the new generation heap memory to the same value as total heap memory. Pretty much the definition of Doing It Wrong (TM). After some more reading and tweaking, I found that with the 12GB memory footprint on my finalized hardware setup, a 6GB heap size provided Cassandra with the memory it needed to stay stable, while leaving the OS with a fair amount of room for its filesystem caches. I later refined those heap settings to set the new generation heap size to 1GB, leaving 5GB for older data objects.

One tool that I found very helpful was jconsole. It’s included in my OS’s Java Development Kit, and was very useful in getting a good idea how Cassandra was using memory. It gives a real-time view into the innards of a running Java process, which, among many other things, gave me the ability to see how memory was being utilized in the various heap regions, which made how Java uses memory much clearer and easier to understand. Old Java pros probably know all about this, but it’s still pretty new and novel to me.

Tune your column family’s read repair chance

After I had observed the fairly awful read performance in my benchmarks, I started doing some research. In my readings, I came across the description of a column family’s read repair chance, and realized that I was, once again, Doing It Wrong. Basically, the read repair chance is the likelihood that a particular read operation will initiate a read repair operation between the various Cassandra servers that contain replicas of the data object you’re trying to retrieve. If a read repair is initiated, the read operation will go to all replicas, and the results are then compared to make sure all replicas are in sync. If you have your cluster configured to have four replicas of each piece of data, a read operation triggering a read repair will cause all four replicas to do read operations instead of just one (or however many you asked for). For a system that is already under heavy read load, this can make things much slower.

I looked at the column families that the benchmarking utility was creating, and sure enough, the read repair chance was set to 1, or 100%. I tried setting it much lower, to 0.1, and my read rates improved dramatically. Setting it even lower, to 0.01, and they went higher still. I was using a replication factor of four in my benchmarking tests, and with the closer I got to a read repair chance of zero, the closer to I got to a factor of four increase in the “base” read rate with a read repair chance of 100%. This brought my read rates out of “crap” territory and into “probably acceptable” territory. My production schemas were also using a read repair chance of 100%, so I adjusted them down to far more conservative values.

It’s worth noting that read repairs aren’t a bad thing, per-se, they’re just really good at adding load to a cluster. They’re actually a very good sanity check for data integrity, but they come with a cost.

Tune key/row caches

Another thing I found that can improve read performance are the key and row caches. I was well aware of the row cache concept, and mostly discarded it in my case, since my data model didn’t really work with it. The key cache, however, was definitely useful. Basically, the key cache is a pointer to the disk location where the data for a particular key is stored. This makes disk access a lot less painful, since Cassandra doesn’t have to search through its data files to find the piece of data its looking for.

I initially tried setting my key caches large enough to account for all keys currently in each column family. My data model has a fairly fixed number of row keys that scales mostly linearly based on the number of hosts being monitored by the system, so I was able to figure pretty easily how many keys I needed to cache. I set the number comfortably above that value, and watched as read performance improved as the caches filled in. And then, after coming back to work after a weekend had passed, I noticed that my key caches seemed to have shrunk far below the maximums I had set previously. Some quick searches in the Cassandra system logs showed that Cassandra had lowered those limits because of memory pressure inside the JVM. The whole cluster’s stability and proper operation seemed to be negatively impacted by this state (all nodes were showing the ‘memory pressure’ messages in their logs), so I experimented with some lower key cache settings until I found a good balance between performance and stability.