Tag Archives: big data

Cassandra – Successes, Failures, Headaches, Lessons Learned

Victory!

My long project of converting my monitoring application at work to use Apache Cassandra is finally in production, and for the most part, running smoothly. All data storage and retrieval is using Cassandra, although I’ve yet to take the training wheels off, so to speak. Everything is still logging into the old MySQL databases until I’m confident that everything is working properly. The Cassandra storage setup has been the primary storage method for a bit over a week now, and while everything is functioning properly on the surface, I’d still like to do some additional sanity checks to make sure the data going in is completely consistent in both storage methods. I’ve had enough weirdness during the process to warrant caution.

For the most part, the process of adapting my code base to use Cassandra was straight forward, but getting to know Cassandra has been more complicated. Being a fully redundant and replicated system leaves a lot more room for strangeness than your typical single-host software package, and the fact that it’s a Java app meant that I was stepping into largely unfamiliar territory, since I typically run far, far away from Java apps (except you, Minecraft… we love you!). What follows are descriptions of some of the bumps in my road to victory, so that others may benefit from the lessons learned from my ignorance.

Picking  the right interface library is key

Since the bulk of my monitoring application is written in PHP, and I didn’t really feel like doing a complete rewrite, I needed to find an interface library that allowed PHP to talk to a Cassandra cluster. The first library I decided to try was the Cassandra PHP Client Library (CPCL). It was pretty easy to learn and get working, and at first glance, seemed to do exactly what I wanted with a very small amount of pain. When I really got deep into the code conversion, I started having some issues. Things were behaving strangely, and I got fairly (ok, seriously) frustrated. Then I found out it was my fault, and that I was doing things wrong. There’s always a “man, do I feel stupid” moment when things happen like that, but I quickly got past it and started moving on.

Then I came across some issues that really weren’t my fault (I think). Once the basics of converting the code were done, I started adding some new pieces to take advantage of the fact that I’d be able to store many times more data than I was able to previously. One of my previous posts on the topic mentioned my problems with drawing graphs when a large number of data points were involved. My solution to that was to incorporate rrdtool-style averaging of data (henceforth known as “crunching”) of data, which takes the raw data and averages things based upon fixed time intervals, typically 5 minutes, 30 minutes, 2 hours, 1 day, etc. rrdtool only keeps the raw data for a short period of time, but my system keeps it for the lifespan that Cassandra is told to keep it (currently two years). Combining the averaging over the various time intervals and the raw data itself, I can quickly graph data series over many different time ranges while keeping the PHP memory footprint and data access time at sane levels.

My solution for crunching the data was fairly simple to implement, but works in a fundamentally different way from rrdtool. rrdtool does the averaging calculations as data is submitted, and it does it very quickly because it enforces a fixed number of data points in its database, with fixed intervals in between each collection. The scheduling of data collections in my system is completely arbitrary and can be done on any regular or irregular interval. It uses agents that report the monitored data to the storage cluster via HTTP POSTs, so doing the averaging in-line would be very costly in terms of web server processing and latency for the monitoring agent. A big goal is to keep the monitoring agent idle and asleep as much as possible, so waiting for long(ish) HTTP POSTs isn’t very optimal. Therefore, the data processing needed to be done out of band from the data submission.

That’s where I came across the first big gotcha with the CPCL package. After I had my crunching code written and made the necessary changes to my graphing code to use the averaged data, I started seeing some really weird graphs. Certain ranges of the data were completely wrong, in ways that made no sense at all. For example, on an otherwise normal graph of a machine’s load average, there would be a day’s worth of data showing the machine’s load average at over 1,000,000,000,000 (which probably isn’t even possible without a machine catching on fire). After a lot of debugging, I finally came to the conclusion that even though the query to Cassandra was right, the wrong data was being returned. That huge number of over 1,000,000,000,000 was actually a counter belonging to a network interface on that machine. I tried everything I could think of to find or fix the problem, but nothing I did eliminated the problem completely.

The other big problem manifested itself at the same time as I was seeing the first. For no reason I could discern, large-ish write operations to Cassandra would just hang indefinitely. Not so good for a long-running process that processes data in a serial fashion. I tried debugging this problem extensively as well, and came up with nothing but more questions. What had me completely confused was that I could unfreeze the write operation using a few non-standard methods – either by attaching a strace process to the cruncher process, or by stopping the cruncher process using a Ctrl-Z, then putting it back into the foreground. Both of these methods reliably unfroze the hung cruncher process, but it would typically freeze up sometime later.

These two issues had my wits completely frayed. In desperation, I put up a plea for help on the CPCL github bug tracker page. I couldn’t find a mailing list or fourm, so I had to post it on their bug tracker, even though that’s not the optimal place for such a request. I waited a few weeks, but I never received any kind of response.

Instead of kicking the dead horse further, I decided to try another library – phpcassa. The classes were laid out in a similar way to CPCL, so converting the CPCL code over to phpcassa wasn’t too hard. And guess what! Everything worked properly! No strange hangups, no incorrect data. I’m not sure there was any way I could have known to try phpcassa first, since both libraries looked feature complete and fairly well documented, but man, I wish I had tried phpcassa first. it would have saved me a lot of headache.

Think hard about your hardware configuration, then test the crap out of it

Once I had all of my code changes ready, I was pretty anxious to get them pushed out to my production setup. Before I could do so, I needed to set up a production Cassandra cluster. My development clusters were done in a virtualized manner with ten or so Cassandra VMs spread across a couple hardware nodes. This worked perfectly well in my development setup, but my production setup monitors a few orders of magnitude more hosts, so I knew there was going to be some trial and error involved with finding the right production setup.

After talking things over with my boss and another co-worker familiar with Cassandra, I came up with a template of what I was shooting for. I aimed for some fairly capable machines that had been retired from our normal product offerings. They were servers that were just occupying space that would otherwise go unused. I ended up setting up eight Harpertown Xeon (E5420) machines, each with 8GB of RAM, a 500GB disk for the OS and Cassandra commit logs, and three 3TB drives in a RAID0 for the Cassandra data. I figured that the RAID0 would make things nice and fast, and that the large amount of data each node could store would get me to my utilization goals.

I set up the nodes, and quickly ran into a few issues in my pre-deployment testing. The boxes kept losing disks at a rapid rate, even under fairly paltry load conditions. It turns out that the motherboard model that was present in those machines had a fairly high rate of goofy onboard SATA controllers. It was known to our server setup team and our systems restoration folks, but I was unaware of it. Lesson? Research your hardware. Make sure it isn’t crap.

I was able to bypass the issue in the affected servers by using add-on SATA controllers, so I proceeded undeterred. After much preparation, I was ready to put things live in production. I activated the Cassandra code in a write-only state, and everything worked pretty well. Anxious to see how it functioned under heavier load, I started doing some data imports from the old MySQL database into Cassandra, and things got a lot less awesome. Doing any more than one  or two concurrent data imports caused a lot of congestion in the Cassandra cluster, and caused the whole system to slow down and grind to a halt. I figured I could survive if I could only import data in serial fashion, so I let it run for a while. I really wanted to test the data crunching code, so after a while I stopped the imports and started up a couple cruncher processes. They dragged Cassandra down a bit, but things were keeping up. I decided to see how far I could push things, so I started up a few more cruncher processes, and they proceeded to drive Cassandra straight into the ground. The entire system ground to a halt. Crap. Frustrated and completely annoyed, I disabled the Cassandra code and went back to the drawing board.

After taking a few days away from the project to clear my head, I realized that I had forgotten a fairly important piece of the process of building my Cassandra cluster. I never really tested it under load. Sure, I pointed my dev environment at it to make sure it worked, but I never tried in any way to simulate the huge difference in load between what my dev environment generates and the total onslaught generated by the 20,000+ servers being monitored in the production environment. Lesson? Load testing is good. Beat the crap out of your setup, then when you think you’re satisfied it can handle the load, double it. Beat it up more.

I decided that my next round of servers would be newer and less flaky. My MySQL setup uses a good number of AMD X6 1055T servers as cheap and fairly capable storage nodes, so I went in that direction. I started with 8GB of RAM in each box, and did a lot of testing to figure out whether I should use single data disks, or multiple disks in a RAID configuration. My first test was with a single disk compared with a two-disk RAID0, and the RAID0 performed much better. The next iteration compared the same two-disk RAID0 configuration with a four-disk RAID10. I expected the RAID10 to perform better, but the numbers really weren’t that different, so I decided on the two-disk RAID0 as my setup to save on equipment costs.

One thing that I could see very clearly in the benchmarking is Cassandra’s huge optimization of write performance at the expense of read performance. I could easily throw a few hundred thousand write operations per second at the test cluster, but it struggled to get more than a few hundred reads per second.   I did a lot of reading to find out what I was doing wrong, and came up with a few pointers (some of which will be discussed later), but the big one was simply the scale of the cluster. If you want better read performance  out of Cassandra, throw more hardware at it.

So I did. I added more RAM to each box (for a total of 12GB per box currently), and added many additional nodes. My current cluster is comprised of 22 machines, each with 12GB of RAM and at least 3TB of storage each. Go big or go home, right? The big lesson learned here is that Cassandra does better with more hardware nodes of moderate capability, rather than fewer large nodes. Give the cluster as many I/O paths to your data as possible, or risk trapping it behind I/O bottlenecks.

Node death can be a pain even when you have many other active nodes

During the course of my first attempt at pushing things live in production, a very painful problem emerged as a result of the frequent drive issues the Harpertown boxes suffered. If I had my web nodes configured to connect to a node that was offline (as in, the machine is not responding on the network), the whole system would grind to a standstill as web nodes tried, in vain, to connect to the dead node. Even with eleven out of twelve nodes active, a large enough percentage of the web processes tried to connect to the dead node that the whole system would eventually jam up waiting for the connection to time out before trying the next node in the list. My long-running processes, like the cruncher script, weathered this gracefully, since connection latency was not a factor to them, and they would remember that a particular node was down. Short-running tasks, namely the HTTP interactions, suffered greatly when a node died, since the system essentially has to relearn the state of each Cassandra node each time the process starts. Since the HTTP stack is heavily reliant on moving the enormous number of short-lived HTTP connections through the queue in quick fashion. wasting a bunch of time figuring out that a node is down is very costly – the stalled HTTP processes quickly pile up and quickly prevent real work from occurring.

My solution to this was to add pair of load balancers into the mix. I have a good amount of experience using the linux kernel load balancing (managed with ipvsadm) and ldirectord as a monitoring agent, but the network setup required to properly implement it wasn’t really conducive for my needs. I wanted to be able to arbitrarily forward the Cassandra TCP connections to anywhere, not just a local subnet directly connected to the load balancers. So I looked around a bit and found haproxy. It’s a lightweight process that will do load balancing and service checking for either HTTP or bare TCP connections, which fit my needs to a T. I did some quick testing and found it to be exactly what I needed.

In my final production setup, I grabbed two of my previously discarded Harpertown boxes (making sure they weren’t the ones with SATA issues) and configured them in an active-active setup using haproxy and heartbeat. Each load balancer has a VIP that is active at any given time, but can also fail over to the other box, which avoids the downed node problem. I configured haproxy to run in a multi-process fashion, one process for each of the eight CPU cores in the Harpertown machines. The end result is a redundant service-checking load balancer setup that ignores downed Cassandra nodes that quite easily passed over 100MByte/sec of Cassandra traffic in my testing.

Make sure your Cassandra JVM is properly tuned

Before Cassandra, my typical interaction with Java applications was to see them, acknowledge their presence, then run in the opposite direction.  I had some pretty bad experiences trying to deal with Tomcat in my early days as a SysAdmin, and they really turned me off to Java as a whole. Once Cassandra came into the picture, I didn’t have much choice but to learn about it.

I read plenty of articles where people talked about tuning the Java heap, but really didn’t have a concept of what all the variables in the equation meant. My first attempts at changing the heap size came after a Cassandra process crashed, and wouldn’t start up again. I’m not sure what prompted me to think about changing the heap size as a solution, but it worked, temporarily at least. I increased the heap size to 6GB I think, out of 8GB of total system RAM. Cassandra started, worked fine for a short while, then promptly crashed and burned. Attempt number one – resounding failure!

I read up some more on what each heap-related configuration setting related to, and what it affected. Learning the relationship between the new and old generations of heap memory, and how they were used by the garbage collection processes was key. In my first attempt, not knowing what I was changing, I set the new generation heap memory to the same value as total heap memory. Pretty much the definition of Doing It Wrong (TM). After some more reading and tweaking, I found that with the 12GB memory footprint on my finalized hardware setup, a 6GB heap size provided Cassandra with the memory it needed to stay stable, while leaving the OS with a fair amount of room for its filesystem caches. I later refined those heap settings to set the new generation heap size to 1GB, leaving 5GB for older data objects.

One tool that I found very helpful was jconsole. It’s included in my OS’s Java Development Kit, and was very useful in getting a good idea how Cassandra was using memory. It gives a real-time view into the innards of a running Java process, which, among many other things, gave me the ability to see how memory was being utilized in the various heap regions, which made how Java uses memory much clearer and easier to understand. Old Java pros probably know all about this, but it’s still pretty new and novel to me.

Tune your column family’s read repair chance

After I had observed the fairly awful read performance in my benchmarks, I started doing some research. In my readings, I came across the description of a column family’s read repair chance, and realized that I was, once again, Doing It Wrong. Basically, the read repair chance is the likelihood that a particular read operation will initiate a read repair operation between the various Cassandra servers that contain replicas of the data object you’re trying to retrieve. If a read repair is initiated, the read operation will go to all replicas, and the results are then compared to make sure all replicas are in sync. If you have your cluster configured to have four replicas of each piece of data, a read operation triggering a read repair will cause all four replicas to do read operations instead of just one (or however many you asked for). For a system that is already under heavy read load, this can make things much slower.

I looked at the column families that the benchmarking utility was creating, and sure enough, the read repair chance was set to 1, or 100%. I tried setting it much lower, to 0.1, and my read rates improved dramatically. Setting it even lower, to 0.01, and they went higher still. I was using a replication factor of four in my benchmarking tests, and with the closer I got to a read repair chance of zero, the closer to I got to a factor of four increase in the “base” read rate with a read repair chance of 100%. This brought my read rates out of “crap” territory and into “probably acceptable” territory. My production schemas were also using a read repair chance of 100%, so I adjusted them down to far more conservative values.

It’s worth noting that read repairs aren’t a bad thing, per-se, they’re just really good at adding load to a cluster. They’re actually a very good sanity check for data integrity, but they come with a cost.

Tune key/row caches

Another thing I found that can improve read performance are the key and row caches. I was well aware of the row cache concept, and mostly discarded it in my case, since my data model didn’t really work with it. The key cache, however, was definitely useful. Basically, the key cache is a pointer to the disk location where the data for a particular key is stored. This makes disk access a lot less painful, since Cassandra doesn’t have to search through its data files to find the piece of data its looking for.

I initially tried setting my key caches large enough to account for all keys currently in each column family. My data model has a fairly fixed number of row keys that scales mostly linearly based on the number of hosts being monitored by the system, so I was able to figure pretty easily how many keys I needed to cache. I set the number comfortably above that value, and watched as read performance improved as the caches filled in. And then, after coming back to work after a weekend had passed, I noticed that my key caches seemed to have shrunk far below the maximums I had set previously. Some quick searches in the Cassandra system logs showed that Cassandra had lowered those limits because of memory pressure inside the JVM. The whole cluster’s stability and proper operation seemed to be negatively impacted by this state (all nodes were showing the ‘memory pressure’ messages in their logs), so I experimented with some lower key cache settings until I found a good balance between performance and stability.

 

The Problem With Dealing With More Data Than You Can Deal With

Over the past few weeks, as I’ve mentioned in previous posts, I’ve been working on converting a server monitoring application to use Apache Cassandra as its storage engine. Now that I have got past the initial hurdles of learning the system and my own stupidity while making code modifications, the code is successfully converted and all of my collected data is dumping into Cassandra. Now what?

For the life of the application, I’ve stored collected data in two ways. First is a simple snapshot of the latest value collected, along with its time stamp, which is used for simple numeric threshold checks, i.e. “Is the server’s memory usage currently too high” or “is free disk space currently too low”. Each piece of snapshot data is overwritten with the newest value when its collected. The other method is a historical record of all values collected. Numeric data gets stored each time its collected, and text-based data (software versions, currently loaded kernel modules, etc) is logged when it changes. This allows for the application to draw (somewhat) pretty graphs of numerical data or provide a nice change log of text-based data.

An Example Graph

My current quandary is how to deal with the vast amounts of data I’ll be able to store. Previously I had to constantly prune the amount of data stored so that MySQL wouldn’t melt down under the weight of indexing and storing millions of data points. I set up scripts that would execute nightly and trim away data that was older than a certain point in time, and then optimize the data tables to keep things running quickly. Cassandra shouldn’t have that problem.

Even though I’ve only been storing data in Cassandra for a few weeks, I’m already running into issues with having more data than I can handle. My graphing scripts are currently set up to get all data that will be graphed in a single request, and then iterate through it to determine the Y-axis minimums and maximums, and then build the graph. It then grabs another set of data via a single request to draw the colored bar at the bottom of the graph, which displays whether data collection attempts were successful or if they failed. With that approach, I’m a slave to the amount of memory PHP can allocate, since the arrays I’m building with the data from Cassandra can only get so large before PHP shuts things down. I’m already hitting that ceiling with test servers in my development environment.

Some of the possible solutions to this problem are tricky. Some of them are easy, but won’t work forever. Some of them require out-of-band processing of data that makes graphing easier. None of the potential solutions I’ve come up with is a no-brainer. Since some of the graphed data is customer-facing, performance is a concern.

  1. Increase the PHP memory limit. This one is easy, but will only work for so long. I’m already letting the graph scripts allocate 128MB of RAM, which is on the high side in my book.
  2. Pull smaller chunks of my data set in the graphing code, and iterate through it to create graphs. This is probably the most sane approach, all told, but it seems fairly inefficient with how things are currently structured. I’d have to do two passes through the graph data in order to draw the graph (the first to grab the data set boundaries, and the second to actually draw the data points within the graph), and a single pass through the data detailing whether collections were successful or not. For a larger number of data points, this could mean a fair number of Cassandra get operations, which would cause slow graphing performance. 
  3. Take an approach similar to how MRTG does things, where data is averaged over certain time frames, with the higher resolution data being kept for shorter periods, with larger-length averages stored longer. This is something I’ve wanted to do for a while, but I’m not sure how much out-of-band processing this would require in the production cluster. One possible advantage to this is that if I did some basic analysis, I could store things like maximum and minimum values for particular time ranges ahead of time, and use those in my graphs instead of calculating them on the fly. 

I’m sure there are brilliant folks out there who have come up with elegant solutions to this type of problem, but at this point, I’m kind of stuck.

First Steps Into Big Data With Apache Cassandra

I’ve got a monitoring application at work that I wrote and maintain which currently uses MySQL as a storage back end. With the amount of data it holds and the activity in the system, MySQL has gone from “probably not the optimal solution” to “really stupid”. The system is comprised of many storage servers, most of which are completely I/O bottle-necked because of MySQL’s write overhead. It’s a typical “big data” kind of problem, and I need a big data solution.

Over the past couple of weeks, I’ve been experimenting with Apache Cassandra. We recently started using it in another context, and it seems pretty damned slick. Based on what I read, it seems like a great fit for my needs. The data model is consistent with what I do in MySQL, and the built-in redundancy and replication is awesome.

Most of the stuff I’ve tried so far has Just Worked (TM). Setting up a basic test cluster was easy, and once I found a suitable PHP client library for accessing Cassandra, I was able to make my development setup store collected data in under 30 minutes. I started off using a data model that pretty much mirrored what I was doing in MySQL, but as I learned more, I was able to strip away a few of the MySQL-specific “optimizations” (read: hacks) in favor of a more streamlined setup.

However, there are a few things that just make me scratch my head. From what I can tell, updating rows in Cassandra is “strange”. In my testing so far, inserting new data works flawlessly. Both adding new rows and adding columns onto an existing row work as expected. However, I notice lots of weirdness when updating pre-existing columns in pre-existing rows. It seems as though Cassandra is only updating the values associated with columns if the value is “larger” than the previous. See the following for an example.

# ./casstest
truncating column family for cleanliness...
========================================================
What we're storing...
Array
(
    [timestamp] => 1339529068
    [value] => 0.01
)
storing ...
sleeping a second for consistency...
What is retrieved from a get()...
Array
(
    [timestamp] => 1339529068
    [value] => 0.01
)
========================================================
What we're storing...
Array
(
    [timestamp] => 1339529071
    [value] => 1.01
)
storing ...
sleeping a second for consistency...
What is retrieved from a get()...
Array
(
    [timestamp] => 1339529071
    [value] => 1.01
)
========================================================
What we're storing...
Array
(
    [timestamp] => 1339529074
    [value] => 2.01
)
storing ...
sleeping a second for consistency...
What is retrieved from a get()...
Array
(
    [timestamp] => 1339529074
    [value] => 2.01
)
========================================================
What we're storing...
Array
(
    [timestamp] => 1339529077
    [value] => 1.01
)
storing ...
sleeping a second for consistency...
What is retrieved from a get()...
Array
(
    [timestamp] => 1339529077
    [value] => 2.01
)
========================================================
What we're storing...
Array
(
    [timestamp] => 1339529080
    [value] => 0.05
)
storing ...
sleeping a second for consistency...
What is retrieved from a get()...
Array
(
    [timestamp] => 1339529080
    [value] => 2.01
)
========================================================

In the example above, the timestamp column is just the result of a call to time(), so it will always increment over time. The values for the value column are just a few static entries pulled from a pre-populated array I used for testing. They increment three times, then decrement twice. I’m just making a simple array out of the two pieces of data, and then doing a set operation to write the data into Cassandra. As you can see, the timestamp fields show the proper values each time the key is retrieved, but the value column only shows the proper values when the value being written is larger than the last. WTF? I don’t know whether to blame Cassandra or the PHP client library I’m using (CPCL), but it’s really cramping my style at this point. I’ve gone as far as watching the contents of the TCP connections between client and server with tcpdump/wireshark to see if the client is making the same set requests for all values, and it seems to be. I’ve also tried changing the write consistency level, with no change.

It is also worth noting that when using the cassandra-cli utility to do inserts sets/gets manually, things work as I would expect.

[default@keyspace] assume TestCF VALIDATOR as utf8; 
Assumption for column family 'TestCF' added successfully.
[default@keyspace] assume TestCF SUB_COMPARATOR as utf8; 
Assumption for column family 'TestCF' added successfully.
[default@keyspace] assume TestCF keys as utf8; 
Assumption for column family 'TestCF' added successfully.
[default@keyspace] assume TestCF COMPARATOR as utf8; 
Assumption for column family 'TestCF' added successfully.
[default@keyspace] get TestCF['TestKey'];
=> (column=timestamp, value=1339532764, timestamp=172800)
=> (column=value, value=2.01, timestamp=172800)
Returned 2 results.
Elapsed time: 2 msec(s).
[default@keyspace] set TestCF['TestKey']['value'] = utf8('0.0');
Value inserted.
Elapsed time: 1 msec(s).
[default@keyspace] get TestCF['TestKey'];
=> (column=timestamp, value=1339532764, timestamp=172800)
=> (column=value, value=0.0, timestamp=1339532783904000)
Returned 2 results.
Elapsed time: 2 msec(s).
[default@keyspace] set TestCF['TestKey']['value'] = utf8('2.0');
Value inserted.
Elapsed time: 2 msec(s).
[default@keyspace] get TestCF['TestKey'];
=> (column=timestamp, value=1339532764, timestamp=172800)
=> (column=value, value=2.0, timestamp=1339532783913000)
Returned 2 results.
Elapsed time: 2 msec(s).
[default@keyspace] set TestCF['TestKey']['value'] = utf8('1.5');
Value inserted.
Elapsed time: 1 msec(s).
[default@keyspace] get TestCF['TestKey'];
=> (column=timestamp, value=1339532764, timestamp=172800)
=> (column=value, value=1.5, timestamp=1339532783923000)
Returned 2 results.
Elapsed time: 2 msec(s).
[default@keyspace] set TestCF['TestKey']['value'] = utf8('0.2');
Value inserted.
Elapsed time: 0 msec(s).
[default@keyspace] get TestCF['TestKey'];
=> (column=timestamp, value=1339532764, timestamp=172800)
=> (column=value, value=0.2, timestamp=1339532783933000)
Returned 2 results.
Elapsed time: 2 msec(s).

Another thing that isn’t acting as I would expect is row deletions. In my testing, it seems that once a row has been deleted, subsequent attempts to write to that row will just silently fail. I suspect that it has to do with the fact that Cassandra’s distributed nature makes deletes a bit counter-intuitive, which is outlined here in the Cassandra documentation. It would be nice to know for sure, though.

EDIT: I was doing it wrong. Sigh. Deletes are still weird to me though.