The Problem With Dealing With More Data Than You Can Deal With

Over the past few weeks, as I’ve mentioned in previous posts, I’ve been working on converting a server monitoring application to use Apache Cassandra as its storage engine. Now that I have got past the initial hurdles of learning the system and my own stupidity while making code modifications, the code is successfully converted and all of my collected data is dumping into Cassandra. Now what?

For the life of the application, I’ve stored collected data in two ways. First is a simple snapshot of the latest value collected, along with its time stamp, which is used for simple numeric threshold checks, i.e. “Is the server’s memory usage currently too high” or “is free disk space currently too low”. Each piece of snapshot data is overwritten with the newest value when its collected. The other method is a historical record of all values collected. Numeric data gets stored each time its collected, and text-based data (software versions, currently loaded kernel modules, etc) is logged when it changes. This allows for the application to draw (somewhat) pretty graphs of numerical data or provide a nice change log of text-based data.

An Example Graph

My current quandary is how to deal with the vast amounts of data I’ll be able to store. Previously I had to constantly prune the amount of data stored so that MySQL wouldn’t melt down under the weight of indexing and storing millions of data points. I set up scripts that would execute nightly and trim away data that was older than a certain point in time, and then optimize the data tables to keep things running quickly. Cassandra shouldn’t have that problem.

Even though I’ve only been storing data in Cassandra for a few weeks, I’m already running into issues with having more data than I can handle. My graphing scripts are currently set up to get all data that will be graphed in a single request, and then iterate through it to determine the Y-axis minimums and maximums, and then build the graph. It then grabs another set of data via a single request to draw the colored bar at the bottom of the graph, which displays whether data collection attempts were successful or if they failed. With that approach, I’m a slave to the amount of memory PHP can allocate, since the arrays I’m building with the data from Cassandra can only get so large before PHP shuts things down. I’m already hitting that ceiling with test servers in my development environment.

Some of the possible solutions to this problem are tricky. Some of them are easy, but won’t work forever. Some of them require out-of-band processing of data that makes graphing easier. None of the potential solutions I’ve come up with is a no-brainer. Since some of the graphed data is customer-facing, performance is a concern.

  1. Increase the PHP memory limit. This one is easy, but will only work for so long. I’m already letting the graph scripts allocate 128MB of RAM, which is on the high side in my book.
  2. Pull smaller chunks of my data set in the graphing code, and iterate through it to create graphs. This is probably the most sane approach, all told, but it seems fairly inefficient with how things are currently structured. I’d have to do two passes through the graph data in order to draw the graph (the first to grab the data set boundaries, and the second to actually draw the data points within the graph), and a single pass through the data detailing whether collections were successful or not. For a larger number of data points, this could mean a fair number of Cassandra get operations, which would cause slow graphing performance. 
  3. Take an approach similar to how MRTG does things, where data is averaged over certain time frames, with the higher resolution data being kept for shorter periods, with larger-length averages stored longer. This is something I’ve wanted to do for a while, but I’m not sure how much out-of-band processing this would require in the production cluster. One possible advantage to this is that if I did some basic analysis, I could store things like maximum and minimum values for particular time ranges ahead of time, and use those in my graphs instead of calculating them on the fly. 

I’m sure there are brilliant folks out there who have come up with elegant solutions to this type of problem, but at this point, I’m kind of stuck.

  1. My boss proposed a solution related to the number of pixels my graph can contain. His solution was to “stream” the data through a function that would average out the data on a per-pixel basis in the graph. This would essentially give me a fixed-size data structure that would be used to render the graph, and allow for grabbing smaller chunks of the data set in a single pass. I could also use it to draw minimums and maximums for that particular pixel’s time range, but that might make the graph noisy.

Leave a Comment


NOTE - You can use these HTML tags and attributes:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>