Tag Archives: mysql

The Problem With Dealing With More Data Than You Can Deal With

Over the past few weeks, as I’ve mentioned in previous posts, I’ve been working on converting a server monitoring application to use Apache Cassandra as its storage engine. Now that I have got past the initial hurdles of learning the system and my own stupidity while making code modifications, the code is successfully converted and all of my collected data is dumping into Cassandra. Now what?

For the life of the application, I’ve stored collected data in two ways. First is a simple snapshot of the latest value collected, along with its time stamp, which is used for simple numeric threshold checks, i.e. “Is the server’s memory usage currently too high” or “is free disk space currently too low”. Each piece of snapshot data is overwritten with the newest value when its collected. The other method is a historical record of all values collected. Numeric data gets stored each time its collected, and text-based data (software versions, currently loaded kernel modules, etc) is logged when it changes. This allows for the application to draw (somewhat) pretty graphs of numerical data or provide a nice change log of text-based data.

An Example Graph

My current quandary is how to deal with the vast amounts of data I’ll be able to store. Previously I had to constantly prune the amount of data stored so that MySQL wouldn’t melt down under the weight of indexing and storing millions of data points. I set up scripts that would execute nightly and trim away data that was older than a certain point in time, and then optimize the data tables to keep things running quickly. Cassandra shouldn’t have that problem.

Even though I’ve only been storing data in Cassandra for a few weeks, I’m already running into issues with having more data than I can handle. My graphing scripts are currently set up to get all data that will be graphed in a single request, and then iterate through it to determine the Y-axis minimums and maximums, and then build the graph. It then grabs another set of data via a single request to draw the colored bar at the bottom of the graph, which displays whether data collection attempts were successful or if they failed. With that approach, I’m a slave to the amount of memory PHP can allocate, since the arrays I’m building with the data from Cassandra can only get so large before PHP shuts things down. I’m already hitting that ceiling with test servers in my development environment.

Some of the possible solutions to this problem are tricky. Some of them are easy, but won’t work forever. Some of them require out-of-band processing of data that makes graphing easier. None of the potential solutions I’ve come up with is a no-brainer. Since some of the graphed data is customer-facing, performance is a concern.

  1. Increase the PHP memory limit. This one is easy, but will only work for so long. I’m already letting the graph scripts allocate 128MB of RAM, which is on the high side in my book.
  2. Pull smaller chunks of my data set in the graphing code, and iterate through it to create graphs. This is probably the most sane approach, all told, but it seems fairly inefficient with how things are currently structured. I’d have to do two passes through the graph data in order to draw the graph (the first to grab the data set boundaries, and the second to actually draw the data points within the graph), and a single pass through the data detailing whether collections were successful or not. For a larger number of data points, this could mean a fair number of Cassandra get operations, which would cause slow graphing performance. 
  3. Take an approach similar to how MRTG does things, where data is averaged over certain time frames, with the higher resolution data being kept for shorter periods, with larger-length averages stored longer. This is something I’ve wanted to do for a while, but I’m not sure how much out-of-band processing this would require in the production cluster. One possible advantage to this is that if I did some basic analysis, I could store things like maximum and minimum values for particular time ranges ahead of time, and use those in my graphs instead of calculating them on the fly. 

I’m sure there are brilliant folks out there who have come up with elegant solutions to this type of problem, but at this point, I’m kind of stuck.

First Steps Into Big Data With Apache Cassandra

I’ve got a monitoring application at work that I wrote and maintain which currently uses MySQL as a storage back end. With the amount of data it holds and the activity in the system, MySQL has gone from “probably not the optimal solution” to “really stupid”. The system is comprised of many storage servers, most of which are completely I/O bottle-necked because of MySQL’s write overhead. It’s a typical “big data” kind of problem, and I need a big data solution.

Over the past couple of weeks, I’ve been experimenting with Apache Cassandra. We recently started using it in another context, and it seems pretty damned slick. Based on what I read, it seems like a great fit for my needs. The data model is consistent with what I do in MySQL, and the built-in redundancy and replication is awesome.

Most of the stuff I’ve tried so far has Just Worked (TM). Setting up a basic test cluster was easy, and once I found a suitable PHP client library for accessing Cassandra, I was able to make my development setup store collected data in under 30 minutes. I started off using a data model that pretty much mirrored what I was doing in MySQL, but as I learned more, I was able to strip away a few of the MySQL-specific “optimizations” (read: hacks) in favor of a more streamlined setup.

However, there are a few things that just make me scratch my head. From what I can tell, updating rows in Cassandra is “strange”. In my testing so far, inserting new data works flawlessly. Both adding new rows and adding columns onto an existing row work as expected. However, I notice lots of weirdness when updating pre-existing columns in pre-existing rows. It seems as though Cassandra is only updating the values associated with columns if the value is “larger” than the previous. See the following for an example.

# ./casstest
truncating column family for cleanliness...
========================================================
What we're storing...
Array
(
    [timestamp] => 1339529068
    [value] => 0.01
)
storing ...
sleeping a second for consistency...
What is retrieved from a get()...
Array
(
    [timestamp] => 1339529068
    [value] => 0.01
)
========================================================
What we're storing...
Array
(
    [timestamp] => 1339529071
    [value] => 1.01
)
storing ...
sleeping a second for consistency...
What is retrieved from a get()...
Array
(
    [timestamp] => 1339529071
    [value] => 1.01
)
========================================================
What we're storing...
Array
(
    [timestamp] => 1339529074
    [value] => 2.01
)
storing ...
sleeping a second for consistency...
What is retrieved from a get()...
Array
(
    [timestamp] => 1339529074
    [value] => 2.01
)
========================================================
What we're storing...
Array
(
    [timestamp] => 1339529077
    [value] => 1.01
)
storing ...
sleeping a second for consistency...
What is retrieved from a get()...
Array
(
    [timestamp] => 1339529077
    [value] => 2.01
)
========================================================
What we're storing...
Array
(
    [timestamp] => 1339529080
    [value] => 0.05
)
storing ...
sleeping a second for consistency...
What is retrieved from a get()...
Array
(
    [timestamp] => 1339529080
    [value] => 2.01
)
========================================================

In the example above, the timestamp column is just the result of a call to time(), so it will always increment over time. The values for the value column are just a few static entries pulled from a pre-populated array I used for testing. They increment three times, then decrement twice. I’m just making a simple array out of the two pieces of data, and then doing a set operation to write the data into Cassandra. As you can see, the timestamp fields show the proper values each time the key is retrieved, but the value column only shows the proper values when the value being written is larger than the last. WTF? I don’t know whether to blame Cassandra or the PHP client library I’m using (CPCL), but it’s really cramping my style at this point. I’ve gone as far as watching the contents of the TCP connections between client and server with tcpdump/wireshark to see if the client is making the same set requests for all values, and it seems to be. I’ve also tried changing the write consistency level, with no change.

It is also worth noting that when using the cassandra-cli utility to do inserts sets/gets manually, things work as I would expect.

[default@keyspace] assume TestCF VALIDATOR as utf8; 
Assumption for column family 'TestCF' added successfully.
[default@keyspace] assume TestCF SUB_COMPARATOR as utf8; 
Assumption for column family 'TestCF' added successfully.
[default@keyspace] assume TestCF keys as utf8; 
Assumption for column family 'TestCF' added successfully.
[default@keyspace] assume TestCF COMPARATOR as utf8; 
Assumption for column family 'TestCF' added successfully.
[default@keyspace] get TestCF['TestKey'];
=> (column=timestamp, value=1339532764, timestamp=172800)
=> (column=value, value=2.01, timestamp=172800)
Returned 2 results.
Elapsed time: 2 msec(s).
[default@keyspace] set TestCF['TestKey']['value'] = utf8('0.0');
Value inserted.
Elapsed time: 1 msec(s).
[default@keyspace] get TestCF['TestKey'];
=> (column=timestamp, value=1339532764, timestamp=172800)
=> (column=value, value=0.0, timestamp=1339532783904000)
Returned 2 results.
Elapsed time: 2 msec(s).
[default@keyspace] set TestCF['TestKey']['value'] = utf8('2.0');
Value inserted.
Elapsed time: 2 msec(s).
[default@keyspace] get TestCF['TestKey'];
=> (column=timestamp, value=1339532764, timestamp=172800)
=> (column=value, value=2.0, timestamp=1339532783913000)
Returned 2 results.
Elapsed time: 2 msec(s).
[default@keyspace] set TestCF['TestKey']['value'] = utf8('1.5');
Value inserted.
Elapsed time: 1 msec(s).
[default@keyspace] get TestCF['TestKey'];
=> (column=timestamp, value=1339532764, timestamp=172800)
=> (column=value, value=1.5, timestamp=1339532783923000)
Returned 2 results.
Elapsed time: 2 msec(s).
[default@keyspace] set TestCF['TestKey']['value'] = utf8('0.2');
Value inserted.
Elapsed time: 0 msec(s).
[default@keyspace] get TestCF['TestKey'];
=> (column=timestamp, value=1339532764, timestamp=172800)
=> (column=value, value=0.2, timestamp=1339532783933000)
Returned 2 results.
Elapsed time: 2 msec(s).

Another thing that isn’t acting as I would expect is row deletions. In my testing, it seems that once a row has been deleted, subsequent attempts to write to that row will just silently fail. I suspect that it has to do with the fact that Cassandra’s distributed nature makes deletes a bit counter-intuitive, which is outlined here in the Cassandra documentation. It would be nice to know for sure, though.

EDIT: I was doing it wrong. Sigh. Deletes are still weird to me though.