Friday, February 11, 2011

Ben Black always has interesting things on the go. He’s now down in San Francisco working on his startup Fastip which he describes as “an incredible platform for operating, exploring, and optimizing data networks.” A couple of days ago Deepak Singh sent me to a recent presentation of Ben’s I found interesting: Challenges and Trade-offs in Building a Web-scale Real-time Analytics System.

 

The problem described in this talk was “Collect, index, and query trillions of high dimensionality

records with seconds of latency for ingestion and response.” What Ben is doing is collecting per flow networking data with tcp/ip 11-tuples (src_mac, dst_mac, src_IP, dest_IP, …) as the dimension data and, as metrics, he is tracking start usecs, end usecs, packets, octets, and UID. This data is interesting for two reasons: 1) networks are huge, massively shared resources and most companies haven’t really a clue on the details of what traffic is clogging it and have only weak tools to understand what traffic is flowing – the data sets are so huge, the only hope is to sample it with solutions like Cisco's NetFlow. The second reason I find this data interesting is closely related: 2) it is simply vast and  I love big data problems. Even on small networks, this form of flow tracking produces a monstrous data set very quickly. So, it’s an interesting problem in that it’s both hard and very useful to solve.

 

Ben presented 3 possible solutions and why they don’t work before offering a solution. The failed approaches that couldn’t cope with high dimensionality and the sheer volume of the dataset:

1.       HBase: Insert into HBase then retrieve all records in a time range and filter, aggregate, and sort

2.       Cassandra: Insert all records into Cassandra partitioned over a large cluster with each dimension indexed independently. Select qualifying records on each dimension, aggregate, and sort.

3.       Statistical Cassandra: Run a statistical sample over the data stored in Cassandra in the previous attempt.

 

The end solution proposed by the presenter is to treat it as an Online Analytic Processing (OLAP) problem. He describes OLAP as “A business intelligence (BI) approach to swiftly answer multi-dimensional analytics queries by structuring the data specifically eliminate expensive processing at query time, even at a cost of enormous storage consumption.” OLAP is essentially a mechanism to support fast query over high-dimensional data by pre-computing all interesting aggregations and storing the pre-computed results in a highly compressed form that is often kept memory resident. However, in this case, the data set is far too large to be practical for an in-memory approach.

 

Ben draws from two papers to implement an OLAP based solution at this scale:

·         High-Dimensional OLAP: A Minimal Cubing Approach by Li, Han, and Gonzalez

·         Sorting improves word-aligned bitmap indexes by Lemire, Kaser, and Aouiche

 

The end solution is:

·         Insert records into Cassandra.

·         Materialize lower-dimensional cuboids using bitsets and then join as needed.

·         Perform all query steps directly in the database


Ben’s concluding advice:

·         Read the literature

·         Generic software is 90% wrong at scale, you just don’t know which 90%.

·         Iterate to discover and be prepared to start over

 

If you want to read more, the presentation is at: Challenges and Trade-offs in Building a Web-scale Real-time Analytics System.

 

                                                                --jrh

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Friday, February 11, 2011 5:41:36 AM (Pacific Standard Time, UTC-08:00)  #    Comments [12] - Trackback
Software
Friday, February 11, 2011 11:40:54 AM (Pacific Standard Time, UTC-08:00)
Skytide has been doing something similar for a number of years. I know one of their CDN customers needs to process 50 TB of edge cache log data daily into a repository containing a years worth (20 PB) to support traffic segmentation analysis.

I thought Skytide's approach here was interesting - to transform the log format into XML, and then use XPATH to define the dimensions and metrics for analysis. Worth a look...
Anurag Gupta
Friday, February 11, 2011 5:29:05 PM (Pacific Standard Time, UTC-08:00)
That's clearly easy to do but transfers the problem to an xpath query engine that supports clustered queries. What are the Skytide folks using Anurag?

--jrh
Friday, February 11, 2011 6:37:09 PM (Pacific Standard Time, UTC-08:00)
Something's wrong with the High-Dimensional OLAP link. Here's a working link:
http://www.cs.toronto.edu/vldb04/protected/eProceedings/contents/pdf/RS14P1.PDF

Also, I got an error trying to post this.
Saturday, February 12, 2011 3:35:03 AM (Pacific Standard Time, UTC-08:00)
There are a couple of web analytic products that use the OLAP approach - Oracle (ex Monitored) RUEI is one.

They get a data feed from a span or tap port, extract the TCP and HTTP information they're interested in and stuff it into a set of cubes.

It's pretty powerful but at the same time limiting, as it's difficult to ask some what if questions because the raw data has already gone.

Currently pondering using the Pion collector from Atomic Labs and pumping the data into something that I can run map reduce over.

Andy
Saturday, February 12, 2011 6:19:49 AM (Pacific Standard Time, UTC-08:00)
Thanks for catching the URL glitch Michael. I've updated the main text.

--jrh
Saturday, February 12, 2011 6:25:21 AM (Pacific Standard Time, UTC-08:00)
Thanks Andy, I am familiar with the Oracle OLAP product. SQL Server includes one in the box as well.

You mentioned the problem of wanting to drill deeper than supported by the aggregations but not being able to do. I've not used them but I vaguely recall some products advertising drill through where they can drill through a cube back to the source database.

--jrh
Monday, February 14, 2011 10:26:15 AM (Pacific Standard Time, UTC-08:00)
RUEI doesn't use Oracle's OLAP products - it was built by Dutch(?) company called Moniforce before Oracle acquired them.

RUEI only retains the source data for a limited period of time (depending on how much space is a available), it then aggregates and consolidates the data depending on how old it is, what dimensions are available etc.

We set the satisfied page load-time to be 3 secs and generated an Apdex from this, but also wanted to look at how the number of satisfied / tolerating / frustrated requested would change if we reduced the allowable load-time. Unfortunately the source data isn't maintained at a low enough level to be able to do this.

Andy
Monday, February 14, 2011 10:38:03 AM (Pacific Standard Time, UTC-08:00)
Makes sense Andy.

I'm a data guy so I generally favor a "keep it all" strategy. Low-end, low-performance disk is pretty cheap these days. Also you could use cloud storage where prices range from 6 cents to 14 cents a GB/month on S3. Neither approach will support direct drill through but you could run an MapReduce job to ask deeper questions when the aggregations have hidden detail. I generally hate through data out.
Tuesday, February 15, 2011 6:53:52 PM (Pacific Standard Time, UTC-08:00)
My memory of Skytide is fuzzy, so the below may be inaccurate.

The use of XPATH expressions is primarily to provide flexibility in defining the notions of dimensions and metrics across source semi-structured log data. Based on the particular target dimensions and metrics required for the various reports being generated, they push the values into (what sounded to me like) a column store which implements one or more logical OLAP cubes. The column store was custom, as I recall. The metrics at the leaves of the dimensions are accumulated (for example, if I only analyze down to day grain, all my source data will be accumulated to day grain in the store). All queries happen against the column store (not through clustered XPATH queries). As you mention, one can always run some sort of drill-through to data below the leaf levels of the cube if necessary - at this point, hopefully the data set is more manageable - I don't know if they supported this.

They claimed getting about 10x compression from the column store (going from unstructured logs to the structured cube, removing unanalyzed elements and compressing out repeated dimension values) and another 10x from the accumulation of metrics at dimension leaves (eg day grain, source mac, etc). The latter may seem surprising since the data is so dimensionally sparse, but seems more plausible when you consider that it does cluster a fair bit (lots of mac addresses, but I'll get a lot of packets from any given one for a given period).

At my limited understanding of both, it sounds like they circled around to a similar approach to Fastip. I was interested in their XPATH work because I think, in this sort of domain, once you get the data into a column store (with leaf accumulation), you're kind of good to go. The harder problem to me was dealing with a multitude of source formats and still providing the kind of parallelization of loads you would only get from the core product, rather than site customization. XPATH for me was a nice declarative mechanism to provide a parallelizable mapping from site-specific source log format to site-specific target reports.
Anurag Gupta
Wednesday, February 16, 2011 7:32:26 AM (Pacific Standard Time, UTC-08:00)
Thanks for the additional detail Anurag. At the core store level, I agree, its sounds like the skytide approach is closely related to that proposed by Ben (based upon an OLAP engine ) and, in addition, they support Xpath.

--jrh
Wednesday, February 16, 2011 12:38:44 PM (Pacific Standard Time, UTC-08:00)
As one of the founders of Skytide and their former CTO I can con firm that Anurag's description is very close. Skytide is a proprietary columnar store using MDX as its query language and XPath as a declarative language to describe dimensional and other properties of the cube. Data aggregation model closely resembles ROLAP with one important distinction, the leaves contain pre aggregated data only and the details are thrown away once they get aggregated into leaves. The approach works very well for netflow data as well as all kinds of weblogs where people rarely want to analyze individual flows but instead are interested in the IP or network level analysis. The product also features a continuous data load and incremental update process that allows for ingestion of very "big data" (tens of terabytes a day). Compression is actually much better than 10X simply because the size of the cube is bound by its dimensions' cardinality while the incoming data volume is virtually unlimited. Skytide's patent provides a few more details if anyone is interested http://www.faqs.org/patents/app/20100076977.

Joseph Rozenfeld
Joseph Rozenfeld
Wednesday, February 16, 2011 6:03:30 PM (Pacific Standard Time, UTC-08:00)
Thanks for the background on Skytide Joseph. Useful and much appreciated.

--jrh
Comments are closed.

Disclaimer: The opinions expressed here are my own and do not necessarily represent those of current or past employers.

Archive
<February 2011>
SunMonTueWedThuFriSat
303112345
6789101112
13141516171819
20212223242526
272812345
6789101112

Categories
This Blog
Member Login
All Content © 2014, James Hamilton
Theme created by Christoph De Baene / Modified 2007.10.28 by James Hamilton