Challenges and Trade-offs in Building a Web-scale Real-time Analytics System

Ben Black always has interesting things on the go. He’s now down in San Francisco working on his startup Fastip which he describes as “an incredible platform for operating, exploring, and optimizing data networks.” A couple of days ago Deepak Singh sent me to a recent presentation of Ben’s I found interesting: Challenges and Trade-offs in Building a Web-scale Real-time Analytics System.

The problem described in this talk was “Collect, index, and query trillions of high dimensionality

records with seconds of latency for ingestion and response.” What Ben is doing is collecting per flow networking data with tcp/ip 11-tuples (src_mac, dst_mac, src_IP, dest_IP, …) as the dimension data and, as metrics, he is tracking start usecs, end usecs, packets, octets, and UID. This data is interesting for two reasons: 1) networks are huge, massively shared resources and most companies haven’t really a clue on the details of what traffic is clogging it and have only weak tools to understand what traffic is flowing – the data sets are so huge, the only hope is to sample it with solutions like Cisco’s NetFlow. The second reason I find this data interesting is closely related: 2) it is simply vast and I love big data problems. Even on small networks, this form of flow tracking produces a monstrous data set very quickly. So, it’s an interesting problem in that it’s both hard and very useful to solve.

Ben presented 3 possible solutions and why they don’t work before offering a solution. The failed approaches that couldn’t cope with high dimensionality and the sheer volume of the dataset:

1. HBase: Insert into HBase then retrieve all records in a time range and filter, aggregate, and sort

2. Cassandra: Insert all records into Cassandra partitioned over a large cluster with each dimension indexed independently. Select qualifying records on each dimension, aggregate, and sort.

3. Statistical Cassandra: Run a statistical sample over the data stored in Cassandra in the previous attempt.

The end solution proposed by the presenter is to treat it as an Online Analytic Processing (OLAP) problem. He describes OLAP as “A business intelligence (BI) approach to swiftly answer multi-dimensional analytics queries by structuring the data specifically eliminate expensive processing at query time, even at a cost of enormous storage consumption.” OLAP is essentially a mechanism to support fast query over high-dimensional data by pre-computing all interesting aggregations and storing the pre-computed results in a highly compressed form that is often kept memory resident. However, in this case, the data set is far too large to be practical for an in-memory approach.

Ben draws from two papers to implement an OLAP based solution at this scale:

· High-Dimensional OLAP: A Minimal Cubing Approach by Li, Han, and Gonzalez

· Sorting improves word-aligned bitmap indexes by Lemire, Kaser, and Aouiche

The end solution is:

· Insert records into Cassandra.

· Materialize lower-dimensional cuboids using bitsets and then join as needed.

· Perform all query steps directly in the database

Ben’s concluding advice:

· Read the literature

· Generic software is 90% wrong at scale, you just don’t know which 90%.

· Iterate to discover and be prepared to start over

If you want to read more, the presentation is at: Challenges and Trade-offs in Building a Web-scale Real-time Analytics System.


James Hamilton



b: /

12 comments on “Challenges and Trade-offs in Building a Web-scale Real-time Analytics System
  1. Thanks for the background on Skytide Joseph. Useful and much appreciated.


  2. Joseph Rozenfeld says:

    As one of the founders of Skytide and their former CTO I can con firm that Anurag’s description is very close. Skytide is a proprietary columnar store using MDX as its query language and XPath as a declarative language to describe dimensional and other properties of the cube. Data aggregation model closely resembles ROLAP with one important distinction, the leaves contain pre aggregated data only and the details are thrown away once they get aggregated into leaves. The approach works very well for netflow data as well as all kinds of weblogs where people rarely want to analyze individual flows but instead are interested in the IP or network level analysis. The product also features a continuous data load and incremental update process that allows for ingestion of very "big data" (tens of terabytes a day). Compression is actually much better than 10X simply because the size of the cube is bound by its dimensions’ cardinality while the incoming data volume is virtually unlimited. Skytide’s patent provides a few more details if anyone is interested

    Joseph Rozenfeld

  3. Thanks for the additional detail Anurag. At the core store level, I agree, its sounds like the skytide approach is closely related to that proposed by Ben (based upon an OLAP engine ) and, in addition, they support Xpath.


  4. Anurag Gupta says:

    My memory of Skytide is fuzzy, so the below may be inaccurate.

    The use of XPATH expressions is primarily to provide flexibility in defining the notions of dimensions and metrics across source semi-structured log data. Based on the particular target dimensions and metrics required for the various reports being generated, they push the values into (what sounded to me like) a column store which implements one or more logical OLAP cubes. The column store was custom, as I recall. The metrics at the leaves of the dimensions are accumulated (for example, if I only analyze down to day grain, all my source data will be accumulated to day grain in the store). All queries happen against the column store (not through clustered XPATH queries). As you mention, one can always run some sort of drill-through to data below the leaf levels of the cube if necessary – at this point, hopefully the data set is more manageable – I don’t know if they supported this.

    They claimed getting about 10x compression from the column store (going from unstructured logs to the structured cube, removing unanalyzed elements and compressing out repeated dimension values) and another 10x from the accumulation of metrics at dimension leaves (eg day grain, source mac, etc). The latter may seem surprising since the data is so dimensionally sparse, but seems more plausible when you consider that it does cluster a fair bit (lots of mac addresses, but I’ll get a lot of packets from any given one for a given period).

    At my limited understanding of both, it sounds like they circled around to a similar approach to Fastip. I was interested in their XPATH work because I think, in this sort of domain, once you get the data into a column store (with leaf accumulation), you’re kind of good to go. The harder problem to me was dealing with a multitude of source formats and still providing the kind of parallelization of loads you would only get from the core product, rather than site customization. XPATH for me was a nice declarative mechanism to provide a parallelizable mapping from site-specific source log format to site-specific target reports.

  5. Makes sense Andy.

    I’m a data guy so I generally favor a "keep it all" strategy. Low-end, low-performance disk is pretty cheap these days. Also you could use cloud storage where prices range from 6 cents to 14 cents a GB/month on S3. Neither approach will support direct drill through but you could run an MapReduce job to ask deeper questions when the aggregations have hidden detail. I generally hate through data out.

  6. Andy Davies says:

    RUEI doesn’t use Oracle’s OLAP products – it was built by Dutch(?) company called Moniforce before Oracle acquired them.

    RUEI only retains the source data for a limited period of time (depending on how much space is a available), it then aggregates and consolidates the data depending on how old it is, what dimensions are available etc.

    We set the satisfied page load-time to be 3 secs and generated an Apdex from this, but also wanted to look at how the number of satisfied / tolerating / frustrated requested would change if we reduced the allowable load-time. Unfortunately the source data isn’t maintained at a low enough level to be able to do this.


  7. Thanks Andy, I am familiar with the Oracle OLAP product. SQL Server includes one in the box as well.

    You mentioned the problem of wanting to drill deeper than supported by the aggregations but not being able to do. I’ve not used them but I vaguely recall some products advertising drill through where they can drill through a cube back to the source database.


  8. Thanks for catching the URL glitch Michael. I’ve updated the main text.


  9. Andy Davies says:

    There are a couple of web analytic products that use the OLAP approach – Oracle (ex Monitored) RUEI is one.

    They get a data feed from a span or tap port, extract the TCP and HTTP information they’re interested in and stuff it into a set of cubes.

    It’s pretty powerful but at the same time limiting, as it’s difficult to ask some what if questions because the raw data has already gone.

    Currently pondering using the Pion collector from Atomic Labs and pumping the data into something that I can run map reduce over.


  10. Something’s wrong with the High-Dimensional OLAP link. Here’s a working link:

    Also, I got an error trying to post this.

  11. That’s clearly easy to do but transfers the problem to an xpath query engine that supports clustered queries. What are the Skytide folks using Anurag?


  12. Anurag Gupta says:

    Skytide has been doing something similar for a number of years. I know one of their CDN customers needs to process 50 TB of edge cache log data daily into a repository containing a years worth (20 PB) to support traffic segmentation analysis.

    I thought Skytide’s approach here was interesting – to transform the log format into XML, and then use XPATH to define the dimensions and metrics for analysis. Worth a look…

Leave a Reply

Your email address will not be published. Required fields are marked *