Ben Black always has interesting things on the go. He’s now down in San Francisco working on his startup Fastip which he describes as “an incredible platform for operating, exploring, and optimizing data networks.” A couple of days ago Deepak Singh sent me to a recent presentation of Ben’s I found interesting: Challenges and Trade-offs in Building a Web-scale Real-time Analytics System.
The problem described in this talk was “Collect, index, and query trillions of high dimensionality
records with seconds of latency for ingestion and response.” What Ben is doing is collecting per flow networking data with tcp/ip 11-tuples (src_mac, dst_mac, src_IP, dest_IP, …) as the dimension data and, as metrics, he is tracking start usecs, end usecs, packets, octets, and UID. This data is interesting for two reasons: 1) networks are huge, massively shared resources and most companies haven’t really a clue on the details of what traffic is clogging it and have only weak tools to understand what traffic is flowing – the data sets are so huge, the only hope is to sample it with solutions like Cisco’s NetFlow. The second reason I find this data interesting is closely related: 2) it is simply vast and I love big data problems. Even on small networks, this form of flow tracking produces a monstrous data set very quickly. So, it’s an interesting problem in that it’s both hard and very useful to solve.
Ben presented 3 possible solutions and why they don’t work before offering a solution. The failed approaches that couldn’t cope with high dimensionality and the sheer volume of the dataset:
1. HBase: Insert into HBase then retrieve all records in a time range and filter, aggregate, and sort
2. Cassandra: Insert all records into Cassandra partitioned over a large cluster with each dimension indexed independently. Select qualifying records on each dimension, aggregate, and sort.
3. Statistical Cassandra: Run a statistical sample over the data stored in Cassandra in the previous attempt.
The end solution proposed by the presenter is to treat it as an Online Analytic Processing (OLAP) problem. He describes OLAP as “A business intelligence (BI) approach to swiftly answer multi-dimensional analytics queries by structuring the data specifically eliminate expensive processing at query time, even at a cost of enormous storage consumption.” OLAP is essentially a mechanism to support fast query over high-dimensional data by pre-computing all interesting aggregations and storing the pre-computed results in a highly compressed form that is often kept memory resident. However, in this case, the data set is far too large to be practical for an in-memory approach.
Ben draws from two papers to implement an OLAP based solution at this scale:
· High-Dimensional OLAP: A Minimal Cubing Approach by Li, Han, and Gonzalez
· Sorting improves word-aligned bitmap indexes by Lemire, Kaser, and Aouiche
The end solution is:
· Insert records into Cassandra.
· Materialize lower-dimensional cuboids using bitsets and then join as needed.
· Perform all query steps directly in the database
Ben’s concluding advice:
· Read the literature
· Generic software is 90% wrong at scale, you just don’t know which 90%.
· Iterate to discover and be prepared to start over
If you want to read more, the presentation is at: Challenges and Trade-offs in Building a Web-scale Real-time Analytics System.