Wednesday Yahoo announced they have a built a petascale, distributed relational database. In Yahoo Claims Record With Petabyte Database, the details are thin but they built on the PostgreSQL relational database system. In Size matters: Yahoo claims 2-petabyte database is world’s biggest, busiest, the system is described as an over 2 petabyte repository of user click stream and context data with an update rate for 24 billion events per day. Waqar Hasan, VP of Engineering at Yahoo! Data group, describes the system as updated in real time and live – essentially a real time data warehouse where changes go in as they are made and queries always run against the most current data. I strongly suspect they are bulk parsing logs and the data is being pushed into the system in large bulk units but, even near real time at this update rate, is impressive.

The original work was done at a Seattle startup called Mahat Technologies acquired by Yahoo! in November 2005.

The approach appears to be similar to what we did with IBM DB2 Parallel Edition. 13 years ago we had it running on a cluster of 512 RS/6000s at the Maui Super Computer Center and 256 nodes at the Cornel Theory Center. It’s a shared nothing design which means that each server in the cluster have independent disk and don’t share memory. The upside of this approach is it scales incredibly well. It looks like Yahoo! has done something similar using PostgreSQL as the base technology. Each node in the cluster runs a full copy of the storage engine. The query execution engine is replaced with one modified to run over a cluster and use a communications fabric to interconnect the nodes in the cluster. The parallel query plans are run over the entire cluster with the plan nodes interconnected by the communication fabric. The PostgreSQL client, communications protocol and server side components with some big exceptions run mostly unchanged. The query optimizer is either replaced completely with a cluster parallel aware implementation that models the data layout and cluster topology in making optimization decisions. Or the original, non-cluster parallel optimizer is used and the resultant single node plans are then optimized for the cluster in a post optimization phase. The former will yield provably better plans but it’s also more complex. I’m fearful of complexity around optimizers and, as a consequence, I actually prefer the slightly less optimal, post-optimization phase. Many other problems have to be addressed including having the cluster metadata available on each node to support SQL query compilation but what I’ve sketched here covers the major points required to get such a design running.

The result is a modified version of PostgreSQL runs on each node. A client can connect to any of the nodes in the cluster (or a policy restricted subset). A query flows from the client to the server it chose to connect with. The SQL compiler on that node compiles and optimizes the query on that single node (no parallelism). The query optimizer is either cluster-aware or uses a post-optimization cluster-aware component. The resultant query plan when ready for execution is divided up into sub-plans (plan fragments) that run on each node connected over the communication fabric. Some execution engines initiate top-down and some bottom up. I don’t recall what PostgreSQL uses but bottom-up is easier in this case. However, either can be made to work. The plan fragments are distributed to the appropriate nodes in the cluster. Each runs on local data and pipes results to other nodes which run plan fragments and forward the results yet again toward the root of the plan. The root of the plan runs on the node that started the compilation and the final results end up there to be returned to the client.

It’s a nice approach and as evidenced by Yahoo’s experience it scales, scales, scales. I also like the approach in that most tools and applications can continue to work with little change. Most clusters of this design have some restrictions such unique ID generation is either not supported or slow as is referential integrity. Nonetheless, a large class of software can be run without change.

If you are interested in digging deeper into Relational Database technology and how the major commercial systems are written, see Architecture of a Database System.

Yahoo has a long history of contributing to Open Source and they are the largest contributor to the Apache Hadoop project. It’ll be interesting to see if Yahoo! Data ends up open source or held as an internal only asset.

Kevin Merritt pointed me to the Yahoo! Data work.


James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | | | blog: