MapReduce has created some excitement in the relational database community. Dave Dewitt and Michael Stonebraker’s MapReduce: A Major Step Backwards is perhaps the best example. In that posting they argued that map reduce is a poor structured storage technology, the execution engine doesn’t include many of the advances found in modern, parallel RDBMS execution engines, it’s not novel, and its missing features.
In Mapreduce: A Minor Step Forward I argued that MapReduce is an execution model rather than storage engine. It is true that it is typically run over a file system like GFS or HDFS or simple structured storage system like BigTable or Hbase. But, it could be run over a full relational database.
Why would we want to run Hadoop over a full relational database? Hadoop scales: Hadoop has been scaled to 4,000 nodes at Yahoo! Scaling Hadoop to 4000 nodes at Yahoo!. Scaling a clustered RDBMS too 4k nodes is certainly possible but the high scale single system image cluster I’ve seen was 512 nodes (what was then called DB2 Parallel Edition). Getting to 4k is big. Hadoop is simple: automatic parallelism has been an industry goal for decades but progress has been limited. There really hasn’t been success in allowing programmers of average skill to write massively parallel programs except for SQL and Hadoop. Programmers of bounded skill can easily write SQL that will be run in parallel over high scale clusters. Hadoop is the only other example I know where this is possible and happening regularily.
Hadoop makes the application of 100s or even 1000s of nodes of commodity computers easy so why not Hadoop over full RDBMS nodes? Daniel Abadi and team from Yale and Brown have done exactly that. In this case, Hadoop over PostgresSQL. From Daniel’s blog:
1. A hybrid of DBMS and MapReduce technologies targeting analytical query workloads
2. Designed to run on a shared-nothing cluster of commodity machines, or in the cloud
3. An attempt to fill the gap in the market for a free and open source parallel DBMS
4. Much more scalable than currently available parallel database systems and DBMS/MapReduce hybrid systems (see longer blog post).
5. As scalable as Hadoop, while achieving superior performance on structured data analysis workloads
See: http://dbmsmusings.blogspot.com/2009/07/announcing-release-of-hadoopdb-longer.html for more detail and http://sourceforge.net/projects/hadoopdb/ for source code for HadoopDB.
A more detailed paper has been accepted for publication at VLDB: http://db.cs.yale.edu/hadoopdb/hadoopdb.pdf.
The development work for HadoopDB was done using AWS Elastic Compute Cluster. Nice work Daniel.
1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | email@example.com