MapReduce has created some excitement in the relational database community. Dave Dewitt and Michael Stonebraker’s MapReduce: A Major Step Backwards is perhaps the best example. In that posting they argued that map reduce is a poor structured storage technology, the execution engine doesn’t include many of the advances found in modern, parallel RDBMS execution engines, it’s not novel, and its missing features.
In Mapreduce: A Minor Step Forward I argued that MapReduce is an execution model rather than storage engine. It is true that it is typically run over a file system like GFS or HDFS or simple structured storage system like BigTable or Hbase. But, it could be run over a full relational database.
Why would we want to run Hadoop over a full relational database? Hadoop scales: Hadoop has been scaled to 4,000 nodes at Yahoo! Scaling Hadoop to 4000 nodes at Yahoo!. Scaling a clustered RDBMS too 4k nodes is certainly possible but the high scale single system image cluster I’ve seen was 512 nodes (what was then called DB2 Parallel Edition). Getting to 4k is big. Hadoop is simple: automatic parallelism has been an industry goal for decades but progress has been limited. There really hasn’t been success in allowing programmers of average skill to write massively parallel programs except for SQL and Hadoop. Programmers of bounded skill can easily write SQL that will be run in parallel over high scale clusters. Hadoop is the only other example I know where this is possible and happening regularily.
Hadoop makes the application of 100s or even 1000s of nodes of commodity computers easy so why not Hadoop over full RDBMS nodes? Daniel Abadi and team from Yale and Brown have done exactly that. In this case, Hadoop over PostgresSQL. From Daniel’s blog:
HadoopDB is:
1. A hybrid of DBMS and MapReduce technologies targeting analytical query workloads
2. Designed to run on a shared-nothing cluster of commodity machines, or in the cloud
3. An attempt to fill the gap in the market for a free and open source parallel DBMS
4. Much more scalable than currently available parallel database systems and DBMS/MapReduce hybrid systems (see longer blog post).
5. As scalable as Hadoop, while achieving superior performance on structured data analysis workloads
See: http://dbmsmusings.blogspot.com/2009/07/announcing-release-of-hadoopdb-longer.html for more detail and http://sourceforge.net/projects/hadoopdb/ for source code for HadoopDB.
A more detailed paper has been accepted for publication at VLDB: http://db.cs.yale.edu/hadoopdb/hadoopdb.pdf.
The development work for HadoopDB was done using AWS Elastic Compute Cluster. Nice work Daniel.
–jrh
James Hamilton, Amazon Web Services
1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | james@amazon.com
H:mvdirona.com | W:mvdirona.com/jrh/work | blog:http://perspectives.mvdirona.com
My question was a little bit more narrowly focused, I think — I was trying to understand the advantages of a HadoopDB setup (MapReduce over RDBMS) with a Hadoop+HBase setup (MapReduce over BigTable).
Perhaps I am completely underestimating the value of query planning at the individual node level here but it seemed like the strictures imposed by the parallelization would end up neutralizing a lot of the potential benefits of SQL/RDMS here.
If the data in the sharded HadoopDB’s RDBMS instances is heavily denormalized with tables split by a single primary key, the data model looks pretty similar to BigTable. Also, it seems to me that usefulness of the query flexibility that the sharded RDBMS instances provide is diminished. For example, it seems like only very limited joins can be provided (joining on data with the same primary key?).
Thoughts?
I’m a huge believer in simple structured stores like SimpleDB (http://aws.amazon.com/simpledb/) but I’ve also worked on RDBMS for a decade and a half. Relational databases have profited from 35 years of evolution and they are very efficient at complex joins, aggregations,and queries in general. They are extremely good at optimizing easy to express SQL Queries into parallel plans and efficiently executing them.
For simple problems, hyper-scalable, simple structured stores are the right answer. For more complex problems, I would still use a relational database. For OLTP workloads, a RDBMS is a better choice than a BigTable-like solution. There’s a place for both and that’s why companies like Google, IBM and Amazon have development teams on both. Both are useful.
James Hamilton
jrh@mvdirona.com
It’s not clear to me what advantage having separate relational databases provide over having a BigTable-like system such as HBase, and BigTable/HBase/et al. will certain win out big time in terms of ease of administration.
Can anyone provide some insights here?
@Nicolas – HadoopDB is an extension of Hive, check out the VLDB paper for details
There is also Hive from the Facebook team. See http://www.facebook.com/note.php?note_id=16121578919
What do you think of that?