Dave Dewitt and Michael Stonebraker posted an article worth reading yesterday titled: MapReduce: A Major Step Backwards (Thanks to Kevin Merrit and Sriram Krishnan for sending this one my way). Their general argument is that MapReduce isn’t better than current generation RDBMS which is certainly true in many dimensions and it isn’t a new invention which is also true. I’m not in agreement with the conclusion that MapReduce is a major step backwards but I’m fully in agreement with many of the points building towards that conclusion. Let’s look at some of the major points made by the article:
1. MapReduce is a step backwards in database access
In this section, the authors argue that schema is good, separation of schema and application are good, and high level language access is good. On the first two points, I agree schema is good and there is no question that application/schema separation has long ago proven to be a good thing. The thing to keep in mind is that MapReduce is only an execution framework. The data store is GFS or sometimes Bigtable in the case of Google or HDFS or HBase in the case of Hadoop. MapReduce is only the execution framework so it’s not 100% correct to argue that MapReduce doesn’t support schema – that’s a store issue and it is true that most stores that MapReduce is run over don’t implement these features today.
I argue that a separation of execution framework from store and indexing technology is a good thing in that MapReduce can be run over many stores. You can use MapReduce over either BigTable (which happens to be implemented on GFS) or over GFS depending upon the type of data you have at hand. I think that Dewitt and Stonebraker would both agree that breaking up monolithic database management systems into extensible components is a very good thing to do. In fact much of the early work in extensible database management systems was done by David Dewitt. The point here is that Dewitt and Stonebraker would like to see schema enforcement as part of the store and, generally, I agree that this would be useful. However, MapReduce is not a store.
They also argue that high level languages are good. I agree and any language can be used with MapReduce systems so this isn’t a problem and is supported today.
2. MapReduce is a poor implementation
The argument here is that any reasonable structured store will support indexes. I agree for many workloads you absolutely must have indexes. However, for many data mining and analysis algorithms, all the data in a data set is accessed. Indexes, in these cases, don’t help. This is one of the reason why many data mining algorithms run poorly over RDBMS – if all they are going to do is repeatedly scan the same data, a flat file is faster. It depends upon application access pattern and the amount of data that is accessed. A common execution approach for data mining algorithms is to export the data to a flat file and then operate on it there. An index helps when you are looking at a small subset of the data and there is point N where if you are looking at less than N% of the data, the index helps and should be used. But, if looking at more than N%, you are better off table scanning. The point N is implementation dependent but storage technology trends have been pushing this number down over the years. Basically some algorithms look at all the data and aren’t helped by indexes and some look at only a portion of the data and for those that look at more than N% of the data, the index again won’t help.
There is no question that indexes are a good thing and there is no arguing that much of the worlds persistent storage access is done through indexes. Indexes are good. But, they are not good for all workloads and for all access patterns. Remember MapReduce is not a store – only an execution framework. To implement indexing in a store used by MapReduce would be easy and presumably someone will when it’s need is broadly noticed. In the interim, indexes can be built using MapReduce jobs and then used by subsequent MapReduce jobs. Certainly more of a hassle than stores that automatically maintain indexes but acceptable for some workloads.
3. MapReduce is not novel
This is clearly true. These ideas have been fully and deeply investigated by the database community in the distant past. What is innovative is scale. I’ve seen MapReduce clusters of 3,000 nodes and I strongly suspect that clusters of 5,000+ servers can be found if you look in the right places. I’ve been around parallel database management systems for many years but have never seen multi-thousand node clusters of Oracle RAC or IBM DB2 Parallel Edition. The innovative part of MapReduce is that it REALLY scales and, for where MapReduce is used today, scale matters more than everything else. I’ll claim that 3,000 server query engines ARE novel but I agree that the constituent technologies have been around for some time.
4. MapReduce is missing features
All of the missing features (bulk loader, indexing, updates, transactions, RI, views) are features that could be implemented in a store used by MapReduce. As these features become important in domains over which MapReduce is used, they can be implemented in the underlying stores. I suspect, as long as MapReduce is used for analysis and data mining workloads the pressing need for RI may never get strong enough to motivate someone to implement it. However, it clearly could be done and the absence of RI in many stores is not a shortcoming of MapReduce.
5. MapReduce is incompatible with the DBMS tools
I 100% agree. Tools are useful and today many of these tools target RDBMS. It’s not mentioned by the authors but another useful characteristic of RDBMS is developers understand them and many people know how to write SQL. It’s an data access and manipulation language that is broadly understood. The thing to keep in mind is that MapReduce is part of a componentized system. It’s just the execution framework. I could easily write a SQL compiler that emitted MapReduce jobs (SQL doesn’t dictate or fundamentally restrict the execution engine design). MapReduce can be run over simple stores as it mostly is today or over stores with near database level functionality if needed.
I’m arguing that the languages with which MapReduce jobs are expressed could be higher level and there have been research projects to do this (for example: http://research.microsoft.com/research/sv/dryad/). Even a SQL Compiler is possible over MapReduce. And I’m arguing that MapReduce could be run over very rich stores with indexes and integrity constraints should that become broadly interesting. MapReduce is just an execution engine that happens to scale extremely well. For example, in the MapReduce-like system used around Microsoft, there exist layers of languages above the execution engine that offer different levels of abstraction and control on the same engine.
An execution engine that runs on multi-thousand node clusters really is an important step forward. The separation of execution engine and storage engine into extensible parts isn’t innovative but it is a very flexible approach that current generation commercial RDBMS could profit from.
I love MapReduce because I love high scale data manipulation. What can be frustrating for database folks is 1) most of the ideas of MapReduce have been around for years and 2) there has been decades of good research in the DB community focusing on execution engine techniques and algorithms that haven’t yet been applied to the MapReduce engines. Many of these optimizations from the DB world will help make better MapReduce engines. But, for all these faults, MapReduce sure does scale and it’s hard not to love being able to submit a job and see several thousand nodes churning over several petabytes of data. Priceless.
–jrh
James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | JamesRH@microsoft.com