In this month’s Communications of the Association of Computing Machinery, a rematch of the MapReduce debate was staged. In the original debate, Dave Dewitt and Michael Stonebraker, both giants of the database community, complained that:
1. MapReduce is a step backwards in database access
2. MapReduce is a poor implementation
3. MapReduce is not novel
4. MapReduce is missing features
5. MapReduce is incompatible with the DBMS tools
Unfortunately, the original article appear to be no longer available but you will find the debate branching out from that original article by searching on the title Map Reduce: A Major Step Backwards. The debate was huge, occasionally entertaining, but not always factual. My contribution was MapReduce a Minor Step forward.
Update: In comments, csliu offered updated URLs for the original blog post and a follow-on article:
I like MapReduce for a variety of reasons the most significant of which is that it allows non-systems programmers to write very high-scale, parallel programs with comparative ease. There have been many attempts to allow mortals to write parallel programs but there really have only been two widely adopted solutions that allow modestly skilled programmers to write highly concurrent executions: SQL and MapReduce. Ironically the two communities participating in the debate, Systems and Database, have each produced a great success by this measure.
More than 15 years ago, back when I worked on IBM DB2, we had DB2 Parallel Edition running well over a 512 server cluster. Even back then you could write a SQL Statement that would run over a ½ thousand servers. Similarly, programmers without special skills can run MapReduce programs that run over thousands of serves. The last I checked Yahoo, was running MapReduce jobs over a 4,000 node cluster: Scaling Hadoop to 4,000 nodes at Yahoo!.
The update on the MapReduce debate is worth reading but, unfortunately, the ACM has marked the first article as “premium content” so you can only read it if you are a CACM subscriber:
Update: Moshe Vardi, Editor in Chief of the Communications of the Association of Computing Machinery has kindly decided to make both the of the above articles freely available for all whether or not CACM member. Thank you Moshe.
Even more important to me than the MapReduce debate is seeing this sort of content made widely available. I hate seeing it classified as premium content restricted to members only. You really all should be members but, with the plunging cost of web publishing, why can’t the above content be made freely available? But, while complaining about the ACM publishing policies, I should hasten to point out that the CACM has returned to greatness. When I started in this industry, the CACM was an important read each month. Well, good news, the long boring hiatus is over. It’s now important reading again and has been for the last couple of years. I just wish the CACM would follow the lead of ACM Queue and make the content more broadly available outside of the membership community.
Returning to the MapReduce discussion, in the second CACM article above, MapReduce: A Flexible Data Processing Tool, Jeff Dean and Sanjay Ghemawat, do a thoughtful job of working through some of the recent criticism of MapReduce.
If you are interested in MapReduce, I recommend reading the original Operating Systems Design and Implementation MapReduce paper: MapReduce: Simplied Data Processing on Large Clusters and the detailed MapReduce vs database comparison paper: A Comparison of Approaches to Large-Scale Data Analysis.