MapReduce in CACM

In this month’s Communications of the Association of Computing Machinery, a rematch of the MapReduce debate was staged. In the original debate, Dave Dewitt and Michael Stonebraker, both giants of the database community, complained that:

1. MapReduce is a step backwards in database access

2. MapReduce is a poor implementation

3. MapReduce is not novel

4. MapReduce is missing features

5. MapReduce is incompatible with the DBMS tools

Unfortunately, the original article appear to be no longer available but you will find the debate branching out from that original article by searching on the title Map Reduce: A Major Step Backwards. The debate was huge, occasionally entertaining, but not always factual. My contribution was MapReduce a Minor Step forward.

Update: In comments, csliu offered updated URLs for the original blog post and a follow-on article:

· MapReduce: A Major Step Backwards

· MapReduce II

I like MapReduce for a variety of reasons the most significant of which is that it allows non-systems programmers to write very high-scale, parallel programs with comparative ease. There have been many attempts to allow mortals to write parallel programs but there really have only been two widely adopted solutions that allow modestly skilled programmers to write highly concurrent executions: SQL and MapReduce. Ironically the two communities participating in the debate, Systems and Database, have each produced a great success by this measure.

More than 15 years ago, back when I worked on IBM DB2, we had DB2 Parallel Edition running well over a 512 server cluster. Even back then you could write a SQL Statement that would run over a ½ thousand servers. Similarly, programmers without special skills can run MapReduce programs that run over thousands of serves. The last I checked Yahoo, was running MapReduce jobs over a 4,000 node cluster: Scaling Hadoop to 4,000 nodes at Yahoo!.

The update on the MapReduce debate is worth reading but, unfortunately, the ACM has marked the first article as “premium content” so you can only read it if you are a CACM subscriber:

· MapReduce and Parallel DBMSs: Friend or Foe

· MapReduce: A Flexible Data Processing Tool

Update: Moshe Vardi, Editor in Chief of the Communications of the Association of Computing Machinery has kindly decided to make both the of the above articles freely available for all whether or not CACM member. Thank you Moshe.

Even more important to me than the MapReduce debate is seeing this sort of content made widely available. I hate seeing it classified as premium content restricted to members only. You really all should be members but, with the plunging cost of web publishing, why can’t the above content be made freely available? But, while complaining about the ACM publishing policies, I should hasten to point out that the CACM has returned to greatness. When I started in this industry, the CACM was an important read each month. Well, good news, the long boring hiatus is over. It’s now important reading again and has been for the last couple of years. I just wish the CACM would follow the lead of ACM Queue and make the content more broadly available outside of the membership community.

Returning to the MapReduce discussion, in the second CACM article above, MapReduce: A Flexible Data Processing Tool, Jeff Dean and Sanjay Ghemawat, do a thoughtful job of working through some of the recent criticism of MapReduce.

If you are interested in MapReduce, I recommend reading the original Operating Systems Design and Implementation MapReduce paper: MapReduce: Simplied Data Processing on Large Clusters and the detailed MapReduce vs database comparison paper: A Comparison of Approaches to Large-Scale Data Analysis.

–jrh

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

Software

8 comments on “MapReduce in CACM”

James Hamilton says:

January 6, 2010 at 3:36 pm

Thanks for your comment Scott and hats off to you and the rest of the CACM team for consistently delivering a great magazine.

I fully understand the tension between "free" and having paying members that allow for the good quality content to be published. Many argue that what I would like to see happen isn’t practical. I think it is practical but freely admit that it is difficult. What I have in mind is dedicating the ACM to free and wide dissemination of information. Making that the primary focus of the organization gives up one source of revenue and I fully understand that a high quality organization, even one run extremely frugally, will still have bills that need to be paid.

To stay solvent, I would take a three prong approach: 1) delivery all content electronically which is remarkably cheap these days (I pay for this site personally); 2) Where possible and practical, depend upon volunteers that believe in the goals of the organization and are excited about helping to fulfill them (we already do a lot of that); and 3) recognize that the community is incredbily valuable and sell advertising in and around the content delivered. If folks find that objectionable, make one of the advantages of membership be that all content is advertising free to members.

Congratulations on achieving such broad distribution of the digital library. That’s impressive and I know its difficult. And I know what I’m advocating here is even more difficult: make the content freely available while at the same time maintain a strong, very high quality organization with high editorial standards on all delivered content. I think it can be done and I think it is worth doing. I hope we continue to experiment with ways maintain the strength of the ACM and the quality of the CACM. Thanks again.

James Hamilton
jrh@mvdirona.com

Reply
James Hamilton says:

January 6, 2010 at 3:13 pm

Thanks for making those two articles available as non-premium content Moshe. Having them widely read is good for our community and I hope it exposes more people to both the Association of Computing Machinery (http://www.acm.org/)and the CACM (http://cacm.acm.org/).

For those that may not know Moshe, he is Editor in Chief of the Communications of the Association of Computing Machinery and partly responsible for the continuing stream of high quality content in CACM. Thanks Moshe,

James Hamilton
jrh@mvdirona.com

Reply
Scott Delman says:

January 5, 2010 at 9:57 pm

Thank you for posting your comments about the MapReduce debate in the Communications of the ACM. As the Publisher of the magazine, I was also particularly interested in reading your comments about making the content as widely available as possible. From a purely theoretical perspective, I think most would agree with you that broad distribution of high quality content is a social good that authors and publishers alike should strive for. ACM is no exception here, but as a membership organization that exists to service the computing community we also need to balance the "ideal" of complete open access with the realities of publishing high quality content in print and online formats. One of these realities is that there are significant costs involved in publishing content in the form of journals, magazines, proceedings, newsletters, and web sites. Finding the right balance of completely opening up articles on the web that have mass appeal versus making certain content available to paying subscribers is a balance many publishers are struggling with these days, and for good reason. Not all information sources on the internet ensure quality control and so it is also important to protect those sources or publication "brands" that have a tradition of quality control for the community’s benefit. If all content was instantly opened up and made freely available to all, many of these sources of trusted information would disappear over time.

With this said, I do not believe that premium content and broad dissemination are necessarily mutually exclusive concepts. As a matter of fact, up until today (when we opened the MapReduce articles at the request of our EIC Moshe Vardi) all articles published in the Communications of the ACM are available for unlimited access through ACM’s Digital Library (http://acm.org/dl) which is currently accessible by over 1,500,000 million computing students and professionals at over 2,700 academic, government, and corporate institutions in over 170 countries around the world.

Reply
Moshe Vardi says:

January 5, 2010 at 7:34 pm

Both articles are open.

Moshe

Reply
James Hamilton says:

January 2, 2010 at 10:31 pm

Thanks for the comment Andrey.

Pregel is a language for expressing computation over graphs created at Google. For those interested in learning more about Pregel (and the origins of the name), see Large Scale Graph Computing at Google (http://googleresearch.blogspot.com/2009/06/large-scale-graph-computing-at-google.html).

James Hamilton
jrh@mvdirona.com

Reply
James Hamilton says:

January 2, 2010 at 10:20 pm

Thanks for the URLs to the original articles. I updated the post to point to them.

James Hamilton
jrh@mvdirona.conm

Reply
Andrey Kuzmin says:

January 2, 2010 at 9:30 pm

Nice read, thanks. It would be interesting to follow MapReduce evolution into more elaborated computation models like large-scale graph processing (there had been an announcement-type Gogole paper on that recently, Pregel :) is understandably the system’s name).

Reply
csliu says:

January 2, 2010 at 5:22 pm

Thanks for the information about the new issue of CACM.

The two original posts are still available, just the urls had changed a little. In fact, the new addresses are in reference section of Jeffery Dean’s "MapReduce: A Flexible Data Processing Tool". :-)

5. Dewitt, D. and Stonebraker, M. MapReduce: A Major Step Backwards blogpost; http://databasecolumn.vertica.com/database-innovation/mapreduce-a-major-step-backwards/

6. Dewitt, D. and Stonebraker, M. MapReduce II blogpost; http://databasecolumn.vertica.com/database-innovation/mapreduce-ii/

Reply

8 comments on “MapReduce in CACM”

Leave a Reply Cancel reply