Hadoop Summit Summary

Yahoo! hosted the Hadoop Summit Tuesday of this week. I posted my rough notes on the conference over the course of the day – posting summarized some of what caught my interest and consolidates my notes.

Yahoo expected 100 attendees and ended up having to change venues to get closer to fitting the more than 400 who wanted to attend. For me the most striking thing is that Hadoop is now clearly in broad use and at scale. Dave Cutting did a quick survey at the start and rough ½ the crowd were running Hadoop in production and around 1/5 have over 100 node clusters. Yahoo remains the biggest with 2,000 nodes in their cluster.

Christian Kunz of Yahoo! gave a bit of a window into how Yahoo! is using Hadoop to process their Webmap data store. The Webmap is a structured storage representation of all Yahoo! crawled pages and all the metadata they extract or compute on those pages. There are over 100 Webmap applications used in managing the Yahoo! indexing engine. Christian talked about why they moved to Hadoop from the legacy system and summarized the magnitude of the workload they are running. These are almost certainly the largest Hadoop jobs in the world. The longest map/reduce jobs run for over three days and have 100k maps and 10k reduces. This job reads 300 TB and produces 200 TB.

Another informative talk was given by the Facebook team. They described Hive, the data warehouse at Facebook. Joydeep Sarma and Ashish Thusoo presented this work. I liked this talk as it was 100% customer driven. They implemented what the analyst and programmers inside Facebook needed and I found their observations credible and interesting. They reported that Analyst are used to SQL and found a SQL like language most productive but that programmers like to have direct access to map/reduce primitives. As a consequence, they provide both (so do we). The Facebook team reports they roughly 25% of the development team using Hive and process 3,500 map/reduce jobs a week.

Google is heavily invested in Hadoop using it as a teaching vehicle even though it’s not used internally. The Google interest in Haddop is to get graduating students more familiar with the map/reduce programming model. Several schools have agreed to teach the map/reduce programming using Hadoop. For example Berkeley, CMU, MIT, Stanford, UW, and UMD all plan courses

The agenda for the day:

Time

Topic

Speaker(s)

8:00-8:55

Breakfast/Registration

8:55-9:00

Welcome & Logistics

Ajay Anand, Yahoo!

9:00-9:30

Hadoop Overview

Doug Cutting / Eric Baldeschwieler, Yahoo!

9:30-10:00

Pig

Chris Olston, Yahoo!

10:00-10:30

JAQL

Kevin Beyer, IBM

10:30-10:45

Break

10:45-11:15

DryadLINQ

Michael Isard, Microsoft

11:15-11:45

Monitoring Hadoop using X-Trace

Andy Konwinski and Matei Zaharia, UC Berkeley

11:45-12:15

Zookeeper

Ben Reed, Yahoo!

12:15-1:15

Lunch

1:15-1:45

Hbase

Michael Stack, Powerset

1:45-2:15

Hbase at Rapleaf

Bryan Duxbury, Rapleaf

2:15-2:45

Hive

Joydeep Sen Sarma / Ashish Thusoo, Facebook

2:45-3:05

GrepTheWeb – Hadoop an AWS

Jinesh Varia, Amazon.com

3:05-3:20

Break

3:20-3:24

Building Ground Models of Southern California

Steve Schlosser, David O’Hallaron, Intel / CMU

3:40-4:00

Online search for engineering design content

Mike Haley, Autodesk

4:00-4:20

Yahoo – Webmap

Arnab Bhattacharjee, Yahoo!

4:20-4:45

Natural language Processing

Jimmy Lin, U of Maryland / Christophe Bisciglia, Google

4:45-5:30

Panel on future directions

Sameer Paranjpye, Sanjay Radia, Owen O.Malley (Yahoo), Chad Walters (Powerset), Jeff Eastman (Mahout)

My more detailed notes are at: HadoopSummit2008_NotesJamesRH.doc (81.5 KB). Peter Lee’s Hadoop Summit summary is at: http://www.csdhead.cs.cmu.edu/blog/

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.