Yahoo! hosted the Hadoop Summit Tuesday of this week. I posted my rough notes on the conference over the course of the day – posting summarized some of what caught my interest and consolidates my notes.
Yahoo expected 100 attendees and ended up having to change venues to get closer to fitting the more than 400 who wanted to attend. For me the most striking thing is that Hadoop is now clearly in broad use and at scale. Dave Cutting did a quick survey at the start and rough ½ the crowd were running Hadoop in production and around 1/5 have over 100 node clusters. Yahoo remains the biggest with 2,000 nodes in their cluster.
Christian Kunz of Yahoo! gave a bit of a window into how Yahoo! is using Hadoop to process their Webmap data store. The Webmap is a structured storage representation of all Yahoo! crawled pages and all the metadata they extract or compute on those pages. There are over 100 Webmap applications used in managing the Yahoo! indexing engine. Christian talked about why they moved to Hadoop from the legacy system and summarized the magnitude of the workload they are running. These are almost certainly the largest Hadoop jobs in the world. The longest map/reduce jobs run for over three days and have 100k maps and 10k reduces. This job reads 300 TB and produces 200 TB.
Another informative talk was given by the Facebook team. They described Hive, the data warehouse at Facebook. Joydeep Sarma and Ashish Thusoo presented this work. I liked this talk as it was 100% customer driven. They implemented what the analyst and programmers inside Facebook needed and I found their observations credible and interesting. They reported that Analyst are used to SQL and found a SQL like language most productive but that programmers like to have direct access to map/reduce primitives. As a consequence, they provide both (so do we). The Facebook team reports they roughly 25% of the development team using Hive and process 3,500 map/reduce jobs a week.
Google is heavily invested in Hadoop using it as a teaching vehicle even though it’s not used internally. The Google interest in Haddop is to get graduating students more familiar with the map/reduce programming model. Several schools have agreed to teach the map/reduce programming using Hadoop. For example Berkeley, CMU, MIT, Stanford, UW, and UMD all plan courses
The agenda for the day:
Time |
Topic |
Speaker(s) |
8:00-8:55 |
Breakfast/Registration |
|
8:55-9:00 |
Welcome & Logistics |
Ajay Anand, Yahoo! |
9:00-9:30 |
Hadoop Overview |
Doug Cutting / Eric Baldeschwieler, Yahoo! |
9:30-10:00 |
Pig |
Chris Olston, Yahoo! |
10:00-10:30 |
JAQL |
Kevin Beyer, IBM |
10:30-10:45 |
Break |
|
10:45-11:15 |
DryadLINQ |
Michael Isard, Microsoft |
11:15-11:45 |
Monitoring Hadoop using X-Trace |
Andy Konwinski and Matei Zaharia, UC Berkeley |
11:45-12:15 |
Zookeeper |
Ben Reed, Yahoo! |
12:15-1:15 |
Lunch |
|
1:15-1:45 |
Hbase |
Michael Stack, Powerset |
1:45-2:15 |
Hbase at Rapleaf |
Bryan Duxbury, Rapleaf |
2:15-2:45 |
Hive |
Joydeep Sen Sarma / Ashish Thusoo, Facebook |
2:45-3:05 |
GrepTheWeb – Hadoop an AWS |
Jinesh Varia, Amazon.com |
3:05-3:20 |
Break |
|
3:20-3:24 |
Building Ground Models of Southern California |
Steve Schlosser, David O’Hallaron, Intel / CMU |
3:40-4:00 |
Online search for engineering design content |
Mike Haley, Autodesk |
4:00-4:20 |
Yahoo – Webmap |
Arnab Bhattacharjee, Yahoo! |
4:20-4:45 |
Natural language Processing |
Jimmy Lin, U of Maryland / Christophe Bisciglia, Google |
4:45-5:30 |
Panel on future directions |
Sameer Paranjpye, Sanjay Radia, Owen O.Malley (Yahoo), Chad Walters (Powerset), Jeff Eastman (Mahout) |
My more detailed notes are at: HadoopSummit2008_NotesJamesRH.doc (81.5 KB). Peter Lee’s Hadoop Summit summary is at: http://www.csdhead.cs.cmu.edu/blog/
James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | JamesRH@microsoft.com