Yahoo is hosting a conference the Hadoop Summit down in Sunnyvale today. There are over 400 attendees of which more than ½ are current Hadoop users and roughly 15 to 20% are running more than 100 node clusters.
I’ll post my rough notes from the talks over the course of the day. So far, it’s excellent. My notes on the first two talks are below.
Hadoop: A Brief History
· Doug Cutting
· Started with Nutch in 2002 to 2004
o Initial goal was web-scale, crawler-based search
o Distributed by neciessity
o Sort/merge based processing
o Demonstrated on 4 nodes over 100M web pages.
o Was operational onerous. “Real” Web scale was a ways away yet
· 2004 through 2006: Gestation period
o GFS & MapReduce papers published (addressed the scale problems we were having)
o Add DFS and MapReduce to Nutch
o Two part-time developers over two years
o Ran on 20 nodes at Internet Archive (IA) and UW
o Much easier to program and run
o Scaled to several 100m web pages
· 2006 to 2008: Childhood
o Y! hired Doug Cutting and a dedicated team to work on it reporting to E14 (Eric Baldeschwieler)
o Hadoop project split out of Nutch
o Hit web scale in 2008
Yahoo Grid Team Perspective: Eric Baldeschwieler
· Grid is Eric’s team internal name
o On-demand, shared access to vast pools of resources
o Support massive parallel execution (2k nodes and roughly 10k processors)
o Data Intensive Super Computing (DISC)
o Centrally provisioned and managed
o Service-oriented, elastic
o Utility for user and researchers inside Y!
· Open Source Stack
o Committed to open source development
o Y! is Apache Platinum Sponsor
· Project on Eric’s team:
§ Distributed File System
§ MapReduce Framework
§ Dynamic Cluster Management (HOD)
· Allows sharing of a Hadoop cluster with 100’s of users at the same time.
· HOD: Hadoop on Demand. Creates virtual clusters using Torq (open source resource managers). Allocates cluster into many virtual clusters.
§ Parallel Programming Language and Runtime
§ High-availability directory and confuration service
§ Cluster and application monitoring
§ Collects stats from 100’s of clusters in parallel (fairly new so far). Also will be open sourced.
§ All will eventually be part of Apache
§ Similar to Ganglia but more configurable
§ Builds real time reports.
§ Goal is to use Hadoop to monitor Hadoop.
· Largest production clusters are currently 2k nodes. Working on more scaling. Don’t want to have just one cluster but want to run much bigger clusters. We’re investing heavily in scheduling to handle more concurrent jobs.
· Using 2 data centers and moving to three soon.
· Working with Carnegie Mellon University (Yahoo provided a container of 500 systems – it appears to be a Rackable Systems container)
· We’re running Megawatts of Hadoop
· Over 400 people express interest in this conference.
o About ½ the room running Hadoop
o Just about the same number running over 20 nodes
o About 15 to 20% running over 100 nodes