I didn’t attend the Hadoop Summit this year or last but was at the inaugural event back in 2008 and it was excellent. This year, the Hadoop Summit 2010 was held June 29 again in Santa Clara. This agenda for the 2010 event is at: Hadoop Summit 2010 Agenda. Since I wasn’t able to be there, Adam Gray of the Amazon AWS team was kind enough to pass on his notes and let me use them here:

Key Takeaways

· Yahoo and Facebook operate the world largest Hadoop clusters, 4,000/2,300 nodes with 70/40 petabytes respectively. They run full cluster replicas to assure availability and data durability.

· Yahoo released Hadoop security features with Kerberos integration which is most useful for long running multitenant Hadoop clusters.

· Cloudera released paid enterprise version of Hadoop with cluster management tools and several dB connectors and announced support for Hadoop security.

· Amazon Elastic MapReduce announced expand/shrink cluster functionality and paid support.

· Many Hadoop users use the service in conjunction with NoSQL DBs like Hbase or Cassandra.


Yahoo had the opening keynote with talks by Blake Irving, Chief Products Officer, Shelton Shugar, SVP of Cloud Computing, and Eric Baldeschwieler, VP of Hadoop. They talked about Yahoo’s scale, including 38k Hadoop servers, 70 PB of storage, and more than 1 million monthly jobs, with half of those jobs written in Apache Pig. Further their agility is improving significantly despite this massive scale—within 7 minutes of a homepage click they have a completely reconfigured preference model for that user and an updated homepage. This would not be possible without Hadoop. Yahoo believes that Hadoop is ready for enterprise use at massive scale and that their use case proves it. Further, a recent study found that 50% of enterprise companies are strongly considering Hadoop, with the most commonly cited reason being agility. Initiatives over the last year include: further investment and improvement in Hadoop 0.20, integration of Hadoop with Kerberos, and the Oozie workflow engine.

Next, Peter Sirota gave a keynote for Amazon Elastic MapReduce that focused on how the service makes combining the massive scalability of MapReduce with the web-scale infrastructure of AWS more accessible, particularly to enterprise customers. He also announced several new features including expanding and shrinking the cluster size of running job flows, support for spot instances, and premium support for Elastic MapReduce. Further, he discussed Elastic MapReduce’s involvement in the ecosystem including integration with Karmasphere and Datameer. Finally, Scott Capiello, Senior Director of Products at Microstrategy, came on stage to discuss their integration with Elastic MapReduce.

Cloudera followed with talks by Doug Cutting, the creator of Hadoop, and Charles Zedlweski, Senior Director of Product Management. They announced Cloudera Enterprise, a version of their software that includes production support and additional management tools. These tools include improved data integration and authorization management that leverages Hadoops security updates. And they demoed a WebUI for using these management tools.

The final keynote was given by Mike Schroepfer, VP of Engineering at Facebook. He talked about Facebook’s scale with 36 PB of uncompressed storage, 2,250 machines with 23k processors, and 80-90 TB growth per day. Their biggest challenge is in getting all that data into Hadoop clusters. Once the data is there, 95% of their jobs are Hive-based. In order to ensure reliability they replicate critical clusters in their entirety. As far as traffic, the average user spends more time on Facebook than the next 6 web pages combined. In order to improve user experience Facebook is continually improving the response time of their Hadoop jobs. Currently updates can occur within 5 minutes; however, they see this eventually moving below 1 minute. As this is often an acceptable wait time for changes to occur on a webpage, this will open up a whole new class of applications.

Discussion Tracks

After lunch the conference broke into three distinct discussion tracks: Developers, Applications, and Research. These tracks had several interesting talks including one by Jay Kreps, Principal Engineer at LinkedIn, who discussed LinkedIn’s data applications and infrastructure. He believes that their business data is ideally suited for Hadoop due to its massive scale but relatively static nature. This supports large amounts of computation being done offline. Further, he talked about their use of machine learning to predict relationships between users. This requires scoring 120 billion relationships each day using 82 Hadoop jobs. Lastly, he talked about LinkedIn’s in-house developed workflow management tool, Azkaban, an alternative to Oozie.

Eric Sammer, Solutions Architect at Cloudera, discussed some best practices for dealing with massive data in Hadoop. Particularly, he discussed the value of using workflows for complex jobs, incremental merges to reduce data transfer, and the use of Sqoop (SQL to Hadoop) for bulk relational database imports and exports. Yahoo’s Amit Phadke discussed using Hadoop to optimize online content. His recommendations included leveraging Pig to abstract out the details of MapReduce for complex jobs and taking advantage of the parallelism of HBase for storage. There was also significant interest in the challenges of using Hadoop for graph algorithms including a talk that was so full that they were unable to let additional people in.

Elastic MapReduce Customer Panel

The final session was a customer panel of current Amazon Elastic MapReduce users chaired by Deepak Singh. Participants included: Netflix, Razorfish, Coldlight Solutions, and Spiral Genetics. Highlights include:

· Razorfish discussed a case study in which a combination of Elastic MapReduce and cascading allowed them to take a customer to market in half the time with a 500% return in ad spend. They discussed how using EMR has given them much better visibility into their costs, allowing them to pass this transparency on to customers.

· Netflix discussed their use of Elastic MapRedudce to setup a hive-based data warehouseing infrastructure. They keep a persistent cluster with data backups in S3 to ensure durability. Further, they also reduce the amount of data transfer through pre-aggregation and preprocessing of data.

· Spiral Genetics talked about they had to leverage AWS to reduce capital expenditures. By using Amazon Elastic MapReduce they were able to setup a running job in 3 hours. They are also excited to see spot instance integration.

· Coldlight Solutions said that buying $1/2M in infrastructure wasn’t even an option when they started. Now it is, but they would rather focus on their strength: machine learning and Amazon Elastic MapReduce allows them to do this.

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com