HBase: Michael Stack (Powerset)
· Distributed DB built on Hadoop core
· Modeled on BigTable
· Same advantages as BigTable:
o Column store
§ Efficient compression
§ Support for very wide tables when most columns aren’t looked at together
o Nulls stored for free
o Cells are versioned (cells addressed by row, col, and timestamp)
· No join support
· Rows are ordered lexicography
· Columns grouped into columnfamilies
· Tables are horizontally partitioned into regions
· Like Hadoop: master node and regionServers
· Client initially goes to master to find the RegionServer. Cached thereafter.
o On failure (or split) or other change, fail the client and it will go back to master.
· All java access and implementation.
o Thrift server hosting supports C++, Ruby, and Java (via thrift) clients
o Rest server supports Ruby gem
· Focusing on developer a user/developer base for HBase
· Three committers: Jim Bryan Duxbury, and Michael Stack
Hbase at Rapleaf: Bryan Duxbury
· Rapleaf is a people search application. Supports profile aggregation, Data API
· “It’s a privacy tool for yourself and a stalking tool for others”
· Customer Ruby web crawler
· Index structured data from profiles
· They are using HBase to store pages (HBase via REST servlet)
· Cluster specs:
o HDFS/Hbase cluster of 16 macdhines
o 2TB of disk (big plans to grow)
o 64 cores
o 64GB memory
o Average row size: 65KB (14KB gzipped)
o Predominantly new rows (not versioned)
Facebook Hive: Joydeep Sen Sarma & Ashish Thusoo (Facebook Data Team)
· Data Warehousing use Hadoop
· Hive is the Facebook datawarehouse
· Query language brings together SQL and streaming
o Developers love direct access to map/reduce and streaming
o Analyst love SQL
· Hive QL (parser, planner, and execution engine)
· Uses the Thrift API
· Hive CLI implemented in Python
· Query operators in initial versions
o Projections, equijoins, cogroups, groupby, & sampling
· Supports views as well
· Supports 40 users (about 25% of engineering team)
· 200GB of compressed data per day
· 3,514 jobs run over the last 7 days
· 5 engineers on the project
· Q: Why not use PIG? A: Wanted to support SQL and python.
Processing Engineering Design Content with Hadoop and Amazon
· Mike Haley (Autodesk)
· Running classifiers over CAD drawings and classifying them according to what the objects actually are. The problem they are trying to solve is to allow someone to look for drawings of wood doors and to find elm doors, wood doors, pine doors and not find non-doors.
· They were running on an internal autodesk cluster originally. Now running on an EC2 cluster to get more resources in play when needed.
· Mike showed some experimental products that showed power and gas consumption over entire cities by showing the lines and using color and brightness to show consumption rate. Showed the same thing to show traffic hot spots. Pretty cool visualizations.
Yahoo! Webmap: Christian Kunz
· Webmap is now build in production usng Hadoop
· Webmap is the a gigantic table o finformation about every web site, page, and link Yahoo! tracks.
· Why port to Hadoop
o Old system only scales to 1k nodes (Hadoop cluster at Y! is at 2k servers)
o One failed or slow server, used to slow all
o High management costs
o Hard to evolve infrastructure
· Challenges: port ~100 webmap applications to map/reduce
· Webmap builds are not done on latest Hadoop release without any patches
· These are almost certainly the largest Hadoop jobs in the world:
o 100,000 maps
o 10,000 reduces
o Runs 3 days
o Moves 300 terabytes
o Produces 200 terabytes
· Believe they can gain another 30 to 50% improvement in run time.
Computing in the cloud with Hadoop
· Christophe Bisciglia: Google open source team
· Jimmy Lin: Assistant Professor at University of Maryland
· Set up a 40 node cluster at UofW.
· Using Hadoop to help students and academic community learn the map/reduce programming model.
· It’s a way for Google to contribute to the community without open sourcing Map/Reduce
· Interested in making Hadoop available to other fields beyond computer science
· Five universities in program: Berekeley, CMU, MIT, Stanford, UW, UMD
· Jimmy Lin shows some student projects including a statistical machine translations project that was a compelling use of Hadoop.
· Berkeley will use Hadoop in their introductory computing course (~400 students).
Panel on Future Directions:
· Five speakers from the Hadoop community:
1. Sanjay Radia
2. Owen O’Malley (Yahoo & chair of Apache PMC for Apache)
3. Chad Walters (Powerset)
4. Jeff Eastman (Mahout)
5. Sameer Paranjpye
· Yahoo planning to scale to 5,000 nodes in near future (at 2k servers now)
· Namespace entirely in memory. Considering implementing volumes. Volumes will share data. Just the volumes will be partitioned. Volume name spaces will be “mounted” into a shared file tree.
· HoD scheduling implementation has hit the wall. Need a new scheduler. HoD was a good short term solution but not adequate for current usage levels. It’s not able to handle the large concurrent job traffic Yahoo! is currently experiencing.
· Jobs often have a large virtual partition for the maps. Because they are held during reduce phase, considerable resources are left unused.
· FIFO scheduling doesn’t scale for large, diverse user bases.
· What is needed to declare Hadoop 1.0: API Stability, future proof API to use single object parameter, add HDFS single writer append, & Authentication (Owen O’Malley)
· Malhout project build classification, clustering, regression, etc. kernels that run on hadoop and release under commercial friendly, Apache license.
· Plans for HBase looking forward:
1. 0.1.0: Initial release
2. 0.2.0: Scalability and Robustness
3. 0.3.0: Performance
James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | JamesRH@microsoft.com