James Hamilton's Blog RSS 2.0
 Thursday, April 03, 2008

A couple of interesting directions brought together: 1) Oracle compatible DB startup, and 2) a cloud-based implementation.

 

The Oracle compatible offering is EnterpriseDB. They use the PostgreSQL code base and implement Oracle compatibility to make it easy for the huge Oracle install base to support them.  An interesting approach.  I used to lead the SQL Server Migration Assistant team so I know that true Oracle compatibility is tough but, even failing to be 100% compatible makes it easier for Oracle apps to port over to them. The pricing model is free for a developer license and $6k/socket for their Advanced Server edition.

 

The second interesting direction is offering is from Elastra.  It’s a management and administration system that automates deploying and managing dynamically scalable services. As part of the Elastra offering is support for Amazon AWS EC2 deployments.

 

Bring together EnterpriseDB and Elastra and you have an Oracle compatible database, hosted in EC2, with deployment and management support: ELASTRA Propels EnterpriseDB into the Cloud. I couldn’t find any customer usage examples so this may be more press release than a fully exercised, ready for prime-time solution but it’s a cool general direction and I expect to see more offerings along these lines over next months.  Good to see.

 

                                --jrh

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh

Thursday, April 03, 2008 11:17:15 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services
 Wednesday, April 02, 2008

I’m a big believer in auto-installable client software but I also want a quality user experience.  For data intensive applications, I want a caching client. I use and love many of browser-hosted clients but, for development work, email clients, and photo editing, I still use installable software. I want a snappy user experience, I need to be able to run disconnected or weakly connected, and I want to fully use my local resources.  Speed and richness is king for these apps – it’s the casual apps that are getting replaced well by browser based software in my world. 

 

However, I’ve been blown away but how fast the set of applications I’m willing to run in the browser has been expanding. For example, Yahoo Mail impressed me when it came out. Both Google and Live maps are impressive (how can anyone understand and maintain that much JavaScript?).  In fact, in the ultimate compliment, these mapping services are good enough that, even though I have local mapping software installed, I seldom bother to start it.  

 

Here’s another one that announced last week that is truly impressive: https://www.photoshop.com/express/landing.html.  The Adobe online implementation of Photoshop is an eye opener. Predictably, it’s flash and flex based and, wow, it’s amazing for a within-the-browser experience.  I’m personally still editing my pictures locally but Photoshop Express shows a bit of what’s possible.

 

                                --jrh

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh

Wednesday, April 02, 2008 11:18:16 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services | Software
 Tuesday, April 01, 2008

Microsoft has been investigating and testing containers and modular data centers for some time now.  I wrote about them some time back in Architecture for Modular Data Centers (presentation) at the 2007 Conference on Innovative Data Research. Around that time Rackable Systems and Sun Microsystems announced shipping container based solutions and Rackable shipped the first production container.  That first unit had more than 1,000 servers.  Rackable and Sun helped get this started as early on most of the industry was somewhere between skeptical and actively resistant.

 

Over the last couple of years, the modular datacenter approach has gained momentum.  Now nearly all data center equipment providers have started offering container based solutions

·         IBM Scalable modular data center

·         Rackable ICE Cube™ Modular Data Center

·         Sun Modular Datacenter S20 (project Blackbox)

·         Dell Insight

·         Verari Forest Container Solution

 

It’s great to see all the major systems providers investing modular data centers. I expect the pace of innovation to pick up and over the last two weeks I’ve seen three new designs.  Things are moving.

 

Yesterday Mike Manos who leads the Microsoft Global Foundations Data Center team made the first public announcement of a containerized production data center at Data Center World. The Microsoft Chicago facility is a two floor design where the first floor is a containerized design housing 150 to 220 40’ containers each 1,000 to 2,000 servers.   Chicago is a large facility with the low end of the ranges Mike quoted yielding 150k serves and the high end running to 440k servers.  If you assume 200W/server, the critical load would run between 30MW and 88MW for the half of the data center that is containerized.  If you conservatively assume a PUE of 1.5, we can estimate the containerized portion of the data center at between 45MW and 132MW total load.  It’s a substantial facility.

 

John Rath posted great notes on Mike’s entire talk: http://datacenterlinks.blogspot.com/2008/04/miichael-manos-keynote-at-data-center.html.  And, I’m excited about this new news now being public, so when Mike gets back into the office at Redmond I’ll pester him to see if he can release the slides he used.  If so, I’ll post them here.

 

Thanks to Rackable Systems and Sun Microsystems for getting the industry started on commodity-based containerized designs.  We now have modular components from most major server vendors and Mike’s talk yesterday at Data Center World market the first publically announced modular facility.

 

                                    --jrh

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh 

Tuesday, April 01, 2008 11:19:54 PM (Pacific Standard Time, UTC-08:00)  #    Comments [3] - Trackback
Services
 Monday, March 31, 2008

Tom Kleinpeter was one of the founders of Foldershare (acquired by Microsoft in 2006) and before that he was a part of the original team at Audiogalaxy. I worked with Tom while he was at Microsoft working on Mesh. Tom recently decided to take some time off, to relax, be a father, and it looks like he’s also finding time to put write up some of his experiences. I particularly like the Audiogalaxy Chronicles where he writes up his experiences with Audiogalaxy which grew like only successful startup can shooting to 80 million page views a day from 35 million unique users.

 

I found this post particularly interesting where Tom describes scaling the Audiogalaxy design and some of the challenges they had in scaling to 80 million page views a day: http://www.spiteful.com/2008/02/27/scaling-audiogalaxy-to-80-million-daily-page-views/.

 

Read them all: http://www.spiteful.com/the-audiogalaxy-chronicles/.

 

                                    --jrh

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh

Monday, March 31, 2008 11:21:05 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services
 Sunday, March 30, 2008

There is (again) a rumor out there that Google will soon offer a third party service platform: http://www.scripting.com/stories/2008/03/29/pigs.html.  I mostly ignore the rumors but this is one I find hard to ignore. Why?  Mostly because it makes too much sense.  The Google infrastructure investment combined with phenomenal scale yields some of the lowest cost compute and storage in the industry.  They can sell compute and storage at considerably above their costs and yet still be offering substantial cost reductions to smaller services.  That’s if they chose to charge for it.  Google also has the highest scale advertising platform in the world offering opportunity to monetize even that for which they don’t directly charge.  When something looks like it makes sense economically and fits in strategically, it just about has to happen.

 

We all know that these rumors often have nothing at all behind them.  Some are simply excited fabrications. But, even knowing that, on this one it’s a matter of when rather than if.

 

Thanks to Dare Obasanjo for pointing me to the blog posting above.

 

                                    --jrh

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh

Sunday, March 30, 2008 9:23:04 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services
 Thursday, March 27, 2008

Yahoo! hosted the Hadoop Summit Tuesday of this week.  I posted my rough notes on the conference over the course of the day – posting summarized some of what caught my interest and consolidates my notes.

 

Yahoo expected 100 attendees and ended up having to change venues to get closer to fitting the more than 400 who wanted to attend.  For me the most striking thing is that Hadoop is now clearly in broad use and at scale. Dave Cutting did a quick survey at the start and rough ½ the crowd were running Hadoop in production and around 1/5 have over 100 node clusters. Yahoo remains the biggest with 2,000 nodes in their cluster.

 

Christian Kunz of Yahoo! gave a bit of a window into how Yahoo! is using Hadoop to process their Webmap data store. The Webmap is a structured storage representation of all Yahoo! crawled pages and all the metadata they extract or compute on those pages.  There are over 100 Webmap applications used in managing the Yahoo! indexing engine. Christian talked about why they moved to Hadoop from the legacy system and summarized the magnitude of the workload they are running. These are almost certainly the largest Hadoop jobs in the world. The longest map/reduce jobs run for over three days and have 100k maps and 10k reduces. This job reads 300 TB and produces 200 TB.

 

Another informative talk was given by the Facebook team. They described Hive, the data warehouse at Facebook.  Joydeep Sarma and Ashish Thusoo presented this work. I liked this talk as it was 100% customer driven. They implemented what the analyst and programmers inside Facebook needed and I found their observations credible and interesting.  They reported that Analyst are used to SQL and found a SQL like language most productive but that programmers like to have direct access to map/reduce primitives.  As a consequence, they provide both (so do we).  The Facebook team reports they roughly 25% of the development team using Hive and process 3,500 map/reduce jobs a week.

 

Google is heavily invested in Hadoop using it as a teaching vehicle even though it’s not used internally.  The Google interest in Haddop is to get graduating students more familiar with the map/reduce programming model. Several schools have agreed to teach the map/reduce programming using Hadoop. For example Berkeley, CMU, MIT, Stanford, UW, and UMD all plan courses

 

The agenda for the day:

Time

Topic

Speaker(s)

8:00-8:55

Breakfast/Registration

8:55-9:00

Welcome & Logistics

Ajay Anand, Yahoo!

9:00-9:30

Hadoop Overview

Doug Cutting / Eric Baldeschwieler, Yahoo!

9:30-10:00

Pig

Chris Olston, Yahoo!

10:00-10:30

JAQL

Kevin Beyer, IBM

10:30-10:45

Break

10:45-11:15

DryadLINQ

Michael Isard, Microsoft

11:15-11:45

Monitoring Hadoop using X-Trace

Andy Konwinski and Matei Zaharia, UC Berkeley

11:45-12:15

Zookeeper

Ben Reed, Yahoo!

12:15-1:15

Lunch

1:15-1:45

Hbase

Michael Stack, Powerset

1:45-2:15

Hbase at Rapleaf

Bryan Duxbury, Rapleaf

2:15-2:45

Hive

Joydeep Sen Sarma / Ashish Thusoo, Facebook

2:45-3:05

GrepTheWeb - Hadoop an AWS

Jinesh Varia, Amazon.com

3:05-3:20

Break

3:20-3:24

Building Ground Models of Southern California

Steve Schlosser, David O'Hallaron, Intel / CMU

3:40-4:00

Online search for engineering design content

Mike Haley, Autodesk

4:00-4:20

Yahoo - Webmap

Arnab Bhattacharjee, Yahoo!

4:20-4:45

Natural language Processing

Jimmy Lin, U of Maryland / Christophe Bisciglia, Google

4:45-5:30

Panel on future directions

Sameer Paranjpye, Sanjay Radia, Owen O.Malley (Yahoo), Chad Walters (Powerset), Jeff Eastman (Mahout)

My more detailed notes are at: HadoopSummit2008_NotesJamesRH.doc (81.5 KB). Peter Lee’s Hadoop Summit summary is at: http://www.csdhead.cs.cmu.edu/blog/

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh

Thursday, March 27, 2008 11:53:46 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services
 Tuesday, March 25, 2008

HBase: Michael Stack (Powerset)

·         Distributed DB built on Hadoop core

·         Modeled on BigTable

·         Same advantages as BigTable:

o   Column store

§  Efficient compression

§  Support for very wide tables when most columns aren’t looked at together

o   Nulls stored for free

o   Cells are versioned (cells addressed by row, col, and timestamp)

·         No join support

·         Rows are ordered lexicography

·         Columns grouped into columnfamilies

·         Tables are horizontally partitioned into regions

·         Like Hadoop: master node and regionServers

·         Client initially goes to master to find the RegionServer. Cached thereafter. 

o   On failure (or split) or other change, fail the client and it will go back to master.

·         All java access and implementation. 

o   Thrift server hosting supports C++, Ruby, and Java (via thrift) clients

o   Rest  server supports Ruby gem

·         Focusing on developer a user/developer base for HBase

·         Three committers: Jim Bryan Duxbury, and Michael Stack

 

Hbase at Rapleaf: Bryan Duxbury

·         Rapleaf is a people search application.  Supports profile aggregation, Data API

·         “It’s a privacy tool for yourself and a stalking tool for others”

·         Customer Ruby web crawler

·         Index structured data from profiles

·         They are using HBase to store pages (HBase via REST servlet)

·         Cluster specs:

o   HDFS/Hbase cluster of 16 macdhines

o   2TB of disk (big plans to grow)

o   64 cores

o   64GB memory

·         Load:

o   3.6TB/month

o   Average row size: 65KB (14KB gzipped)

o   Predominantly new rows (not versioned)

 

Facebook Hive: Joydeep Sen Sarma & Ashish Thusoo (Facebook Data Team)

·         Data Warehousing use Hadoop

·         Hive is the Facebook datawarehouse

·         Query language brings together SQL and streaming

o   Developers love direct access to map/reduce and streaming

o   Analyst love SQL

·         Hive QL (parser, planner, and execution engine)

·         Uses the Thrift API

·         Hive CLI implemented in Python

·         Query operators in initial versions

o   Projections, equijoins, cogroups, groupby, & sampling

·         Supports views as well

·         Supports 40 users (about 25% of engineering team)

·         200GB of compressed data per day

·         3,514 jobs run over the last 7 days

·         5 engineers on the project

·         Q: Why not use PIG? A: Wanted to support SQL and python.

 

Processing Engineering Design Content with Hadoop and Amazon

·         Mike Haley (Autodesk)

·         Running classifiers over CAD drawings and classifying them according to what the objects actually are. The problem they are trying to solve is to allow someone to look for drawings of wood doors and to find elm doors, wood doors, pine doors and not find non-doors.

·         They were running on an internal autodesk cluster originally. Now running on an EC2 cluster to get more resources in play when needed.

·         Mike showed some experimental products that showed power and gas consumption over entire cities by showing the lines and using color and brightness to show consumption rate.  Showed the same thing to show traffic hot spots.  Pretty cool visualizations.

 

Yahoo! Webmap: Christian Kunz

·         Webmap is now build in production usng Hadoop

·         Webmap is the a gigantic table o finformation about every web site, page, and link Yahoo! tracks.

·         Why port to Hadoop

o   Old system only scales to 1k nodes (Hadoop cluster at Y! is at 2k servers)

o   One failed or slow server, used to slow all

o   High management costs

o   Hard to evolve infrastructure

·         Challenges: port ~100 webmap applications to map/reduce

·         Webmap builds are not done on latest Hadoop release without any patches

·         These are almost certainly the largest Hadoop jobs in the world:

o   100,000 maps

o   10,000 reduces

o   Runs 3 days

o   Moves 300 terabytes

o   Produces 200 terabytes

·         Believe they can gain another 30 to 50% improvement in run time.

 

Computing in the cloud with Hadoop

·         Christophe Bisciglia: Google open source team

·         Jimmy Lin: Assistant Professor at University of Maryland

·         Set up a 40 node cluster at UofW.

·         Using Hadoop to help students and academic community learn the map/reduce programming model.

·         It’s a way for Google to contribute to the community without open sourcing Map/Reduce