Thursday, April 03, 2008

A couple of interesting directions brought together: 1) Oracle compatible DB startup, and 2) a cloud-based implementation.

 

The Oracle compatible offering is EnterpriseDB. They use the PostgreSQL code base and implement Oracle compatibility to make it easy for the huge Oracle install base to support them.  An interesting approach.  I used to lead the SQL Server Migration Assistant team so I know that true Oracle compatibility is tough but, even failing to be 100% compatible makes it easier for Oracle apps to port over to them. The pricing model is free for a developer license and $6k/socket for their Advanced Server edition.

 

The second interesting direction is offering is from Elastra.  It’s a management and administration system that automates deploying and managing dynamically scalable services. As part of the Elastra offering is support for Amazon AWS EC2 deployments.

 

Bring together EnterpriseDB and Elastra and you have an Oracle compatible database, hosted in EC2, with deployment and management support: ELASTRA Propels EnterpriseDB into the Cloud. I couldn’t find any customer usage examples so this may be more press release than a fully exercised, ready for prime-time solution but it’s a cool general direction and I expect to see more offerings along these lines over next months.  Good to see.

 

                                --jrh

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh

Thursday, April 03, 2008 11:17:15 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services
 Wednesday, April 02, 2008

I’m a big believer in auto-installable client software but I also want a quality user experience.  For data intensive applications, I want a caching client. I use and love many of browser-hosted clients but, for development work, email clients, and photo editing, I still use installable software. I want a snappy user experience, I need to be able to run disconnected or weakly connected, and I want to fully use my local resources.  Speed and richness is king for these apps – it’s the casual apps that are getting replaced well by browser based software in my world. 

 

However, I’ve been blown away but how fast the set of applications I’m willing to run in the browser has been expanding. For example, Yahoo Mail impressed me when it came out. Both Google and Live maps are impressive (how can anyone understand and maintain that much JavaScript?).  In fact, in the ultimate compliment, these mapping services are good enough that, even though I have local mapping software installed, I seldom bother to start it.  

 

Here’s another one that announced last week that is truly impressive: https://www.photoshop.com/express/landing.html.  The Adobe online implementation of Photoshop is an eye opener. Predictably, it’s flash and flex based and, wow, it’s amazing for a within-the-browser experience.  I’m personally still editing my pictures locally but Photoshop Express shows a bit of what’s possible.

 

                                --jrh

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh

Wednesday, April 02, 2008 11:18:16 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services | Software
 Tuesday, April 01, 2008

Microsoft has been investigating and testing containers and modular data centers for some time now.  I wrote about them some time back in Architecture for Modular Data Centers (presentation) at the 2007 Conference on Innovative Data Research. Around that time Rackable Systems and Sun Microsystems announced shipping container based solutions and Rackable shipped the first production container.  That first unit had more than 1,000 servers.  Rackable and Sun helped get this started as early on most of the industry was somewhere between skeptical and actively resistant.

 

Over the last couple of years, the modular datacenter approach has gained momentum.  Now nearly all data center equipment providers have started offering container based solutions

·         IBM Scalable modular data center

·         Rackable ICE Cube™ Modular Data Center

·         Sun Modular Datacenter S20 (project Blackbox)

·         Dell Insight

·         Verari Forest Container Solution

 

It’s great to see all the major systems providers investing modular data centers. I expect the pace of innovation to pick up and over the last two weeks I’ve seen three new designs.  Things are moving.

 

Yesterday Mike Manos who leads the Microsoft Global Foundations Data Center team made the first public announcement of a containerized production data center at Data Center World. The Microsoft Chicago facility is a two floor design where the first floor is a containerized design housing 150 to 220 40’ containers each 1,000 to 2,000 servers.   Chicago is a large facility with the low end of the ranges Mike quoted yielding 150k serves and the high end running to 440k servers.  If you assume 200W/server, the critical load would run between 30MW and 88MW for the half of the data center that is containerized.  If you conservatively assume a PUE of 1.5, we can estimate the containerized portion of the data center at between 45MW and 132MW total load.  It’s a substantial facility.

 

John Rath posted great notes on Mike’s entire talk: http://datacenterlinks.blogspot.com/2008/04/miichael-manos-keynote-at-data-center.html.  And, I’m excited about this new news now being public, so when Mike gets back into the office at Redmond I’ll pester him to see if he can release the slides he used.  If so, I’ll post them here.

 

Thanks to Rackable Systems and Sun Microsystems for getting the industry started on commodity-based containerized designs.  We now have modular components from most major server vendors and Mike’s talk yesterday at Data Center World market the first publically announced modular facility.

 

                                    --jrh

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh 

Tuesday, April 01, 2008 11:19:54 PM (Pacific Standard Time, UTC-08:00)  #    Comments [3] - Trackback
Services
 Monday, March 31, 2008

Tom Kleinpeter was one of the founders of Foldershare (acquired by Microsoft in 2006) and before that he was a part of the original team at Audiogalaxy. I worked with Tom while he was at Microsoft working on Mesh. Tom recently decided to take some time off, to relax, be a father, and it looks like he’s also finding time to put write up some of his experiences. I particularly like the Audiogalaxy Chronicles where he writes up his experiences with Audiogalaxy which grew like only successful startup can shooting to 80 million page views a day from 35 million unique users.

 

I found this post particularly interesting where Tom describes scaling the Audiogalaxy design and some of the challenges they had in scaling to 80 million page views a day: http://www.spiteful.com/2008/02/27/scaling-audiogalaxy-to-80-million-daily-page-views/.

 

Read them all: http://www.spiteful.com/the-audiogalaxy-chronicles/.

 

                                    --jrh

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh

Monday, March 31, 2008 11:21:05 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services
 Sunday, March 30, 2008

There is (again) a rumor out there that Google will soon offer a third party service platform: http://www.scripting.com/stories/2008/03/29/pigs.html.  I mostly ignore the rumors but this is one I find hard to ignore. Why?  Mostly because it makes too much sense.  The Google infrastructure investment combined with phenomenal scale yields some of the lowest cost compute and storage in the industry.  They can sell compute and storage at considerably above their costs and yet still be offering substantial cost reductions to smaller services.  That’s if they chose to charge for it.  Google also has the highest scale advertising platform in the world offering opportunity to monetize even that for which they don’t directly charge.  When something looks like it makes sense economically and fits in strategically, it just about has to happen.

 

We all know that these rumors often have nothing at all behind them.  Some are simply excited fabrications. But, even knowing that, on this one it’s a matter of when rather than if.

 

Thanks to Dare Obasanjo for pointing me to the blog posting above.

 

                                    --jrh

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh

Sunday, March 30, 2008 9:23:04 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services
 Thursday, March 27, 2008

Yahoo! hosted the Hadoop Summit Tuesday of this week.  I posted my rough notes on the conference over the course of the day – posting summarized some of what caught my interest and consolidates my notes.

 

Yahoo expected 100 attendees and ended up having to change venues to get closer to fitting the more than 400 who wanted to attend.  For me the most striking thing is that Hadoop is now clearly in broad use and at scale. Dave Cutting did a quick survey at the start and rough ½ the crowd were running Hadoop in production and around 1/5 have over 100 node clusters. Yahoo remains the biggest with 2,000 nodes in their cluster.

 

Christian Kunz of Yahoo! gave a bit of a window into how Yahoo! is using Hadoop to process their Webmap data store. The Webmap is a structured storage representation of all Yahoo! crawled pages and all the metadata they extract or compute on those pages.  There are over 100 Webmap applications used in managing the Yahoo! indexing engine. Christian talked about why they moved to Hadoop from the legacy system and summarized the magnitude of the workload they are running. These are almost certainly the largest Hadoop jobs in the world. The longest map/reduce jobs run for over three days and have 100k maps and 10k reduces. This job reads 300 TB and produces 200 TB.

 

Another informative talk was given by the Facebook team. They described Hive, the data warehouse at Facebook.  Joydeep Sarma and Ashish Thusoo presented this work. I liked this talk as it was 100% customer driven. They implemented what the analyst and programmers inside Facebook needed and I found their observations credible and interesting.  They reported that Analyst are used to SQL and found a SQL like language most productive but that programmers like to have direct access to map/reduce primitives.  As a consequence, they provide both (so do we).  The Facebook team reports they roughly 25% of the development team using Hive and process 3,500 map/reduce jobs a week.

 

Google is heavily invested in Hadoop using it as a teaching vehicle even though it’s not used internally.  The Google interest in Haddop is to get graduating students more familiar with the map/reduce programming model. Several schools have agreed to teach the map/reduce programming using Hadoop. For example Berkeley, CMU, MIT, Stanford, UW, and UMD all plan courses

 

The agenda for the day:

Time

Topic

Speaker(s)

8:00-8:55

Breakfast/Registration

8:55-9:00

Welcome & Logistics

Ajay Anand, Yahoo!

9:00-9:30

Hadoop Overview

Doug Cutting / Eric Baldeschwieler, Yahoo!

9:30-10:00

Pig

Chris Olston, Yahoo!

10:00-10:30

JAQL

Kevin Beyer, IBM

10:30-10:45

Break

10:45-11:15

DryadLINQ

Michael Isard, Microsoft

11:15-11:45

Monitoring Hadoop using X-Trace

Andy Konwinski and Matei Zaharia, UC Berkeley

11:45-12:15

Zookeeper

Ben Reed, Yahoo!

12:15-1:15

Lunch

1:15-1:45

Hbase

Michael Stack, Powerset

1:45-2:15

Hbase at Rapleaf

Bryan Duxbury, Rapleaf

2:15-2:45

Hive

Joydeep Sen Sarma / Ashish Thusoo, Facebook

2:45-3:05

GrepTheWeb - Hadoop an AWS

Jinesh Varia, Amazon.com

3:05-3:20

Break

3:20-3:24

Building Ground Models of Southern California

Steve Schlosser, David O'Hallaron, Intel / CMU

3:40-4:00

Online search for engineering design content

Mike Haley, Autodesk

4:00-4:20

Yahoo - Webmap

Arnab Bhattacharjee, Yahoo!

4:20-4:45

Natural language Processing

Jimmy Lin, U of Maryland / Christophe Bisciglia, Google

4:45-5:30

Panel on future directions

Sameer Paranjpye, Sanjay Radia, Owen O.Malley (Yahoo), Chad Walters (Powerset), Jeff Eastman (Mahout)

My more detailed notes are at: HadoopSummit2008_NotesJamesRH.doc (81.5 KB). Peter Lee’s Hadoop Summit summary is at: http://www.csdhead.cs.cmu.edu/blog/

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh

Thursday, March 27, 2008 11:53:46 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services
 Tuesday, March 25, 2008

HBase: Michael Stack (Powerset)

·         Distributed DB built on Hadoop core

·         Modeled on BigTable

·         Same advantages as BigTable:

o   Column store

§  Efficient compression

§  Support for very wide tables when most columns aren’t looked at together

o   Nulls stored for free

o   Cells are versioned (cells addressed by row, col, and timestamp)

·         No join support

·         Rows are ordered lexicography

·         Columns grouped into columnfamilies

·         Tables are horizontally partitioned into regions

·         Like Hadoop: master node and regionServers

·         Client initially goes to master to find the RegionServer. Cached thereafter. 

o   On failure (or split) or other change, fail the client and it will go back to master.

·         All java access and implementation. 

o   Thrift server hosting supports C++, Ruby, and Java (via thrift) clients

o   Rest  server supports Ruby gem

·         Focusing on developer a user/developer base for HBase

·         Three committers: Jim Bryan Duxbury, and Michael Stack

 

Hbase at Rapleaf: Bryan Duxbury

·         Rapleaf is a people search application.  Supports profile aggregation, Data API

·         “It’s a privacy tool for yourself and a stalking tool for others”

·         Customer Ruby web crawler

·         Index structured data from profiles

·         They are using HBase to store pages (HBase via REST servlet)

·         Cluster specs:

o   HDFS/Hbase cluster of 16 macdhines

o   2TB of disk (big plans to grow)

o   64 cores

o   64GB memory

·         Load:

o   3.6TB/month

o   Average row size: 65KB (14KB gzipped)

o   Predominantly new rows (not versioned)

 

Facebook Hive: Joydeep Sen Sarma & Ashish Thusoo (Facebook Data Team)

·         Data Warehousing use Hadoop

·         Hive is the Facebook datawarehouse

·         Query language brings together SQL and streaming

o   Developers love direct access to map/reduce and streaming

o   Analyst love SQL

·         Hive QL (parser, planner, and execution engine)

·         Uses the Thrift API

·         Hive CLI implemented in Python

·         Query operators in initial versions

o   Projections, equijoins, cogroups, groupby, & sampling

·         Supports views as well

·         Supports 40 users (about 25% of engineering team)

·         200GB of compressed data per day

·         3,514 jobs run over the last 7 days

·         5 engineers on the project

·         Q: Why not use PIG? A: Wanted to support SQL and python.

 

Processing Engineering Design Content with Hadoop and Amazon

·         Mike Haley (Autodesk)

·         Running classifiers over CAD drawings and classifying them according to what the objects actually are. The problem they are trying to solve is to allow someone to look for drawings of wood doors and to find elm doors, wood doors, pine doors and not find non-doors.

·         They were running on an internal autodesk cluster originally. Now running on an EC2 cluster to get more resources in play when needed.

·         Mike showed some experimental products that showed power and gas consumption over entire cities by showing the lines and using color and brightness to show consumption rate.  Showed the same thing to show traffic hot spots.  Pretty cool visualizations.

 

Yahoo! Webmap: Christian Kunz

·         Webmap is now build in production usng Hadoop

·         Webmap is the a gigantic table o finformation about every web site, page, and link Yahoo! tracks.

·         Why port to Hadoop

o   Old system only scales to 1k nodes (Hadoop cluster at Y! is at 2k servers)

o   One failed or slow server, used to slow all

o   High management costs

o   Hard to evolve infrastructure

·         Challenges: port ~100 webmap applications to map/reduce

·         Webmap builds are not done on latest Hadoop release without any patches

·         These are almost certainly the largest Hadoop jobs in the world:

o   100,000 maps

o   10,000 reduces

o   Runs 3 days

o   Moves 300 terabytes

o   Produces 200 terabytes

·         Believe they can gain another 30 to 50% improvement in run time.

 

Computing in the cloud with Hadoop

·         Christophe Bisciglia: Google open source team

·         Jimmy Lin: Assistant Professor at University of Maryland

·         Set up a 40 node cluster at UofW.

·         Using Hadoop to help students and academic community learn the map/reduce programming model.

·         It’s a way for Google to contribute to the community without open sourcing Map/Reduce

·         Interested in making Hadoop available to other fields beyond computer science

·         Five universities in program: Berekeley, CMU, MIT, Stanford, UW, UMD

·         Jimmy Lin shows some student projects including a statistical machine translations project that was a compelling use of Hadoop.

·         Berkeley will use Hadoop in their introductory computing course (~400 students).

 

Panel on Future Directions:

·         Five speakers from the Hadoop community:

1.       Sanjay Radia

2.       Owen O’Malley (Yahoo & chair of Apache PMC for Apache)

3.       Chad Walters (Powerset)

4.       Jeff Eastman (Mahout)

5.       Sameer Paranjpye

·         Yahoo planning to scale to 5,000 nodes in near future (at 2k servers now)

·         Namespace entirely in memory.  Considering implementing volumes. Volumes will share data. Just the volumes will be partitioned.  Volume name spaces will be “mounted” into a shared file tree.

·         HoD scheduling implementation has hit the wall.  Need a new scheduler. HoD was a good short term solution but not adequate for current usage levels.  It’s not able to handle the large concurrent job traffic Yahoo! is currently experiencing.

·         Jobs often have a large virtual partition for the maps. Because they are held during reduce phase, considerable resources are left unused.

·         FIFO scheduling doesn’t scale for large, diverse user bases.

·         What is needed to declare Hadoop 1.0: API Stability, future proof API to use single object parameter, add HDFS single writer append, & Authentication (Owen O’Malley)

·         Malhout project build classification, clustering, regression, etc. kernels that run on hadoop and release under commercial friendly, Apache license.

·         Plans for HBase looking forward:

1.       0.1.0: Initial release

2.       0.2.0: Scalability and Robustness

3.       0.3.0: Performance

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

 

Tuesday, March 25, 2008 11:51:51 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services
 Monday, March 24, 2008

X-Tracing Hadoop: Andy Konwinski

·         Berkeley student with the Berkeley RAD Lab

·         Motivation: Make Hadoop map/reduce jobs easier to understand and debug

·         Approach: X-trace Hadoop (500 lines of code)

·         X-trace is a path based tracing framework

·         Generates an event graph to capture causality of events across a network.

·         Xtrace collects: Report label, trace id, report id, hostname, timestamp, etc.

·         What we get from Xtrace:

o   Deterministic causality and concurrency

o   Control over which events get traced

o   Cross-layer

o   Low overhead (modest sized traces produced)

o   Modest implementation complexity

·         Want real, high scale production data sets. Facebook has been very helpful but Andy is after more data to show the value of the xtrace approach to Hadoop debugging.  Contact andyk@cs.berkeley.edu if you want to contribute data.

 

ZooKeeper: Benjamin Reed (Yahoo Research)

·         Distributed consensus service

·         Observation:

o   Distributed systems need coordination

o   Programmers can’t use locks correctly

o   Message based coordination can be hard to use in some applications

·         Wishes:

o   Simple, robust, good performance

o   Tuned for read dominant workloads

o   Familiar models and interface

o   Wait-free

o   Need to be able to wait efficiently

·         Google uses Locks (Chubby) but we felt this was too complex an approach

·         Design point: start with a file system API model and strip out what is not needed

·         Don’t need:

o   Partial reads & writes

o   Rename

·         What we do need:

o   Ordered updates with strong persistence guarantees

o   Conditional updates

o   Watches for data changes

o   Ephemeral nodes

o   Generated file names (mktemp)

·         Data model:

o   Hiearchical name space

o   Each znode has data and children

o   Data is read and written in its entirety

·         All API take a path (no file handles and no open and close)

·         Quorum based updates with reads from any servers (you may get old data – if you call sync first, the next read will be current as of the point of time when the sync was run at the oldest.  All updates flow through an elected leader (re-elected on failure).

·         Written in Java

·         Started oct/2006.  Prototyped fall 2006.  Initial implementation March 2007.  Open sourced in Nov 2007.

·         A Paxos variant (modified multi-paxos)

·         Zookeeper is a software offering in Yahoo whereas Hadoop

 

Note: Yahoo is planning to start a monthly Hadoop user meeting.

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

 

Monday, March 24, 2008 11:50:52 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services

JAQL: A Query Language for Jason

·         Kevin Beyer from IBM (did the DB2 Xquery implementation)

·         Why use JSON?

o   Want complete entities in one place (non-normalized)

o   Want evolvable schema

o   Want standards support

o   Didn’t want a DOC markup language (XML)

·         Designed for JSON data

·         Functional query language (few side effects)

·         Core operators: iteration, grouping, joining, combining, sorting, projection, constructors (arrays, records, values), unesting, ..

·         Operates on anything that is JSON format or can be transformed to JSON and produces JSON or any format that can be transformed from JSON.

·         Planning to

o   add indexing support   

o   Open source next summer

o   Adding schema and integrity support

 

DryadLINQ: Michael Isard (Msft Research)

·         Implementation performance:

o   Rather than temp between every stage, join them together and stream

o   Makes failure recovery more difficult but it’s a good trade off

·         Join and split can be done with Map/Reduce but ugly to program and hard to avoid performance penalty

·         Dryad is more general than Map/Reduce and addresses the above two issues

o   Implements a uniform state machine for scheduling and fault tolerance

·         LINQ addresses the programming model and makes it more access able

·         Dryad supports changing the resource allocation (number of servers used) dynamically during job execution

·         Generally, Map/Reduce is complex so front-ends are being built to make it easier: e.g. PIG & Sawzall

·         Linq: General purpose data-paralle programming contructs

·         LINQ+C# provides parsing, thype-checking, & is a lazy evaluator

o   It builds an expression tree and materializes data only when requested

·         PLINQ: supports parallelizing LINQ queries over many cores

·         Lots of interest in seeing this code out there in open source and interest in the community to building upon it.  Some comments very positive about how far along the work is matched with more negative comments on this being closed rather than open source available for other to innovate upon.

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

 

Monday, March 24, 2008 11:47:35 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services

PIG: Web-Scale Processing

·         Christopher Olston

·         The project originated in Y! Research.

·         Example data analysis task: Find users that visit “good” web pages.

·         Christopher points out that joins are hard to write in Hadoop and there are many ways of writing joins and choosing a join technique is actually a problem that requires some skill.  Basically the same point made by the DB community years ago.  PIG is a dataflow language that describes what you want to happen logically and then map it to map/reduce.  The language of PIG is called Pig Latin

·         Pig Latin allows the declaration of “views” (late bound queries)

·         Pig Latin is essentially a text form of a data flow graph.  It generates Hadoop Map/Reduce jobs.

o   Operators: filter, foreach … generate, & group

o   Binary operators: join, cogroup (“more customizable type of join”), & union

o   Also support split operator

·         How different from SQL?

o   It’s a sequence of simple steps rather than a declarative expression.  SQL is declarative whereas Pig Latin says what steps you want done in what order.  Much closer to imperative programming and, consequently, they argue it is simpler.

o   They argue that it’s easier to build a set of steps and work with each one at a time and slowly build them up to a complete and correct language.

·         PIG is written as a language processing layer over Map/Reduce

·         He propose writing SQL as a processing layer over PIG but this code isn’t yet written

·         Is PIG+Hadoop a DBMS? (there have been lots of blogs on this question :-))

o   P+H only support sequential scans super efficiently (no indexes or other access methods)

o   P+H operate on any data format (PIGS eat anything) whereas DBMS only run on data that they store

o   P+H is a sequence of steps rather than a sequence of constraints as used in DBMS

o   P+H has custom processing as a “first class object” whereas UDFs were added to DBMSs later

·         They want an Eclipse development environment but don’t have it running yet. Planning an Eclipse Plugin.

·         Team of 10 engineers currently working on it.

·         New version of PIG to come out next week will include “explain” (shows mapping to map/reduce jobs to help debug).

·         Today PIG does joins exactly one way. They are adding more join techniques.  There aren’t explicit stats tracked other than file size.  Next version will allow user to specify. They will explore optimization.

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

 

Monday, March 24, 2008 11:46:32 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services

Yahoo is hosting a conference the Hadoop Summit down in Sunnyvale today. There are over 400 attendees of which more than ½ are current Hadoop users and roughly 15 to 20% are running more than 100 node clusters.

 

I’ll post my rough notes from the talks over the course of the day.  So far, it's excellent. My notes on the first two talks are below.

 

                                    --jrh

 

Hadoop: A Brief History

·         Doug Cutting

·         Started with Nutch in 2002 to 2004

o   Initial goal was web-scale, crawler-based search

o   Distributed by neciessity

o   Sort/merge based processing

o   Demonstrated on 4 nodes over 100M web pages. 

o   Was operational onerous. “Real” Web scale was a ways away yet

·         2004 through 2006: Gestation period

o   GFS & MapReduce papers published (addressed the scale problems we were having)

o   Add DFS and MapReduce to Nutch

o   Two part-time developers over two years

o   Ran on 20 nodes at Internet Archive (IA) and UW

o   Much easier to program and run

o   Scaled to several 100m web pages

·         2006 to 2008: Childhood

o   Y! hired Doug Cutting and a dedicated team to work on it reporting to E14 (Eric  Baldeschwieler)

o   Hadoop project split out of Nutch

o   Hit web scale in 2008

 

Yahoo Grid Team Perspective: Eric Baldeschwieler

·         Grid is Eric’s team internal name

·         Focus:

o   On-demand, shared access to vast pools of resources

o   Support massive parallel execution (2k nodes and roughly 10k processors)

o   Data Intensive Super Computing (DISC)

o   Centrally provisioned and managed

o   Service-oriented, elastic

o   Utility for user and researchers inside Y!

·         Open Source Stack

o   Committed to open source development

o   Y! is Apache Platinum Sponsor

·         Project on Eric’s team:

o   Hadoop:

§  Distributed File System

§  MapReduce Framework

§  Dynamic Cluster Management (HOD)

·         Allows sharing of a Hadoop cluster with 100’s of users at the same time.

·         HOD: Hadoop on Demand. Creates virtual clusters using Torq (open source resource managers).  Allocates cluster into many virtual clusters.

o   PIG

§  Parallel Programming Language and Runtime

o   Zookeeper:

§  High-availability directory and confuration service

o   Simon:

§  Cluster and application monitoring

§  Collects stats from 100’s of clusters in parallel (fairly new so far).  Also will be open sourced.

§  All will eventually be part of Apache

§  Similar to Ganglia but more configurable

§  Builds real time reports.

§  Goal is to use Hadoop to monitor Hadoop.

·         Largest production clusters are currently 2k nodes.  Working on more scaling.  Don’t want to have just one cluster but want to run much bigger clusters. We’re investing heavily in scheduling to handle more concurrent jobs.

·         Using 2 data centers and moving to three soon.

·         Working with Carnegie Mellon University (Yahoo provided a container of 500 systems – it appears to be a Rackable Systems container)

·         We’re running Megawatts of Hadoop

·         Over 400 people express interest in this conference.

o   About ½ the room running Hadoop

o   Just about the same number running over 20 nodes

o   About 15 to 20% running over 100 nodes

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

 

Monday, March 24, 2008 11:45:30 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services

I’m long been a big fan of modular data centers using ISO standard Shipping containers as the component building block:

Containers have revolutionized shipping and are by far the cheapest way to move good over sea, land, rail or truck. I’ve seen them used to house telecommunications equipment, power generators, and even stores and apartments have been made using them: http://www.treehugger.com/files/2005/01/shipping_contai.php.

 

The datacenter-in-a-box approach to datacenter design is beginning to be deployed more widely with Lawrence Berkeley National Lab having taken delivery of a Sun Black Box and a “customer in eastern Washington” having taken delivery of a Rackable Ice Cube Module earlier this year.

 

Last summer I came across a book on Shipping Containers by Marc Levinson: The Box: How the Shipping Container Made the World Smaller and the World Economy Bigger. It’s a history of containers from the early experiments in 1956 through to mega-containers terminals distributed throughout the world. The book doesn’t talk about all the innovative applications of containers outside of shipping but does give an interesting background on their invention, evolution, and standardization.

 

                                                -jrh

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh

Monday, March 24, 2008 10:42:00 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Ramblings
 Tuesday, March 18, 2008

Earlier today I viewed Steve Jobs 2005 Commencement Speech at Stanford University. In this talk Jobs recounts three stories and ties them together with a common theme.  The first was dropping out of Reed College and showing up for the courses he wanted to take rather than spend time on those he had to take. Dropping out was a tough decision but some of what he learned in these audited courses had a fundamental impact on the Mac and, in retrospect, appeared to be a good decision or at least one that lead to a good outcome.  Getting fired from Apple was the second.  Clearly not his choice, not what he would have wanted to happen but it lead to Pixar, Next and rejoining Apple stronger and more experienced than before. Again, a tough path but one that may have lead to a better overall outcome. Likely he is a better and more capable leader for the experience.  Finally, facing death. Death awaits us all and, when facing death, it becomes clear what really matters.  It becomes clear that following your heart is what is really important. Don’t be trapped by Dogma, don’t live other people’s lives, and have the courage to follow your own intuition. Clearly nobody wants to approach to death but knowing it is coming can free each of us to realize we can’t hide, we don’t have forever, and those things that scare us most are really tiny and insignificant when compared with death. Facing death can free us to take chances and to do what is truly important even if success looks uncertain or the risk is high.

 

The theme that wove these three stories together and Jobs parting words for his listeners was to “stay hungry and stay foolish”.


It’s a good read: http://news-service.stanford.edu/news/2005/june15/jobs-061505.html.  Or you can view it at: http://www.youtube.com/watch?v=D1R-jKKp3NA.

 

Sent my way by Michael Starbird-Valentine.

 

                                    --jrh

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh

Tuesday, March 18, 2008 11:55:59 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Ramblings
 Saturday, March 15, 2008

Theo Jansen is a Dutch artist and engineer.  His work is truly amazing. What he builds are massive synthetic animals, many more than a floor high, that walk.  Their gait is surprisingly realistic and they are wind powered. More than anything they are spooky and yet deeply engaging.

 

Check out a selection of Jansen’s work on Marni’s blog: http://delvecem.spaces.live.com/blog/cns!58EB8CB464718427!148.entry.

 

                                                --jrh      

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh 

Saturday, March 15, 2008 11:56:42 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Ramblings
 Wednesday, March 12, 2008

Past experience suggests that disk and memory are the most common server component failures but what about power supplies and mother boards?  Amaya Souarez of Global Foundation Services pulled the data on component replacements for the last six months of 2007 and we saw this distribution:

 

1.       Disks:                    59.0%

2.       Memory:             23.1%

3.       Disk Controller: 05.0% (Fiber HBAs and array controllers)

4.       Power Supply:   03.4%

5.       Fan:                       01.1%

6.       NIC:                       01.0%

 

After disk and memory the numbers fade to the noise fairly quickly.  Not a big surprise. What I did find quite surprising was the percentage of systems requiring service.  Some systems will require service more than once and some systems will have multiple components replaced in a single service.  Ignoring these factors and treating each logged component replacement as a service event, in the sample we looked at, we found we had 192 service events per 1,000 servers in six months. Making the reasonable assumption that this data is not sensitive to the time of year, that would be 384 service events per 1,000.

 

The good news is that server service is fairly inexpensive.  Nonetheless, these data reinforce the argument I’ve been making for the last couple of years: the service-free, fail-in-place model is where we should be going over the longer term.  I wrote this up in http://research.microsoft.com/~jamesrh/TalksAndPapers/JamesRH_CIDR.doc but the basic observation is that the cost per server continues to decline while people costs don’t. 

 

Going to a service free-model can save service costs but, even more interesting, in this model the servers can be packaged in designs optimized for cooling efficiency without regard to human access requirements. If technicians need to go into a space, then the space needs be safe for humans and meet multiple security regulations, a growing concern, and there needs to be space for them.  I believe we will get to a model where servers are treated like flash memory blocks: you have a few thousand in a service-free module, over time some fail and are shut off, and the overall module capacity diminishes over time.  When server work done/watt improves sufficiently or when the module capacity degrades far enough, the entire module is replaced and returned to the manufacturer for recycling.

 

                                                --jrh

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh

Wednesday, March 12, 2008 11:57:35 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Hardware | Services
 Thursday, March 06, 2008

Rules of thumb help us understand complex systems at a high level.  Examples are that high performance server disks will do roughly 180 IOPS, or that enterprise system administrators can manage roughly 100 systems.  These numbers ignore important differences between workloads and therefore can’t be precise, but they serve as a quick check. They ignore, for example, that web servers are MUCH easier to administer than database servers.  Whenever I’m looking at a new technique, algorithm, or approach, I always start with the relevant rules of thumb and note the matches and differences.  Where it differs, I look deeper.  Sometimes I find great innovation and learn that the rules need to be updated to take into account the efficiencies of this new approach.  But, more frequently, I find an error in the data, incorrect measurement technique, an algorithm that only works over a narrow set of workloads, or other restriction.  Basically, a good repertoire of rules of thumb is useful in helping to find innovation and quickly spot mistakes.

 

Everyone does this at some level, although often they aren’t using formal rules but more of an informal gut feel.  This gut feel helps people avoid mistakes and move more quickly without having to stare at each new idea and understand it from first principles. But there is a danger.  Every so often, the rules change and if you don’t update your “gut feel” you’ll miss opportunities and new innovations.

 

Over the years, I’ve noticed that the duration from the first breakthrough idea on a topic to it actually making sense and having broad applicability is 7 to 10 years. The earliest research work is usually looking beyond current conditions and when the ideas are first published, we usually don’t know how to employ them, don’t yet have efficient algorithms, or the breadth of problems solved by the new ideas is not yet sufficiently broad.  It usually  takes 7 to 10 years to refine and generalize an idea to the point where it is correct and broadly useable.

 

Now you would think that once an idea “makes sense” broadly, once it’s been through it’s 7 to 10 years of exile, it would be ready for broad deployment.  Ironically, there is yet one more delay and this one has been the death of many startups and a great many development projects.  Once an idea is clearly “correct”, applicable, and broadly generalized, enterprise customers still typically won’t buy it for 5 to 7 years.  What happens is that the new idea, product, or approach violates their rules of thumb and they simply won’t buy it until the evidence builds over time and they begin to understand that the rules have changed.  Some examples:

·         Large memories: In the early ’90s it became trivially true that very large memories were the right answer for server-side workloads, especially database workloads.  The combination of large SMPs and rapidly increasing processor performance coupled with lagging I/O performance and falling memory prices made memory a bargain.  You couldn’t afford to not buy large memories and yet many customers I was working with at the time were much more comfortable buying more disk and more CPU even though it was MORE expensive and less effective than adding memory.  They were trapped in their old rules of thumb on memory cost vs value.

·         Large SMPs: In the late ’80s customers were still spending huge sums of money buying very high end water cooled ECL mainframes when they should have been buying the emerging commodity UNIX SMPs and saving a fortune.  It takes a while for customers and the market as a whole to move between technologies.

·         Large clusters: in the late ’90s and to a lesser extent even to this day, customers often buy very large SMPs when they should be buying large clusters of commodity servers.  Sure, they have to rewrite software systems to run in this environment but, even with those costs, in large deployments, mammoth savings are possible.   It’ll take time before they are comfortable and it’ll take time before they have software that’ll run in the new environment. 

 

Basically, once an idea becomes “true” it still has 5 to 7 more years before it’s actually in broad use. What do we learn from this?  First, it’s very easy to be early and to jump on an idea before it’s time.  The dot com era was full of failures even though the same idea will actually succeed in the hands of a new startup over the next couple of years (5 to 7 year delay).  It’s hard to have the right amount of patience and it’s very hard to sell new ideas when they violate the current commonly held rules of thumb.  The second thing we learn is to check our rules of thumb more frequently and to get good at challenging  the rules  of thumb or gut feels of others when trying to get new ideas adopted.  Understand that some of our rules might no longer be true and that some of the rules of thumb used by the person you are speaking with may be outdated.

 

Here are four rules of thumb, all of which were inarguably true at one point in time, and each is either absolutely not true today or on the way to being broken:

·         Compression is a lose in an OLTP system: This is a good place to start since compression is a clear and obvious win today. Back in 1990 or thereabout I argued strongly against adding compression to DB2 UDB (where I was lead architect at the time).  At the time, a well tuned OLTP system was CPU bound and sufficient I/O devices had been added to max out the CPU. The valuable resource was CPU in that you could always add more disk (at a cost) but you can’t just add more CPUs.  At the time, 4-way to 8-way systems were BIG and CPUs were 100x slower than they are today.  Under those conditions, it would be absolutely nuts to trade off CPU for a reduction in I/O costs for the vast majority of workloads.  Effectively we would be getting help with a solvable problem and, in return, accepting more of an unsolvable problem.  I was dead against it at the time and we didn’t do it then.  Today, compression is so obviously the right answer it would be nuts not to do it.  Most very large scale services are running their OLTP systems over clusters of many databases servers.  They have CPU cycles to burn but I/O is what they are short of and where the costs are.  Any trick that can reduce I/O consumption is worth considering and compression is an obvious win.  In fact, compression is now making sense higher up the memory hierarchy and there are times when it makes sense to leave data compressed in memory only decompressing when needed rather than when first brought in from disk. Compression is obviously a win in high end OLTP systems and beyond and, as more customers move to clusters and multi-core systems this will be just keep getting more clear.

 

·         Bottlenecks are either I/O, CPU, or inter-dispatch unit contention: I love performance problems and have long used the magic three as a way of getting a high level understanding of what’s wrong with a complex system that is performing poorly.  The first thing I try to understand when looking at a poorly performing system is whether the system is CPU bound, I/O bound (network, disk, or UI if there is one), or contention bound (processes or threads blocked on other processes or threads – basically, excess serialization).  Looking at these three has always been a gross simplification but it’s been a workable and useful simplification and has many times guided me to the source of the problem quickly.  Increasingly, these three are inadequate in describing what’s wrong with a slow system in that memory bandwidth/contention is increasingly becoming the blocker.  Now in truth, memory bandwidth has always been a factor but it typically wasn’t the biggest problem.  Cache conscious algorithms attempt to solve the memory bandwidth/contention problem but how do you measure whether you have a problem or not?  How do you know if the memory and/or cache hierarchy is the primary blocking factor? 

 

The simple answer is actually very near to the truth but not that helpful: it probably IS the biggest problem.  Memory bandwidth/contention is becoming at least the number two problem for many apps only behind I/O contention. And, for many, it’s the number one performance issue.  How do we measure it quickly and easily and understand the magnitude of the problem?  I use cycles per instructions as an approximation.  CPUs are mostly super-scalar which means they can retire (complete) more than one instruction per clock cycle.  What I like to look at is rate that instructions are retired.  This gives a rough view of how much time is being wasted in pipeline stalls most of which are caused by waiting on memory.  As an example to give a view for the magnitude of this problem, note that most CPUs I’ve worked with over the last 15 years can execute more than one instruction per cycle and yet  I’ve NEVER seen it happen running a data-intensive, server-side workload.  Good database systems will run in the 2.0 to 2.5 cycles per instruction (CPI) and I’ve seen operating system traces that were as bad as 7.5 cycles per instruction.  Instead of executing multiple instructions per cycle, we are typically only executing fractions of instructions each cycle.  So, another rule of thumb is changing: you now need to look at the CPI of your application in addition to the big three if you care deeply about performance.

 

·         CPU cycles are precious: This has always been true and is still true today but it’s become less true fast.  10 years ago, most systems I used, whether clients or servers, were CPU bound.  It’s rare to find a CPU bound client machine these days.  Just about every laptop in the world is now I/O bound.  Servers are quickly going the same route and many are already there.  CPU performance increases much faster than I/O performance so this imbalance will continue to worsen.  As the number of cores per socket climbs, this imbalance will climb at an accelerated pace.  We talked above about compression making sense today. That’s because CPU cycles are no longer the precious resource. Instead of worrying about path length, we should be spending most of our time reducing I/O and memory references.  CPU cycles are typically not the precious resource and multi-core will accelerate the devaluation of CPU resources over memory and I/O resources.

 

·         Virtual Machines can’t be used in services: I chose this as the last example as it is something that I’ve said in the last 12 months and yet it’s becoming increasingly clear that this won’t stay true for long.  It’s worth looking more closely at this one.  First, the argument behind virtual machines not making sense in the services world is this: when you are running 10s to 1000s of systems of a given personality, why mix personalities on the same box?  Just write the appropriate image to as many systems as you need and run only a single application personality per server.  To get the advantages of dynamic resource management often associated with using virtual machines, take a system running one server personality and re-image it to another.  It’s easy to move the resources between roles.  In effect you get all the advantages without the performance penalty of virtualization.  And, by the way, it’s the virtualization penalty that really drives this point home.  I/O intensive workloads often run as badly as 1/5 the speed (20% of native) when running in a virtual machine and ½ the throughput is a common outcome. The simple truth is that the benefits of using VMs in high scale services are too easy to obtain using other not quite as elegant techniques and the VM overhead at scale is currently unaffordable.  Who could afford to take a 20,000 system cluster and, post virtualization, have to double the server count to 40,000? It’s simply too large a bill to pay today.  What would cause this to ever become interesting?  Why would we ever run virtual machines in a service environment?

 

At a factor of two performance penalty, it’s unlikely to make sense in large services (hardware is the number one cost in services -- well ahead of all others – I’ll post the detailed breakdown sometime) so doubling this comes to close to adding 25% to the overall service cost.  Unacceptable.  When do they make sense is:

 

1.       When running non-trusted code, some form of isolation layer is needed.  ASP and the .Net Common Language Runtime are viable isolation boundaries but, for running arbitrary code, VMs are an excellent choice.  That’s what Amazon EC2 uses for isolation. Another scenario used by many customers is to run lightweight clients and centrally host the client side systems in the data center.  For these types of scenarios, VMs are a great answer.

2.       When running very small services.  Large services need 100s to 10s of thousands of systems so sharing servers isn’t very interesting.  However, for every megaservice, there are many very small services some of which could profit from server resource sharing. 

3.       Hardware independence. One nice way to get real hardware independence and to enforce it is to go with a VM.  I wouldn’t pay a factor of two overhead to get that benefit but, for 10% it probably would make sense.  You could pay for that overhead but reduction in operational complexity which lowers costs, brings increased reliability, and allows more flexible and effective use of hardware resources.

 

As a final observation in support of VMs in the longer term, I note that the resource they are wasting in largest quantity today is the resource that is about to be the most plentiful looking forward.  Multi-core, many core, and the continuing separation of compute performance vs I/O performance, will make VMs a great decision and a big part of the future service landscape.

 

This last example is perhaps the most interesting example of a changing rule of thumb in that it’s one that is still true today.  In mega-deployments, VMs aren’t worth the cost.  However, it looks like this is very unlikely to stay true.  As VM overhead is reduced and the value of the squandered CPU resources continues to fall, VMs will look increasingly practical in the service world.  It’s a great example of a rule of thumb that is about to be repealed.

 

Rules of thumb help us quickly get a read on a new idea, algorithm or approach.   They are a great way to quickly check a new idea for reasonableness. But, they don’t stay true forever.  Ratios change over time. Make sure you are re-checking your rules every year or so and, when selling new ideas, be aware the person you are talking to may be operating under an outdated set of rules.  Bring the data to show which no longer apply, or you’ll be working harder than you need to.

 

                                                --jrh

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh

Thursday, March 06, 2008 11:58:48 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Software
 Wednesday, March 05, 2008

When you look at disk transfer rates, it’s pretty obvious that the faster you spin them, the lower the rotational latency and the better the transfer rates.  It’s also very clear that disk transfer rates are improving much slower than memory subsystem bandwidth.  Why not rotate disks faster?  We’ve had 15,000 RPM enterprise disk for a considerably long time.  Why have there been no recent improvements on this dimension?

 

The reason I’ve long understood is that power consumption scales cubically with the RPM but there is more to it than that.  I recently came across a claim that power scaled to the fifth power with RPM.  That didn’t seem right so I decided to go check the facts.  I asked Dave Anderson of Seagate Research if he could help chase down the real data. By the way, if you ever get a chance to see Dave speak, jump on it. He knows an enormous amount about all aspects of spinning media technology and, when he really get down to the details, it’s amazing that disk can work at all.  At the outside edge of a 15,000 RPM drive, the platter is going by the head at 118 MPH, the head is flying at 0.5 microinches (less than the wavelength of visible light), the track width is less than 1/300 the width of a page, nearby vibration sources such as other disks mean the platter will be vibrating slightly and there is always some non-repeatable runout (NRRO).  As the tracks shoots by the disk, the blocks need to be read, decoded, identified to ensure the right block was read, and error checked so significant computational work needs to be done as well. Phenomenal technology.

 

Disks really are amazing but returning to why RPMs haven’t been increasing of late, I checked with Dave Anderson a senior engineer at Seagate. The short answer is cost.  As RPM is increased, NRRO increases quickly, as does vibration both leading to serious tracking complexity. The cost of the drive motors goes up, air drag increased dramatically, as does power consumption. What’s exact relationship for disk power consumption as a function of RPM? From Dave:

 

There are 2 general cases for power consumption today, a cube relation for the general case, and a square relation, where the discs are packed so closely together, they approximate a cylinder.

 

Apparently there can still be a **4 condition where there is extensive turbulence in a single disk drive.

 

Power consumption going up cubically is problematic but enterprise disk today only draw 15 to 18W each.  I would never recommend wasting power but many deployments need to use more disks to get the performance they need.  In effect the power is often getting spent to support the given workload anyway. Would less power be consumed by fewer, faster disks? Further complicating the equation and making that gain harder to find, rotational latency improvements diminish rapidly as RPM goes up.  The improvement from the current high of 15,000 RPM to say 20,000 RPM isn’t very significant:

 

  Dave Anderson Chart

 

I suspect we will see increased RPM in the future but increase in cost and power coupled with diminishing returns and far increased complexity from all the factors listed above suggest we’ll not see the RPM going up very quickly in the future. Some speculate, we’ve seen the last RPM increase but I’ve never been a subscriber of hit-the-wall theory. We always eventually find a way.

 

                                                --jrh

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh

Wednesday, March 05, 2008 12:03:44 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Hardware
 Monday, March 03, 2008

In past blog entries, I’ve talked of the impact of Flash on server-side systems.  On the client, flash SSDs can help with at least two different markets paradoxically at different ends of the cost spectrum: 1) economy low cost laptops, and 2) high performance laptops. At the low end, flash can help make less expensive systems in that hard disk drives are mechanical devices with motors and actuators and they have a price floor.  Even very small disks need to have a motor and an actuator so getting a HDD for much less than $50 is quite difficult.  For very low cost devices with very small storage requirements, a flash SSD can be cheaper than a disk of similar size. And, in addition to being cheaper than HDDs in very small form factors, flash SSDs also consume less power, are more durable, and can operate reliably in broader environmental conditions. Perhaps the prototypical inexpensive laptop is the, One Laptop Per Child project.  It uses NAND flash for persistent storage: http://wiki.laptop.org/index.php/Hardware_specification.

 

On the other end of the spectrum, high-end laptops where performance, light-weight, silence, and long battery life are all important factors, NAND flash SSDs again are becoming common.  Many high-end laptops are shipped with flash SSDs rather than depending on a HDD. Some examples: Samsung, Sony Vaio, Dell Lattitude, HP Compaq, Asus, and many others.

 

Flash SSDs are also emerging as a common choice in ruggedized laptops due to the broad environmental range within which flash SSDs operate reliably.  They are also becoming a common choice in Ultra Mobile PCs such as the Samsung Q1.

 

Flash SSDs are on track to be used in a broad percentage of the laptop market.  EE times, for example, estimates that flash SSDs will be the supplied storage media in 38% of laptop market:

 

The above is from: http://www.eetimes.com/showArticle.jhtml?articleID=204400359 (Jack Creasey sent the article my way). Looking at specifically corporate laptops, I would expect the penetration to be far higher than 38%.

 

                                                --jrh

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh

Monday, March 03, 2008 12:05:15 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Hardware
 Tuesday, February 26, 2008

The internet was designed in a different time at a different scale. It’s rare that a design continues to work at all when scaled multiple orders of magnitude so it remains impressive but there are issues. The blackholing of YouTube over the weekend showed one of them. Routing is fragile and open to administrative error and also certain forms of attack but this particular example was the more common one: human error.

 

Over the weekend, a decision was made in Pakistan took down Youtube for two hours.  Here’s what happened.  Pakistan Telecom received a government order to block Pakistani access to YouTube. They did this for their network (most of Pakistan) but also advertised this route incorrect route to their provider, PCCW, as well.  PCCW shouldn’t have accepted iy but did.  Since PCCW is big and hence fairly credible, the error propagated quickly throughout the world from there.

 

This issue was inconvenient but the same sort of attack can be constructed intentionally to disrupt access to a web sites or to direct users to a web site masquerading as another.

 

Perhaps the best detail is on the Renesys blog: http://www.renesys.com/blog/2008/02/pakistan_hijacks_youtube_1.shtml. Other sources: http://www.news.com/8301-10784_3-9878655-7.html, http://www.nytimes.com/2008/02/26/technology/26tube.html?_r=1&ref=business&oref=slogin.  

 

                        --jrh

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh

Tuesday, February 26, 2008 12:07:19 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services
 Friday, February 22, 2008

I get several hundred emails a day, some absolutely vital and needing prompt action and some about the closest thing to corporate spam.  I know I’m not alone.  I’ve developed my own systems on managing the traffic load and, on different days, have varying degrees of success in sticking to my systems.   In my view, it’s important not to confuse “processing email” with what we actually get paid to do.  Email is often the delivery vehicle for work needing to be done and work that has been done, but email isn’t what we “do”. 

 

We all need to find ways of coping with all the email while still getting real work done and having a shot at a life outside of work.  My approach is fairly simple:

·         Don’t process email in real time or it’ll become your job.  When I’m super busy, I process email twice a day: early in the morning and again in the evening.  When I’m less heavily booked, I’ll try to process email in micro bursts rather than in real time.  It’s more efficient and allows more time to focus on other things.

·         Shut off email arrival sounds and the “new mail” toast or you’ll end up with 100 interruptions an hour and get nothing done but email.

·         I get up early and try to get my email down to under 10 each morning.  I typically fail but get close. And I hold firm on that number once a week.  Each weekend I do get down to less than 10 messages.  If I enter the weekend with hundreds of email items, I get to work all weekend. This is a great motivator to not take a huge number of unprocessed email messages into the weekend.

·         Do everything you can to process a message fully in one touch.  I work hard to process email once. As I work through it, I delete or respond to everything I can quickly. Those that really do require more work I divide into two groups: 1) those I will do today or, at the very latest by end of week, I flag with a priority and leave in my inbox for processing later in the day (many argue these should be moved to a separate folder and they may be right).  The longer-lived items go into my todo list and I remove them from my inbox.  Because I get my email down to under 10 each week and spend as much of my weekend as needed to do this, I’m VERY motivated to not have many emails hanging around waiting to be processed.  Consequently most email is handled up front as I see them and the big things are moved to the todo list. Very few are prioritized for handling later in the day.

·         I chose not to use rules to auto-file email. Primarily I found that if I sent email directly to another folder, I almost never looked at it. So I let everything come into my inbox and I deal with them very quickly and, for the vast majority, they will only be touched once.  If I really don’t even want to see them once, I just don’t subscribe or ask not to get them.

·         Set your draft folder to be your inbox.  With email systems that use a separate folder for unsent mail, there is risk that you’ll get a message 90% written and ready to be sent, get interrupted and then forget to send it.  I set my draft folder to be my inbox so I don’t lose unsent email.  Since my email is worked down to under 10 daily, I’ll find it there for sure before end of day.

·         Don’t bother with complicated folder hierarchies—they are time-consuming to manage. If you want to save something, save it in a single folder or simple folder hierarchy and let desktop search find it when you need it.  Don’t waste time filing in complex ways.

·         Finally, be realistic: if you can’t process at the incoming rate, it’ll just keep backing up indefinitely.  If you aren’t REALLY going to read it, then delete it or file it on the first touch.  Filing it has some value in that, should you start to care more in the future, you can find it via full text search and read it then. 

 

Jeff Johnson of MSN pointed out this excellent talk on email management called “Inbox Zero” by Merlin Mann: http://www.43folders.com/2007/07/25/merlins-inbox-zero-talk.  Merlin’s advice is good and he presents well.

 

                                                --jrh

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh

Friday, February 22, 2008 12:09:06 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Ramblings

Disclaimer: The opinions expressed here are my own and do not necessarily represent those of current or past employers.

Archive
<April 2008>
SunMonTueWedThuFriSat
303112345
6789101112
13141516171819
20212223242526
27282930123
45678910

Categories
This Blog
Member Login
All Content © 2014, James Hamilton
Theme created by Christoph De Baene / Modified 2007.10.28 by James Hamilton