James Hamilton's Blog RSS 2.0
 Wednesday, June 11, 2008

Jeff Dean did a great talk at Google IO this year. Some key points from Steve Garrity (msft pm) and some note from the excellent write-up at Google spotlights data center inner workings:

·         many unreliable servers to fewer high cost servers

·         Single search query touches 700 to up to 1k machines in < 0.25sec

·         36 data centers containing > 800K servers

o   40 servers/rack

·         Typical H/W failures: Install 1000 machines and in 1 year you’ll see: 1000+ HD failures, 20 mini switch failures, 5 full switch failures, 1 PDU failure

·         There are more than 200 Google File System clusters

·         The largest BigTable instance manages about 6 petabytes of data spread across thousands of machines

·          MapReduce is increasing used within Google.

o   29,000 jobs in August 2004 and 2.2 million in September 2007

o   Average time to complete a job has dropped from 634 seconds to 395 seconds

o   Output of MapReduce tasks has risen from 193 terabytes to 14,018 terabytes

·         Typical day will run about 100,000 MapReduce jobs

o   each occupies about 400 servers

o   takes about 5 to 10 minutes to finish

 

More detail on the typical failures during the first year of a cluster from Jeff:

·         ~0.5 overheating (power down most machines in <5 mins, ~1-2 days to recover)

·         ~1 PDU failure (~500-1000 machines suddenly disappear, ~6 hours to come back)

·         ~1 rack-move (plenty of warning, ~500-1000 machines powered down, ~6 hours)

·         ~1 network rewiring (rolling ~5% of machines down over 2-day span)

·         ~20 rack failures (40-80 machines instantly disappear, 1-6 hours to get back)

·         ~5 racks go wonky (40-80 machines see 50% packetloss)

·         ~8 network maintenances (4 might cause ~30-minute random connectivity losses)

·         ~12 router reloads (takes out DNS and external vips for a couple minutes)

·         ~3 router failures (have to immediately pull traffic for an hour)

·         ~dozens of minor 30-second blips for dns

·         ~1000 individual machine failures

·         ~thousands of hard drive failures

 

A pictorial history of Google hardware through the years starting with the current generation server hardware and working backwards from Jeff’s talk at the 2007 Seattle Scalability Conference:

Current Generation Google Servers

 

Google Servers 2001

 

Google Servers 2000

 

Google Servers 1999

 

Google Servers 1997

My general rule on hardware is that, if you have a viewing window into the data center, you are probably spending too much on servers. The Google model of cheap servers with software redundancy is the only economic solution at scale.

 

Other notes from Google IO:

·         http://perspectives.mvdirona.com/2008/05/29/RoughNotesFromSelectedSessionsAtGoogleIODay1.aspx

·         http://perspectives.mvdirona.com/2008/05/29/IO2008RoughNotesFromMarissaMayerDay2KeynoteAtGoogleIO.aspx

·         http://perspectives.mvdirona.com/2008/05/30/IO2008RoughNotesFromSelectedSessionsAtGoogleIODay2.aspx

 

All pictures above courtesy of Jeff Dean.

 

                                                --jrh

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

 

 

Wednesday, June 11, 2008 4:57:56 AM (Pacific Standard Time, UTC-08:00)  #    Comments [4] - Trackback
Hardware
 Sunday, June 08, 2008

I’m interested in high-scale web sites, their architecture and their scaling problems.  Last Thursday, Oren Hurvitz posted a great blog entry summarizing two presentations at Java One on the LinkedIn service architecture.

 

LinkedIn scale is respectable:

·         22M members

·         130M connections

·         2M email messages per day

·         250k invitations per day

 

No big surprises other than LinkedIn are still using a big, expensive commercial RDBMS and a large central Sun SPARC Server. LinkedIn call this central server “The Cloud” and its described as the “backend server caching the entire LinkedIn Network”.  Note that “server” is not plural – it appears that the cloud is not partitioned  but it is replicated 40 times. However, each instance of the cloud hosts the entire social graph in 12GB of memory. 

 

Social graphs are one of my favorite services problems in that they are notoriously difficult to effectively partition. I often refer to this as the “hairball problem”.  There is no clean data partition that will support the workload with adequate performance. Typical approaches to this problem redundantly store a several copies of the data with different lookup keys and potentially different partitions. For example, most social networks have some notion of a user group.  The membership list is stored with each group.  

 

But, a common request is to find all groups for which a specific user is a member. Storing users with the group allows efficient group enumeration but doesn’t support the “find all groups of a given user” query.  Storing the group membership with each user supports this query well but doesn’t allow efficient group membership enumeration. The most common solutions are to store the data redundantly both ways. Typically one is the primary copy and the other is updated asynchronously after the primary is updated. This effectively makes read much more efficient at the cost of more work at update time.  Sometimes, the secondary copy isn’t persisted in the backend database and is only stored in an in memory cache often implemented using memcached.

 

The LinkedIn approach of using a central in-memory, social graph avoids some of the redundancy of the partitioned model I described above at the expensive of requiring a single-server memory large enough to store the entire un-partitioned social graph.  Clearly this is more efficient by many measures but requiring that the entire social graph fit into a single servers memory means that more expensive servers are required as the service grows. And, many of us can’t sleep at night with hard scaling limits even if they are apparently fairly large.

 

Other scaling web sites pointers are posted at: http://perspectives.mvdirona.com/2007/11/12/ScalingWebSites.aspx.

 

Thanks to Dare Obasanjo and Kevin Merritt for sending my way.

 

                                                                --jrh

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

 

 

Sunday, June 08, 2008 11:14:02 AM (Pacific Standard Time, UTC-08:00)  #    Comments [1] - Trackback
Services
 Friday, June 06, 2008

There was an interesting talk earlier today at Microsoft Research by Jason Cong of the UCLA Computer Science Department on compiling design specifications in C/C++/SystemC and user constraints into ASIC and FPGA design. The advantage of compiler based approaches include, more productivity working at a higher level, automating verification, allows optimization, and allows rapid experimentation at different frequencies and different optimization goals (performance vs power, for example). As design complexity increases higher level language and optimization have excellent potential.  Essentially the same thing is happening in hardware design as happened 30 years ago in operating systems implementation languages. High level implementation languages replace lower level as complexity climbs.  For example, the Intel Core 2 Duo processor is a 1B transistor implementation whereas the 386 was 1/1000th of that complexity at 1M.

 

Also super interesting was the example from a financial institution that is taking a software based stock analysis system where they take the hottest parts of the system and compile these to FPGA implementations. 30x faster at 1/10th the power. Very cool.  

 

Now that AMD supports hyper transport it is possible to implement custom processors with excellent overall system performance.  Intel has opened up the FSB and is also expected to offer a non-compatible hyper transport-like implementation in the future.

 

My rough notes from the talk follow:

 

·         Speaker: Jason Cong, UCLA Computer Science (cong@cs.ucla.edu)

·         Working on on-chip interconnects & communications

o   3D IC design

o   RF-interconnects

§  Note that power restrictions restrict processors to ~5GH

·         But, communications lines can scale to 100s of GH

·         Dividing communications link into 10 or more “channels” that operate at different frequencies

·         This talk focused on  ESL SystemC to FPGA compiler

·         Why?

o   700,000 lines of RTL for a 10M gate design is too much

o   Allows executable specification

o   Verification requires executable design

o   Accelerated computing or reconfigurable computing also need C/C++ based compilation/synthesis to FPGAs

§  CPUs coupled with FPGA to support common functions at high performance and lower power

·         Note that performance limited by communications (getting data to the CPU)

o   Long wires that have to be traversed in a single clock are the limiting factor

o   This research focuses on supporting multi-cycle communications

·         xPilot: Behavior-to-RTL (Register Transfer Level design) synthesis flow

o   takes behavior spec in C/SystemC to front-end compiler to SSDM

o   SSDM is optimized using standard compiler optimization (loop unrolling, strength reduction, scheduling, etc.)

o   SSDM is compiled to:

§  Verilog/VHDL/SystemC

§  FPGAs: Altera, Xilinx

§  ASICs: Magma, Synopsys

o   UPS: Uniform Power Specification

o   During final compilation optimize for power and shut off compute units that are not being used and shut off those that are being during idle periods (a busy disk controller is frequently waiting for mechanicals and not need to execute instructions)

§  Can’t shut FPGAs but can with ASICs

§  The only solution to dynamic power leakage only solution is shut the component of

o   Allows faster experimentation than hand coding.  You can try different frequencies and different power optimizations (too complex for most humans)

o   Scheduling (allocation of operations to compute logic and specific clock cycles) is NP-complete and automated techniques can exceed quality of expert designs

·         Example:

o   Schedule the behavior to RTL using the following characterization, cycle time, constraints, and objectives:

§  Platform characterization: adder (2ns) & multiplier (5ns)

§  Target cycle time (10ns)

§  Resource constraint: only one multiplier is available

§  Objective: high performance or lower power as examples

·         Note as optimizing and reducing component counts, less space is required, which can allow faster clocking

·         Investigating compilation for Reconfigurable Accelerated Computing

o   Take GCC 3DES implementation and synthesis FPGS RTL description

o   Example took 3DES from C level implementation to a FPGA (Xilinx Virtex-5)

·         Investment bank is using this tool to compile financial optimizations from S/W implementation to FPGA accelerators (Black-Scholes S/w kernel)

o   30x speed-up over software implementation and 1/10 the power (6 vs 68W)

 

A related presentation: http://cadlab.cs.ucla.edu/~cong/slides/fpt05_xpilot_final.pdf.

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

 

 

Friday, June 06, 2008 10:52:35 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Hardware
 Tuesday, June 03, 2008

 

Last week at Google IO, pricing was announced for Google Application Engine. Actually it was blogged the night before at: http://googleappengine.blogspot.com/2008/05/announcing-open-signups-expected.html.

 

The prices are close to identical with Amazon AWS although GAE differs substantially from the AWS offerings.  The former offers a easy to use Python  execution environment whereas Amazon offers the infinitely flexible run-this-virtual-machine model. Clearly the Amazon model costs more to provide so, by that measure, AWS pricing is somewhat better:

 

Google Application Engine Pricing:

·    $0.10 - $0.12 per CPU core-hour

·    $0.15 - $0.18 per GB-month of storage

·    $0.11 - $0.13 per GB outgoing bandwidth

·    $0.09 - $0.11 per GB incoming bandwidth

·    From: http://googleappengine.blogspot.com/2008/05/announcing-open-signups-expected.html

Compared with AWS Pricing:

·         $0.10 - $0.80 per VM hour (depending upon resources allocated)

·         $0.15 per GB-month of storage

·         $0.100 - $0.170 per GB outgoing bandwidth

·         $0.100 per GB incoming bandwidth

·         From: http://www.amazon.com/S3-AWS-home-page-Money/b/ref=sc_fe_l_2?ie=UTF8&node=16427261&no=3435361&me=A36L942TSJ2AJA and http://www.amazon.com/EC2-AWS-Service-Pricing/b/ref=sc_fe_l_2?ie=UTF8&node=201590011&no=3435361&me=A36L942TSJ2AJA.

There are some important differences that make the pricing comparison somewhat biased in a couple of ways. Two important differences: 1) as mentioned above, Amazon gives an entire virtual machine so EC2 is much more flexible than GAE both in that it can run arbitrary applications in arbitrary languages and that it supports all execution models whereas GAE only supports HTTP request/response.  Another key difference is the storage subsystem.  In the numbers above, we’re comparing the Amazon blob store (S3) with the more structured storage model offered by GAE.  The more comparable AWS SimpleDB pricing is considerably higher than the GAE storage pricing. SimpleDB charges $1.50 GB/month in addition to machine usage and network transmission costs.  GAE is offering much more affordable semi-structured storage and the GAE storage model actually supports data types rather than having to force everything to character format.

 

GAE is still free to start with under 5M page views/month and up to ½ GB storage for free.  Obviously this helps developers get started without strings and that’s a good thing. But, more importantly, it avoids Google from having to go to the expense of billing very small values.  In a weird sort of way, I’m more impressed with AWS billing $0.04 on some accounts in that it shows there billing system is incredibly lean. Scaling down billing is hard, hard, hard.

 

In addition to announcing prices, GAE went from a controlled admission beta to a fully open beta where all comers are welcome.  I’m impressed how quickly they have gone from making the service initially available to a full open beta.  Impressive.

 

Also announced last week was a new GAE Memcached API which appears to have been a 20% project of Brad Fitzpatrick (Sriram Krishnan sent my way). And a set of image manipulation APIs supporting image scaling, rotating, etc. will now be part of the GAE API set.

 

My notes from Google IO:

·         http://perspectives.mvdirona.com/2008/05/29/RoughNotesFromSelectedSessionsAtGoogleIODay1.aspx

·         http://perspectives.mvdirona.com/2008/05/29/IO2008RoughNotesFromMarissaMayerDay2KeynoteAtGoogleIO.aspx

·         http://perspectives.mvdirona.com/2008/05/30/IO2008RoughNotesFromSelectedSessionsAtGoogleIODay2.aspx

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

Tuesday, June 03, 2008 8:05:05 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services
 Sunday, June 01, 2008

Yesterday the Tribute to Honor Jim Gray was held at the University of California at Berkeley. We all miss Jim deeply so it really is a tough topic.  But it was great to get together with literally 100s of Jim’s friends and share stories and talk about some of his accomplishments, his contributions to the field, and his contributions to each of us.  Jim is amazing across all three dimensions but what is most remarkable is the profound way he helped others achieve more throughout the industry.  We’re all better engineers, researchers, and human beings for having been lucky enough to have known and worked with Jim.

 

Also announced yesterday was the creation of the Jim Gray Chair at Cal Berkeley.  Bill Gates Eric Schmidt, Marc Benioff, and Mike Stonebraker each donated $250,000 which were matched by a $1,000,000 from the Hewlett Foundation.

 

Seattle PI coverage: Gathering in Berkeley, Calif., today to honor legendary scientist, Microsoft researcher Jim Gray.

 

The morning, general session agenda:

             Welcome - Shankar Sastry

             Opening Remarks - Joseph Hellerstein

             A Tribute, Not a Memorial: Understanding Ambiguous Loss - Pauline Boss

             The Amateur Search - Michael Olson

             Jim Gray at Berkeley - Michael Harrison

             Knowledge and Wisdom - Pat Helland

             Why Did Jim Gray Win the Turing Award? - Michael Stonebraker

             Jim Gray Chair - Stuart Russell

             500 Special Relationships: Jim as a Mentor to Faculty and Students - Ed Lazowska

             Jim Gray: His Contributions to Industry - David Vaskevitch

             A "Gap Bridger" - Richard Rashid

             Thanks to the U.S. Coast Guard - Paula Hawthorn

 

The afternoon, technical session agenda:

·         Welcome - Shankar Sastry

·         Opening Remarks - Joseph Hellerstein

·         A Tribute, Not a Memorial: Understanding Ambiguous Loss - Pauline Boss

·         The Amateur Search - Michael Olson

·         Jim Gray at Berkeley - Michael Harrison

·         Knowledge and Wisdom - Pat Helland

·         Why Did Jim Gray Win the Turing Award? - Michael Stonebraker

·         Jim Gray Chair - Stuart Russell

·         500 Special Relationships: Jim as a Mentor to Faculty and Students - Ed Lazowska

·         Jim Gray: His Contributions to Industry - David Vaskevitch 

·         A "Gap Bridger" - Richard Rashid

·         Thanks to the U.S. Coast Guard - Paula Hawthorn

 

The event was video recorded and streamed via http://webcast.berkeley.edu/.

 

Update: the video will be at:  Tribute to Honor Jim Gray - General Session (thanks to George Spix for sending my way).

Second Update: A good article by John Markoff of the NY Times: A Tribute to Jim Gray: Sometimes Nice Guys Do Finish First.

 

                                    --jrh

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

 

 

Sunday, June 01, 2008 10:08:01 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Ramblings
 Thursday, May 29, 2008

Google IO notes continued from earlier in the day: http://perspectives.mvdirona.com/2008/05/29/IO2008RoughNotesFromMarissaMayerDay2KeynoteAtGoogleIO.aspx and yesterday: http://perspectives.mvdirona.com/2008/05/29/RoughNotesFromSelectedSessionsAtGoogleIODay1.aspx

 

Google Web Toolkit and Client-Server Communications

·         Speaker: Miquel Mendez

·         GWT client/server communication options:

o   Frames

o   Form Panel

o   XHR: RequestBuilder (be careful don’t to start too many—many browsers have limits)

o   XML RPC

·         XML Encoding/Decoding: com.google.gwt.xml defines XML related classes

·         JSON Encoding/Decoding: com.google.gwt.json.JSON defines JSON related classes

·         GWT RPC: Generator that generates code and makes use of RequestBuilder

 

Reusing Google APIs with Google Web Toolkit

·         Speaker: Miquel Mendez

·         GALGWT: Google API Library for GWT.  It’s an open source project lead by Miquel (Javascript bindings to GWT).

o   It’s a collection of easy to use GWT bindings for existing Google JavaScript APIs

o   It’s a Google code open source project

·         Reminder: GWT is a java to Javascript compiler.

·         GWT now has a gadget class.  Google Gadget creation using GWT by extending the Gadget class and implementing the NeedsXXX intrerfaces.

·         Gears support:

o   Exposes database, LocalServer, and WorkerPool JS modules

o   Provides an offline module that automates the process fo going offline (creates the necessary manifests automatically)

·         Google Maps support as well

 

Engaging User Experiences with Google App Engine

·         Speakers: John Skidgel (designer) & & Lindsay Simon (developer).

·         Showed a guest book application written using Djanjo Form.  It’s been modified to run under App Eng