Jeff Dean: Design Lessons and Advice from Building Large Scale Distributed Systems

Jeff Dean of Google did an excellent keynote talk at LADIS 2009. Jeff’s talk is up at: http://www.cs.cornell.edu/projects/ladis2009/talks/dean-keynote-ladis2009.pdf and my notes follow:

· A data center wide storage hierarchy:

o Server:

§ DRAM: 16GB, 100ns, 20GB/s

§ Disk: 2TB, 10ms, 200MB/s

o Rack:

§ DRAM: 1TB, 300us, 100MB/s

§ Disk: 160TB, 11ms, 100MB/s

o Aggregate Switch:

§ DRAM: 30TB, 500us, 10MB/s

§ Disk: 4.8PB, 12ms, 10MB/s

· Failure Inevitable:

o Disk MTBF: 1 to 5%

o Servers: 2 to 4%

· Excellent set of distributed systems rules of thumb:

o L1 cache reference 0.5 ns

o Branch mispredict 5 ns

o L2 cache reference 7 ns

o Mutex lock/unlock 25 ns

o Main memory reference 100 ns

o Compress 1K bytes with Zippy 3,000 ns

o Send 2K bytes over 1 Gbps network 20,000 ns

o Read 1 MB sequentially from memory 250,000 ns

o Round trip within same datacenter 500,000 ns

o Disk seek 10,000,000 ns

o Read 1 MB sequentially from disk 20,000,000 ns

o Send packet CA->Netherlands->CA 150,000,000 ns

· Typical first year for a new cluster:

o ~0.5 overheating (power down most machines in <5 mins, ~1-2 days to recover)

o ~1 PDU failure (~500-1000 machines suddenly disappear, ~6 hours to come back)

o ~1 rack-move (plenty of warning, ~500-1000 machines powered down, ~6 hours)

o ~1 network rewiring (rolling ~5% of machines down over 2-day span)

o ~20 rack failures (40-80 machines instantly disappear, 1-6 hours to get back)

o ~5 racks go wonky (40-80 machines see 50% packetloss)

o ~8 network maintenances (4 might cause ~30-minute random connectivity losses)

o ~12 router reloads (takes out DNS and external vips for a couple minutes)

o ~3 router failures (have to immediately pull traffic for an hour)

o ~dozens of minor 30-second blips for dns

o ~1000 individual machine failures

o ~thousands of hard drive failures

o slow disks, bad memory, misconfigured machines, flaky machines, etc.

· GFS Usage at Google:

o 200+ clusters

o Many clusters of over 1000 machines

o 4+ PB clients

o 40 GB/s read/write laod

· Map Reduce Usage at Google: 3.5m jobs/year averaging 488 machines each & taking ~8 min

· Big Table Usage at Google: 500 clusters with largest having 70PB, 30+ GB/s I/O

· Working on next generation GFS system called Colossus

· Metadata management for Colossus in BigTable

· Working on next generation Big Table system called Spanner

o Similar to BigTable in that Spanner has tables, families, groups, coprocessors, etc.

o But has hierarchical directories rather than rows, fine-grained replication (ad directory level), ACLs

o Supports both weak and strong data consistency across data centers

o Strong consistency implemented using Paxos across replicas

o Supports distributed transactions across directories/machines

o Much more automated operation

§ Auto data movement and replicas on basis of computation, usage patterns, and failures

o Spanner design goals: 10^6 to 10^7 machines, 10^13 directories, 10^18 storage, 10^3 to 10^4 locations over long distances

o Users specify require latency and replication factor and location

I did the keynote at last year’s LADIS: http://perspectives.mvdirona.com/2008/09/16/InternetScaleServiceEfficiency.aspx

–jrh

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

4 comments on “Jeff Dean: Design Lessons and Advice from Building Large Scale Distributed Systems
  1. @Dave: "Disk: 4.8PB, 12ms, 10MB/s" refers to the average network bandwidth you should expect between any 2 servers placed in _different_ racks. If you compare it with the intra-rack bandwidth (Disk: 160TB, 11ms, 100MB/s), you get a 1:10 oversubscription ratio (which sometimes could be even optimistic…).

    Disclaimer: I didn’t attend LADIS2009, so I could have interpreted it wrongly!

  2. Great article Bradford. Thanks for pointing me to it.

    From your article, I see you are at visible technologies. That’s just down the road from where I am. If its OK to discuss your Hadoop work publicly, I would love to hear more about it. Collins Pub is just down the road. I love big data.

    James Hamilton
    jrh@mvdirona.com

  3. Dave Quick says:

    > Aggregate Switch:
    >§ DRAM: 30TB, 500us, 10MB/s
    >§ Disk: 4.8PB, 12ms, 10MB/s

    Isn’t that wrong? If there are 30+ racks shouldn’t it be 3GB/s for disk transfer?

  4. Awesome post.

    I’ve recently finished an article of a similar nature, about ‘Making Life Suck Less (While Making Scalable Systems)". http://www.roadtofailure.com/2009/09/09/how-to-make-life-suck-less-while-making-scalable-systems/

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.