Saturday, October 17, 2009

Jeff Dean of Google did an excellent keynote talk at LADIS 2009.  Jeff’s talk is up at: http://www.cs.cornell.edu/projects/ladis2009/talks/dean-keynote-ladis2009.pdf and my notes follow:

·         A data center wide storage hierarchy:

o   Server:

§  DRAM: 16GB, 100ns, 20GB/s

§  Disk: 2TB, 10ms, 200MB/s

o   Rack:

§  DRAM: 1TB, 300us, 100MB/s

§  Disk: 160TB, 11ms, 100MB/s

o   Aggregate Switch:

§  DRAM: 30TB, 500us, 10MB/s

§  Disk: 4.8PB, 12ms, 10MB/s

·         Failure Inevitable:

o   Disk MTBF: 1 to 5%

o   Servers: 2 to 4%

·         Excellent set of distributed systems rules of thumb:

o   L1 cache reference 0.5 ns

o   Branch mispredict 5 ns

o   L2 cache reference 7 ns

o   Mutex lock/unlock 25 ns

o   Main memory reference 100 ns

o   Compress 1K bytes with Zippy 3,000 ns

o   Send 2K bytes over 1 Gbps network 20,000 ns

o   Read 1 MB sequentially from memory 250,000 ns

o   Round trip within same datacenter 500,000 ns

o   Disk seek 10,000,000 ns

o   Read 1 MB sequentially from disk 20,000,000 ns

o   Send packet CA->Netherlands->CA 150,000,000 ns

·         Typical first year for a new cluster:

o   ~0.5 overheating (power down most machines in <5 mins, ~1-2 days to recover)

o   ~1 PDU failure (~500-1000 machines suddenly disappear, ~6 hours to come back)

o   ~1 rack-move (plenty of warning, ~500-1000 machines powered down, ~6 hours)

o   ~1 network rewiring (rolling ~5% of machines down over 2-day span)

o   ~20 rack failures (40-80 machines instantly disappear, 1-6 hours to get back)

o   ~5 racks go wonky (40-80 machines see 50% packetloss)

o   ~8 network maintenances (4 might cause ~30-minute random connectivity losses)

o   ~12 router reloads (takes out DNS and external vips for a couple minutes)

o   ~3 router failures (have to immediately pull traffic for an hour)

o   ~dozens of minor 30-second blips for dns

o   ~1000 individual machine failures

o   ~thousands of hard drive failures

o   slow disks, bad memory, misconfigured machines, flaky machines, etc.

·         GFS Usage at Google:

o   200+ clusters

o   Many clusters of over 1000 machines

o   4+ PB clients

o   40 GB/s read/write laod

·         Map Reduce Usage at Google: 3.5m jobs/year averaging 488 machines each & taking ~8 min

·         Big Table Usage at Google: 500 clusters with largest having 70PB, 30+ GB/s I/O

·         Working on next generation GFS system called Colossus

·         Metadata management for Colossus in BigTable

·         Working on next generation Big Table system called Spanner

o   Similar to BigTable in that Spanner has tables, families, groups, coprocessors, etc.

o   But has hierarchical directories rather than rows, fine-grained replication (ad directory level), ACLs

o   Supports both weak and strong data consistency across data centers

o   Strong consistency implemented using Paxos across replicas

o   Supports distributed transactions across directories/machines

o   Much more automated operation

§  Auto data movement and replicas on basis of computation, usage patterns, and failures

o   Spanner design goals: 10^6 to 10^7 machines, 10^13 directories, 10^18 storage, 10^3 to 10^4 locations over long distances

o   Users specify require latency and replication factor and location

 

I did the keynote at last year’s LADIS: http://perspectives.mvdirona.com/2008/09/16/InternetScaleServiceEfficiency.aspx

 

                                                                --jrh

James Hamilton

e: jrh@mvdriona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Saturday, October 17, 2009 6:06:24 AM (Pacific Standard Time, UTC-08:00)  #    Comments [4] - Trackback
Services
Sunday, October 18, 2009 9:54:12 AM (Pacific Standard Time, UTC-08:00)
Awesome post.

I've recently finished an article of a similar nature, about 'Making Life Suck Less (While Making Scalable Systems)". http://www.roadtofailure.com/2009/09/09/how-to-make-life-suck-less-while-making-scalable-systems/
Monday, October 19, 2009 9:43:36 AM (Pacific Standard Time, UTC-08:00)
> Aggregate Switch:
>§ DRAM: 30TB, 500us, 10MB/s
>§ Disk: 4.8PB, 12ms, 10MB/s

Isn't that wrong? If there are 30+ racks shouldn't it be 3GB/s for disk transfer?
Monday, October 19, 2009 11:06:19 AM (Pacific Standard Time, UTC-08:00)

Great article Bradford. Thanks for pointing me to it.

From your article, I see you are at visible technologies. That's just down the road from where I am. If its OK to discuss your Hadoop work publicly, I would love to hear more about it. Collins Pub is just down the road. I love big data.

James Hamilton
jrh@mvdirona.com
Wednesday, November 11, 2009 3:41:26 AM (Pacific Standard Time, UTC-08:00)
@Dave: "Disk: 4.8PB, 12ms, 10MB/s" refers to the average network bandwidth you should expect between any 2 servers placed in _different_ racks. If you compare it with the intra-rack bandwidth (Disk: 160TB, 11ms, 100MB/s), you get a 1:10 oversubscription ratio (which sometimes could be even optimistic...).

Disclaimer: I didn't attend LADIS2009, so I could have interpreted it wrongly!
Comments are closed.

Disclaimer: The opinions expressed here are my own and do not necessarily represent those of current or past employers.

Archive
<October 2009>
SunMonTueWedThuFriSat
27282930123
45678910
11121314151617
18192021222324
25262728293031
1234567

Categories
This Blog
Member Login
All Content © 2014, James Hamilton
Theme created by Christoph De Baene / Modified 2007.10.28 by James Hamilton