Jeff Dean of Google did an excellent keynote talk at LADIS 2009. Jeff’s talk is up at: http://www.cs.cornell.edu/projects/ladis2009/talks/dean-keynote-ladis2009.pdf and my notes follow:
· A data center wide storage hierarchy:
o Server:
§ DRAM: 16GB, 100ns, 20GB/s
§ Disk: 2TB, 10ms, 200MB/s
o Rack:
§ DRAM: 1TB, 300us, 100MB/s
§ Disk: 160TB, 11ms, 100MB/s
o Aggregate Switch:
§ DRAM: 30TB, 500us, 10MB/s
§ Disk: 4.8PB, 12ms, 10MB/s
· Failure Inevitable:
o Disk MTBF: 1 to 5%
o Servers: 2 to 4%
· Excellent set of distributed systems rules of thumb:
o L1 cache reference 0.5 ns
o Branch mispredict 5 ns
o L2 cache reference 7 ns
o Mutex lock/unlock 25 ns
o Main memory reference 100 ns
o Compress 1K bytes with Zippy 3,000 ns
o Send 2K bytes over 1 Gbps network 20,000 ns
o Read 1 MB sequentially from memory 250,000 ns
o Round trip within same datacenter 500,000 ns
o Disk seek 10,000,000 ns
o Read 1 MB sequentially from disk 20,000,000 ns
o Send packet CA->Netherlands->CA 150,000,000 ns
· Typical first year for a new cluster:
o ~0.5 overheating (power down most machines in <5 mins, ~1-2 days to recover)
o ~1 PDU failure (~500-1000 machines suddenly disappear, ~6 hours to come back)
o ~1 rack-move (plenty of warning, ~500-1000 machines powered down, ~6 hours)
o ~1 network rewiring (rolling ~5% of machines down over 2-day span)
o ~20 rack failures (40-80 machines instantly disappear, 1-6 hours to get back)
o ~5 racks go wonky (40-80 machines see 50% packetloss)
o ~8 network maintenances (4 might cause ~30-minute random connectivity losses)
o ~12 router reloads (takes out DNS and external vips for a couple minutes)
o ~3 router failures (have to immediately pull traffic for an hour)
o ~dozens of minor 30-second blips for dns
o ~1000 individual machine failures
o ~thousands of hard drive failures
o slow disks, bad memory, misconfigured machines, flaky machines, etc.
· GFS Usage at Google:
o 200+ clusters
o Many clusters of over 1000 machines
o 4+ PB clients
o 40 GB/s read/write laod
· Map Reduce Usage at Google: 3.5m jobs/year averaging 488 machines each & taking ~8 min
· Big Table Usage at Google: 500 clusters with largest having 70PB, 30+ GB/s I/O
· Working on next generation GFS system called Colossus
· Metadata management for Colossus in BigTable
· Working on next generation Big Table system called Spanner
o Similar to BigTable in that Spanner has tables, families, groups, coprocessors, etc.
o But has hierarchical directories rather than rows, fine-grained replication (ad directory level), ACLs
o Supports both weak and strong data consistency across data centers
o Strong consistency implemented using Paxos across replicas
o Supports distributed transactions across directories/machines
o Much more automated operation
§ Auto data movement and replicas on basis of computation, usage patterns, and failures
o Spanner design goals: 10^6 to 10^7 machines, 10^13 directories, 10^18 storage, 10^3 to 10^4 locations over long distances
o Users specify require latency and replication factor and location
I did the keynote at last year’s LADIS: http://perspectives.mvdirona.com/2008/09/16/InternetScaleServiceEfficiency.aspx
–jrh
b: http://blog.mvdirona.com / http://perspectives.mvdirona.com
@Dave: "Disk: 4.8PB, 12ms, 10MB/s" refers to the average network bandwidth you should expect between any 2 servers placed in _different_ racks. If you compare it with the intra-rack bandwidth (Disk: 160TB, 11ms, 100MB/s), you get a 1:10 oversubscription ratio (which sometimes could be even optimistic…).
Disclaimer: I didn’t attend LADIS2009, so I could have interpreted it wrongly!
Great article Bradford. Thanks for pointing me to it.
From your article, I see you are at visible technologies. That’s just down the road from where I am. If its OK to discuss your Hadoop work publicly, I would love to hear more about it. Collins Pub is just down the road. I love big data.
James Hamilton
jrh@mvdirona.com
> Aggregate Switch:
>§ DRAM: 30TB, 500us, 10MB/s
>§ Disk: 4.8PB, 12ms, 10MB/s
Isn’t that wrong? If there are 30+ racks shouldn’t it be 3GB/s for disk transfer?
Awesome post.
I’ve recently finished an article of a similar nature, about ‘Making Life Suck Less (While Making Scalable Systems)". http://www.roadtofailure.com/2009/09/09/how-to-make-life-suck-less-while-making-scalable-systems/