Jim Gray proposed the original sort benchmark back in his famous Anon et al paper A Measure of Transaction Processing Power originally published in Datamation April 1, 1985. TeraSort is one of the benchmarks that Jim evolved from this original proposal.
TeraSort is essentially a sequential I/O benchmark and the best way to get lots of I/O capacity is to have many servers. The mainframe engineer-a-bigger-bus technique has produced some nice I/O rates but it doesn’t scale. There have been some very good designs but, in the end, commodity parts in volume always win. The trick is coming up with a programming model that is understandable to allow thousands of nodes to be harnessed. MapReduce takes some heat for not being innovative and not having learned enough from the database community (MapReduce – A Major Step Backwards). However, Google, Microsoft, and Yahoo run the model over thousands of nodes. And all three have written higher level languages layers above MapReduce some of which look very SQL-like.
Owen O’Malley of the Yahoo Grid team took a moderate sized Hadoop cluster of 910 nodes and won the TeraSort benchmark. Owen blogged the result: Apache Hadoop Wins Terabyte Sort Benchmark and provided more details in a short paper: TeraByte Sort on Apache Hadoop. Great result Owen.
Here’s the configuration that won:
- 910 nodes
- 4 dual core Xeons @ 2.0ghz per a node
- 4 SATA disks per a node
- 8G RAM per a node
- 1 gigabit ethernet on each node
- 40 nodes per a rack
- 8 gigabit ethernet uplinks from each rack to the core
- Red Hat Enterprise Linux Server Release 5.1 (kernel 2.6.18)
- Sun Java JDK 1.6.0_05-b13
Yahoo bought expensive 4-socket servers for this configuration but, even then, this effort was won on less than ½ million in hardware. Let’s assume that their fat nodes are $3.8k each. They have 40 servers per rack so well need 23 racks. Let’s assume $4.2k per top of rack switch and $100k for core switching. That’s 910*$3.8k+23*4.2k+$100k or $3,655k. That means you can go out and spend roughly $3.5m and have the same resources that won the last sort benchmark. Amazing. I love what’s happening in our industry.
Update: Math glitch in original posting fixed above (thanks to Ari Rabkin & Nathan Shrenk).
The next thing I would like to see is this same test run on very low power servers. Assuming the fairly beefy nodes used above are 350W each (they may well be more), the overall cluster ignoring networking will be 318kW and it ran for 209 seconds which is 18.490kW/hrs. Let’s focus on power and show what can be sorted for 1kW/hr. The kW/hr sort.
Congratulations to the Yahoo and Hadoop team for a great result.
Wei Xiao of the Internet Search Research Center sent Owen’s result my way.
James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | JamesRH@microsoft.com