Jim Gray proposed the original sort benchmark back in his famous Anon et al paper A Measure of Transaction Processing Power originally published in Datamation April 1, 1985. TeraSort is one of the benchmarks that Jim evolved from this original proposal.
TeraSort is essentially a sequential I/O benchmark and the best way to get lots of I/O capacity is to have many servers. The mainframe engineer-a-bigger-bus technique has produced some nice I/O rates but it doesn’t scale. There have been some very good designs but, in the end, commodity parts in volume always win. The trick is coming up with a programming model that is understandable to allow thousands of nodes to be harnessed. MapReduce takes some heat for not being innovative and not having learned enough from the database community (MapReduce – A Major Step Backwards). However, Google, Microsoft, and Yahoo run the model over thousands of nodes. And all three have written higher level languages layers above MapReduce some of which look very SQL-like.
Owen O’Malley of the Yahoo Grid team took a moderate sized Hadoop cluster of 910 nodes and won the TeraSort benchmark. Owen blogged the result: Apache Hadoop Wins Terabyte Sort Benchmark and provided more details in a short paper: TeraByte Sort on Apache Hadoop. Great result Owen.
Here’s the configuration that won:
- 910 nodes
- 4 dual core Xeons @ 2.0ghz per a node
- 4 SATA disks per a node
- 8G RAM per a node
- 1 gigabit ethernet on each node
- 40 nodes per a rack
- 8 gigabit ethernet uplinks from each rack to the core
- Red Hat Enterprise Linux Server Release 5.1 (kernel 2.6.18)
- Sun Java JDK 1.6.0_05-b13
Yahoo bought expensive 4-socket servers for this configuration but, even then, this effort was won on less than ½ million in hardware. Let’s assume that their fat nodes are $3.8k each. They have 40 servers per rack so well need 23 racks. Let’s assume $4.2k per top of rack switch and $100k for core switching. That’s 910*$3.8k+23*4.2k+$100k or $3,655k. That means you can go out and spend roughly $3.5m and have the same resources that won the last sort benchmark. Amazing. I love what’s happening in our industry.
Update: Math glitch in original posting fixed above (thanks to Ari Rabkin & Nathan Shrenk).
The next thing I would like to see is this same test run on very low power servers. Assuming the fairly beefy nodes used above are 350W each (they may well be more), the overall cluster ignoring networking will be 318kW and it ran for 209 seconds which is 18.490kW/hrs. Let’s focus on power and show what can be sorted for 1kW/hr. The kW/hr sort.
Congratulations to the Yahoo and Hadoop team for a great result.
–jrh
Wei Xiao of the Internet Search Research Center sent Owen’s result my way.
James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | JamesRH@microsoft.com
H:mvdirona.com | W:research.microsoft.com/~jamesrh | blog:http://perspectives.mvdirona.com
thanks for you!! Oyun.
I agree Robert. The engineering investment in Hadoop has been substantial. Many super-useful software projects like Hadoop and Memcache get huge contributions from services companies like Yahoo and Facebook.
I imagine that the soft cost way exceeds the hardware :)
How many engineers have paved the way inside Yahoo building this cool architecture making the test possible? 10-25-50? Over 2 or 3 years? 25 employees on the team over 2 years earning an average salary of 125k (salary could be way off) already pushes you well over 6.25 million + benefits.
I really like your thoughts on doing it with low power, that would be a cool stat to track..
Thanks for the update Arun. What’s the per node cost?
–jrh
jrh@mvdirona.com
James, there was a typo in the original blog post; it should have read:
2 quad core Xeons @ 2.0ghz per a node
rather than
4 dual core Xeons @ 2.0ghz per a node.
It’s been fixed since…
You wrote kW / hr instead of kW * hr
Thanks for the correction Ari and Nathan.
SteveL, you asked why not include storage and RHEL licenses. Storage is included. You can include RHEL if you want O/S support. I chose not to. DC costs are not included but I agree that power consumption is interesting and worth tracking.
–jrh
jrh@mvdirona.com
Given your cost assumptions the actual formula should be:
23 racks * (4.2k/rack switch + (40 servers/rack * $4k/server)) + $100k core switch = $3,876k.
I don’t follow your math. It looks like you’re accounting for the price of 40 nodes, not 910. Why?
Don’t forget the cost of 900 RHEL licenses plus the datacentre to house it all. Sort/KWh is kind of nice. Sort/$ could be good too, to compare EC2 with alternates. Upload the TB of data there and see how much AWS will bill you for the storage and then the sort itself.