Tuesday, July 08, 2008

Jim Gray proposed the original sort benchmark back in his famous Anon et al paper A Measure of Transaction Processing Power originally published in Datamation April 1, 1985. TeraSort is one of the benchmarks that Jim evolved from this original proposal.

 

TeraSort is essentially a sequential I/O benchmark and the best way to get lots of I/O capacity is to have many servers.  The mainframe engineer-a-bigger-bus technique has produced some nice I/O rates but it doesn’t scale. There have been some very good designs but, in the end, commodity parts in volume always win. The trick is coming up with a programming model that is understandable to allow thousands of nodes to be harnessed.  MapReduce takes some heat for not being innovative and not having learned enough from the database community (MapReduce – A Major Step Backwards).  However, Google, Microsoft, and Yahoo run the model over thousands of nodes.  And all three have written higher level languages layers above MapReduce some of which look very SQL-like.

 

Owen O’Malley of the Yahoo Grid team took a moderate sized Hadoop cluster of 910 nodes and won the TeraSort benchmark.  Owen blogged the result: Apache Hadoop Wins Terabyte Sort Benchmark and provided more details in a short paper: TeraByte Sort on Apache Hadoop. Great result Owen.

 

Here’s the configuration that won:

  • 910 nodes
  • 4 dual core Xeons @ 2.0ghz per a node
  • 4 SATA disks per a node
  • 8G RAM per a node
  • 1 gigabit ethernet on each node
  • 40 nodes per a rack
  • 8 gigabit ethernet uplinks from each rack to the core
  • Red Hat Enterprise Linux Server Release 5.1 (kernel 2.6.18)
  • Sun Java JDK 1.6.0_05-b13

Yahoo bought expensive  4-socket servers for this configuration but, even then, this effort was won on less than ½ million in hardware.  Let’s assume that their fat nodes are $3.8k each.  They have 40 servers per rack so well need 23 racks. Let’s assume $4.2k per top of rack switch and $100k for core switching.  That’s 910*$3.8k+23*4.2k+$100k or $3,655k.  That means you can go out and spend roughly $3.5m and have the same resources that won the last sort benchmark. Amazing.  I love what’s happening in our industry.

Update: Math glitch in original posting fixed above (thanks to Ari Rabkin & Nathan Shrenk).

 

The next thing I would like to see is this same test run on very low power servers.  Assuming the fairly beefy nodes used above are 350W each (they may well be more), the overall cluster ignoring networking will be 318kW and it ran for 209 seconds which is 18.490kW/hrs. Let’s focus on power and show what can be sorted for 1kW/hr. The kW/hr sort.

 

Congratulations to the Yahoo and Hadoop team for a great result.

 

                                --jrh

 

Wei Xiao of the Internet Search Research Center sent Owen’s result my way.

 

James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

 

Tuesday, July 08, 2008 4:03:17 AM (Pacific Standard Time, UTC-08:00)  #    Comments [10] - Trackback
Software
Tuesday, July 08, 2008 5:56:11 AM (Pacific Standard Time, UTC-08:00)
Don't forget the cost of 900 RHEL licenses plus the datacentre to house it all. Sort/KWh is kind of nice. Sort/$ could be good too, to compare EC2 with alternates. Upload the TB of data there and see how much AWS will bill you for the storage and then the sort itself.
SteveL
Tuesday, July 08, 2008 6:22:05 AM (Pacific Standard Time, UTC-08:00)
I don't follow your math. It looks like you're accounting for the price of 40 nodes, not 910. Why?
Ari Rabkin
Tuesday, July 08, 2008 6:48:29 AM (Pacific Standard Time, UTC-08:00)
Given your cost assumptions the actual formula should be:

23 racks * (4.2k/rack switch + (40 servers/rack * $4k/server)) + $100k core switch = $3,876k.

Nathan Schrenk
Tuesday, July 08, 2008 6:59:34 AM (Pacific Standard Time, UTC-08:00)
Thanks for the correction Ari and Nathan.

SteveL, you asked why not include storage and RHEL licenses. Storage is included. You can include RHEL if you want O/S support. I chose not to. DC costs are not included but I agree that power consumption is interesting and worth tracking.

--jrh
jrh@mvdirona.com
Tuesday, July 08, 2008 7:41:43 AM (Pacific Standard Time, UTC-08:00)
You wrote kW / hr instead of kW * hr
Anon
Tuesday, July 08, 2008 11:49:18 AM (Pacific Standard Time, UTC-08:00)
James, there was a typo in the original blog post; it should have read:

2 quad core Xeons @ 2.0ghz per a node

rather than

4 dual core Xeons @ 2.0ghz per a node.

It's been fixed since...

Arun C Murthy
Wednesday, July 09, 2008 1:44:40 AM (Pacific Standard Time, UTC-08:00)
Thanks for the update Arun. What's the per node cost?

--jrh
jrh@mvdirona.com
Thursday, July 10, 2008 4:15:43 PM (Pacific Standard Time, UTC-08:00)
I imagine that the soft cost way exceeds the hardware :)

How many engineers have paved the way inside Yahoo building this cool architecture making the test possible? 10-25-50? Over 2 or 3 years? 25 employees on the team over 2 years earning an average salary of 125k (salary could be way off) already pushes you well over 6.25 million + benefits.

I really like your thoughts on doing it with low power, that would be a cool stat to track..
robert towne
Friday, July 11, 2008 6:08:43 AM (Pacific Standard Time, UTC-08:00)
I agree Robert. The engineering investment in Hadoop has been substantial. Many super-useful software projects like Hadoop and Memcache get huge contributions from services companies like Yahoo and Facebook.
Wednesday, September 03, 2008 10:59:26 PM (Pacific Standard Time, UTC-08:00)
thanks for you!! Oyun.
Comments are closed.

Disclaimer: The opinions expressed here are my own and do not necessarily represent those of current or past employers.

Archive
<July 2008>
SunMonTueWedThuFriSat
293012345
6789101112
13141516171819
20212223242526
272829303112
3456789

Categories
This Blog
Member Login
All Content © 2014, James Hamilton
Theme created by Christoph De Baene / Modified 2007.10.28 by James Hamilton