Large sorts need to be done daily and doing it well actually is economically relevant. Last July, Owen O’Malley of the Yahoo Grid team announced they had achieved a 209 second TeraSort run: Apache Hadoop Wins Terabyte Sort Benchmark. My summary of the Yahoo result with cluster configuration: Hadoop Wins TeraSort.
Google just announced a MapReduce sort result on the same benchmark: Sorting 1PB with MapReduce. They improved on the 209 second result that Yahoo produced achieving 68 seconds. How did they get roughly 3x speedup? Google used slightly more servers at 1,000 than the 910 used by Hadoop but that difference is essentially rounding error and doesn’t explain the difference.
We know that sorting is essentially, an I/O problem. The more I/O a cluster has, the better the performance of a well written sort. It’s not quite the case that computation doesn’t matter but close. A well written sort will scale almost linearly with the I/O capacity of the cluster. Let’s look closely at the I/O sub-systems used in these two sorts and see if that can explain some of the differences between the two results. Yahoo used 3,640 disks in their 209 second run. The Google cluster uses 12 disks per server for a total of 12,000. Both are using commodity disks. The Hadoop result uses 3,640 disk for 209 seconds (761k disk seconds) and the Google result uses 12,000 disks for 68 seconds (816k disk seconds).
Normalizing for number of disks, the Google result is roughly 7% better than the Hadoop number from earlier in the year. That fairly small difference could be explained by more finely tuned software, better disks, or a combination of both.
The Google experiment included a petabyte sort on a 4,000 node cluster. This result is impressive for at least two reasons: 1) a 4,000 node, 48000 disk cluster running a commercial workload is impressive, and 2) sorting a petabyte in 6 hours in 2 min is noteworthy.
In my last posting on high-scale sorts Hadoop Wins TeraSort I argued that we should be also be measuring power consumed. Neither the Google nor Yahoo results report power consumption but there is a enough data to strongly suggest the Google number is better by this measure. Since the data isn’t published, let’s assume that commodity disks draw roughly 10W each and that each server is drawing 150W not including the disks. Using that data, let’s compute the number kilo-watt hours for each run:
· Google: 68*(1000*150+1000*12*10)/3600/1000 => 5.1 kwh
· Yahoo: 209*(910*150+910*4*10)/3600/1000 => 10.0 kwh
Both are good results and both very similar in their utilization of I/O resources but the Google result uses much less power under our assumptions. The key advantage is that they have 4x the number of disks per server so can amortize the power “overhead” of each server over more disks.
When running scalable algorithms like sort, a larger cluster will produce a faster result unless cluster scaling limits are hit. I argue that to really understand the real quality of the implementation in a comparable way we need to report work done per dollar and work done per joule.
Thanks to Savas Parastatidis for pointing this result out to me.