The server tax is what I call the mark-up applied to servers, enterprise storage, and high scale networking gear. Client equipment is sold in much higher volumes with more competition and, as a consequence, is priced far more competitively. Server gear, even when using many of the same components as client systems, comes at a significantly higher price. Volumes are lower, competition is less, and there are often many lock-in features that help maintain the server tax. For example, server memory subsystems support Error Correcting Code (ECC) whereas most client systems do not. Ironically both are subject to many of the same memory faults and the cost of data corruption in a client before the data is sent to a server isn’t obviously less than the cost of that same data element being corrupted on the server. Nonetheless, server components typically have ECC while commodity client systems usually do not.
Back in 1987 Garth Gibson, Dave Patterson, and Randy Katz invented Redundant Array of Inexpensive Disks (RAID). Their key observation was that commodity disks in aggregate could be more reliable than very large, enterprise class proprietary disks. Essentially they showed that you didn’t have to pay the server tax to achieve very reliable storage. Over the years, the “inexpensive” component of RAID was rewritten by creative marketing teams as “independent” and high scale RAID arrays are back to being incredibly expensive. Large Storage Area Networks (SANs) are essentially RAID arrays of “enterprise” class disk, lots of CPU and huge amounts of cache memory with a fiber channel attach. The enterprise tax is back with a vengeance and an EMC NS-960 prices in at $2,800 a terabyte.
BackBlaze, a client compute backup company, just took another very innovative swipe at destroying the server tax on storage. Their work shows how to bring the “inexpensive” back to RAID storage arrays and delivers storage at $81/TB. Many services are building secret, storage subsystems that deliver super reliable storage at very low cost. What makes the BackBlaze work unique is they have published the details on how they built the equipment. It’s really very nice engineering.
In Petabytes on a budget: How to Build Cheap Cloud Storage they outline the details of the storage pod:
· 1 storage pod per 4U of standard rack space
· 1 $365 mother board and 4GB of ram per storage pod
· 2 non-redundant Power Supplies
· 4 SATA cards
· Case with 6 fans
· Boot drive
· 9 backplane multipliers
· 45 1.5 TB commodity hard drives at $120 each.
Each storage pod runs Apache TomCat 5.5 on Debian Linux and implements 3 RAID6 volumes of 15 drives each. They provide a hardware full bill of materials in Appendix A of Petabytes on a budget: How to Build Cheap Cloud Storage.
Predictably some have criticized the design as inappropriate for many workloads and they are right. The I/O bandwidth is low so this storage pod would be a poor choice for data intensive applications like OLTP databases. But, it’s amazingly good for cold storage like the BackBlaze backup application. Some folks have pointed out that the power supplies are very inefficient at around 80% peak efficiency and the configuration chosen will have them far below peak efficiency. True again but it wouldn’t be hard to replace these two PSUs with a single, 90+% efficiency, commodity unit. Many are concerned with cooling and vibration. I doubt cooling is an issue and, in the blog posting, they addressed the vibration issue and talked briefly about how they isolated the drives. The technique they chose might not be adequate for high IOPS arrays but it seems to be working for their workload. Some are concerned by the lack of serviceability in that the drives are not hot swappable and the entire 67TB storage pod has to be brought offline to do drive replacements. Again, this concern is legitimate but I’m actually not a big fan of hot swapping drives – I always recommend bringing down a storage server before service (I hate risk and complexity). And, I hate paying for hot swamp gear and there isn’t space for hot swap in very high density designs. Personally, I’m fine with a “shut-down to service” model but others will disagree.
The authors compared their hardware storage costs to a wide array of storage sub-systems from EMC through Sun and Netapp. They also compared to Amazon S3 and made what is a fairly unusual mistake for a service provider. They compared on-premise storage equipment purchase cost (just the hardware) with a general storage service. The storage pod costs include only hardware while the S3 costs include data center rack space, power for the array, cooling, administration, inside the data center networking gear, multi-data center redundancy, a general I/O path rather than one only appropriate for cold storage, and all the software to support a highly reliable, geo-redundant storage service. So I’ll quibble on their benchmarking skills – the comparison is of no value as currently written — but, on the hardware front, it’s very nice work.
Good engineering and a very cool contribution to the industry to publish the design. One more powerful tool to challenge the server tax. Well done Backblaze.
VentureBeat article: http://venturebeat.com/2009/09/01/backblaze-sets-its-cheap-storage-designs-free/.
James Hamilton, Amazon Web Services
1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | email@example.com
H:mvdirona.com | W:mvdirona.com/jrh/work | blog:http://perspectives.mvdirona.com
Perhaps a stupid question — where can one find cases like the aforementioned SC833S2-550 on the cheap? I can’t find anything near the $50 price tag mentioned.
Luke, you are 100% correct that the most common hardware failure is disk. My observation was that hardware failures are not the most common service failure. Administrative error and software problems dominate hardware.
But, as you argue, disk failures are common. The only question is can you afford to bring a storage server to service a disk or do we need hot swap. Since the storage server WILL go down occasionally no matter what we do to avoid it, we have to harden the service to these events and still be able to meet SLA. Having done that, taking a storage server down for a short period of time to change disk doesn’t impact SLA. At this point we have a choice between hot swap and reboot for disk change since neither impact SLA. I prefer cheap and simple and don’t fully trust the O/S handling hot swap correctly (even though errors are rare) so I prefer reboot to hot swap.
I agree that it’s best to have redundancy at the software/system layer. Not all failures are disk failures, and redundancy at the software layer saves you from many other problems as well. (now, if a significant portion of your failures are not disk failures, I’d take a long hard look at your ESD precautions and/or change VARs. But you do need to be prepared to deal with non-disk hardware failures.) But even when you have redundancy at the software layer, replacing a drive is a lot less labor than shutting down the box, dragging it to an esd safe workstation, opening/unscrewing it, replacing the drive, and re-racking the server. More importantly, it’s harder to screw up. If your servers are assembled by people who take fair ESD precautions, disk failures will be 90% or more of your hardware failures, so it makes sense to build a system where it’s easy to deal with the common failures.
I suppose you could also have a google-style system, where you just leave bad servers dead in the rack until it’s time to replace the whole rack, but hard drives fail a lot. Few companies can afford to waste servers like that.
I agree the server tax can be avoided and some are doing it but it’s a lot of work and more time than some teams can afford. That’s why I like ZT Systems and the Dell Data Center Solutions teams. They will build what you want in volume.
I hear you on not wanting to bring down big disk subsystems to change a single drive. I 100% agree that hot swap works but I like systems have redundancy in the storage software. Once you have that, you can bring down a server without loss of availability so that’s my recommendation. Nothing against hot swap – I just like using boring and simple as dirt techniques at scale. Thanks for the comment.
First, I’m with djb on this one. My single-socket systems sport unbuffered ecc ram.
The server tax is mostly a problem with ‘enterprise’ vendors. If you make sure you only buy commodities, it can be avoided. In the case of servers, the tax is added on by sales departments at the VAR.
I buy 3u supermicro chassis from secondary vendors for $50 each: http://supermicro.com/products/chassis/3U/833/SC833S2-550.cfm – another $80 or so gets me a sata backplane, so for the cost of a good consumer case, I have a really solid case with hot swap drives.
Next, I use opterons, not Xeons. FB-dimms suck a lot of power, and registered ecc ddr3 is still pretty expensive. if you wait for the sales, you can get 2.2Ghz quad-core Shanghai opterons for under $180 each. That puts you ahead of the slowest Nehalams. Socket F motherboards are super cheap, especially if you use something like the tyan thunder n3600r that was built during the dual core days (a simple firmware upgrade and it works fine with quad-core chips) I buy registered ecc ddr2 from Kingston; around $20 per gigabyte. All told, I get out the door between $1200 and $1500 for a 8 core server with 32GiB ram.
heh. and not too long ago, newegg had a sale on SuperMicro 2 in 1u servers: http://prgmr.com/~lsc/luke_opterons.jpg
the important thing about assembling your own hardware is to use ESD protection. If you can’t be bothered to setup a good work space with a wriststrap, buy from a VAR, and never open the box.
Personally I will not field gear without hot-swap drives. If I have to take down a box, that’s 32GiB worth of VPS customers who are mad at me. But yeah, I tend to shop around for a ‘scratch and dent’ case, and supermicro backplanes are not so expensive.
Hey, thanks for the comment Gleb. My only point of disagreement is you can’t compare a single storage appliance not plugged in to a multi-data center redundant storage service. They are pretty different beasts. Comparing two storage services makes sense and comparing two hardware systems makes sense but its hard to do a good job of comparing services to hardware without adding in all the other factors.
Congratulations on the excellent hardware design and thanks for contributing it publically. If you are ever in the Seattle area, I would love to buy you a beer and get into more detail. You folks are doing very interesting work.
Mike, I believe you are right that Backblaze likely has redundancy at the application layer “above” the storage POD so likely could operate the pods without RAID6. That was my read as well. I suspect they went with RAID6 to reduce overhead on drive failure. Without some form of within-the-storage pod redundancy, they would have to recreate an entire 15 disk volume upon drive failure. With RAID volumes they can bring the pod down and service the disk without having to regenerate an entire 20TB volume. This potentially makes disk failure less resource intensive since data regeneration can be hard on the network and other resources depending upon what encoding they use for their application level redundancy.
Given that the data is fairly cold, the I/O penalty of RAID shouldn’t be a problem in this case.
Great writeup James. We share the same view and refer to it internally as the "IT tax".
Wanted to clarify one note – we actually struggled how to compare apples-to-apples between Amazon S3 and a Backblaze storage pod, realizing that one is a service and one is hardware. Since there is very little published about the hardware side of S3, we took our best guess and the pricing was our estimate if you subtract electricity, colo, and administration costs. Doing so, we assumed 1/2 of S3 pricing went to these costs and would be borne by anyone running pods.
Appreciate your well-thought out post – and believe in S3 as a great service for many companies in general.
I am stuck wondering what the net benefit of RAID is here. I don’t know the answer but maybe you do. Presumably, they have redundant data copies so that data remains available when particular servers are being serviced. So why use two layers of redundancy? Why aren’t they more like GFS?
In the configuration described, when one drive fails the server operates in a degraded but fully functional state. But the entire 67TB pod must go offline for ~5-10 minutes to replace the drive. Then it will be degraded, but functional while the RAID set is rebuilt. You need to rely on alternate copies of the data during the downtime.
If instead you just used 45 individual file systems (one per drive), you get ~15% more storage capacity. When a drive fails, you have to copy the data that was on it from still available servers to new drives. This is effectively the same quantity of IO that happens when rebuilding RAID, but it is copied via network instead of pci bus. The other ~44 drives can remain available during this time.
15% cheaper would seem to be important to them.
Maybe it didn’t perform very well during testing. Maybe the network overhead was a significant bottleneck. Maybe their high-availability is not as advanced as I assume. Maybe my thinking is flawed.
I love it Frank. I actually do want "hot swamp" data centers (//perspectives.mvdirona.com/2009/05/05/NextPointOfServerDifferentiationEffiiciencyAtVeryHighTemprature.aspx). Just not hot swap disks :-).
"I hate paying for hot swamp gear"
I love the Freudian slip.