The Conference on Innovative Data Systems Research was held last week at Asilomar California. It’s a biennial systems conference. At the last CIDR, two years ago, I wrote up Architecture for Modular Data Centers where I argued that containerized data centers are an excellent way to increase the pace of innovation in data center power and mechanical systems and are also a good way to grow data centers more cost effectively with a smaller increment of growth.
Containers have both supporters and detractors and its probably fair to say that the jury is still out. I’m not stuck on containers as the only solution but any approach that supports smooth, incremental data center expansion is interesting to me. There are some high scale modular deployments are in the works (First Containerized Data Center Announcement) so, as an industry, we’re starting to get some operational experience with the containerized approach.
One of the arguments that I made in the Architecture for Modular Systems paper was that a fail-in-place might be the right approach to server deployment. In this approach, a module of servers (multiple server racks) is deployed and, rather than servicing them as they fail, the overall system capacity just slowly goes down as servers fail. As each fails, they are shut off but not serviced. Most data centers are power-limited rather than floor space limited. Allowing servers to fail in place trades off space which we have in abundance in order to get high efficiency service. Rather than servicing systems as they fail, just let them fail-in-place and when the module healthy-server density gets too low, send it back for remanufacturing at the OEM who can do it faster, cheaper, and recycle all that is possible.
Fail in place (Service Free Systems) was by far the most debated part of the modular datacenter work. But, it did get me thinking about how cheaply a server could be delivered. And, over time, I’ve become convinced that that optimizing for server performance is silly. What we should be optimizing for is work done/$ and work done/joule (a watt-second). Taking those two optimizations points with a goal of a sub-$500 server, led to the Cooperative, Expendable, Micro-Slice Server project that I wrote up for this years CIDR.
In this work, we took an existing very high scale web property (many thousands of servers) and ran their production workload on the existing servers currently in use. We compared the server SKU currently being purchased with a low-cost, low-power design using work done/$ and work done/joule as the comparison metric. Using this $500 server design, we were able to achieve:
· RPS/Joule: 3.9x
· RPS/Rack: 9.4x
Note that I’m not a huge fan of gratuitous density (density without customer value). See Why Blade Servers aren’t the Answer to all Questions for the longer form of this argument. I show density here only because many find it interesting, it happens to be quite high and, in this case, did not bring a cost penalty.
The paper is at: http://mvdirona.com/jrh/TalksAndPapers/JamesHamilton_CEMS.pdf.
Abstract: evaluates low cost, low power servers for high-scale internet-services using commodity, client-side components. It is a follow-on project to the 2007 CIDR paper Architecture for Modular Data Centers. The goals of the CEMS project are to establish that low-cost, low-power servers produce better price/performance and better power/performance than current purpose-built servers. In addition, we aim to establish the viability and efficiency of a fail-in-place model. We use work done per dollar and work done per joule as measures of server efficiency and show that more, lower-power servers produce the same aggregate throughput much more cost effectively and we use measured performance results from a large, consumer internet service to argue this point.
Thanks to Giovanni Coglitore and the rest of the Rackable Systems team for all their engineering help with this work.
Amazon Web Services
I agree that ARM looks VERY interesting. Any product where the power consumption is quoted in milliwatts/megahertz is popular with me. Device volumes are huge which helps keep costs down and there are multiple sources. The downside is the software stack needs to be ported but that’s not a huge blocker. Some Linux variants already run on ARM. I’ve met with ARM several times and agree this approach has merit and I expect we’ll see some examples of it over the next 12 to 18 months.
What is the chance that an alternate CPU architecture which supports a full LAMP stack could break into this market. ARM for example could attack this market (if rumors are true) with a much better balance between compute, memory, and IO at a fraction of cost and power? Given movement to managed code it does seem possible.
They are mid-tier IIS servers handling relatively light-weight transactions. Very little local I/O with only the local O/S and logging local. There is no local persistent state across transactions with all state stored in the data tier.
Thanks, James Hamilton. That paper was well-written, succinct, and informative.
As a software and distributed systems guy, I’m interested in what sorts of software workloads those systems are running. The paper describes it as Windows Server 2003 and IIS, but what is the CPU actually doing most of the time? Is it reading files from disk and handing them out over TCP in response to HTTP GETs? Or is it running a C# interpreter and some application-specific C# code, or what?
You’re 100% right Greg, there are always many factors in play and one is that many server purchases aren’t well matched to the workload. But, it’s more than just that. Rather than tease apart all the possible factors which is almost impossible to do in a controlled way, I took an existing production service and the servers they were buying in very large numbers and compared with what we could do. The goal is to show that the CEMS prototype is optimal. It’s far from that. But I would like to show that we can a lot better even without rewriting the service and I’m arguing that work done/$ and work done/joule are the right way to chose a server.
Summary: I’m after two things with this work: 1) establish that the right criteria for server purchases is work done/$ and work done/joule and 2) show that high volume client parts can produce better results with (at least) some workloads.
It’s pretty surprising that the System-X detailed in the paper costs over x4.5 the amount of the Athlon 4850e system ($2371 compared to $500) but only yields about 25% higher performance for this task (96 requests/second compared to 75 requests/second). That would seem to indicate a mismatch between what the System-X server provides and what the application needs.
In particular, it seems odd that "current purpose-build servers" had expensive enterprise SCSI disks that were not needed by the application. Just replacing those two disks with a normal disk could close much of the gap, no? So, perhaps the lesson here is to have some purpose-built servers with little or no disk for applications that do not need disk?
I think we all agree that commodity servers are more cost effective than high end servers. But, in this particular case, it looks like there was more going on, where the "purpose-build" servers were not actually a good match with the needs of the app.