In The Case for Low-Cost, Low-Power Servers, I made the argument that the right measures of server efficiency was work done per dollar and work done per joule. Purchasing servers on single dimensional metrics like performance or power or even cost alone, makes no sense at all. Single dimensional purchasing leads to micro-optimizations that push one dimension to the detriment of others. Blade servers have been one of my favorite examples of optimizing the wrong metric (Why Blade Servers aren’t the Answer to All Questions). Blades often trade increased cost to achieve server density. But density doesn’t improve work done per dollar nor does it produce better work done per joule. In fact, density often takes work done per joule in the wrong direction by driving higher power consumption due to the challenge of cooling higher power densities.
There is no question that selling in high volume drives price reductions so client and embedded parts have the potential to be the best price/performing components. And, as focused as the server industry has been on power of late, the best work is still in the embedded systems world where a cell phone designer would sell their souls for a few more amp-hours if they could have it without extra size or extra-weight. Nobody focuses on power as much as embedded systems designers and many of the tricks arriving in the server world showed up years ago in embedded devices.
A very common processor used in cell phone applications is the ARM. The ARM business is model is somewhat unusual in that they sells a processor design and then the design is taken and customized by many teams including Texas Instruments, Samsung, and Marvel. These processors find their way into cell phones, printers, networking gear, low-end Storage Area Networks, Network Attached Storage devices, and other embedded applications. The processors produce respectable performance and great price/performance and absolutely amazing power/performance.
Could this processor architecture be used in server applications? The first and most obvious push back is that it’s a different instruction set architecture but servers software stacks really are not that complex. If you can run Linux and Apache some web workloads can be hosted. There are many Linux ports to ARM — the software will run. The next challenge, and this one is the hard one, does the workload partition into sufficiently fine slices to be hosted on servers built using low end processors. Memory size limitations are particularly hard to work around in that ARM designs have the entire system on the chip including the memory controller and none I’ve seen address more than 2GB. But, for those workloads that do scale sufficiently finely, ARM can work.
I’ve been interested in seeing this done for a couple of years and have been watching ARM processors scale up for quite some time. Well, we now have an example. Check out http://www.linux-arm.org/Main/LinuxArmOrg. That web site is hosted on 7 servers, each running the following:
· 1 disk
· 1.5 GB DDR2 with ECC!
· Nginx web proxy/load balancer
· Apache web server
Note that, unlike Intel Atom based servers, this ARM-based solution has the full ECC memory support we want in server applications (actually you really want ECC in all applications from embedded through client to servers).
Clearly this solution won’t run many server workloads but it’s a step in the right direction. The problems I have had when scaling systems down to embedded processors have been dominated by two issues: 1) some workloads don’t scale down to sufficiently small slices (what I like to call bad software but, as someone who spent much of his career working on database engines, I probably should know better), and 2) surrounding component and packaging overhead. Basically, as you scale down the processor expense, other server costs begin to dominate. For example, If you half the processor cost and also ½ the throughput, its potentially a step backwards since all the other components in the server didn’t also half in cost. So, in this example, you would get ½ the throughput with something more than ½ the cost. Generally not good. But, what’s interesting are those cases where it’s non-linear in the other direction. Cut the cost to N% with throughput at M% where M is much more than N. As these system on a chip (SOC) server solutions improve, this is going to be more common.
It’s not always a win based upon the discussion above but it is a win for some workloads today. And, if we can get multi-core versions of ARM, it’ll be a clear win for many more workloads. Actually, the Marvel MV78200 actually is a two core SOC but it’s not cache coherent which isn’t a useful configuration in most server applications.
The ARM is a clear win on work done per dollar and work done per joule for some workloads. If a 4-core, cache coherent version was available with a reasonable memory controller, we would have a very nice server processor with record breaking power consumption numbers. Thanks for the great work ARM and Marvel. I’m looking forward to tracking this work closely and I love the direction its taking. Keep pushing.