In The Case For Low-Power Servers I reviewed the Cooperative, Expendable, Micro-slice Servers project. CEMS is a project I had been doing in my spare time in investigating using low-power, low costs servers running internet-scale workloads. The core premise of the CEMS project: 1) servers are out-of-balance, 2) client and embedded volumes, and 3) performance is the wrong metric.

Out-of-Balance Servers: The key point is that CPU bandwidth is increasing far faster than memory bandwidth (see page 7 of Internet-Scale Service Efficiency). CPU performance continues to improve at roughly historic rates. Core count increases have replaced the previous reliance on frequency increase but performance improvements continue unabated. As a consequence, CPU performance is outstripping memory bandwidth with the result that more and more cycles are spent in pipeline stalls. There are two broad approaches to this problem: 1) improve the memory subsystem, and 2) reduce CPU performance. The former drives up design cost and consumes more power. The later is a counter-intuitive approach. Just run the CPU slower.

The CEMS project investigates using low-cost, low-power client and embedded CPUs to produce better price-performing servers. The core observation is that internet-scale workloads are partitioned over 10s to 1000s of servers. Running more slightly slower servers is an option if it produces better price performance. Raw, single-server performance is neither needed nor the most cost effective goal

Client and Embedded Volumes: It’s always been a reality of the server world that volumes are relatively low. Clients and embedded devices are sold at an over 10^9 annual clip. Volume drives down costs. Servers leveraging client and embedded volumes can be MUCH less expensive and still support the workload.

Performance is the wrong metric: Most servers are sold on the basis of performance but I’ve long argued that single dimensional metrics like raw performance are the wrong measure. What we need to optimize for is work done per dollar and work done per joule (a watt-second). In a partitioned workload running over many servers, we shouldn’t care about or optimize for single server performance. What’s relevant is work done/$ and work done/joule. The CEMS projects investigates optimizing for these metrics rather than raw performance.

Using work done/$ and work done/joule as the optimization point, we tested a $500/slice server design on a high-scale production workload and found nearly 4x improvement over the current production hardware.

Earlier this week Rackable Systems announced Microslice Architecture and Products. These servers come in at $500/slice and optimize for work done/$ and work done/joule. I particularly like this design in that its using client/embedded CPUS but includes full ECC memory and the price/performance is excellent. These servers will run partitionable workloads like web-serving extremely cost effectively.


James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | | | blog: