I met Google’s Wolf-Dietrich Weber at the 2009 CIDR conference where he presented what is still one of my favorite datacenter power-related papers. I liked the paper because the gain was large, the authors weren’t confused or distracted by much of what is incorrectly written on datacenter power consumption, and the technique is actually practical. In Power Provisioning for a Warehouse-sized Computer, the authors argue that we should oversell power, the most valuable resource in a data center. Just as airlines oversell seats, their key revenue producing asset, datacenter operators should oversell power.
Most datacenter operators take the critical power, the total power available to the data center less power distribution losses and mechanical system cooling loads, then reduce it by at least 10 to 20% to protect against the risk of overdraw which can draw penalty or power loss. Servers are then provisioned to this reduced critical power level. But, the key point is that almost no data center is ever anywhere close to 100% utilized (or even close to 50% for that matter but that’s another discussion) so there is close to no chance that all servers will draw their full load at the same time. And, with some diversity of workloads, even with some services spiking to 100%, we can often exploit the fact that peak loads across dissimilar services are not fully correlated. On this understanding, we can provision more servers than we actually have critical power.
This exactly what airlines do when selling seats. And, just as airlines need to be able to offer a free ticket to Hawaii in the unusual event that they find a flight over-subscribed, we need the same safety valve here. Some datacenter equivalents of a free ticket to Hawaii is: 1) delay all non-customer impacting workloads (administrative and operational batch jobs, 2) stop non-critical or best-effort workloads, 3) force servers into lower power states. This last one is a favorite research topic but is almost never done in practice because it is the equivalent of solving the oversold airline seat problem by actually having two people sit in the same seat. It sort of works but isn’t safe and doesn’t make for happy customers. Option #3 reduces the resources available to all workloads by lowering overall quality of service. For most businesses this is not a good economic choice. The best answers are options 1 and 2 above.
One class of application that is particularly difficult to manage efficiently are online data-intensive workloads. Web search, advertising, and machine translation are examples of this workload type. These workloads can be very profitable so option #3 above, that of reducing the quality of service doesn’t make economic sense. In the note the cost of latency we reviewed the importance of very rapid response in these workload types and ecommerce systems. Reducing the quality of service for these high value workloads to save power, doesn’t make economic sense.
The best answer for these workloads is what Barroso and Hoelzle refer to Energy Proportional Computing (The Case for Energy Proportional Computing). Essentially the goal of energy proportional computing is that a server at 10% load should consume 10% of the power of a server running at 100% load. Clearly there is overhead and this goal will never be fully achieved but, the closer we get, the lower the cost and environmental impact for hosing OLDI workloads.
The good news is there has been progress. When energy proportional computing was first proposed, many servers at idle would consume 80% of the power that it would consume at full load. Today, a good server can be as low as 45% at idle. We are nowhere close to where we want to be but good progress is being made. In fact, CPUs are quite good by this measure today — the worst offenders are the other components in the server. Memory has big opportunities and the mobile consumer device world shows us what is possible. I expect we’ll continue to progress by stealing ideas from the cell phone industry and applying them to servers.
In Power Management of Online Data-Intensive Services, a research team from Google and the University of Michigan target the OLDI power proportionality problem focusing on Google search, advertising, and translation workloads. These workloads are difficult because the latency goals are achieved using large in-memory caches and, as workload moves from peak to valley, all these machines need to stay available in order to meet the application latency goals. It is not an option to concentrate the workload on fewer servers – the cache size requires all the servers continue to be available so, as workload goes down towards idle, all the servers continue to have some small amount of workload so they can’t be dropped into full system low power states.
The data cache size requires the memory of all the servers so as the workload volume goes down, each server gets progressively less busy but never actually hit idle. They always need to be online and available so the next request can be served at the required latency. The paper draws the following conclusions:
· CPU active low-power modes provide the best single power-performance mechanism but, by themselves, cannot achieve power proportionality
· There is a pressing need to improve idle low-power modes for shared caches and on-chip memory controllers
· There is a substantial opportunity to save memory system power with low-power modes [mobile systems do this well today so the techniques are available]
· Even with query-batching, full system idle low-power modes cannot provide acceptable latency-power tradeoffs
· Coordinated, full-system active low-power modes hold the greatest promise to achieve energy proportionality with acceptable query latency
Summarizing the OLDI workload type as presented in the paper, the workload latency goals are achieved by spreading very large data caches over the operational servers. As the workload goes from peak to trough, these servers all get less busy but never are actually at idle so can’t be dropped into a full system lower power state.
I like to look at servers supporting these workloads as being in a two dimensional grid. Each row represents one entire copy of the cache spread over 100s of servers. A single row could serve the workload and successfully deliver on the application latency goals but a single row will not scale. To scale to workloads beyond that which can be served on a single row, more rows are added. When a search query comes into the system, it is sent to the 100s of systems in a single row but only to the servers in a single row. Looking at the workload this way, I would argue, we actually do have some ability to make OLDI workloads power proportional at a warehouse-scale. When the workload goes up towards peak, more rows are needed. When the workload reduces towards trough, fewer rows are used and the rows not currently in use can be used to support other workloads.
This row-level scaling technique produces very nearly full proportionality at the overall datacenter level with two problems: 1) the workload can’t scale down below a row for all the reason outlined in the paper, 2) if the workload is very dynamic and jumps from trough to peak quickly, more rows need to be kept ready in case they are needed which further reduces the power proportionality of the technique.
If a workload is substantially higher scale than a single row and predictably swings from trough to peak, this per-row scaling technique produces very good results. It fails where workloads change dramatically or where less-than-single row scaling is needed.
The two referenced papers:
Thanks to Alex Mallet for sending me Power Management of Data-Intensive Services.