I met Google’s Wolf-Dietrich Weber at the 2009 CIDR conference where he presented what is still one of my favorite datacenter power-related papers. I liked the paper because the gain was large, the authors weren’t confused or distracted by much of what is incorrectly written on datacenter power consumption, and the technique is actually practical. In Power Provisioning for a Warehouse-sized Computer, the authors argue that we should oversell power, the most valuable resource in a data center. Just as airlines oversell seats, their key revenue producing asset, datacenter operators should oversell power.
Most datacenter operators take the critical power, the total power available to the data center less power distribution losses and mechanical system cooling loads, then reduce it by at least 10 to 20% to protect against the risk of overdraw which can draw penalty or power loss. Servers are then provisioned to this reduced critical power level. But, the key point is that almost no data center is ever anywhere close to 100% utilized (or even close to 50% for that matter but that’s another discussion) so there is close to no chance that all servers will draw their full load at the same time. And, with some diversity of workloads, even with some services spiking to 100%, we can often exploit the fact that peak loads across dissimilar services are not fully correlated. On this understanding, we can provision more servers than we actually have critical power.
This exactly what airlines do when selling seats. And, just as airlines need to be able to offer a free ticket to Hawaii in the unusual event that they find a flight over-subscribed, we need the same safety valve here. Some datacenter equivalents of a free ticket to Hawaii is: 1) delay all non-customer impacting workloads (administrative and operational batch jobs, 2) stop non-critical or best-effort workloads, 3) force servers into lower power states. This last one is a favorite research topic but is almost never done in practice because it is the equivalent of solving the oversold airline seat problem by actually having two people sit in the same seat. It sort of works but isn’t safe and doesn’t make for happy customers. Option #3 reduces the resources available to all workloads by lowering overall quality of service. For most businesses this is not a good economic choice. The best answers are options 1 and 2 above.
One class of application that is particularly difficult to manage efficiently are online data-intensive workloads. Web search, advertising, and machine translation are examples of this workload type. These workloads can be very profitable so option #3 above, that of reducing the quality of service doesn’t make economic sense. In the note the cost of latency we reviewed the importance of very rapid response in these workload types and ecommerce systems. Reducing the quality of service for these high value workloads to save power, doesn’t make economic sense.
The best answer for these workloads is what Barroso and Hoelzle refer to Energy Proportional Computing (The Case for Energy Proportional Computing). Essentially the goal of energy proportional computing is that a server at 10% load should consume 10% of the power of a server running at 100% load. Clearly there is overhead and this goal will never be fully achieved but, the closer we get, the lower the cost and environmental impact for hosing OLDI workloads.
The good news is there has been progress. When energy proportional computing was first proposed, many servers at idle would consume 80% of the power that it would consume at full load. Today, a good server can be as low as 45% at idle. We are nowhere close to where we want to be but good progress is being made. In fact, CPUs are quite good by this measure today — the worst offenders are the other components in the server. Memory has big opportunities and the mobile consumer device world shows us what is possible. I expect we’ll continue to progress by stealing ideas from the cell phone industry and applying them to servers.
In Power Management of Online Data-Intensive Services, a research team from Google and the University of Michigan target the OLDI power proportionality problem focusing on Google search, advertising, and translation workloads. These workloads are difficult because the latency goals are achieved using large in-memory caches and, as workload moves from peak to valley, all these machines need to stay available in order to meet the application latency goals. It is not an option to concentrate the workload on fewer servers – the cache size requires all the servers continue to be available so, as workload goes down towards idle, all the servers continue to have some small amount of workload so they can’t be dropped into full system low power states.
The data cache size requires the memory of all the servers so as the workload volume goes down, each server gets progressively less busy but never actually hit idle. They always need to be online and available so the next request can be served at the required latency. The paper draws the following conclusions:
· CPU active low-power modes provide the best single power-performance mechanism but, by themselves, cannot achieve power proportionality
· There is a pressing need to improve idle low-power modes for shared caches and on-chip memory controllers
· There is a substantial opportunity to save memory system power with low-power modes [mobile systems do this well today so the techniques are available]
· Even with query-batching, full system idle low-power modes cannot provide acceptable latency-power tradeoffs
· Coordinated, full-system active low-power modes hold the greatest promise to achieve energy proportionality with acceptable query latency
Summarizing the OLDI workload type as presented in the paper, the workload latency goals are achieved by spreading very large data caches over the operational servers. As the workload goes from peak to trough, these servers all get less busy but never are actually at idle so can’t be dropped into a full system lower power state.
I like to look at servers supporting these workloads as being in a two dimensional grid. Each row represents one entire copy of the cache spread over 100s of servers. A single row could serve the workload and successfully deliver on the application latency goals but a single row will not scale. To scale to workloads beyond that which can be served on a single row, more rows are added. When a search query comes into the system, it is sent to the 100s of systems in a single row but only to the servers in a single row. Looking at the workload this way, I would argue, we actually do have some ability to make OLDI workloads power proportional at a warehouse-scale. When the workload goes up towards peak, more rows are needed. When the workload reduces towards trough, fewer rows are used and the rows not currently in use can be used to support other workloads.
This row-level scaling technique produces very nearly full proportionality at the overall datacenter level with two problems: 1) the workload can’t scale down below a row for all the reason outlined in the paper, 2) if the workload is very dynamic and jumps from trough to peak quickly, more rows need to be kept ready in case they are needed which further reduces the power proportionality of the technique.
If a workload is substantially higher scale than a single row and predictably swings from trough to peak, this per-row scaling technique produces very good results. It fails where workloads change dramatically or where less-than-single row scaling is needed.
The two referenced papers:
· Power Management of Online Data-Intensive Services
· Power Provisioning for a Warehouse-sized Computer
Thanks to Alex Mallet for sending me Power Management of Data-Intensive Services.
–jrh
James Hamilton
e: jrh@mvdirona.com
w: http://www.mvdirona.com
b: http://blog.mvdirona.com / http://perspectives.mvdirona.com
Dave comments: "James, you assume that many redundant copies of the data are kept in order to scale the read workload. No doubt this is true for some services, but certainly not all, and my intuition says not even most. Most of the read-replica scaled systems I’ve seen are cases where there is a single monolithic store being read scaled over several machines."
You are right that most workloads aren’t this well written. This note focused on OLDI workloads in general and Google Search, Advertising, and Translate specifically. Although these workloads are hosted in datacenters all over the world, I suspect they need to be able to sustain query rates beyound that which can be supported by a single row. Recall I defined row to be the 100 to 1,000 servers that handle a portion of each search. For low scale request rates, the single row is all that is needed. But, for workloads beyound the request rate that the slowest server in a row can support, requires adding rows. As rows are added, search workloads can scale close to without bound.
In this architecture, scaling down only requires taking rows out of rotation and keeping the load on the remaining fleet from dropping near idle.
For workloads that are written this way, Dave’s suggestion of making a spot market along the lines of EC2 Spot (http://aws.amazon.com/ec2/spot-instances/) makes perfect sense.
You are right that, overtime, server components outside of the CPU will eventually receive the power-focused engineering that has been applied to the CPU. Memory is on the verge of being the biggest consuming component in the datacenter. I expect the focus on memory power to ramp up over the next 18 to 24 months.
–jrh
James, you assume that many redundant copies of the data are kept in order to scale the read workload. No doubt this is true for some services, but certainly not all, and my intuition says not even most. Most of the read-replica scaled systems I’ve seen are cases where there is a single monolithic store being read scaled over several machines. Most of the scaled out data systems I’ve seen are scaled based on required storage capacity. And most services seem to use an enormous number of boxes in roles like "webworker" which scale proportional to load (not that very many organizations bother to leverage this fact). Even at, eg, Facebook, my impression was that the memcache tier had redundancy only for durability and not as a read-scaling mechanism. Your exposure is quite a bit broader than mine. Have you seen many examples of redundant read replicas dominating the scale equation (beyond the 3 or so replicas needed for durability)?
Overall, the oversubscription of power is one of the (great) points I’ve seen you make over and over. It will be exciting to see how for these and other techniques take us in the direction of reducing $/unit-of-work.
Few other points:
* You could have mentioned the EC2 Spot market
* #3 (low power mode) – doing this across all servers is pretty silly, but it could be seen as an alternative to powering off low-priority workloads. Even though 1% utilized=45% baseline power, this technique has a few advantages. Power-cycles increase wear out rates, boot-up times for complex software stacks can take many minutes, and resetting a restartable batch job can waste many minutes more … all adding up to a non-trivial fixed-cost for powering off a machine. Because of that minimum-idle-power flaw, shaving long-duration peaks will probably be most economically accomplished by finding servers to turn off entirely. However, having banks of servers that can instantly shed half their power usage without resetting workloads, even if those workloads slow by 80% or pause entirely — that would make a great low-risk "safety valve".
* Future Trends – Of course, in the long long run, if you look at the physics of the situation, power consumption grows exponentially with performance. This is obviously why bleeding-edge CPUs are the most "energy proportional". With enough focus, other system aspects may catch up … for example, you previous post about running banks of commodity power supplies to keep the "on" supplies in their efficiency band. And my personal pet peeve – servers that ship with more chips and peripherals than needed – usb, dvi, dvd reader, etc – seems like trimming down to cpu+memory+network+debug controller (eg serial port) would save some base-load power.