Every so often, I come across a paper that just nails it and this one is pretty good.. Using a market Economy to Provision Resources across a Planet-wide Clusters doesn’t fully investigate the space but it’s great to progress on this important area and the paper is a strong step in the right direction.
I spend much of my time working on driving down infrastructure costs. There is lots of great work that can be done in datacenter infrastructure, networking, and server design. It’s both a fun and important area. But, an even bigger issue is utilization. As an industry, we can and are driving down the cost of computing and yet it remains true that most computing resources never get used. Utilization levels at large and small companies typically run in the 10 to 20% range. I occasionally hear reference to 30% but it’s hard to get data to support it. Most compute cycles go wasted. Most datacenter power doesn’t get useful work done. Most datacenter cooling is not spent supporting productive work. Utilization is a big problem. Driving down the cost of computing certainly helps but it doesn’t address the core issue: low utilization.
That’s one of the reasons I work at a cloud computing provider. When you have very large, very diverse workloads, wonderful things happen. Workload peaks are not highly correlated. For example, tax preparation software is busy around tax time. Retail software towards the end of the year. Social networking while folks in the region are awake. All these peaks and valleys overlay to produce a much flatter peak to trough curve. As the peak to trough ratio decreases, utilization sky rockets. You can only get these massively diverse workloads in public clouds and its one of the reasons why private clouds are a bit depressing (see Private Clouds are not the Future). Private clouds are so close to the right destination and yet that last turn was a wrong one and the potential gains won’t be achieved. I hate wasted work as much as I hate low utilization.
The techniques above smooth the aggregated utilization curve and, the flatter that curve gets, the higher the utilization, the lower the cost, and are better it is for the environment. Large public clouds get this curve flattened the workload peaks considerably but the goal of steady unchanging load 24 hours a day, 7 days a week isn’t achievable. Even power companies have base load and peak load. What to do with the remaining utilization valleys? The next technique is to use a market economy to incent developers and users to use resources that aren’t currently fully utilized.
In The Cost of A Cloud: Research Problems in Datacenter Networks, we argued that turning servers off is a mistake in that the most you can hope to achieve is to save the cost of the power which is tiny when compared to the cost of the servers, the cost of power distribution gear, and the cost of the mechanical systems. See the Cost of Power in Large-Scale Datacenters (I’ve got an update of this work coming – the changes are interesting but the cost of power remains the minority cost). Rather than shutting off servers, the current darling idea of the industry, we should be productively using the servers. If we can run any workload worth more than the marginal cost of power, we should. Again, a strong argument for public clouds with large pools of resources on which a market can be made.
Continuing with making a market and offering computing resources not under supply crunch (under-utilized) at lower costs, Amazon Web Services has a super interesting offering called spot instances. Spot instances allow customers to bid on unused EC2 capacity and allow them to run those instances as long as their bids exceed the current instance spot price.
The paper I mentioned above is heading in a similar direction but this time working on the Google MapReduce cluster utilization problem. Technically the paper actually is working on a private cloud but its still nice work and it is using the biggest private cloud in the world at well over a million servers so I can’t complain too much. I really like the paper. From the conclusion:
In this paper, we have thus proposed a framework for allocating and pricing resources in a grid-like environment. This framework employs a market economy with prices adjusted in periodic clock auctions. We have implemented a pilot allocation system within Google based on these ideas. Our preliminary experiments have resulted in significant improvements in overall utilization; users were induced to make their services more mobile, to make disk/memory/network tradeoffs as appropriate in different clusters, and to fully utilize each resource dimension, among other desirable outcomes. In addition, these auctions have resulted in clear price signals, information that the company and its engineering teams can take advantage of for more efficient future provisioning.
It’s worth reading the full paper: http://www.stokely.org/papers/google-cluster-auctions.pdf. One of the authors, Murray Stokely of Google, also wrote an interesting blog entry Fun with Amazon Web Services where he developed many of the arguments above. Thanks to Greg Linden and Deepak Singh for pointing me to this paper. It made for a good read this morning and I hope you enjoy it as well.