Every so often, I come across a paper that just nails it and this one is pretty good.. Using a market Economy to Provision Resources across a Planet-wide Clusters doesn’t fully investigate the space but it’s great to progress on this important area and the paper is a strong step in the right direction.
I spend much of my time working on driving down infrastructure costs. There is lots of great work that can be done in datacenter infrastructure, networking, and server design. It’s both a fun and important area. But, an even bigger issue is utilization. As an industry, we can and are driving down the cost of computing and yet it remains true that most computing resources never get used. Utilization levels at large and small companies typically run in the 10 to 20% range. I occasionally hear reference to 30% but it’s hard to get data to support it. Most compute cycles go wasted. Most datacenter power doesn’t get useful work done. Most datacenter cooling is not spent supporting productive work. Utilization is a big problem. Driving down the cost of computing certainly helps but it doesn’t address the core issue: low utilization.
That’s one of the reasons I work at a cloud computing provider. When you have very large, very diverse workloads, wonderful things happen. Workload peaks are not highly correlated. For example, tax preparation software is busy around tax time. Retail software towards the end of the year. Social networking while folks in the region are awake. All these peaks and valleys overlay to produce a much flatter peak to trough curve. As the peak to trough ratio decreases, utilization sky rockets. You can only get these massively diverse workloads in public clouds and its one of the reasons why private clouds are a bit depressing (see Private Clouds are not the Future). Private clouds are so close to the right destination and yet that last turn was a wrong one and the potential gains won’t be achieved. I hate wasted work as much as I hate low utilization.
The techniques above smooth the aggregated utilization curve and, the flatter that curve gets, the higher the utilization, the lower the cost, and are better it is for the environment. Large public clouds get this curve flattened the workload peaks considerably but the goal of steady unchanging load 24 hours a day, 7 days a week isn’t achievable. Even power companies have base load and peak load. What to do with the remaining utilization valleys? The next technique is to use a market economy to incent developers and users to use resources that aren’t currently fully utilized.
In The Cost of A Cloud: Research Problems in Datacenter Networks, we argued that turning servers off is a mistake in that the most you can hope to achieve is to save the cost of the power which is tiny when compared to the cost of the servers, the cost of power distribution gear, and the cost of the mechanical systems. See the Cost of Power in Large-Scale Datacenters (I’ve got an update of this work coming – the changes are interesting but the cost of power remains the minority cost). Rather than shutting off servers, the current darling idea of the industry, we should be productively using the servers. If we can run any workload worth more than the marginal cost of power, we should. Again, a strong argument for public clouds with large pools of resources on which a market can be made.
Continuing with making a market and offering computing resources not under supply crunch (under-utilized) at lower costs, Amazon Web Services has a super interesting offering called spot instances. Spot instances allow customers to bid on unused EC2 capacity and allow them to run those instances as long as their bids exceed the current instance spot price.
The paper I mentioned above is heading in a similar direction but this time working on the Google MapReduce cluster utilization problem. Technically the paper actually is working on a private cloud but its still nice work and it is using the biggest private cloud in the world at well over a million servers so I can’t complain too much. I really like the paper. From the conclusion:
In this paper, we have thus proposed a framework for allocating and pricing resources in a grid-like environment. This framework employs a market economy with prices adjusted in periodic clock auctions. We have implemented a pilot allocation system within Google based on these ideas. Our preliminary experiments have resulted in significant improvements in overall utilization; users were induced to make their services more mobile, to make disk/memory/network tradeoffs as appropriate in different clusters, and to fully utilize each resource dimension, among other desirable outcomes. In addition, these auctions have resulted in clear price signals, information that the company and its engineering teams can take advantage of for more efficient future provisioning.
It’s worth reading the full paper: http://www.stokely.org/papers/google-cluster-auctions.pdf. One of the authors, Murray Stokely of Google, also wrote an interesting blog entry Fun with Amazon Web Services where he developed many of the arguments above. Thanks to Greg Linden and Deepak Singh for pointing me to this paper. It made for a good read this morning and I hope you enjoy it as well.
b: http://blog.mvdirona.com / http://perspectives.mvdirona.com
Dennis asked if public clouds could possibly host HPC or analytic workloads especially those dependent upon specialized and high performance hardware. Yes, absolutely. If there is a broad market that needs GPU, FPGAs, high performance networking, etc., then cloud providers will emerge that offer this support.
Where the market is narrow, I expect you are correct that support won’t show up in the cloud. The common case is that customers aren’t alone in their hardware needs and, where there is broad interest, there will be cloud solutions. Market economies are a wonderful thing.
Re your referenced entry on so-called private clouds. One argument being surfaced in favor of keeping compute capabity in house is the need to provision application-specific infrastructure capabilities (e.g. engines such as Oracle’s Exadata supposedly tuned for database applications, HPC for analytics on massive data sets etc). Can a public cloud presumably based on commodity technology effectively handle these sorts of workloads? Thanks.
Its great hearing from you Murray.
If you are around Seattle area anytime in the future, I would love to hear more about the work you are doing. Drop by and grab a coffee or an after-work beer.
James, thanks for the kind words about the paper. I’ve been reading your blog for ~2 years, but just now saw this. I’ve enjoyed the links and commentary on all the interesting work you write about here, so please keep it up.
You’re right, the savings of shutting off servers is 15 to 20%. So, the quick answer is the one that is currently exciting the industry. Shut them off and save money. This logic is almost correct but doesn’t look closely enough. When you shut a server off it is at zero utilization but you are still paying for the power distribution system, the datacenter shell, the mechanical systems, the networking gear, and the servers. If utilization is resources used divided by resources spent, then shutting off reduces the denominator by 20% but drops the numerator to 0. For example, if you were running 20% utilization over a total resource allocation R, then shutting off yields a utilization of 0/(RC*.8). It actually reduces the resource utilization.
Your point is that it saves 20% of the resource is still correct but it is useful to look at the problem the other way around If you have a workload that does productive work worth more than 20% of the overall total cost of all datacenter resources, then you should run it. Any workload worth more than the marginal cost of power (fairly low bar), should be run and doing so will increase resource utilization and improve work done per resource dollar.
Its counter intuitive but "off" is not nearly as good as "in use". Thanks for the question Mikio.
Interesting subject. I agreed to increase the utilization is important and there is a way to improve. Butlooking at your cost analysis on data center, the cost of power represents around 20%, I do not think this is tiny, its a big portion of data center cost. I believe shutting down server when is not used effectively makes sense.
Am I missing something?
Thanks for the feedback Javier.
I usually don’t write you comments, I used to just read your post … Thank you for share your ideas, and for your lucid vision or better said perspective.
Good to see Ed. Finding idle servers is hard in that even "idle" servers are usually running at least some daemon processes. What’s really challenging is knowing what to do with an idle server. Will it be needed 2 msec from now or not for the rest of the day? It’s a hard problem and I’m glad you are working on it.
I’m a software engineer at 1e. I work on a product that is targeted at discovering the useful work that a server estate is doing. Currently we’re focused on windows, with some unix/linux support.
I thought that you would be interested to know that someone is trying to solve the useful work and utilisation problem, at least from the point of view of discovering servers that aren’t doing much useful work.
If you google ‘night watchman server’ you will find lots of information about our product.