If you read this blog in the past, you’ll know I view cloud computing as a game changer (Private Clouds are not the Future) and spot instances as a particularly powerful innovation within cloud computing. Over the years, I’ve enumerated many of the advantages of cloud computing over private infrastructure deployments. A particularly powerful cloud computing advantage is driven by noting that when combining a large number of non-correlated workloads, the overall infrastructure utilization is far higher for most workload combinations. This is partly because the reserve capacity to ensure that all workloads are able to support peak workload demands is a tiny fraction of what is required to provide reserve surge capacity for each job individually.

This factor alone is a huge gain but an even bigger gain can be found by noting that all workloads are cyclic and go through sinusoidal capacity peaks and troughs. Some cycles are daily, some weekly, some hourly, and some on different cycles but nearly all workloads exhibit some normal expansion and contraction over time. This capacity pumping is in addition to handling unusual surge requirements or increasing demand discussed above.

To successfully run a workload, sufficient hardware must be provisioned to support the peak capacity requirement for that workload. Cost is driven by peak requirements but monetization is driven by the average. The peak to average ratio gives a view into how efficiently the workload can be hosted. Looking at an extreme, a tax preparation service has to provision enough capacity to support their busiest day and yet, in mid-summer, most of this hardware is largely unused. Tax preparation services have a very high peak to average ratio so, necessarily, utilization in a fleet dedicated to this single workload will be very low.

By hosting many diverse workloads in a cloud, the aggregate peak to average ratio trends towards flat. The overall efficiency to host the aggregate workload will be far higher than any individual workloads on private infrastructure. In effect, the workload capacity peak to trough differences get smaller as the number of combined diverse workloads goes up. Since costs tracks the provisioned capacity required at peak but monetization tracks the capacity actually being used, flattening this out can dramatically improve costs by increasing infrastructure utilization.

This is one of the most important advantages of cloud computing. But, it’s still not as much as can be done. Here’s the problem. Even with very large populations of diverse workloads, there is still some capacity that is only rarely used at peak. And, even in the limit with an infinitely large aggregated workload where the peak to average ratio gets very near flat, there still must be some reserved capacity such that surprise, unexpected capacity increases, new customers, or new applications can be satisfied. We can minimize the pool of rarely used hardware but we can’t eliminate it.

What we have here is yet another cloud computing opportunity. Why not sell the unused reserve capacity on the spot market? This is exactly what AWS is doing with Amazon EC2 Spot Instances. From the Spot Instance detail page:

Spot Instances enable you to bid for unused Amazon EC2 capacity. Instances are charged the Spot Price set by Amazon EC2, which fluctuates periodically depending on the supply of and demand for Spot Instance capacity. To use Spot Instances, you place a Spot Instance request, specifying the instance type, the Availability Zone desired, the number of Spot Instances you want to run, and the maximum price you are willing to pay per instance hour. To determine how that maximum price compares to past Spot Prices, the Spot Price history for the past 90 days is available via the Amazon EC2 API and the AWS Management Console. If your maximum price bid exceeds the current Spot Price, your request is fulfilled and your instances will run until either you choose to terminate them or the Spot Price increases above your maximum price (whichever is sooner).

It’s important to note two points:

1. You will often pay less per hour than your maximum bid price. The Spot Price is adjusted periodically as requests come in and available supply changes. Everyone pays that same Spot Price for that period regardless of whether their maximum bid price was higher. You will never pay more than your maximum bid price per hour.

2. If you’re running Spot Instances and your maximum price no longer exceeds the current Spot Price, your instances will be terminated. This means that you will want to make sure that your workloads and applications are flexible enough to take advantage of this opportunistic capacity. It also means that if it’s important for you to run Spot Instances uninterrupted for a period of time, it’s advisable to submit a higher maximum bid price, especially since you often won’t pay that maximum bid price.

Spot Instances perform exactly like other Amazon EC2 instances while running, and like other Amazon EC2 instances, Spot Instances can be terminated when you no longer need them. If you terminate your instance, you will pay for any partial hour (as you do for On-Demand or Reserved Instances). However, if the Spot Price goes above your maximum price and your instance is terminated by Amazon EC2, you will not be charged for any partial hour of usage.

Spot instances effectively harvest unused infrastructure capacity. The servers, data center space, and network capacity are all sunk costs. Any workload worth more than the marginal costs of power is profitable to run. This is a great deal for customers in because it allows non-urgent workloads to be run at very low cost. Spot Instances are also a great for the cloud provider because it further drives up utilization with the only additional cost being the cost of power consumed by the spot workloads. From Overall Data Center Costs, you’ll recall that the cost of power is a small portion of overall infrastructure expense.

I’m particularly excited about Spot instances because, while customers get incredible value, the feature is also a profitable one to offer. Its perhaps the purest win/win in cloud computing.

Spot Instances only work in a large market with many diverse customers. This is a lesson learned from the public financial markets. Without a broad number of buyers and sellers brought together, the market can’t operate efficiently. Spot requires a large customer base to operate effectively and, as the customer base grows, it continues to gain efficiency with increased scale.

I recently came across a blog posting that ties these ideas together: New CycleCloud HPC Cluster Is a Triple Threat: 30000 cores, $1279/Hour, & Grill monitoring GUI for Chef. What’s described in this blog posting is a mammoth computational cluster assembled in the AWS cloud. The speeds and feeds for the clusters:

· C1.xlarge instances: 3,809

· Cores: 30,472

· Memory: 26.7 TB

The workload was molecular modeling. The cluster was managed using the Condor job scheduler and deployment was automated using the increasingly popular Opscode Chef. Monitoring was done using a packaged that CycleComputing wrote that provides a nice graphical interface to this large cluster: Grill for CycleServer (very nice).

The cluster came to life without capital planning, there was no wait for hardware arrival, no datacenter space needed to be built or bought, the cluster ran 154,116 condor jobs with 95,078 compute hours of work and, when the project was done, was torn down without a trace.

What is truly eye opening for me in this example is that it’s a 30,000 core cluster for $1,279/hour. The cloud and Spot instances changes everything. $1,279/hour for 30k cores. Amazing.

Thanks to Deepak Singh for sending the CycleServer example my way.

–jrh

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com