Wednesday, December 31, 2008

In a previous posting, Pat Selinger IBM Ph.D. Fellowships, I mentioned Pat Selinger as one of the greats of the relational database world.  Working with Pat was one of the reasons why leaving IBM back in the mid-90’s was a tough decision for me.  In the December 2008 edition of the Communications of the ACM, an interview I did with Pat back in 2005 is published: Database Dialogue with Pat Selinger. It originally went out as an ACM Queue article.


If you haven’t checked out the CACM recently, you should. The new format is excellent and the articles are now worth reading. The magazine is regaining its old position of decades ago as a must read publication.  The new CACM is excellent.


Thanks to Andrew Cencini for pointing me towards this one. I hadn’t yet read my December issues.


James Hamilton
Amazon Web Services

Wednesday, December 31, 2008 5:35:39 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
 Sunday, December 28, 2008

I’ve long argued that tough constraints often make for a better service and few services are more constrained than Wikipedia where the only source of revenue is user donations. I came across this talk by Domas Mituzas of Wikipedia while reading old posts on Data Center Knowledge.   The posting A Look Inside Wikipedia’s Infrastructure includes a summary of the talk Domas gave at Velocity last summer.  


Interesting points from the Data Center Knowledge posting and the longer document referenced below from the 2007 MySQL coference:

·  Wikipedia serves the world from roughly 300 servers

o  200 application servers

o  70 Squid servers

o  30 Memcached servers (2GB each)

o  20 MySQL servers using Innodb, each with 16GB of memory (200 to 300GB each)

o  They also use Squid, Nagios, dsh, nfs, Ganglia, Linux Virtual Service, Lucene over .net on Mono, PowerDNS, lighttpd, Apache, PHP, MediaWiki (originated at Wikipedia)

·  50,000 http requests per second

·  80,000 MySQL requests per second

·  7 million registered users

·  18 million objects in the English version


For the 2007 MySQL Users Conference, Domas posted great details on the Wikipidia architecture: Wikipedia: Site internals, configuration, code examples and management issues (30 pages).  I’ve posted other big service scaling and architecture talks at:


James Hamilton
Amazon Web Services


Updated: Corrected formatting issue.

Sunday, December 28, 2008 7:04:05 AM (Pacific Standard Time, UTC-08:00)  #    Comments [2] - Trackback
 Saturday, December 27, 2008

From Viraj Mody of the Microsoft Live Mesh team sent this my way: Dan Farino About MySpace Architecture.  


MySpace, like Facebook, uses relational DBs extensively front-ended by a layer of Memcached servers. Less open source at MySpace but otherwise unsurprising – a nice scalable design with 3000 front end servers with well over 100 database servers (1m users per DB server).


Notes from Viraj:

·         ~3000 FEs running IIS 6

·         NET 2.0 and 3.5 on FE and BE machines

·         DB is SQL 2005 but hit scaling limits to so they built their own unmanaged memcache implementation on 64-bit machines; uses .NET for exposing communications with layer

·         DB partitioned to assign ~1million users per DB and Replicated

·         Media content (audio/video) hosted on DFS built using Linux served over http

·         Extensive use of PowerShell for server management

·         Started using ColdFusion, moved when scale became an issue

·         Profiling tools build using CLR profiler and technology from Microsoft Research

·         Looking to upgrade code to use LINQ

·         Spent a lot of time building diagnostic utilities

·         Pretty comfortable with the 3-tier FE + memcache + DB architecture

·         Dealing with caching issues – not a pure write thru/read thru cache. Currently reads populate the cache and write flush the cache entry and just write to the DB. Looking to update this, but it worked well since it was ‘injected’ into the architecture.


I collect high scale service architecture and scaling war stories.  These were previously posted here:

·         Scaling Amazon:

·         Scaling Second Life:

·         Scaling Technorati:

·         Scaling Flickr:

·         Scaling Craigslist:

·         Scaling Findory:

·         MySpace 2006:

·         MySpace 2007:

·         Twitter, Flickr, Live Journal, Six Apart, Bloglines,, SlideShare, and eBay:

·        Scaling LinkedIn:


James Hamilton
Amazon Web Services


Updated: Corrected formatting issue.

Saturday, December 27, 2008 3:12:42 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
 Wednesday, December 24, 2008

Five or six years ago Bill Gates did a presentation to a small group at Microsoft on his philanthropic work at the Bill and Melinda Gates Foundation.  It was by far and away the most compelling talk I had seen in that it was Bill applying his talent to solving world health problems with the same relentless drive, depth of understanding, constant learning, excitement and focus with which he applied himself daily (at the time) at Microsoft.


Thanks to O’Reilly, I just watched an interview with Bill by Charlie Rose which has a lot of the same character and Bill makes some of the same points as that talk I saw some years ago. Find it in Dale Dogherty post Admiring Bill Gates.  It’s well worth watching.



Wednesday, December 24, 2008 2:07:17 PM (Pacific Standard Time, UTC-08:00)  #    Comments [2] - Trackback
 Tuesday, December 23, 2008

Related to The Cost of Bulk Storage posting, Mike Neil dropped me a note. He's built an array based upon this Western Digital part:  Its unusually power efficient:

Power Dissipation



5.4 Watts


2.8 Watts


0.40 Watts


0.40 Watts


And it’s currently only $105: 


It’s always been the case that home storage is wildly cheaper than data center hosted storage.  What excites me even more than the continued plunging cost of raw storage is that data center hosted storage is asymptotically approaching the home storage cost . Data center storage includes someone else doing capacity planning and buying new equipment when needed, someone else replacing failed disks and servers and, in the case of S3, it’s geo-redundant (data is stored in multiple data centers).


I’ve not yet discarded my multi-TB home storage system yet but the time is near.



Tuesday, December 23, 2008 8:46:34 AM (Pacific Standard Time, UTC-08:00)  #    Comments [3] - Trackback
 Monday, December 22, 2008

I wrote this blog entry a few weeks ago before my recent job change.  It’s a look at the cost of high-scale storage and how it has fallen over the last two years based upon the annual fully burdened cost of power in a data center and industry disk costs trends. The observations made in this post are based upon understanding these driving costs and should model any efficient, high-scale bulk storage farm. But, and I need to be clear on this, it was written prior to my joining AWS and there is no information below relating to any discussions I’ve had with AWS or how the AWS team specifically designs, deploys, or manages their storage farm.


When Amazon released Amazon S3, I argued that it was priced below cost at $1.80/GB/year.  At that time, my estimate of their cost was $2.50/GB/year.  The Amazon charge of $1.80/GB/year for data to be stored twice in each of two data centers is impressive. It was amazing when it was released and it remains an impressive value today. 


Even though the storage price was originally below cost by my measure, Amazon could still make money if they were running a super-efficient operation (likely the case).  How could they make money charging less than cost for storage? Customers are charged for ingress/egress on all data entering or leaving the AWS cloud.  The network ingress/egress charged by AWS are reasonable, but telecom pricing strongly rewards volume purchases, so what Amazon pays is likely much less than the AWS ingress/egress charges.  This potentially allows the storage business to be profitable even when operating at a storage cost loss.


One concern I’ve often heard is the need to model the networking costs between the data centers since there are actually two redundant copies stored in two independent data centers.  Networking, like power, is usually billed at the 95 percentile over a given period. The period is usually a month but more complex billing systems exist. The constant across most of these high-scale billing systems is that the charge is based upon peaks. What that means is adding ingress or egress at an off peak time is essentially free. Assuming peaks are short-lived, the sync to the other data center can be delayed until the peak has passed.  If the SLA doesn’t have a hard deadline on when the sync will complete (it doesn’t), then the inter-DC bandwidth is effectively without cost.  I call this technique Resource Consumption Shaping and it’s one of my favorite high-scale service cost savers.


What is the cost of storage today in an efficient, commodity bulk-storage service? Building upon the models in the cost of power in large-scale data centers and the annual fully burdened cost of power, here’s the model I use for cold storage with current data points:

Note that this is for cold storage and I ignore the cost of getting the data to or from the storage farm.  You need to pay for the networking you use.  Again, since it’s cold storage the model assume you can use 80% of the disk which wouldn’t be possible for data with high I/O rates per GB. And we’re using commodity SATA disks at 1TB that only consume 10W of power. This is a cold storage model.  If you are running higher I/O rates, figure out what percentage of the disk you can successfully use and update the model in the spreadsheet (ColdStorageCost.xlsx (13.86 KB)). If you are using higher-power, enterprise disk, you can update the model to use roughly 15W for each.

Update: Bryan Apple found two problems with the spreadsheet that have been updated in the linked spreadsheet above. Ironically the resulting fully brudened cost/GB/year is the unchanged. Thanks Bryan.

 For administration costs, I’ve used a fixed, fairly conservative factor of a 10% uplift on all other operations and administration costs. Most large-scale services are better than this and some are more than twice as good but I included the conservative 10% number


Cold storage with 4x copies at high-scale can now be delivered at: 0.80/GB/year.  It’s amazing what falling server prices and rapidly increasing disks sizes have done.  But, it’s actually pretty hard to do and I’ve led storage related services that didn’t get close to this efficient --  I still think that Amazon S3 is a bargain.


Looking at the same model but plugging in numbers from about two years ago shows how fast we’re seeing storage costs plunge. Using $2,000 servers rather than $1,200, server power consumption at 250W rather than 160W, disk size at ½ TB and disk cost at $250 rather than 160, yield an amazingly different $2.40/GB/year.


Cold storage with redundancy at: $0.80 GB/year and still falling. Amazing.




James Hamilton


Monday, December 22, 2008 7:24:50 AM (Pacific Standard Time, UTC-08:00)  #    Comments [19] - Trackback

 Wednesday, December 17, 2008

Resource Consumption Shaping is an idea that Dave Treadwell and I came up with last year.  The core observation is that service resource consumption is cyclical. We typically pay for near peak consumption and yet frequently are consuming far below this peak.  For example, network egress is typically charged at the 95th percentile of peak consumption over a month and yet the real consumption is highly sinusoidal and frequently far below this charged for rate.  Substantial savings can be realized by smoothing the resource consumption.


Looking at the network egress traffic report below, we can see this prototypical resource consumption pattern:


You can see from the chart above that resource consumption over the course of a day varies by more than a factor of two. This variance is driven by a variety of factors, but an interesting one is the size of the Pacific Ocean, where the population density is near zero. As the service peak load time-of-day sweeps around the world, network load falls to base load levels as the peak time range crosses the Pacific ocean.  Another contributing factor is wide variance in the success of this example service in different geographic markets.


We see the same opportunities with power.  Power is usually charged at the 95th percentile over the course of the month.  It turns out that some negotiated rates are more complex than this but the same principle can be applied to any peak-load-sensitive billing system.  For simplicity sake, we’ll look at the common case of systems that charge the 95th percentile over the month.


Server power consumption varies greatly depending upon load.  Data from an example server SKU shows idle power consumption of 158W and full-load consumption of about 230W.  If we defer batch and non-user synchronous workload  as we approach the current data center power peak we can reduce overall peaks. As the server power consumption moves away from a peak, we can reschedule this non-critical workload.  Using this technique we throttle back the power consumption and knock off the peaks by filling the valleys. Another often discussed technique is to shut off non-needed servers and use workload peak clipping and trough filling to allow the workload to be run with less servers turned on. Using this technique it may actually be possible run the service with less servers overall.  In Should we Shut Off Servers, I argue that shutting off servers should NOT be the first choice.


Applying this technique to power has a huge potential upside because power provisioning and cooling dominates the cost of a data center.  Filling valleys allows better data center utilization in addition to lowering power consumption charges. 


The resource-shaping techniques we’re discussing here, that of smoothing spikes by knocking off peaks and filling valleys, applies to all data center resources.  We have to buy servers to meet the highest load requirements.  If we knock off peaks and fill valleys, less servers are needed.  This also applies also to internal networking.  In fact, Resource Shaping as a technique applies to all resources across the data center. The only difference is the varying complexity of scheduling the consumption of these different resources.


One more observation along this theme, this time returning to egress charges. We mentioned earlier that egress was charged at the 95th percentile.  What we didn’t mention is that ingress/egress are usually purchased symmetrically.  If you need to buy N units of egress, then you just bought N units of ingress whether you need it or not. Many services are egress dominated. If we can find a way to trade ingress to reduce egress, we save.  In effect, it’s cross-dimensional resource shaping, where we are trading off consumption of a cheap or free resources to save an expensive one.  On an egress dominated service,  even ineffective techniques that trade off say 10 units of ingress to save only 1 unit of egress may still work economically.  Remote Differential Compression is one approach to reducing egress at the expense of a small amount of ingress.


The cross-dimensional resource-shaping technique described above where we traded off ingress to reduce egress can be applied across other dimensions as well.  For example, adding memory to a system can reduce disk and/or network I/O.  When does it make sense to use more memory resources to save disk and/or networking resources?  This one is harder to dynamically tune in that it’s a static configuration option but the same principles can be applied.


We find another multi-resource trade-off possibility with disk drives.  When a disk is purchased, we are buying both a fixed I/O capability and a fixed disk capacity in a single package.  For example, when we buy a commodity 750GB disk, we get a bit less than 750GB of capacity and the capability of somewhat more than 70 random I/Os per second (IOPS).  If the workload needs more than 70 I/Os per second, capacity is wasted. If the workload consumes the disk capacity but not the full IOPS capability, then the capacity will be used up but the I/O capability will be wasted. 


Even more interesting, we can mix workloads from different services to “absorb” the available resources. Some workloads are I/O bound while others are storage bound. If we mix these two storage workloads types, we may be able to fully utilize the underlying resource.  In the mathematical limit, we could run a mixed set of workloads with ½ the disk requirements of a workload partitioned configuration.  Clearly most workloads aren’t close to this extreme limit but savings of 20 to 30% appear attainable.   An even more powerful saving is available from mixing workloads using storage by sharing excess capacity. If we pool the excess capacity and dynamically move it around, we can safely increase the utilization levels on the assumption that not all workloads will peak at the same time. As it happens, the workloads are not highly correlated in their resource consumption so this technique appears to offer even larger savings than what we would get through mixing I/O and capacity-bound workloads.  Both gains are interesting and both are worth pursuing.


Note that the techniques that I’ve broadly called resource shaping are an extension to an existing principle called network-traffic shaping  I see great potential in fundamentally changing the cost of services by making services aware of the real second-to-second value of a resource and allowing them to break their resource consumption into classes of urgent (expensive), less urgent (somewhat cheaper), and bulk (near free). 



James Hamilton,

Wednesday, December 17, 2008 9:16:03 AM (Pacific Standard Time, UTC-08:00)  #    Comments [3] - Trackback

 Saturday, December 13, 2008

I’ve resigned from Microsoft and will join the Amazon Web Services team at the start of next year.  As an AWS user, I’ve written thousands of lines of app code against S3, and now I’ll have an opportunity to help improve and expand the AWS suite.

In this case, I’m probably guilty of what many complain about in bloggers: posting rehashed news reported broadly elsewhere without adding anything new:










Job changes generally bring some stress, and that’s probably why I’ve only moved between companies three times in 28 years. I worked 6 years as an auto-mechanic, 10 years at IBM, and 12 years at Microsoft. Looking back over my 12 years at Microsoft, I couldn’t have asked for more excitement, more learning, more challenges, or more trust.

I’ve had a super interesting time at Microsoft and leaving is tough, but I also remember feeling the same way when I left IBM after 10 years to join Microsoft. Change is good; change challenges; change forces humility; change teaches. I’m looking forward to it even though all new jobs are hard. Onward!



Saturday, December 13, 2008 5:26:43 PM (Pacific Standard Time, UTC-08:00)  #    Comments [16] - Trackback
 Saturday, December 06, 2008

In the Cost of Power in Large-Scale Data Centers, we looked at where the money goes in a large scale data center.  Here I’m taking similar assumptions and computing the Annual Cost of Power including all the infrastructure as well as the utility charge. I define the fully burdened cost of power to be the sum of 1) the cost of the power from the utility, 2) the cost of the infrastructure that delivers that power, and 3) the cost of the infrastructure that gets the heat from dissipating the power back out of the building.


We take the monthly cost of the power and cooling infrastructure assuming a 15 year amortization cycle and 5% annual cost of money billed annually divided by the overall data center critical load to get the annual infrastructure cost per watt. The fully burdened cost of power is the cost of consuming 1W for an entire year and includes the infrastructure power and cooling and the power consumed. Essentially it’s the cost of all the infrastructure except the cost of the data center shell (the building).  From Intense Computing or In Tents Computing, we know that 82% of the cost of the entire data center is power delivery and cooling. So taking the entire monthly facility cost divided by the facility critical load * 82% is an good estimator of the infrastructure cost of power.


The fully burdened cost of power is useful for a variety of reasons but here’s two: 1) current generation servers get more work done per joule than older serves -- when is it cost effective to replace them?  And 2) SSDs consume much less power than HDDs --how much can I save in power over three years by moving to using SSDs and is it worth doing?


We’ll come back to those two examples after we work through what power costs annually. In this model, like the last one (, we’ll assume a 15MW data center that was built at a cost of $200M and runs at a PUE of 1.7. This is better than most, but not particularly innovative.


Should I Replace Old Servers?

Let’s say we have 500 servers, each of which can process 200 application operations/second. These servers are about 4 years old and consume 350W each.  A new server has been benchmarked to process 250 operations/second,  and each of these servers costs $1,3000 and consumes 165W at full load. Should we replace the farm?


Using the new server, we only need 400 servers to do the work of the previous 500 (500*200/250). The new server farm consumes less power. The savings are $111kw ((500*350)-(400*160)).  Let’s assume a plan to keep the new servers for three years.  We save 111kw each year for three years and we know from the above model that we are paying $2.12/kw/year. Over three years, we’ll save $705,960.  The new servers will cost $520,000 so, by recycling the old servers and buying new ones we can save $185,960. To be fair, we should accept a charge to recycle the old ones and we need to model the cost of money to spend $520k in capital. We ignore the recycling costs and use a 5% cost of money to model the impact of the capital cost of the servers. Using a 5% cost of money over three years amortization period, we’ll have another $52,845 in interest if we were to borrow to buy these servers or just in recognition that tying up capital has a cost. 


Accepting this $52k charge for tying up capital, it’s still a gain of $135k to recycle the old servers and buy new ones. In this case, we should replace the servers.


What is an SSD Worth?

Let’s look at the second example of the two I brought up above. Let’s say I can replace 10 disk drives with a single SSD. If the workload is not capacity bound and is I/O intensive, this can be the case (see When SSDs Make Sense in Server Applications). Each HDD consumes roughly 10W whereas the SSD only consumes 2.5W. Replacing these 10 HDD with a single SSD could save 97.5W/year and, over a three year life. That’s a savings of 292.5W. Using the fully burdened cost of power from the above model, we could save $620 (292.5W*$2.12) on power alone.  Let’s say the disk drives are $160 each and will last three years, what’s the break-even point where the SSD is a win assuming the performance is adequate and we ignoring other factors such as lifetime and service?  We’ll take the cost of the 10 disks and add in the cost of power saved to see what we could afford to pay for an SSD – the breakeven point (10*160+620 => $2220).  If the SSD is under $2220, then it is a win. The Intel X-25E has a street price of around $700 the last time I checked and, in many application workloads, it will easily replace 10 disks. Our conclusion is that, in this case with these assumptions, the SSD looks like the better investment than 10 disks.


When you factor in the fully burdened price of power, savings can add up quickly.  Compute your fully burdened cost of power (the spread sheet<JRH>) and figure out when you should be recycling old servers or considering lower power components.


If you are interested in tuning the assumptions to more closely match your current costs, here it is: PowerCost.xlsx (11.27 KB).




James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | |  | blog:


Saturday, December 06, 2008 4:23:54 AM (Pacific Standard Time, UTC-08:00)  #    Comments [8] - Trackback

 Wednesday, December 03, 2008

Michael Manos yesterday published Our Vision for Generation 4 Modular Data Centers – One Way of Getting it Just Right. In this posting, Mike goes through the next generation modular data  center designs for Microsoft. Things are moving quickly. I first argued for modular designs in a Conference on Innovative Data Systems paper submitted in 2006.  Last Spring I blogged First Containerized Data Center Announcement that looks at the containerized portion of the Chicago data center.


In this more recent post, the next generation design is being presented in surprising detail. The Gen4 design has 4 classes of service:

·         A: No UPS and no generator

·         B: UPS with optional generator

·         C: UPS, generator with +1 maintenance support

·         D: UPS and generator with +2 support


I’ve argued for years that high-minute UPS and generators are a poor investment.  We design services to be able to maintain SLA through server hardware or software error.  If a service is hosted over a large number of data centers, the loss of an entire data center should not impact the ability of the service to meet the SLA. There is no doubt that this is true and there are services that exploit this fact and reduce their infrastructure costs by not deploying generators. The problem is the vast majority of services don’t run over a sufficiently large number of data centers and some have single points of failure not distributed across data centers. Essentially some services can be hosted without high-minute UPSs and generators but many can’t be. Gen4 gets around that by offering a modular design where A class has no backup and D class is a conventional facility with good power redundancy (roughly a tier-3 design).


The Gen4 design is nearly 100% composed of prefabricated parts. Rather than just the server modules, all power distribution, mechanical, and even administration facilities are modular and prefabricated. This allows for rapid and incremental deployment.  With a large data center costing upwards of $200m (Cost of Power in High Scale Data Centers), an incremental approach to growth is a huge advantage.


Gen4 aims to achieve a PUE of 1.125 and to eliminate the use of water in the mechanical systems relying instead 100% on air-side economization.


Great data, great detail, and hats off to Mike and the entire Microsoft Global Foundations Services for sharing this information with the industry. It’s great to see.




Thanks to Mike Neil for pointing this posting out to me.


James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | |  | blog:


Wednesday, December 03, 2008 7:05:46 AM (Pacific Standard Time, UTC-08:00)  #    Comments [2] - Trackback

Ed Lazowska of University of Washington invited me in speak to his CSE 490H class. This is a great class that teaches distributed systems in general and the programming assignments are MapReduce workloads using Hadoop. I covered two major topics, the first on high scale service best practices. How to design, develop, and efficiently operate high-scale services. The second looked at where the power goes in a high scale data center and what to do about it.

·         Designing and Deploying Internet-Scale Services

·         Alaska Marine Lines Bimodal Loading

·         Where Does the Power go and What to do About it


The middle presentation on what I call Bimodal Loading is just a fun little two slide thing answering a question Ed had asked me some months back in passing. Why are the barges heading up to Alaska form Seattle loaded high in the front and in the rear but almost always unloaded or only lightly-loaded in the middle?




James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | |  | blog:


Wednesday, December 03, 2008 6:57:34 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
 Tuesday, December 02, 2008

Yesterday, AWS announced new pricing for SimpleDB and its noteworthy: free developer usage for 6 months. No charge for up to 1GB of ingres+egress, 25 machine hours, and 1GB storage.

To help you get started with Amazon SimpleDB, we are providing a free usage tier for at least the next six months. Each month, there is no charge for the first 25 machine hours, 1 GB of data transfer (in and out), and 1 GB of storage that you use. Standard pricing will apply beyond these usage levels, and free usage does not accumulate over time.

In addition, beginning today, Amazon SimpleDB customers will now enjoy significantly reduced storage pricing, only $0.25 per gigabyte-month. This new rate reflects an 83% reduction in storage pricing. For more information, see

From: Amazon SimpleDB now in Unlimited Beta.

I’ve been arguing for several years that the utility computing pricing model and the ability to near instantly grow or shrink, make the move to the cloud inevitable.  Google, Microsoft, Amazon, and numerous startups are all producing interesting offerings and all are moving quickly. Enterprises normally move slowly to new technologies but when the price advantage is close to 10x, enterprise decision makers can and will move more quickly. The weak economy provides yet a stronger push. It won’t happen at once and it won’t happen instantly but I’m convinced that in 7 years, the vast majority of enterprises will be using utility computing for some part of their enterprise IT infrastructure.

It reminds me of a decade and a half ago when nearly all enterprise ERP software was home grown.  The price advantage of SAP, Baan, Peoplesoft, etc. was so great at the time that the impossibly difficult and slow move from home grown, internally written ERP software to packaged apps happened nearly overnight by enterprise standards.  We see the same pricing advantage again with utility computing.


Thanks to Jeff Currier for sending this one my way yesterday.

James Hamilton, Data Center Fues
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | |  | blog:


Tuesday, December 02, 2008 5:32:07 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
 Monday, December 01, 2008

In a comment to the last blog entry, Cost of Power in Large-Scale Data Centers Doug Hellmann brought up a super interesting point


It looks like you've swapped the "years" values from the Facilities Amortization and Server Amortization lines. The Facilities Amortization line should say 15 years, and Server 3. The month values are correct, just the years are swapped.

I wonder if the origin of "power is the biggest cost" is someone dropping a word from "power is the biggest *manageable* cost"? If there is an estimated peak load, the server cost is fixed at the rate necessary to meet the load. But average load should be less than peak, meaning some of those servers could be turned off or running in a lower power consumption mode much (or most) of the time.

This comment has not been screened by an external service.

Doug Hellmann


Yes, you’re right the comment explaining the formula on amortization period in Cost of Power in Large-Scale Data Centers is incorrect.  Thanks for you, Mark Verber, and Ken Church for catching this.


You brought up another important point that is worth digging deeper into. You point out that we need to buy enough servers to handle maximum load and argue that you should shut off those you are not using.  This is another one of those points that I’ve heard frequently and am not fundamentally against but, as always, it’s more complex than it appears.  There are two issues here: 1) you can actually move some workload from peak to the valley through a technique that I call Resource Consumption Shaping and 2) turning off isn’t necessarily the right mechanism to run more efficiently.  Let’s look at each:


Resource Consumption Shaping is a technique that Dave Treadwell and I came up with last year.  I’ve not blogged it in detail (I will in the near future), but the key concept is prioritizing workload into at least two groups: 1) customer waiting and 2) customer not waiting.  For more detail, see page 22 of the talk Internet-Scale Service Efficiency from Large Scale Distributed Systems & Middleware (LADIS 2008).  The “customer not waiting” class  includes reports, log processing, re-indexing, and other admin tasks.  Resource consumption shaping argues you should move “customer not waiting” workload from the peak load to off-peak times where you can process it effectively for free since you already have the servers and power. Resource consumption shaping builds upon Degraded Operations Mode.


The second issue is somewhat counter-intuitive.  The industry is pretty much uniform in arguing that you should shut off servers during non-peak periods. I think Luiz Barroso was probably the first to argue NOT to shut off servers and we can use the data from Cost of Power in Large-Scale Data Centers to show that Luiz is correct. The short form of the argument goes like this: you have paid for the servers, the cooling, and the power distribution for the servers.  Shutting them off only saves the power they would have consumed.  So, it’s a mistake to shut them off unless you don’t have any workload to run with a marginal value above the cost of the power the server consumes since you have already paid for everything else.  If you can’t come up with any workload to run worth more than the marginal cost of power, then I agree you should shut them off.


Albert Greenberg, Parveen Patel, Dave Maltz and I make a longer form of this argument against shutting servers off in an article to appear in the next issue of SIGCOM Computer Communications Review. We also looked more closely at networking issues in this paper.




James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | |  | blog:


Monday, December 01, 2008 12:46:29 PM (Pacific Standard Time, UTC-08:00)  #    Comments [4] - Trackback
 Friday, November 28, 2008

I’m not sure how many times I’ve read or been told that power is the number one cost in a modern mega-data center, but it has been a frequent refrain.  And, like many stories that get told and retold, there is an element of truth to the it. Power is absolutely the fastest growing operational costs of a high-scale service. Except for server hardware costs, power and costs functionally related to power usually do dominate.


However, it turns out that power alone itself isn’t anywhere close to the most significant a cost. Let’s look at this more deeply. If you amortize power distribution and cooling systems infrastructure over 15 years and amortize server costs over 3 years, you can get a fair comparative picture of how server costs compare to infrastructure (power distribution and cooling). But how to compare the capital  costs of server, and power and cooling infrastructure with that monthly bill for power?


The approach I took is to convert everything into a monthly charge.  Amortize the power distribution and cooling infrastructure over 15 years and use a 5% per annum cost of money and compute the monthly payments. Then take the server costs and amortize them over a three year life and compute the monthly payment again using a 5% cost of money. Then compute the overall power consumption of the facility per month and compare the costs.

Update: fixed error in spread sheet comments.

What can we learn from this model?  First, we see that power costs not only don’t dominate, but are behind the cost of servers and the aggregated infrastructure costs.  Server hardware costs are actually the largest.  However, if we look more deeply, we see that the infrastructure is almost completely functionally  dependent on power. From Belady and Manos’ article Intense Computing or In Tents Computing, we know that 82% of the overall infrastructure cost is power distribution and cooling. The power distribution costs are functionally related to power, in that you can’t consume power if you can’t get it to the servers.  Similarly, the cooling costs are clearly 100% related to the power dissipated in the data center, so cooling costs are also functionally related to power as well.


We define the fully burdened cost of power to be sum of the cost of the power consumed and the cost of both the cooling and power distribution infrastructure. This number is still somewhat less than the cost of servers in this model but, with cheaper servers or more expensive power assumptions, it actually would dominate. And it’s easy to pay more for power although, very large datacenters are often located to pay less (e.g. Microsoft Columbia or Google Dalles facilities).


Since power and infrastructure costs continue to rise while the cost of servers measured in work done per $ continues to fall, it actually is correct to say that the fully burdened cost of power does, or soon will, dominate all other data center costs.


For those of you interested in playing with different assumptions, the spreadsheet is here: OverallDataCenterCostAmortization.xlsx (14.4 KB).




Friday, November 28, 2008 2:03:07 PM (Pacific Standard Time, UTC-08:00)  #    Comments [14] - Trackback
 Saturday, November 22, 2008

Large sorts need to be done daily and doing it well actually is economically relevant.  Last July, Owen O’Malley of the Yahoo Grid team announced they had achieved a 209 second TeraSort run: Apache Hadoop Wins Terabyte Sort Benchmark. My summary of the Yahoo result with cluster configuration: Hadoop Wins TeraSort.


Google just announced a MapReduce sort result on the same benchmark: Sorting 1PB with MapReduce.  They improved on the 209 second result that Yahoo produced achieving 68 seconds.   How did they get roughly 3x speedup?  Google used slightly more servers at 1,000 than the 910 used by Hadoop but that difference is essentially rounding error and doesn’t explain the difference. 


We know that sorting is essentially, an I/O problem. The more I/O a cluster has, the better the performance of a well written sort. It’s not quite the case that computation doesn’t matter but close.  A well written sort will scale almost linearly with the I/O capacity of the cluster.  Let’s look closely at the I/O sub-systems used in these two sorts and see if that can explain some of the differences between the two results. Yahoo used 3,640 disks in their 209 second run. The Google cluster uses 12 disks per server for a total of 12,000. Both are using commodity disks.  The Hadoop result uses 3,640 disk for 209 seconds (761k disk seconds) and the Google result uses 12,000 disks for 68 seconds (816k disk seconds).  


Normalizing for number of disks, the Google result is roughly 7% better than the Hadoop number from earlier in the year. That fairly small difference could be explained by more finely tuned software, better disks, or a combination of both.


The Google experiment included a petabyte sort on a 4,000 node cluster. This result is impressive for at least two reasons: 1) a 4,000 node, 48000 disk cluster running a commercial workload is impressive, and 2)  sorting a petabyte in 6 hours in 2 min is noteworthy. 


In my last posting on high-scale sorts Hadoop Wins TeraSort I argued that we should be also be measuring power consumed. Neither the Google nor Yahoo results report power consumption but there is a enough data to strongly suggest the Google number is better by this measure.  Since the data isn’t published, let’s assume that commodity disks draw roughly 10W each and that each server is drawing 150W not including the disks. Using that data, let’s compute the number kilo-watt hours for each run:


·         Google: 68*(1000*150+1000*12*10)/3600/1000 => 5.1 kwh

·         Yahoo: 209*(910*150+910*4*10)/3600/1000 => 10.0 kwh


Both are good results and both very similar in their utilization of I/O resources but the Google result uses much less power under our assumptions. The key advantage is that they have 4x the number of disks per server so can amortize the power “overhead” of each server over more disks. 


When running scalable algorithms like sort, a larger cluster will produce a faster result unless cluster scaling limits are hit. I argue that to really understand the real quality of the implementation in a comparable way we need to report work done per dollar and work done per joule. 




Thanks to Savas Parastatidis for pointing this result out to me.


James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | |  | blog:



Saturday, November 22, 2008 7:46:18 AM (Pacific Standard Time, UTC-08:00)  #    Comments [3] - Trackback
 Monday, November 17, 2008

Two weeks ago I posted the notes I took from Tony Hoare’s “The Science of Programming” talk at the Computing in the 21st Century Conference in Beijing.  


Here’s are the slides from the original talk: Tony Hoare Science of Programming (199 KB).

Here are my notes from two weeks back: Tony Hoare on The Science of Programming.




James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | |  | blog:

Monday, November 17, 2008 8:37:52 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
 Saturday, November 15, 2008

Last week, IBM honored database giant Pat Selinger by creating a Ph.D. Fellowship in her name.  I worked with Pat closely for many years at IBM and much of what I learned about database management systems was from Pat during those years.   She was a one of the original members of the IBM System R team and is probably best known as the inventor of the cost based optimizer. Access Path Selection in a Relational Database Management Systems is a paper from that period that I particularly enjoyed.


From the IBM press release:


Pat Selinger IBM Ph.D. Fellowship: awarded to an exceptional female Ph.D. student worldwide with special focus on database design and management


Pat Selinger IBM Ph.D. Fellowship
Dr. Pat Selinger was a leading member of the IBM Research team that produced the world's first relational database system and established the basic architecture for the highly successful IBM DB2 database product family. Her innovative work on cost-based query optimization for relational databases has been adopted by nearly all relational database vendors and is now taught in virtually every university database course. In 1994, Dr. Selinger was named an IBM Fellow -- an honor accorded only to the top 50 technical experts in IBM -- and in 2004, she was inducted into the Women in Technology International Hall of Fame.


An ACM Queue interview with Pat: A conversation with Pat Selinger.


It’s great to see IBM actively supporting engineering education, particular encouraging female engineers, and recognizing Pat Salinger’s contribution to the commercial and academic database community.




James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | |  | blog:


Saturday, November 15, 2008 11:10:14 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
 Wednesday, November 12, 2008

Intel Fellow and Director of Storage Architecture Knut Grimsrud presented at WinHEC 2008 last week and it caught my interest for several reasons: 1) he talked about Intel findings with their new SSD which looks like an extremely interesting price/performer, 2) they have found interesting power savings in their SSD experiments beyond the easy to predict reduction in power consumption of SSDs over HDDs, and 3) Knut presented a list of useful SSD usage do’s and don’ts.


Starting from the best practices:

·         DO queue requests to SSD as deeply as possible

  • SSD has massive internal parallelism and generally is underutilized. Parallelism will further increase over time.
  • Performance scales well with queue depth

·         DON’T withhold requests in order to “optimize” or “aggregate” them

  • Traditional schemes geared towards reducing HDD latencies do not apply. Time lost in withholding requests difficult to make up.

·         DO worry about software/driver overheads & latencies

  • At 100K IOPS how does your SW stack measure up?

·         DON’T use storage “backpressure” to pace activity

  • IO completion time (or rate) is not a useful pacing mechanism and attempting to use that as throttle can result in tasks generating more activity than desired


Common HDD optimizations you should avoid:

·         Block/page sizes, alignments and boundaries

  • Intel® SSD is insensitive to whether host writes have any relationship to internal NAND boundaries or granularities
  • Expect other high-performing SSDs to also handle this
  • Internal NAND structures constantly changing anyway, so chasing this will be a losing proposition

·         Write transfer sizes & write “globbing”

  • No need to accumulate writes in order to create large writes
  • Temporarily logging writes sequentially and later re-locating to final destination unhelpful to Intel SSD (and is detrimental to longevity)

·         Software “helping” by making near-term assumptions about SSD internals will become a long-term hindrance

  • Any SW assistance must have longevity


On the power savings point, Knut laid out an interesting argument on increased power savings for SSDs over HDDs beyond the standard device power difference.  These standard power differences are real of course but, on a laptop device where a HDD typically draws around 2.5W active, these often pointed to savings are relatively small. However, an additional measurable savings was reported by Knut. Because SSDS are considerably faster than HDD, speculative page fetching done by Windows Superfetch is not needed.  And, because Superfetch is sometimes incorrect, the additional I/Os and processing done by Superfecth, consume more power.  Essentially, with the very high random I/O rates offered by SSDs, Superfetch isn’t needed and, if disabled, there will be additional power savings due to reduced I/o and page processing activity.


Another potential factor I’ve discussed with Knut’s is that in standard laptop operating mode, the common usage model is one where there are periods of inactivity and short periods of peak workload typically accompanied by high random I/O rates.  More often than not, laptop performance is bounded by random I/O performance. If SSD usage allows these periods of work to be completed more quickly, the system can quickly return to an idle, low-power state.  We’ve not measured this gain but it seems intuitive that getting the work done more quickly will leave the system active for shorter periods and have it in idle states for longer.  Assuming a faster system spends more time in idle states (rather than simply doing more work), we should be able to measure additional power savings indirectly attributable to SSD usage.


Knut’s slides: Intel’s Solid State Drives. Thanks to Vlad Sadovsky for sending this one my way.




James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | |  | blog:


Wednesday, November 12, 2008 5:01:17 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
 Sunday, November 09, 2008

Abolade Gbadegesin, Windows Live Mesh Architect gave a great talk at the Microsoft Professional Developers Conference on Windows Live Mesh (talk video, talk slides). Live mesh is a service that supports p2p file sharing amongst your devices, file storage in the cloud, remote access to all your devices (through firewalls and NATS), and web access to those files you chose to store in the cloud. Live Mesh is a good service and worth investigating in its own right but what makes this talk particularly interesting is Abolade gets into the architecture of how the system is written and, in many cases, why it is designed that way. 


I’ve been advocating redundant, partitioned, fail fast service designs based upon Recovery Oriented Computing for years.  For example, Designing and Deploying Internet Scale Services (paper, slides). Live Mesh is a great example of such a service.   It’s designed with enough redundancy and monitoring such that service anomalies are detected and, when detected, it’ll auto-recover by first restarting, then rebooting, and finally re-image the failing system.


It’s partitioned across multiple data centers and, in each datacenter, across many symmetric commodity servers each of which is a 2 core, 4 disk, 8 GB system. The general design principles are:

·         Commodity hardware

·         Partitioning for scaling out, redundancy for availability

·         Loose coupling across roles

·         Xcopy deployment and configuration

·         Fail-fast, recovery-oriented error handling

·         Self-monitoring and self-healing


The scale out strategy is to:

·         Partition by user, device, and Mesh Object

·         Use soft state to minimize I/O load

·         Leverage HTTP 1.1 semantics for caching, change notification, and incremental state transfer

·         Leverage client-side resources for holding state

·         Leverage peer connectivity for content replication


Experiences and lessons learned on availability:

·         Design for loosely coupled dependence on building blocks

·         Diligently validate client/cloud upgrade scenarios

·         Invest in pre-production stress and functional coverage in environments that look like production

·         Design for throttling based on both dynamic thresholds and static bounds


Experiences and lessons learned on monitoring:

·         Continuously refine performance counters, logs, and log processing tools

·         Monitor end-user-visible operations (Keynote)

·         Build end-to-end tracing across tiers

·         Self-healing is hard:  Invest in tuning watchdogs and thresholds


Experiences and lessons learned on deployment:

·         Deployments every other week, client upgrades every month

·         Major functionality roughly each quarter

·         Took advantage of gradual ramp to learn lessons early




Thanks to Andrew Enfield  for sending this one my way.


James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | |  | blog:


Sunday, November 09, 2008 10:05:38 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
 Wednesday, November 05, 2008

Butler Lampson, one of the founding members of Xerox PARC, Turing award winner, and one of the most practical engineering thinkers I know spoke a couple of days ago at the Computing in the 21st Century Conference in Beijing. My rough notes from Butler’s talk follow.  Overall Butler argues that “embodiment” is the next big phase of computing after simulation and communications.  Butler defines embodiment as computers interacting directly with the physical world.  For example, autonomously driven vehicles.  Butler argues that this class of applications are only possible now due to the rapidly falling price of computing coupled with systems capabilities driven by Moore’s law.


He argues that we need to further advance how we deal with uncertainty and dependability to be successful with these applications.  Uncertainty is important since all input has noise, all sensors have faults, and all data is incomplete.  Dependability in that these systems are directly interacting with the physical world and actions in the physical world can have live critical failure modes. 


Butler’s recommendation on how to build incredibly complex systems that directly interact with the physical world and yet have these systems be dependable is to build them two tier.  At the core, is a small, simple kernel that doesn’t do a great job of its task but doesn’t hard fail and won’t kill anyone.  He calls this “catastrophe mode”.  For example, an autonomous vehicle may slow down to 10 MPH or just safely stop in catastrophe mode. 


The software stack is designed in two layers where the top layer is responsible for the complex, real time interaction the system is designed to deliver. The inner or lower layer is catastrophe mode designed to be simple and, as only simple systems can be, correct.  I like the approach.


Butlers Slides are: ButlerLampson_China_Microsoft2008 (1.49 MB).




Title: The Uses of Computers: What's Past is Merely Prologue

Speaker: Butler Lampson


Implication of Moore

·         Spend hardware to simplify software

·         Hardware enables new applications

·         Pull complexity up into software (if unavoidable)

The uses of computers:

·         1950: Simulation

·         1980: Communications

·         2010: Embodiment (computers interacting directly with the physical world)

Argument: embodiment is now possible and there are some grand challenges that fall into this category:

·         Gave some examples from Jim Gray’s Systems Challenges (Turing award lecture)

·         Butler  example: Reduce highway traffic deaths to zero

What do we need to learn how to deal with to achieve embodiment in general and zero traffic deaths in particular:

·         Dealing with uncertainty

o   Need good models of what can happen (what is possible)

o   Need boundaries for models (where they don’t apply)

·         Dependability

o   The system meets its spec

o   Measure: probability(failure) x Cost(failure)

o   Had to model dependability. Recommends using “no catastrophes”

o   Must have a threat model of what can go wrong

o   Recommends producing a simple, small base that will avoid catastrophe. It must be simple. There may be incredibly complex, very highly optimized layers but a reliable systems needs to be able to fail back to the reliable base kernel (less than 50k loc?)

Conclusions for Engineers:

·         Understand Moore’s Law

·         Aim for mass markets

·         Learn how to deal with uncertainty

·         Learn how to avoid catastrophe (avoiding fault not possible in systems at scale)


James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | |  | blog:


Wednesday, November 05, 2008 1:07:21 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback

Disclaimer: The opinions expressed here are my own and do not necessarily represent those of current or past employers.

<December 2008>

This Blog
Member Login
All Content © 2015, James Hamilton
Theme created by Christoph De Baene / Modified 2007.10.28 by James Hamilton