In the Cost of Power in Large-Scale Data Centers, we looked at where the money goes in a large scale data center. Here I’m taking similar assumptions and computing the Annual Cost of Power including all the infrastructure as well as the utility charge. I define the fully burdened cost of power to be the sum of 1) the cost of the power from the utility, 2) the cost of the infrastructure that delivers that power, and 3) the cost of the infrastructure that gets the heat from dissipating the power back out of the building.
We take the monthly cost of the power and cooling infrastructure assuming a 15 year amortization cycle and 5% annual cost of money billed annually divided by the overall data center critical load to get the annual infrastructure cost per watt. The fully burdened cost of power is the cost of consuming 1W for an entire year and includes the infrastructure power and cooling and the power consumed. Essentially it’s the cost of all the infrastructure except the cost of the data center shell (the building). From Intense Computing or In Tents Computing, we know that 82% of the cost of the entire data center is power delivery and cooling. So taking the entire monthly facility cost divided by the facility critical load * 82% is an good estimator of the infrastructure cost of power.
The fully burdened cost of power is useful for a variety of reasons but here’s two: 1) current generation servers get more work done per joule than older serves -- when is it cost effective to replace them? And 2) SSDs consume much less power than HDDs --how much can I save in power over three years by moving to using SSDs and is it worth doing?
We’ll come back to those two examples after we work through what power costs annually. In this model, like the last one (http://perspectives.mvdirona.com/2008/11/28/CostOfPowerInLargeScaleDataCenters.aspx), we’ll assume a 15MW data center that was built at a cost of $200M and runs at a PUE of 1.7. This is better than most, but not particularly innovative.

Should I Replace Old Servers?
Let’s say we have 500 servers, each of which can process 200 application operations/second. These servers are about 4 years old and consume 350W each. A new server has been benchmarked to process 250 operations/second, and each of these servers costs $1,3000 and consumes 165W at full load. Should we replace the farm?
Using the new server, we only need 400 servers to do the work of the previous 500 (500*200/250). The new server farm consumes less power. The savings are $111kw ((500*350)-(400*160)). Let’s assume a plan to keep the new servers for three years. We save 111kw each year for three years and we know from the above model that we are paying $2.12/kw/year. Over three years, we’ll save $705,960. The new servers will cost $520,000 so, by recycling the old servers and buying new ones we can save $185,960. To be fair, we should accept a charge to recycle the old ones and we need to model the cost of money to spend $520k in capital. We ignore the recycling costs and use a 5% cost of money to model the impact of the capital cost of the servers. Using a 5% cost of money over three years amortization period, we’ll have another $52,845 in interest if we were to borrow to buy these servers or just in recognition that tying up capital has a cost.
Accepting this $52k charge for tying up capital, it’s still a gain of $135k to recycle the old servers and buy new ones. In this case, we should replace the servers.
What is an SSD Worth?
Let’s look at the second example of the two I brought up above. Let’s say I can replace 10 disk drives with a single SSD. If the workload is not capacity bound and is I/O intensive, this can be the case (see When SSDs Make Sense in Server Applications). Each HDD consumes roughly 10W whereas the SSD only consumes 2.5W. Replacing these 10 HDD with a single SSD could save 97.5W/year and, over a three year life. That’s a savings of 292.5W. Using the fully burdened cost of power from the above model, we could save $620 (292.5W*$2.12) on power alone. Let’s say the disk drives are $160 each and will last three years, what’s the break-even point where the SSD is a win assuming the performance is adequate and we ignoring other factors such as lifetime and service? We’ll take the cost of the 10 disks and add in the cost of power saved to see what we could afford to pay for an SSD – the breakeven point (10*160+620 => $2220). If the SSD is under $2220, then it is a win. The Intel X-25E has a street price of around $700 the last time I checked and, in many application workloads, it will easily replace 10 disks. Our conclusion is that, in this case with these assumptions, the SSD looks like the better investment than 10 disks.
When you factor in the fully burdened price of power, savings can add up quickly. Compute your fully burdened cost of power (the spread sheet<JRH>) and figure out when you should be recycling old servers or considering lower power components.
If you are interested in tuning the assumptions to more closely match your current costs, here it is: PowerCost.xlsx (11.27 KB).
--jrh
James Hamilton, Data Center Futures Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052 W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | JamesRH@microsoft.com
H:mvdirona.com | W:research.microsoft.com/~jamesrh | blog:http://perspectives.mvdirona.com
Michael Manos yesterday published Our Vision for Generation 4 Modular Data Centers – One Way of Getting it Just Right. In this posting, Mike goes through the next generation modular data center designs for Microsoft. Things are moving quickly. I first argued for modular designs in a Conference on Innovative Data Systems paper submitted in 2006. Last Spring I blogged First Containerized Data Center Announcement that looks at the containerized portion of the Chicago data center.
In this more recent post, the next generation design is being presented in surprising detail. The Gen4 design has 4 classes of service:
· A: No UPS and no generator
· B: UPS with optional generator
· C: UPS, generator with +1 maintenance support
· D: UPS and generator with +2 support
I’ve argued for years that high-minute UPS and generators are a poor investment. We design services to be able to maintain SLA through server hardware or software error. If a service is hosted over a large number of data centers, the loss of an entire data center should not impact the ability of the service to meet the SLA. There is no doubt that this is true and there are services that exploit this fact and reduce their infrastructure costs by not deploying generators. The problem is the vast majority of services don’t run over a sufficiently large number of data centers and some have single points of failure not distributed across data centers. Essentially some services can be hosted without high-minute UPSs and generators but many can’t be. Gen4 gets around that by offering a modular design where A class has no backup and D class is a conventional facility with good power redundancy (roughly a tier-3 design).
The Gen4 design is nearly 100% composed of prefabricated parts. Rather than just the server modules, all power distribution, mechanical, and even administration facilities are modular and prefabricated. This allows for rapid and incremental deployment. With a large data center costing upwards of $200m (Cost of Power in High Scale Data Centers), an incremental approach to growth is a huge advantage.
Gen4 aims to achieve a PUE of 1.125 and to eliminate the use of water in the mechanical systems relying instead 100% on air-side economization.
Great data, great detail, and hats off to Mike and the entire Microsoft Global Foundations Services for sharing this information with the industry. It’s great to see.
--jrh
Thanks to Mike Neil for pointing this posting out to me.
James Hamilton, Data Center Futures Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052 W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | JamesRH@microsoft.com
H:mvdirona.com | W:research.microsoft.com/~jamesrh | blog:http://perspectives.mvdirona.com
Yesterday, AWS announced new pricing for SimpleDB and its noteworthy: free developer usage for 6 months. No charge for up to 1GB of ingres+egress, 25 machine hours, and 1GB storage.
To help you get started with Amazon SimpleDB, we are providing a free usage tier for at least the next six months. Each month, there is no charge for the first 25 machine hours, 1 GB of data transfer (in and out), and 1 GB of storage that you use. Standard pricing will apply beyond these usage levels, and free usage does not accumulate over time.
In addition, beginning today, Amazon SimpleDB customers will now enjoy significantly reduced storage pricing, only $0.25 per gigabyte-month. This new rate reflects an 83% reduction in storage pricing. For more information, see aws.amazon.com/simpledb.
From: Amazon SimpleDB now in Unlimited Beta.
I’ve been arguing for several years that the utility computing pricing model and the ability to near instantly grow or shrink, make the move to the cloud inevitable. Google, Microsoft, Amazon, and numerous startups are all producing interesting offerings and all are moving quickly. Enterprises normally move slowly to new technologies but when the price advantage is close to 10x, enterprise decision makers can and will move more quickly. The weak economy provides yet a stronger push. It won’t happen at once and it won’t happen instantly but I’m convinced that in 7 years, the vast majority of enterprises will be using utility computing for some part of their enterprise IT infrastructure.
It reminds me of a decade and a half ago when nearly all enterprise ERP software was home grown. The price advantage of SAP, Baan, Peoplesoft, etc. was so great at the time that the impossibly difficult and slow move from home grown, internally written ERP software to packaged apps happened nearly overnight by enterprise standards. We see the same pricing advantage again with utility computing.
--jrh
Thanks to Jeff Currier for sending this one my way yesterday.
James Hamilton, Data Center Fues Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052 W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | JamesRH@microsoft.com
H:mvdirona.com | W:research.microsoft.com/~jamesrh | blog:http://perspectives.mvdirona.com
In a comment to the last blog entry, Cost of Power in Large-Scale Data Centers Doug Hellmann brought up a super interesting point
It looks like you've swapped the "years" values from the Facilities Amortization and Server Amortization lines. The Facilities Amortization line should say 15 years, and Server 3. The month values are correct, just the years are swapped.
I wonder if the origin of "power is the biggest cost" is someone dropping a word from "power is the biggest *manageable* cost"? If there is an estimated peak load, the server cost is fixed at the rate necessary to meet the load. But average load should be less than peak, meaning some of those servers could be turned off or running in a lower power consumption mode much (or most) of the time.
This comment has not been screened by an external service.
Doug Hellmann
Yes, you’re right the comment explaining the formula on amortization period in Cost of Power in Large-Scale Data Centers is incorrect. Thanks for you, Mark Verber, and Ken Church for catching this.
You brought up another important point that is worth digging deeper into. You point out that we need to buy enough servers to handle maximum load and argue that you should shut off those you are not using. This is another one of those points that I’ve heard frequently and am not fundamentally against but, as always, it’s more complex than it appears. There are two issues here: 1) you can actually move some workload from peak to the valley through a technique that I call Resource Consumption Shaping and 2) turning off isn’t necessarily the right mechanism to run more efficiently. Let’s look at each:
Resource Consumption Shaping is a technique that Dave Treadwell and I came up with last year. I’ve not blogged it in detail (I will in the near future), but the key concept is prioritizing workload into at least two groups: 1) customer waiting and 2) customer not waiting. For more detail, see page 22 of the talk Internet-Scale Service Efficiency from Large Scale Distributed Systems & Middleware (LADIS 2008). The “customer not waiting” class includes reports, log processing, re-indexing, and other admin tasks. Resource consumption shaping argues you should move “customer not waiting” workload from the peak load to off-peak times where you can process it effectively for free since you already have the servers and power. Resource consumption shaping builds upon Degraded Operations Mode.
The second issue is somewhat counter-intuitive. The industry is pretty much uniform in arguing that you should shut off servers during non-peak periods. I think Luiz Barroso was probably the first to argue NOT to shut off servers and we can use the data from Cost of Power in Large-Scale Data Centers to show that Luiz is correct. The short form of the argument goes like this: you have paid for the servers, the cooling, and the power distribution for the servers. Shutting them off only saves the power they would have consumed. So, it’s a mistake to shut them off unless you don’t have any workload to run with a marginal value above the cost of the power the server consumes since you have already paid for everything else. If you can’t come up with any workload to run worth more than the marginal cost of power, then I agree you should shut them off.
Albert Greenberg, Parveen Patel, Dave Maltz and I make a longer form of this argument against shutting servers off in an article to appear in the next issue of SIGCOM Computer Communications Review. We also looked more closely at networking issues in this paper.
--jrh
James Hamilton, Data Center Futures Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052 W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | JamesRH@microsoft.com
H:mvdirona.com | W:research.microsoft.com/~jamesrh | blog:http://perspectives.mvdirona.com
I’m not sure how many times I’ve read or been told that power is the number one cost in a modern mega-data center, but it has been a frequent refrain. And, like many stories that get told and retold, there is an element of truth to the it. Power is absolutely the fastest growing operational costs of a high-scale service. Except for server hardware costs, power and costs functionally related to power usually do dominate.
However, it turns out that power alone itself isn’t anywhere close to the most significant a cost. Let’s look at this more deeply. If you amortize power distribution and cooling systems infrastructure over 15 years and amortize server costs over 3 years, you can get a fair comparative picture of how server costs compare to infrastructure (power distribution and cooling). But how to compare the capital costs of server, and power and cooling infrastructure with that monthly bill for power?
The approach I took is to convert everything into a monthly charge. Amortize the power distribution and cooling infrastructure over 15 years and use a 5% per annum cost of money and compute the monthly payments. Then take the server costs and amortize them over a three year life and compute the monthly payment again using a 5% cost of money. Then compute the overall power consumption of the facility per month and compare the costs.

Update: fixed error in spread sheet comments.
What can we learn from this model? First, we see that power costs not only don’t dominate, but are behind the cost of servers and the aggregated infrastructure costs. Server hardware costs are actually the largest. However, if we look more deeply, we see that the infrastructure is almost completely functionally dependent on power. From Belady and Manos’ article Intense Computing or In Tents Computing, we know that 82% of the overall infrastructure cost is power distribution and cooling. The power distribution costs are functionally related to power, in that you can’t consume power if you can’t get it to the servers. Similarly, the cooling costs are clearly 100% related to the power dissipated in the data center, so cooling costs are also functionally related to power as well.
We define the fully burdened cost of power to be sum of the cost of the power consumed and the cost of both the cooling and power distribution infrastructure. This number is still somewhat less than the cost of servers in this model but, with cheaper servers or more expensive power assumptions, it actually would dominate. And it’s easy to pay more for power although, very large datacenters are often located to pay less (e.g. Microsoft Columbia or Google Dalles facilities).
Since power and infrastructure costs continue to rise while the cost of servers measured in work done per $ continues to fall, it actually is correct to say that the fully burdened cost of power does, or soon will, dominate all other data center costs.
For those of you interested in playing with different assumptions, the spreadsheet is here: OverallDataCenterCostAmortization.xlsx (14.4 KB).
--jrh
Large sorts need to be done daily and doing it well actually is economically relevant. Last July, Owen O’Malley of the Yahoo Grid team announced they had achieved a 209 second TeraSort run: Apache Hadoop Wins Terabyte Sort Benchmark. My summary of the Yahoo result with cluster configuration: Hadoop Wins TeraSort.
Google just announced a MapReduce sort result on the same benchmark: Sorting 1PB with MapReduce. They improved on the 209 second result that Yahoo produced achieving 68 seconds. How did they get roughly 3x speedup? Google used slightly more servers at 1,000 than the 910 used by Hadoop but that difference is essentially rounding error and doesn’t explain the difference.
We know that sorting is essentially, an I/O problem. The more I/O a cluster has, the better the performance of a well written sort. It’s not quite the case that computation doesn’t matter but close. A well written sort will scale almost linearly with the I/O capacity of the cluster. Let’s look closely at the I/O sub-systems used in these two sorts and see if that can explain some of the differences between the two results. Yahoo used 3,640 disks in their 209 second run. The Google cluster uses 12 disks per server for a total of 12,000. Both are using commodity disks. The Hadoop result uses 3,640 disk for 209 seconds (761k disk seconds) and the Google result uses 12,000 disks for 68 seconds (816k disk seconds).
Normalizing for number of disks, the Google result is roughly 7% better than the Hadoop number from earlier in the year. That fairly small difference could be explained by more finely tuned software, better disks, or a combination of both.
The Google experiment included a petabyte sort on a 4,000 node cluster. This result is impressive for at least two reasons: 1) a 4,000 node, 48000 disk cluster running a commercial workload is impressive, and 2) sorting a petabyte in 6 hours in 2 min is noteworthy.
In my last posting on high-scale sorts Hadoop Wins TeraSort I argued that we should be also be measuring power consumed. Neither the Google nor Yahoo results report power consumption but there is a enough data to strongly suggest the Google number is better by this measure. Since the data isn’t published, let’s assume that commodity disks draw roughly 10W each and that each server is drawing 150W not including the disks. Using that data, let’s compute the number kilo-watt hours for each run:
· Google: 68*(1000*150+1000*12*10)/3600/1000 => 5.1 kwh
· Yahoo: 209*(910*150+910*4*10)/3600/1000 => 10.0 kwh
Both are good results and both very similar in their utilization of I/O resources but the Google result uses much less power under our assumptions. The key advantage is that they have 4x the number of disks per server so can amortize the power “overhead” of each server over more disks.
When running scalable algorithms like sort, a larger cluster will produce a faster result unless cluster scaling limits are hit. I argue that to really understand the real quality of the implementation in a comparable way we need to report work done per dollar and work done per joule.
--jrh
Thanks to Savas Parastatidis for pointing this result out to me.
James Hamilton, Data Center Futures Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052 W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | JamesRH@microsoft.com
H:mvdirona.com | W:research.microsoft.com/~jamesrh | blog:http://perspectives.mvdirona.com
Last week, IBM honored database giant Pat Selinger by creating a Ph.D. Fellowship in her name. I worked with Pat closely for many years at IBM and much of what I learned about database management systems was from Pat during those years. She was a one of the original members of the IBM System R team and is probably best known as the inventor of the cost based optimizer. Access Path Selection in a Relational Database Management Systems is a paper from that period that I particularly enjoyed.
From the IBM press release:
Pat Selinger IBM Ph.D. Fellowship: awarded to an exceptional female Ph.D. student worldwide with special focus on database design and management
Pat Selinger IBM Ph.D. Fellowship Dr. Pat Selinger was a leading member of the IBM Research team that produced the world's first relational database system and established the basic architecture for the highly successful IBM DB2 database product family. Her innovative work on cost-based query optimization for relational databases has been adopted by nearly all relational database vendors and is now taught in virtually every university database course. In 1994, Dr. Selinger was named an IBM Fellow -- an honor accorded only to the top 50 technical experts in IBM -- and in 2004, she was inducted into the Women in Technology International Hall of Fame.
An ACM Queue interview with Pat: A conversation with Pat Selinger.
It’s great to see IBM actively supporting engineering education, particular encouraging female engineers, and recognizing Pat Salinger’s contribution to the commercial and academic database community.
--jrh
James Hamilton, Data Center Futures Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052 W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | JamesRH@microsoft.com
H:mvdirona.com | W:research.microsoft.com/~jamesrh | blog:http://perspectives.mvdirona.com
Intel Fellow and Director of Storage Architecture Knut Grimsrud presented at WinHEC 2008 last week and it caught my interest for several reasons: 1) he talked about Intel findings with their new SSD which looks like an extremely interesting price/performer, 2) they have found interesting power savings in their SSD experiments beyond the easy to predict reduction in power consumption of SSDs over HDDs, and 3) Knut presented a list of useful SSD usage do’s and don’ts.
Starting from the best practices:
· DO queue requests to SSD as deeply as possible
- SSD has massive internal parallelism and generally is underutilized. Parallelism will further increase over time.
- Performance scales well with queue depth
· DON’T withhold requests in order to “optimize” or “aggregate” them
- Traditional schemes geared towards reducing HDD latencies do not apply. Time lost in withholding requests difficult to make up.
· DO worry about software/driver overheads & latencies
- At 100K IOPS how does your SW stack measure up?
· DON’T use storage “backpressure” to pace activity
- IO completion time (or rate) is not a useful pacing mechanism and attempting to use that as throttle can result in tasks generating more activity than desired
Common HDD optimizations you should avoid:
· Block/page sizes, alignments and boundaries
- Intel® SSD is insensitive to whether host writes have any relationship to internal NAND boundaries or granularities
- Expect other high-performing SSDs to also handle this
- Internal NAND structures constantly changing anyway, so chasing this will be a losing proposition
· Write transfer sizes & write “globbing”
- No need to accumulate writes in order to create large writes
- Temporarily logging writes sequentially and later re-locating to final destination unhelpful to Intel SSD (and is detrimental to longevity)
· Software “helping” by making near-term assumptions about SSD internals will become a long-term hindrance
- Any SW assistance must have longevity
On the power savings point, Knut laid out an interesting argument on increased power savings for SSDs over HDDs beyond the standard device power difference. These standard power differences are real of course but, on a laptop device where a HDD typically draws around 2.5W active, these often pointed to savings are relatively small. However, an additional measurable savings was reported by Knut. Because SSDS are considerably faster than HDD, speculative page fetching done by Windows Superfetch is not needed. And, because Superfetch is sometimes incorrect, the additional I/Os and processing done by Superfecth, consume more power. Essentially, with the very high random I/O rates offered by SSDs, Superfetch isn’t needed and, if disabled, there will be additional power savings due to reduced I/o and page processing activity.
Another potential factor I’ve discussed with Knut’s is that in standard laptop operating mode, the common usage model is one where there are periods of inactivity and short periods of peak workload typically accompanied by high random I/O rates. More often than not, laptop performance is bounded by random I/O performance. If SSD usage allows these periods of work to be completed more quickly, the system can quickly return to an idle, low-power state. We’ve not measured this gain but it seems intuitive that getting the work done more quickly will leave the system active for shorter periods and have it in idle states for longer. Assuming a faster system spends more time in idle states (rather than simply doing more work), we should be able to measure additional power savings indirectly attributable to SSD usage.
Knut’s slides: Intel’s Solid State Drives. Thanks to Vlad Sadovsky for sending this one my way.
--jrh
James Hamilton, Data Center Futures Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052 W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | JamesRH@microsoft.com
H:mvdirona.com | W:research.microsoft.com/~jamesrh | blog:http://perspectives.mvdirona.com
Abolade Gbadegesin, Windows Live Mesh Architect gave a great talk at the Microsoft Professional Developers Conference on Windows Live Mesh (talk video, talk slides). Live mesh is a service that supports p2p file sharing amongst your devices, file storage in the cloud, remote access to all your devices (through firewalls and NATS), and web access to those files you chose to store in the cloud. Live Mesh is a good service and worth investigating in its own right but what makes this talk particularly interesting is Abolade gets into the architecture of how the system is written and, in many cases, why it is designed that way.
I’ve been advocating redundant, partitioned, fail fast service designs based upon Recovery Oriented Computing for years. For example, Designing and Deploying Internet Scale Services (paper, slides). Live Mesh is a great example of such a service. It’s designed with enough redundancy and monitoring such that service anomalies are detected and, when detected, it’ll auto-recover by first restarting, then rebooting, and finally re-image the failing system.
It’s partitioned across multiple data centers and, in each datacenter, across many symmetric commodity servers each of which is a 2 core, 4 disk, 8 GB system. The general design principles are:
· Commodity hardware
· Partitioning for scaling out, redundancy for availability
· Loose coupling across roles
· Xcopy deployment and configuration
· Fail-fast, recovery-oriented error handling
· Self-monitoring and self-healing
The scale out strategy is to:
· Partition by user, device, and Mesh Object
· Use soft state to minimize I/O load
· Leverage HTTP 1.1 semantics for caching, change notification, and incremental state transfer
· Leverage client-side resources for holding state
· Leverage peer connectivity for content replication
Experiences and lessons learned on availability:
· Design for loosely coupled dependence on building blocks
· Diligently validate client/cloud upgrade scenarios
· Invest in pre-production stress and functional coverage in environments that look like production
· Design for throttling based on both dynamic thresholds and static bounds
Experiences and lessons learned on monitoring:
· Continuously refine performance counters, logs, and log processing tools
· Monitor end-user-visible operations (Keynote)
· Build end-to-end tracing across tiers
· Self-healing is hard: Invest in tuning watchdogs and thresholds
Experiences and lessons learned on deployment:
· Deployments every other week, client upgrades every month
· Major functionality roughly each quarter
· Took advantage of gradual ramp to learn lessons early
--jrh
Thanks to Andrew Enfield for sending this one my way.
James Hamilton, Data Center Futures Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052 W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | JamesRH@microsoft.com
H:mvdirona.com | W:research.microsoft.com/~jamesrh | blog:http://perspectives.mvdirona.com
Butler Lampson, one of the founding members of Xerox PARC, Turing award winner, and one of the most practical engineering thinkers I know spoke a couple of days ago at the Computing in the 21st Century Conference in Beijing. My rough notes from Butler’s talk follow. Overall Butler argues that “embodiment” is the next big phase of computing after simulation and communications. Butler defines embodiment as computers interacting directly with the physical world. For example, autonomously driven vehicles. Butler argues that this class of applications are only possible now due to the rapidly falling price of computing coupled with systems capabilities driven by Moore’s law.
He argues that we need to further advance how we deal with uncertainty and dependability to be successful with these applications. Uncertainty is important since all input has noise, all sensors have faults, and all data is incomplete. Dependability in that these systems are directly interacting with the physical world and actions in the physical world can have live critical failure modes.
Butler’s recommendation on how to build incredibly complex systems that directly interact with the physical world and yet have these systems be dependable is to build them two tier. At the core, is a small, simple kernel that doesn’t do a great job of its task but doesn’t hard fail and won’t kill anyone. He calls this “catastrophe mode”. For example, an autonomous vehicle may slow down to 10 MPH or just safely stop in catastrophe mode.
The software stack is designed in two layers where the top layer is responsible for the complex, real time interaction the system is designed to deliver. The inner or lower layer is catastrophe mode designed to be simple and, as only simple systems can be, correct. I like the approach.
Butlers Slides are: ButlerLampson_China_Microsoft2008 (1.49 MB).
--jrh
Title: The Uses of Computers: What's Past is Merely Prologue
Speaker: Butler Lampson
Implication of Moore
· Spend hardware to simplify software
· Hardware enables new applications
· Pull complexity up into software (if unavoidable)
The uses of computers:
· 1950: Simulation
· 1980: Communications
· 2010: Embodiment (computers interacting directly with the physical world)
Argument: embodiment is now possible and there are some grand challenges that fall into this category:
· Gave some examples from Jim Gray’s Systems Challenges (Turing award lecture)
· Butler example: Reduce highway traffic deaths to zero
What do we need to learn how to deal with to achieve embodiment in general and zero traffic deaths in particular:
· Dealing with uncertainty
o Need good models of what can happen (what is possible)
o Need boundaries for models (where they don’t apply)
· Dependability
o The system meets its spec
o Measure: probability(failure) x Cost(failure)
o Had to model dependability. Recommends using “no catastrophes”
o Must have a threat model of what can go wrong
o Recommends producing a simple, small base that will avoid catastrophe. It must be simple. There may be incredibly complex, very highly optimized layers but a reliable systems needs to be able to fail back to the reliable base kernel (less than 50k loc?)
Conclusions for Engineers:
· Understand Moore’s Law
· Aim for mass markets
· Learn how to deal with uncertainty
· Learn how to avoid catastrophe (avoiding fault not possible in systems at scale)
James Hamilton, Data Center Futures Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052 W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | JamesRH@microsoft.com
H:mvdirona.com | W:research.microsoft.com/~jamesrh | blog:http://perspectives.mvdirona.com
Tony Hoare spoke yesterday at the Computing in the 21st Century Conference in Beijing. Tony is a Turing award winner, Quicksort inventor, author of the influential Communication Sequential Processes (CSP) formal language, and long time advocate of program verification and tools to help produce reliable software systems. In his talk he argues that programming should be and can be a science and the goals should be correct programs that stay correct through change. Zero defect software.
He explains that engineers will accept that there will be defects but the scientist should pursue perfection far beyond that for which there is a commercial need. Tony has spent a big part of his successful career in pursuit of techniques and tools to produce reliable complex systems.
Tony ended his talk on an a practical engineering note hoping that we can advance our field to the point that “Software will contain no more errors than other engineering disciplines”. We’re not there yet.
My rough notes from the talk follow.
Title: The Science of Programming
Speaker: Tony Hoare
The Vision:
· Computer software contains no more errors
o Software is the most reliable component of any device that contains it
· Programmers make no mistakes
o Programs work the first time they run
o They run forever after, even after changing
· Programming is an engineering discipline
o Respected for its delivered benefits and it’s foundation on basic science
· Semantics is the science of programming
o Explores the meaning of computer programs
o Operational: correctness of implementation
o Algebraic: Correctness of optimization
o Axiomatic
The Insight:
· Computer programs are mathematical formulae
o They don’t suffer from rust, wear, decay, fatigue
o If a correct program is started in a correct state, they it will stay correct
· Their correctness is a mathematical conjecture
o To be proved by logic and calculation
o Checked by the computer itself
History of the idea:
· Aristotle (350bc): Syllogistic logic
· Euclid (300bc): geometry
· Leibnitz (1700): calculus
· Boole (1850): laws of thought
· Frege (1880): predicate logic
· Russel (1920): Principia
· Hao Wang (1956): Computer checks
Basic Science:
· Answers fundamental questions
· What does it do?
· How does it work?
· Why does it work?
· How do we know?
What does it do?
· Answered by its behavioral specification
How does it work?
· Answer by it’s internal interface contracts
Why does the program work?
· Answered by programming theory
How do we know?
· By logical/mathematical proof
Ideals in Basic Science
· Pursued for the sake of scientific glory far in advance of commercial need
· Physics: accuracy of measurement
· Chemistry: purity of materials
· Computing Science: zero defect programs
Unifying Theory
· Basic science seeks unifying theories
· Explains diverse phenomena
· Supported by evidence
Overall, industry is not heavily using software verification along the lines that Tony wants to see but there are some in use. For example, some tools in use at Microsoft:
· PREfix and PREfast
· Static Driver Verifier
· ESP (locates potential buffer overflows)
The Hope:
· Software will contain no more errors than other engineering disciplines.
James Hamilton, Data Center Futures Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052 W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | JamesRH@microsoft.com
H:mvdirona.com | W:research.microsoft.com/~jamesrh | blog:http://perspectives.mvdirona.com
Service monitoring at scale is incredibly hard. I’ve long argued that you should never learn anything about a problem your service is experiencing from a customer. How could they possibly know first when there is a service outage or issue? And, yet it happens frequently. The reason it happens is most sites don’t have close to an adequate level of instrumentation. Without this instrumentation, you are flying blind.
Systems monitoring data can be used to drive alerts, to compute SLAs, to drive capacity planning, to find latencies, to understand customer access patterns, and some sites use it to drive billing although the later is probably a mistake.
In the rare cases where I’ve come across high quality monitoring systems that actually do fine-grained data collection, its often not looked at or underutilized. It turns out that fully using and exploiting very large amounts of monitoring data isn’t much easier than collecting it.
Returning the challenge of efficiently collecting fine grained monitoring data and events from thousands of servers, Facebook made a contribution yesterday in making Scribe available as an open source project: Facebook's Scribe technology now open source. Scribe is used at Facebook to monitor their more than 10k servers across multiple data centers. Scribe is a Sourceforge project at: http://sourceforge.net/projects/scribeserver/.
Facebook continues to both develop interesting and broadly useful software and often contributes it to the community by making it open source. For example, Facebook Releases Cassandra as Open Source.
Some excerpts from On Designing and Deploying Internet-Scale Services on why I think auditing, monitoring, and alerting are important
Alerting is an art. There is a tendency to alert on any event that the developer expects they might find interesting and so version-one services often produce reams of useless alerts which never get looked at. To be effective, each alert has to represent a problem. Otherwise, the operations team will learn to ignore them. We don’t know of any magic to get alerting correct other than to interactively tune what conditions drive alerts to ensure that all critical events are alerted and there are not alerts when nothing needs to be done. To get alerting levels correct, two metrics can help and are worth tracking: 1) alerts-to-trouble ticket ratio (with a goal of near one), and 2) number of systems health issues without corresponding alerts (with a goal of near zero).
· Instrument everything. Measure every customer interaction or transaction that flows through the system and report anomalies. There is a place for “runners” (synthetic workloads that simulate user interactions with a service in production) but they aren’t close to sufficient. Using runners alone, we’ve seen it take days to even notice a serious problem, since the standard runner workload was continuing to be processed well, and then days more to know why.
· Data is the most valuable asset. If the normal operating behavior isn’t well-understood, it’s hard to respond to what isn’t. Lots of data on what is happening in the system needs to be gathered to know it really is working well. Many services have gone through catastrophic failures and only learned of the failure when the phones started ringing.
· Have a customer view of service. Perform end-to-end testing. Runners are not enough, but they are needed to ensure the service is fully working. Make sure complex and important paths such as logging in a new user are tested by the runners. Avoid false positives. If a runner failure isn’t considered important, change the test to one that is. Again, once people become accustomed to ignoring data, breakages won’t get immediate attention.
· Instrumentation required for production testing. In order to safely test in production, complete monitoring and alerting is needed. If a component is failing, it needs to be detected quickly.
· Latencies are the toughest problem. Examples are slow I/O and not quite failing but processing slowly. These are hard to find, so instrument carefully to ensure they are detected.
· Have sufficient production data. In order to find problems, data has to be available. Build fine grained monitoring in early or it becomes expensive to retrofit later. The most important data that we’ve relied upon includes:
Thanks to Sriram Krishnan for pointing me to the release of Scribe.
--jrh
James Hamilton, Data Center Futures Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052 W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | JamesRH@microsoft.com
H:mvdirona.com | W:research.microsoft.com/~jamesrh | blog:http://perspectives.mvdirona.com
In When SSDs Make Sense in Server Applications, we looked at where Solid State Drives (SSDs) were practical in servers and services. On the client side, there are even more reasons to use SSDs and I expect that within three years, more than half of enterprise laptops will have NAND Flash as at least part of their storage subsystems. This estimate has SSDs in 38% of all laptops by 2011: Flash SSD in 38% of Laptops by 2011.
What follows is a quick summary of SSD advantages on the client side, followed by the disadvantages, and then a closer look at the write endurance (wear-out) problem that has been the topic of much discussion recently.
Client SSD Advantages:
· Random IOPS: Laptop I/O patterns are dominated by random workloads and, as argued in When SSDs Make Sense in Server Applications, these workloads run cost effectively on SSDs
· Low Power: SSD power consumption is typically in the under 2W range and often under 1W. Enterprise disk can run 15 to 18W, desktop parts are typically in the 10W range but laptop drives usually run a more modest 2.5W when active. So, on one hand this is represents an exciting reduction in storage power of a factor of 2 but, on the other, it’s actually only a 1W saving when the HDD is active and even less when idle. A savings but a small one overall. If you are interested in more data on laptop power consumption see Client-Side Power Consumption. Some very efficient HDDs actually have less idle power consumption than some SSDs so it’s not even the case that SSDs are all better under all conditions from a power consumption perspective.
· Quiet. HDDs can be noisy. They are mechanical parts with precision bearings spinning at high speeds and they make noise. Semi-conductor-based SSDs avoid this.
· Small Form Factors: SSDs can be small and light weight.
· Scale Down Floor: Disks have a price floor where further lowering the capacity of the device doesn’t save money. This price floor changes over time but, at this point, it’s hard to get much below $30 for a disk regardless of how small. The fixed costs of the mechanical parts dominate the media and the cost of the disk doesn’t scale down. SSD costs scale down well and for applications with modest storage requirements, they can be less expensive. This makes them interesting for very low-end laptops, netPCs, ultra-mobile PCs, and, of course, NAND Flash is the storage of choice in cell phones, music players, cameras, and other related applications.
· Shock and Vibration: HDDs usually spec max shock in the 50g to as high as 100G range and vibration in the ¼G to ½G. SSD specs run well over 1,000G shock and around 20G vibration. The are much more durable to this common threat in the laptop world.
· Latency: I/O latency is far lower on an SSD than a HDD and this is particularly noticeable when I/O queues get deep as they often do on single disk laptops.
· Reliability: HDDs are the number one failing component on clients (and servers). This is particularly a problem on laptops as they are (usually) single drive devices and often not well backed up. HDD failures represent a substantial service cost in most enterprises so eliminating them is appealing. Our operational history with SSDs is fairly short so far but we expect they will exhibit less frequent failures that hard disks. However, like all new components, they bring additional failure modes as well as eliminating a few. The biggest concern around SSDs is write endurance with SLC part lifetimes typically in the range of 10^5 writes and MLC parts down around 10^4 write cycles (some even lower). We’ll look at that in more depth below.
· Temperature: SSDs have a much wider temperatures and humidity operating range than HDDs.
Client SSD Disadvantages:
· Capacity/$: Flash devices can deliver excellent random I/O performance and laptops, with only a single disks are frequently random I/O bound rather than capacity limited. In fact, many enterprises customers actually want LESS storage on their laptop fleet. For them, having less capacity is often either not a problem or even a potential advantage. For my uses and for many consumer usage patterns, capacity remains important with pictures, audio, and other media files driving space requirements up to the point where SSDs can be tough to afford. As a direct consequence, I expect that we’ll see more enterprise than consumer use of SSDs in clients.
· Performance Degradation: There have been many reports of SSDs initially performing well and then degrading over time. See Laptop SSD Performance Degradation Problems for more detail.
· Endurance: This is the most common concern I’ve heard of late with MLC write endurance only around 10,000 writes.
Write Endurance
I keep hearing anecdotal reports that SSDs in laptops are going to fail in the first year due to the poor write endurance of MLC SSDs. The typical MLC write endurance is usually quoted at around 10,000 cycles which I agree does sound quite low.
Let’s do a quick back of the envelope on MLC SSD write endurance (SLC parts are typically more expensive but have longer write endurance specifications). Assume a client system is used four hours a day and that it spends ¼ of that time at the max I/O rate of 100 IOPS. My gut feel says this number very likely errs high. Let’s include write amplification. Write amplification is a side effect of Flash memory designs having larger blocks as the unit of erase and smaller pages as the unit of read and programming (write). This combined with wear leveling leads to the device having to do some overhead housekeeping writes when servicing writes from the host system. Assume an average write amplification of 3x over three years of life which again seems high. To make it really aggressive we’ll assume a write to read ration of 1:1 (50% writes) which is very high. Finally let assume it’s a 64GB MLC device and that my writes are all to 4k pages and the overheads are all accounted for by my 3x write amplification number.
4*60*60*365*3*.25*.5*3*100 => 591m
Reading left to write, that’s 4 hours a day * 60 to get minutes * 60 to get seconds * 365 to get seconds use per year * 3 to get seconds use in three years, *25% of time at max I/O, *50% of I/Os are writes, *3 write amplification, *100 I/Os per second.
In aggregate, that’s about ½ billion write I/Os is needed by each laptop living three years. But, a 64GB device has 16m pages. If you spread ½ billion writes over 16m pages with perfect wear leveling, you would have 36 write I/Os per page. Very low. With terrible wear leveling, it could go up an order of magnitude but it’s still a very low number. Move write amplification up to 5x and the wear/page still looks tiny. Move the usage pattern up from 4 hours a day to an aggressive 16 hours a day and it’s still only 147 writes per page. Perhaps we’ll use more lifetime I/Os with an SSD than my magnetic disk model above assuming we spend less time waiting but, still, it’s not looking that big a number of lifetime writes required.
If we use very low endurance MLC where write endurance is specified down around 1,000 cycles rather than the more common 10,000, it’s still not a problem. But is within on order of magnitude so arguably a concern over a three year life. And it would be definitively a concern over a 5 year life.
Because client systems spend such a small percentage of their working lifetimes at 100% I/O rates, it’s hard to see a credible usage model that has MLC write endurance as a serious problem if using parts specified at 10,000 write cycles.
In a subsequent post, we’ll look back at server applications and, in contrast with When SSDs Make Sense in Server Applications, we’ll look at where SSDs don’t make sense on the server side.
James Hamilton, Data Center Futures Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052 W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | JamesRH@microsoft.com
H:mvdirona.com | W:research.microsoft.com/~jamesrh | blog:http://perspectives.mvdirona.com
In past posts, I’ve talked a lot about Solid State Drives. I’ve mostly discussed about why they are going to be relevant on the server side and the shortest form of the argument is based on extremely hot online transaction processing systems (OLTP). There are potential applications as reliable boot disks in blade servers and other small data applications but I’m focused on high-scale OLTP in this discussion. OLTP applications random I/O bound workloads such as ecommerce systems, airline reservation systems, and any data intensive application that does lots of small reads and writes, usually on a database where future access patterns are unknown. When sizing a server for one of these workloads, the key dimension is the number of small random I/Os per second. You need to add memory to increase the memory hit rate and reduce the number I/Os or you need to add disks to support the application-required I/O rates. The problem with adding memory is that it has linear cost – the last DIMM costs as much as the first DIMM – but only logarithmic value. Because the workloads are random, adding memory only delivers a reduction in I/Os roughly proportional to the square root of the memory size. Cheap memory helps but, even then, the costs add up as does the power consumption as memory is added. Alternatively, you can add disk but each disk added gives only another roughly 200 I/Os per second (IOPS) when using very expensive, 15k RPM disks.
The problem is best summarized by my favorite chart these days from Dave Patterson of Berkeley:

This chart is from an amazingly useful paper, Latency Lags Bandwidth (if you know of no-charge location for this paper, let me know). In this chart, Dave tracks the trend of bandwidth and latency over the last 20+ years. For the purposes of this discussion ignore the latency row and focus on bandwidth. Disk bandwidth is growing slower than DRAM and CPU bandwidth. I love looking for divergent trends in that they direct us to the more fundamental problems needing innovation.
Understanding disk bandwidth growth is a growing problem, let’s compare disk sequential bandwidth with random I/O rates over-time. In the chart below, I graph sequential bandwidth growth against random bandwidth growth over the same period:

We know that disk sequential bandwidth growth lags the rest of the system. This graph shows that random IOPS bandwidth is growing even more slowly. Across the industry, we have a huge problem and the trend lines above make it crystal clear that the problem won’t be cost-effectively solved by disk alone. More detail on one dimension of the disk limits problem in: Why Disk Speeds aren’t Increasing.
Disks clearly aren’t the full solution. Ever larger memory sub-systems actually are part of the solution but the logarithmic (or worse) payback with linear cost and power consumption makes memory an expensive approach if we use it as the only tool. Many have argued for the last couple of years that solid state disks are the solution to filling the chasm between memory and disk random IOPS rates. Jim Gray was one of the first to make this observation in: Tape is Dead, Disk is Tape, Flash is Disk, Ram Locality is King.
The first generation, server-side SSDs were slow random write performers but we’re now seeing great components released to the market. See 100,000 IOPS and 1,000,000 IOPS. These are great performers but they are far from commodity pricing at this point. Intel has been doing some great work on SSDs and I really like this one: Intel X25-E Extreme SATA Solid-State Drive. It’s a step towards commodity pricing. Overall the industry now has great performing parts available and the price/performance equation is very rapidly improving since this is a semi-conductor component rather than a mechanical one.
When should we expect the crossover? At what price point are SSDs a win over HDDs? Unfortunately, it’s an application specific answer. It depends upon I/O density of the workload, the number of I/Os per GB of data. Bob Fitzgerald has done a great job of analyzing different workloads to understand what level of application I/O heat (IOPS per GB) are needed to justify a SSD. Building on Fitz’s work, I have a quick test you can use to figure out how cheap an SSD will have to get before it is a win in your application.
My observation goes like this. Disks have an abundance of capacity and are short of IOPS so, on random IOPS intensive workloads, the limiting factor using HDDs will be IOPS. SSDs have an abundance of IOPS and are short of capacity, so the limiting factor using SSDs will be capacity. SSDs are cost effective for your application when the cost of the disk farm adequate to support the IOPS you need is more than the SSD farm required to support the capacity you need. As a formula:
current#hdd * hdd$ > CapacityNeeded / Capacity_ssd * ssd$
Let’s try an example. This example application is hosted on several hundred database servers and it’s a red hot transaction processing system. Each system has 53 disks of which 40 are used to store data and 8 for log and a few for admin purposes. Leave the log on magnetic media since disks sequential bandwidth is cheaper than SSD sequential bandwidth. The database size on each server is 572GB. The disks used by this application are 15k RPM, 3 ½ disks that price out at $333 each. Understanding this, the disk budget per server for this application is 40 * 3333 which is $13,320. We know we need 572GB and let’s assume we are trying out 64 GB SSDs. Using that equation, 572/64 is 8.9 so we’ll need 9 SSDs to support this workload.
Taking the disk budget of $13,320 and dividing by the 9 SSDs we have computed we need, we can afford to pay up to $1,480 for each SSD. If the SSDs cost is less than this, it’s worth doing. This model ignores the power savings (SSDS usually run under 1/5 the power of HDDs and fewer are needed) and other factors like service costs but it’s a quick check to see if SSDs are worth considering.
We also need more data on SSD longevity in high write-rate workloads. In the absence of historical data, ask your vendor to stand behind their product with full warrantee in your usage model before jumping in.
Speaking of wear-out rates, for the next posting I’ll investigate client-side MLC NAND-flash wear out rates.
--jrh
James Hamilton, Data Center Futures Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052 W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | JamesRH@microsoft.com
H:mvdirona.com | W:research.microsoft.com/~jamesrh | blog:http://perspectives.mvdirona.com
Albert Greenberg and I missed Hotnets 2008 last week due to a conflicting meeting down in California but Ken Church was there to present our On Delivering Embarrassingly Distributed Cloud Services paper. I summarized the paper in a recent blog entry: Embarrassingly Distributed Cloud Services and the abstract from the paper follows:
Very large data centers are very expensive (servers, power/cooling, networking, physical plant.) Newer, geo-diverse, distributed or containerized designs offer a more economical alternative. We argue that a significant portion of cloud services are embarrassingly distributed – meaning there are high performance realizations that do not require massive internal communication among large server pools. We argue further that these embarrassingly distributed applications are a good match for realization in small distributed data center designs. We consider email delivery as an illustrative example. Geo-diversity in the design not only im-proves costs, scale and reliability, but also realizes advantages stemming from edge processing; in applications such as spam filtering, unwanted traffic can be blocked near the source to reduce transport costs.
The Hotnets agenda and all the papers present5ed are up at: Seventh ACM Workshop on Hot Topics in Networks.
The slides ken presented are posted at:
· pptx form: http://conferences.sigcomm.org/hotnets/2008/slides/EmbarrassinglyDistributed6.pptx
· pdf form: EmbarrassinglyDistributedFinalSlides.pdf (724.17 KB)
James Hamilton, Data Center Futures Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052 W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | JamesRH@microsoft.com
H:mvdirona.com | W:research.microsoft.com/~jamesrh | blog:http://perspectives.mvdirona.com
Google has long enjoyed a reputation for running efficient data centers. I suspect this reputation is largely deserved but, since it has been completely shrouded in secrecy, that’s largely been a guess built upon respect for the folks working on the infrastructure team rather than anything that’s been published. However, some of the shroud of secrecy was lifted last week and a few interesting tidbits were released in Google Commitment to Sustainable Computing.
On server design (Efficient Servers), the paper documents the use of high-efficiency power supplies and voltage regulators, and the removal of components not relevant in a service-targeted server design. A key point is the use of efficient, variable-speed fans. I’ve seen servers that spend as much as 60W driving the fans alone. Using high efficiency fans running at the minimum speed necessary based upon current heat load can bring big savings. An even better approach is employed by Rackable Systems in their ICE Cube Modular Data Center design (First Containerized Data Center Announcement) where they eliminate server fans entirely.
The paper also argues for energy proportionality a concept introduced by Luiz Barroso and Urs Holzle of Google. Energy proportionality is a call to the industry to produce servers where the amount of energy consumed is proportional to the server load. Sadly, many current server designs consume more than 60% of their full load power when idle. None of us will talk publically about the average utilizations of our servers farms but the quick summary is that achieving very high utilizations is incredibly difficult. Or, worded differently, most servers are on average closer to idle than to full load. Even small steps towards energy proportionality make a huge difference and, of course, getting utilization up remains the holy grail of the industry.
It’s good to see water conservation brought up beside energy efficiency. It’s the next big problem for our industry and the consumption rates are prodigious. To achieve efficiency, most centers have cooling towers which allow them to avoid the use of energy-intensive direct-expansion chillers except under unusually hot and humid conditions. This is great news from an energy efficiency perspective, but cooling towers consume water in two significant ways. The first are evaporative losses which are hard to avoid in wet tower designs (other less water-intensive designs exist). The second is caused by the first. As water evaporates from the closed system, the concentrations of dissolved solids and other contaminants present in the supply water left behind by evaporation continue to rise. These high concentrations are dumped from the system to protect it and this dumping is referred to as blow-down water. Between make-up and blow-down water, a medium-sized, 10MW facility, built to current industry conventions, can go through ¼ to ½ million gallons of water a day.
The paper describes a plan to address this problem in the future by moving to recycled water sources. This is good to see but I argue the industry needs to reduce overall water consumption, whether the source is fresh or recycled. The combination of higher data center temperatures and aggressive use of air-side economization are both good steps in that direction and industry-wide we’re all working hard on new techniques and approaches to reduce water consumption.
The section on PUE is the most interesting in that the are documenting an at-scale facility running at a PUE of 1.13 during a quarter. Generally, you want full-year numbers since these numbers are very load and weather dependent. The best annual number quoted in the paper is 1.15 which is excellent. That means that for every watt delivered to servers 0.15W is lost in power distribution and cooling.
This number, with pure air-side cooling and good overall center design, is quite attainable. But, elsewhere in the document, they described the use of cooling towers. Attaining a PUE of 1.15 with a conventional water-based cooling system is considerably more difficult. On the power distribution side, conventional designs waste about 8% to 9% of the power delivered. A rough breakdown of where it goes is 3 transformers taking 115KV down to 13.2KV down to 480KV and then down to 208KV for delivery to the load. Good transformer designs run around 99.7% efficiency. The uninterruptable power supply can be as poor as 94%, and roughly 1% is lost in switching and conductors. That approach gets us to 8% lost in distribution. We can easily eliminate one layer of transformers and either use a high efficiency bypass UPS. Let’s use 97% efficiency for the UPS. Those two changes will get us 4% to 5% lost in distribution. Let’s assume we can reliably hit 5% power distribution losses. That leaves us with 10% for all the losses to the mechanical systems. Powering the Computer Room Air Handlers, the water pumps etc. at only 10% overhead would be both difficult and more impressive.
The 1.15 PUE with pure air-side economization in the right climate looks quite reasonable, but powering a conventional, high-scale, air and water, multi-conversion cooling system at this efficiency looks considerably harder to me. Unfortunately, there is no data published in the paper on the approach and whether it was simply attained by relying on favorable weather conditions and air-side economization with the water loops idle.
The paper closes with An Efficient and Clean Energy Future, a discussion of the Renewable Energy Less Than Coal (RE<C) project. The RE<C project isn’t part of the Google infrastructure team, and they aren’t building data centers, but it is perhaps the coolest project I’ve ever come across. It’s just amazing. The core premise of this project is to do research into renewable energy sources that can be harnessed less expensively than coal and then let capitalism take care of the rest. Environmental policy lags reality and is influenced by special interest groups. If renewable energy can be made less expensive than coal, the free market system will help eliminate the burning of coal. Why fight a powerful market force if an alternative may exist. This is great research and, the more I hear about it, the more I like it.
The paper concludes that “if all data centers operated at the same efficiency as ours, the U.S. alone would save enough electricity to power every household within the city limits of Atlanta, Los Angeles, Chicago, and Washington, D.C.”. This is hard to independently verify without much more information than offered by the paper. Most of the techniques employed are not discussed in the paper published last week. If the large service providers like Google, Microsoft, Yahoo, Beidu, Amazon and a handful of others don’t publish the details, the rest of the world’s data centers will never run as efficiently as described in the paper. Only high-scale datacenter users can afford the R&D program to spend on increased efficiency and water consumption elimination. I’m arguing it’s up to all of us working in the industry to publish the details to allow smaller-scale deployments to operate at similar efficiency levels. If we don’t, it’ll continue to be the case that US data centers alone will be needlessly spending enough power to support every household in Atlanta, Los Angeles, Chicago, and Washington DC. Each day, every day.
--jrh
Thanks to Alex Mallet, Mike Neil, Janine Harrison, and many others who sent this article my way last week.
James Hamilton, Data Center Futures Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052 W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | JamesRH@microsoft.com
H:mvdirona.com | W:research.microsoft.com/~jamesrh | blog:http://perspectives.mvdirona.com
An interesting file system study is at this year’s USENIX Annual Technical Conference. The paper Measurement and Analysis of Large-Scale Network File System Workloads looks at CIFS remote file system access patterns from two populations. The first a large file store of 19TB serving 500 software developers and the second a medium sized file store of 3TB used by 1,000 marketing, sales, and finance users.
The authors found that file access patterns have changed since previous studies and offer 10 observations:
· Both workloads are more write-heavy than workloads studied previously
· Read-write [rather than pure read or pure write] access patterns are much frequent compared to past studies
· Bytes are transferred in much longer sequential runs than in previous studies [the lengths of sequential runs is increasing but note that the percentage of random access is increasing]
· Bytes are transferred from much larger files than previous studies [files are getting bigger]
· Files live an order of magnitude longer than in previous studies
· Most files are not re-opened once they are closed
· If a file is re-opened, it is temporally related to the previous close
· A small fraction of the clients account for a large fraction of the activity
· Files are infrequently accessed by more than one client
· Files sharing is rarely concurrent and mostly read-only
· Most file types do not have a single pattern of access
The comments in brackets above are mine. Some of the important points that spring out for me: the percentage of random access is increasing; for those accesses that are sequential, the runs are longer; file sizes are increasing, data is getting colder; file lifetimes are increasing; and client usage has very high skew.
Overall, file data has been getting colder and the write to read ratio has been increasing. The authors conclude that substantial increases in the client file caches are unlikely to help significantly based upon this data. But, since file metadata requests make up roughly 50% of all operations, larger metadata caches could be very beneficial. Log Structured File systems look increasingly like the write answer. Increasingly random access patterns make NAND flash an interesting approach. The authors didn’t directly mention it but log structured block stores (below the filesystem) is also interesting in that, like LFS, it’s a write optimized organization. And, in addition, a log structured block store tends to sequentialize writes while randomizing reads which is ideal for NAND Flash.
Thanks to Vlad Sadovsky for sending this paper my way.
--jrh
James Hamilton, Data Center Futures Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052 W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | JamesRH@microsoft.com
H:mvdirona.com | W:research.microsoft.com/~jamesrh | blog:http://perspectives.mvdirona.com
Ken Church, Albert Greenberg, and I just finished On Delivering Embarrassingly Distributed Cloud Services which has been accepted for presentation at ACM Hotnets 2008 in Calgary, Alberta October 6th and 7th. This paper followed from the discussion and debate around a blog entry that Ken and I did some time back: Diseconomies of scale where we argue that the industry trend towards mega-datacenters needs to be questioned and, in many cases, is simply not cost effective.
There are times when Mega-datacenters do makes sense. Very large data analysis jobs and large, multi-server workloads with considerable inter-node communications traffice run best against large central data stores. MapReduce jobs are the classic example of this sort of workload. However, we argue that other types of workloads actually run better in distributed micro-datacenters. Highly partitionable applications with light inter-partition traffic can be better hosted in distributed micro-datacenters. Highly interactive applications such as Google Docs need to be close to their users. Network round trip latencies can make highly interactive applications frustrating to use. We collectively refer to applications can be partitioned effectively and run close to the edge (the users) as Embarrassingly Distributed. Essentially, these are the easy applications when it comes to running them close to the edge.
In the paper, we argue that the class of applications that are embarrassingly distributed and therefore run well on distributed micro-datacenters is large and we are go on to show that distributed micro-datacenters can offer considerable advantage over mega-centers. Essentially the point is that you can run many applications over distributed micr-datacenters and, if you can, you should.
Micro-datacenters are made possible by containerization that I wrote about in a 2007 Conference on Innovative Data Research Paper: Architecture for Modular Data Centers. When that paper was published Rackable Systems had just shipped their first containerized design and Sun Microsystems had announced Black Box but it wasn’t yet shipping. Two years later, containerized designs are offered by most of the major datacenter server vendors:
· IBM Scalable modular data center
· Rackable ICE Cube™ Modular Data Center
· Sun Modular Datacenter S20 (project Blackbox)
· Dell Insight
· Verari Forest Container Solution
Microsoft recently announced the first containerized data center in Chicago: First Containerized Data Center Announcement. The Chicago announcement is a mega-center but it does show that containerized designs are now ready for primetime.
Mega-datacenters remain useful and aren’t going away any time soon but, in Delivering Embarrassingly Distributed Cloud Services, we argue that distributed micro-datacenters are appropriate for many workloads and can reduce costs, improve the quality of service, and increase the speed of deployment.
--jrh
James Hamilton, Data Center Futures Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052 W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | JamesRH@microsoft.com
H:mvdirona.com | W:research.microsoft.com/~jamesrh | blog:http://perspectives.mvdirona.com
|