Thursday, June 23, 2011

Earlier this week, I was in Athens Greece attending annual conference of the ACM Machinery Special Interest Group on Management of Data. SIGMOD is one of the top two database events held each year attracting academic researchers and leading practitioners from industry.


I kicked off the conference with the Plenary keynote. In this talk I started with a short retrospection on the industry over the last 20 years. In my early days as a database developer, things were moving incredibly quickly. Customers were loving our products, the industry was growing fast and yet the products really weren’t all that good. You know you are working on important technology when customers are buying like crazy and the products aren’t anywhere close to where they should be.


In my first release as lead architect on DB2 20 years ago, we completely rewrote the DB2 database engine process model moving from a process-per-connected-user model to a single process where each connection only consumes a single thread supporting many more concurrent connections. It was a fairly fundamental architectural change completed in a single release. And in that same release, we improved TPC-A performance a booming factor of 10 and then did 4x more in the next release. It was a fun time and things were moving quickly.


From the mid-90s through to around 2005, the database world went through what I refer to as the dark ages. DBMS code bases had grown to the point where the smallest was more than 4 million lines of code, the commercial system engineering teams would no longer fit in a single building, and the number of database companies shrunk throughout the entire period down to only 3 major players. The pace of innovation was glacial and much of the research during the period was, in the words of Bruce Lindsay, “polishing the round ball”. The problem was that the products were actually passably good, customers didn’t have a lot of alternatives, and nothing slows innovation like large teams with huge code bases.


In the last 5 years, the database world has become exciting again. I’m seeing more opportunity in the database world now than any other time in the last 20 years. It’s now easy to get venture funding to do database products and the number of and diversity of viable products is exploding. My talk focused on what changed, why it happened, and some of the technical backdrop influencing.


A background thesis of the talk is that cloud computing solves two of the primary reasons why customers used to be stuck standardizing on a single database engine even though some of their workloads may have run poorly. The first is cost. Cloud computing reduces costs dramatically (some of the cloud economics argument: and charges by usage rather than via annual enterprise license. One of the favorite lock-ins of the enterprise software world is the enterprise license. Once you’ve signed one, you are completely owned and it’s hard to afford to run another product.  My fundamental rule of enterprise software is that any company that can afford to give you 50% to 80% reduction from “list price” is pretty clearly not a low margin operator. That is the way much of the enterprise computing world continues to work: start with a crazy price, negotiate down to a ½ crazy price, and then feel like a hero while you contribute to incredibly high profit margins.


Cloud computing charges by the use in small increments and any of the major database or open source offerings can be used at low cost. That is certainly a relevant reason but the really significant factor is the offloading of administrative complexity to the cloud provider.  One of the primary reasons to standardize on a single database is that each is so complex to administer, that it’s hard to have sufficient skill on staff to manage more than one. Cloud offerings like AWS Relational Database Service transfer much of the administrative work to the cloud provider making it easy to chose the database that best fits the application and to have many specialized engines in use across a given company.


As costs fall, more workloads become practical and existing workloads get larger.  For example, If analyzing three months of customer usage data has value to the business and it becomes affordable to analyze two years instead, customers correctly want to do it. The plunging cost of computing is fueling database size growth at a super-Moore pace requiring either partitioned (sharded) or parallel DB engines.


Customers now have larger and more complex data problems, they need the products always online, and they are now willing to use a wide variety of specialized solutions if needed. Data intensive workloads are growing quickly and never have there been so many opportunities and so many unsolved or incompletely solved problems. It’s a great time to be working on database systems.


·         The slides from the talk:

·         Proceedings extended abstract:

·         Video of talk: (select June 14th to get to the video)

The talk video is available but, unfortunately, only to ACM digital library subscribers (thanks to Simon Leinen for pointing out the availability of the video link above).



James Hamilton



b: /


Thursday, June 23, 2011 3:21:11 PM (Pacific Standard Time, UTC-08:00)  #    Comments [2] - Trackback
 Thursday, June 09, 2011

The Amazon Technology Open House was held Tuesday night at the Amazon South Lake Union Campus. I did a short presentation on the following:


       Quickening pace of infrastructure innovation

       Where does the money go?

       Power distribution infrastructure

       Mechanical systems

       Modular & Advanced Building Designs

       Sea Change in Networking


Slides and notes:




James Hamilton



b: /


Thursday, June 09, 2011 5:40:21 AM (Pacific Standard Time, UTC-08:00)  #    Comments [4] - Trackback
 Friday, June 03, 2011

Earlier today Alex Mallet reminded me of the excellent writing of Atul Gawnade by sending me a pointer to the New Yorker coverage of Gawande’s commencement address at the Harvard Medical School: Cowboys and Pit Crews.


Four years ago I wrote a couple of blog entries on Gawande’s work but, at the time, my blog was company internal so I’ve not posted these notes here in the past:


As a follow-on to the posting I made on professional engineering (also posted externally Edwin Young sent me a link to the following talk by Atul Gawande: Outcomes are very Personal. It’s from another domain, medicine, but is a phenomenally good presentation by a surgeon and his core premise applies equally to software: practitioners work and the outcomes of that work are spread on a bell curve.  The truly great are much better than the average and often an order of magnitude better than the lowest performing.  His book and the presentation is about the personal attributes and approaches of those at the very top. It’s well worth a view:


In my view, it’s an insightful presentation by a surgeon who loves data, loves understanding why we do well and how we can do better and is relentless in pursuit himself of doing everything better.  Subsequent to watching the presentation, I read a book by the same author “Better: A Surgeon's Notes on Performance” ( 


Software, like surgery, is part art and part science and there is tremendous variability between the average and the best.  Gawande studies the best in different specializations to understand why the performance of some practitioners is way out there at the positive end of the bell curve and, through a series of essays, makes observations on how to do improve performance of the population Aoverall.  Understanding that human performance is distributed on the bell curve means that, for whatever it is you are doing, there are average performers, terrible performers, and truly gifted performers.  Gawade looks for what he calls positive deviance – it’s always there in a bell curve distributed phenomena – and tries to understand what they do differently.  Worth reading.


A sampling of Gawande’s work:

·         Book:

·         Video:

·         Commencement Speech Notes:




James Hamilton



b: /


Friday, June 03, 2011 7:24:06 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
 Tuesday, May 31, 2011

As a boater, there are times when I know our survival is 100% dependent upon the weather conditions, the boat, and the state of its equipment. As a consequence, I think hard about human or equipment failure modes and how to mitigate them. I love reading the excellent reporting by the UK Marine Accident Investigation Board. This publication covers human and equipment related failures on commercial shipping, fishing, and recreational boats. I read it carefully and I’ve learned considerably from it.


I treat my work in much the same way. At work, human life is not typically at risk but large service failures can be very damaging and require the same care to avoid. As a consequence, at work I also think hard about possible human or equipment failure modes and how to mitigate them.


Wanting to deeply understand unusual failure modes and especially wanting to understand the errors that humans make when managing systems under stress, I spend time reading about system failures. Considerable learning can be drawn from reading about the failures of engineered systems and people under stress. All disasters or near disasters yield some unique lessons and re-enforce some old ones.


The hard part for me is getting enough detail to really learn from the situation. The press reports are often light on details partly because general audiences are not necessarily that interested but there also may be legal or competitive constraints preventing broad publication. NASA, FAA, Coast Guard and some other government reports to get to excellent detail. One analysis of system failure I learned greatly from was Feynman’s analysis of the space shuttle Challenger disaster as part of the Rogers Commission Report.


I just came across another report that is not quite a Feynman  classic but it is an excellent, just-the-facts description of a large scale failure. This report, from IEEE Spectrum, titled What Went Wrong in Japan’s Nuclear Reactors outlines what happened in the eventually catastrophic disaster at Japan’s Fukushima Dai-1 nuclear facility following the Tohoku earthquake and subsequent tsunami. In this report, the terminal failures of 4 of the 6 reactors at the facility is described in more detail than other accounts of that event I’ve come across.

All disasters are unique in some dimensions. What makes Fukushima particularly unusual is these failures occurred over multiple weeks rather than the seconds to hours of many events.  This one was relatively slow to develop and even slower to be brought under control. Looking forward, I suspect Fukushima will share some characteristics with Chernobyl where mitigating the environmental damage is still nowhere close to complete nearly three decades later. In 1998 the Ukraine government obtained economic aid from the European Bank for Reconstruction and Development to rebuild the failing Chernobyl sarcophagus. It is expected that yet more work will need to be done to continue to contain dangerous radioactive substances from escaping.  Similarly, I expect the environmental impact of the Fukushima disaster will be fought for decades at great cost both economic and human.


In many ways Fukishima was a classic disaster where a not particularly surprising event, in this case an earthquake near Japan, started the failure and then cascading natural disaster, equipment failure, and human decisions followed to yield an outcome that every aspect of the system design sought to avoid.


I recommend reading the IEEE report linked below and my rough notes from the write-up follow:

·         On March 11 an earthquake registering 9.0 magnitude was experienced off the coast of Japan

·         The tsunami hit the plant destroying power distribution gear cutting off power to the Fukushima facility

·         Backup generators and switch gear were also disabled by the Tsunami

·         Reactor building integrity was maintained through earthquake and Tsunami and the three reactors that were active at that point where all shut down properly

·         Due to the power failure and the damage to distribution gear and generators, plant cooling systems were not operating at any of the reactors nor the spent fuel rod storage pools

·         Even though the nuclear reaction had been stopped in the three reactors that were operational when the tsunami hit (reactors 1, 2, & 3), considerable heat was still being created putting the reactors at risk of meltdown. Meltdown is a condition where reactor core over temperature occurs, the coolant is boiled off, the fuel rods melt and form a pool of very hot, highly radioactive fuel in the bottom of the reactor. This hot, radioactive fluid then rapidly breaks down steel and concrete in the containment vessel and possibly escapes to the environment.

·         Another area of risk from the failed cooling systems are the spent nuclear fuel rod storage pools. These pools are also housed inside the reactor buildings near the primary containment vessel where the active nuclear reaction actually takes place. Although the fuel rods are no longer contributing to a nuclear reaction, they are both highly radioactive and still producing sufficient heat that active cooling is required. Without cooling these rods can heat the storage pool to the point that it boils off the cooling water and present a risk similar to the active rods inside the primary storage vessel.

·         I find it surprising that both the spent rod storage and the shut down reactor cores don’t appear to fail safe and self-stabilize when cooling water is removed given the considerably higher than zero probability of power failure and the seriously negative impact of radioactive release to the environment.

·         Events at Reactor #1:

o   March 12, a day after the power failure, heat in the recently shutdown reactor built up until the (not circulating) cooling water began to be boiled off.

o   As the water level fell, the now exposed fuel rods reacted with the steam in the primary containment vessel, and began producing hydrogen gas

o   The pressure rose to dangerous levels in the primary containment vessel and operators decided to vent the primary containment vessel into the reactor building.

o   The vented hydrogen gas when exposed to the relatively oxygen-rich environment in the reactor building, exploded blowing the top off the reactor building

o   The explosion may have also damaged the primary containment vessel and definitely released radioactive material

o   The operators chose to pump seawater into the building in an effort to control the escalating temperature inside the reactor and to avoid core meltdown

o   March 29, radioactive water was found outside the reactor building

o   April 5, reactor core temperatures have begun to fall indicating the system is coming back into control

o   Radioactivity levels in the building are very high and operators are injecting nitrogen to reduce the likelihood of subsequent hydrogen explosions.

o   May 12, TEPCO officials confirmed that the reactor had suffered a core meltdown and the bottom of the reactor building may be leaking highly radioactive water into the environment.

·         Events at Reactor #3:

o   March 14, 3 days after the tsunami and 2 days after the roof was blown off the Reactor #1 containment building, the same thing happened on Reactor #3

o   This explosion occurred despite plant operators pumping large quantities of cooling sea water into the reactor building

o   March 17, steam begins billowing from the reactor building confirming that the primary containment vessel was damaged and releasing radioactive compounds.

o   Helicopters dumped water on the building and police water cannons were used to pour water down onto the building.

o   Water was sprayed on the building for days with some interruptions as radiations levels rose sufficiently high that work had to be stopped.

o   March 24, workers laying power cables attempting to restore power to Reactor #3 waded into highly radioactive water requiring hospitalization.

o   March 28, dangerous plutonium was detected in the environment near Reactor #3.

·         Events at Reactor #2:

o   March 15, 4 days after the tsunami, 3 days after the roof was blown off Reactor #1, and a day after the roof was blown off Reactor #3, a serious explosion occurred at Reactor #2.

o   Reactor #2 was later confirmed to have experienced at least a partial core meltdown

o   March 27, highly radioactive water discovered outside of reactor building #2.

·         Subsequently large quantities of uncontained radioactive water has been found throughout the multi-reactor plan and the turbine facilities are flooded as are the cabling tunnels between the buildings. Serious radioactive water leaks into the ocean have been detected and subsequently corrected in one case by injecting 6,000 liters of liquid glass into the ground near the leak.

·         April 4th, 11,500 tons of radioactive water is pumped into the ocean. This water is 100x above the legal safety limit but was pumped into the environment in the hope that the storage facilities can be used to contain waste water that is 10,000x time radioactive limit for environmental release.

·         The spent fuel pools at the inactive reactors 4, 5, & 6 were all slowly overheating as a consequence of there being no cooling water. The Reactor #4 cooling pool either boiled off its water or it leaked off as a result of earthquake damage. The spent fuel rods exposed to atmosphere without cooling lead to fires inside Reactor building #4

·         Outcome:

o   Fukushima now rated to be as serious as the Chernobyl having been classified as a a magnitude 7 event, the worst on the International Nuclear Event Scale. However it is still consider to have released only 5 to 10% of the radiation released by Chernobyl.

o   All residents within 20 km evacuated

o   Voluntary evacuation of all residents between 20 and 30 km.

o   Agricultural products including milk and vegetables from the region contaminated

o   Tokyo’s tap water declared unfit for infants for 1 day

o   Decades of cleanup and containment remain


The report: What Went Wrong in Japan's Nuclear Reactors:


We all wish the situation had been avoided and, those of us involved in engineering projects whether they be life critical systems or not, need to ensure that the lessons from this one are learned well and applied faithfully to new designs. I won’t speculate on human risk in the efforts spent to mitigate this disaster but, clearly, the workers that brought these systems back under control and continue to manage the environmental impact are heroes and deserve our collective thanks. 




James Hamilton



b: /



Tuesday, May 31, 2011 5:39:33 AM (Pacific Standard Time, UTC-08:00)  #    Comments [8] - Trackback
 Wednesday, May 25, 2011

The European Data Center Summit 2011 was held yesterday at SihlCity CinCenter in Zurich. Google Senior VP Urs Hoelzle kicked off the event talking about why data center efficiency was important both economically and socially.  He went on to point out that the oft quoted number that US data centers represent is 2% of total energy consumption is usually mis-understood. The actual data point is that 2% of the US energy budget is spent on IT of which the vast majority is client side systems. This is unsurprising but a super important clarification.  The full breakdown of this data:


·         2% of US power

o   Datacenters:              14%

o   Telecom:                     37%

o   Client Device:            50%


The net is that 14% of 2% or 0.28% of the US power budget is consumed in datacenters.  This is a far smaller but still a very relevant number. In fact, that is the primary motivator behind the conference: how to get the best practices from industry leaders in datacenter efficiency available more broadly .


To help understand why this is important,

·         Of the 0.28% energy consumption by datacenters:

o   Small:            41%

o   Medium:     31%

o   Large:            28%


This later set of statistics predictably shows that the very largest data centers consume 28% of the data center energy budget while small and medium centers consume 72%.  High scale datacenter operates have large staffs of experts focused on increasing energy efficiency but small and medium sized centers can’t afford this overhead at their scale. Urs’s point and the motivation behind the conference is we need to get industry best practices available to all data center operations.


The driving goal behind the conference is that extremely efficient datacenter operations are possible using only broadly understood techniques. No magic is required.  It is true that the very large operators will continue to enjoy even better efficiency but existing industry best practices can easily get even small operators with limited budgets to within a few points of the same efficiency levels.


Using Power Usage Effectiveness as the measure while the industry leaders are at 1.1 to 1.2 where 1.2 means that every watt delivered to the servers requires 1.2 watts to be deliverd from the utility. Effectively it is a measure of the overhead or efficiency of the datacenter infrastructure. Unfortunately the average remains in the 1.8 to 2.0 range and the worst facilities can be as poor as 3.0.


Summarizing: Datacenters consume 0.28% of the annual US energy budget. 72% of these centers are small and medium sized centers that tend towards the lower efficiency levels.


The Datacenter Efficiency conference focused on making cost effective techniques more broadly understood showing how a PUE of 1.5 is available to all without large teams of experts or huge expense. This is good for the environment and less expensive to operate.




James Hamilton



b: /


Wednesday, May 25, 2011 5:52:18 AM (Pacific Standard Time, UTC-08:00)  #    Comments [10] - Trackback
 Monday, May 23, 2011

Guido van Rossum was at Amazon a week back doing a talk. Guido presented 21 Years of Python: From Pet Project to Programming Language of the Year.


The slides are linked below and my rough notes follow:

·         Significant Python influencers:

o   Algol 60, Pascal, C

o   ABC

o   Modula0-2+ and 3

o   Lisp and Icon

·         ABC was the strongest language influencer of this set

·         ABC design goals:

o   Professionals but not professional programmers (lab personal, scientists, etc.)

o   Easy to teach, easy to learn, easy to use

·         Parts of ABC most liked by Guido:

o   Design iterations based on user testing

§  E.g. colon before indented blocks

o   Simple design: IF, WHILE, FOR, …

o   Indentation for grouping (Knuth, occam)

o   Tuples, lists, dictionaries (though changed)

o   Immutable data types

o   No limits

o   The >>> prompt

·         Parts of ABC that most needed improvement:

o   Monolithic design – not extensible

§  E.g. no graphics, not easily added

o   Invented non-standard terminology

§  E.g. “how-to” instead of “procedure”

o   ALL'CAPS keywords

o   No integration with rest of system

§  No file-based I/O (persistent variables instead)

·         The beginnings of Python:

o   Amoeba project at CWI

§  Writing apps in C and sh and wanting something in between

·         Python design philosophy:

o   Borrow ideas whenever it makes sense

o   As simple as possible, no simpler (Einstein)

o   Do one thing well (UNIX)

o   Don’t fret about performance (fix it later)

o   Go with the flow (don’t fight environment)

o   Perfection is the enemy of the good

o   Cutting corners is okay (get back to it later)

·         User Centric Design Philosophy:

o   Avoid platform ties, but not religiously

o   Don’t bother the user with details

o   Discourage but allow coding to the platform

o   Offer multiple levels of extensibility

o   Errors should not be fatal, if possible

o   Errors should never pass silently

o   Don’t blame the user for bugs in Python

·         Core language stabilized quickly in the 1990 to 1991 timeframe

·         Early days of active Python community:

o   1990 – internal at CWI

§  More internal use than ABC ever had

§  Internal contributors

o   1991 – first release;

o   1994 – USENET group comp.lang.python

o   1994 – first workshop (NIST)

o   1995-1999 – from workshops to conferences

o   1995 – Python Software Association

o   1997 – goes online

o   1999 – Python Consortium

§  Modeled after X Consortium

o   2001 – Python Software Foundation

§  Modeled after Apache Software Foundation

·         Present day Python community:

o   PSF runs largest annual Python conference

§  PyCon Atlanta in 2011: 1500 attendees

§  2012-2013: Toronto; 2014-2015: Bay area

§  Also sponsors regional PyCons world-wide

o   EuroPython since 2002

o   Many local events, user groups



o   Stackoverflow etc.

·         Python 2 vs Python 3

o   Fixing deep bugs intrinsic in the design

o   Avoid two extremes:

§  perpetual backwards compatibility (C++)

§  rewrite from scratch (Perl 6)

o   Our approach:

§  evolve the implementation gradually

§  some backwards incompatibilities

§  separate tools to help users cope


Thanks to Guido for doing the well received Python presentation.


Guido’s slides and blog URLS:

·         Slides:

·         Blog:




James Hamilton



b: /


Monday, May 23, 2011 9:42:12 AM (Pacific Standard Time, UTC-08:00)  #    Comments [4] - Trackback
 Friday, May 20, 2011

I invited Nikhil Handigol to present at Amazon earlier this week. Nikhil is a Phd candidate at Stanford University working with networking legend Nick McKeown on the Software Defined Networking team. Software defined networking is an concept coined by Nick where the research team is separating the networking control plane from the data plane. The goal is a fast and dumb routing engine with the control plane factored out and supporting an open programming platform.


From Nikil’s presentation, we see the control plane hoisted up to a central, replicated network O/S configuring the distributed routing engines in each switch.


One implementation of software defined networking is OpenFlow where each router supports the OpenFlow protocol and a central OpenFlow Controller computes routing tables that are installed in each router:


What makes OpenFlow especially interesting is that it’s simple, easy to implement, and getting broad industry support with the Open Networking Foundation as the central organizing body.  The Open Networking Foundation’s primary mission is to advance software defined networking using OpenFlow as the protocol. Founding members of the Open Networking Foundation are Deutsche Telekom, Facebook, Google, Microsoft, Verizon, and Yahoo!.  Also included are networking equipment providers including: Broadcom, Dell, Cisco, Force10, HP, Juniper, Marvell, Mellanox, and many others.


Today, most networking equipment is shipped as a vertically integrated stack including both the control and data planes. There are many reasons why this is not good for the industry. The Stanford team argues it blocks innovation in that researches can’t try new protocols with a closed stack without a programming model.  I agree. This is a problem for both academia and industry but my dislike of the current model is much broader. In Networking: The Last Bastion of Mainframe Computing, I made the case that this vertically integrated approach is artificially holding prices high and slowing the pace of innovation. A quick summary of the argument:


When networking equipment is purchased, it’s packaged as a single sourced, vertically integrated stack. In contrast, in the commodity server world, starting at the most basic component, CPUs are multi-sourced. We can get CPUs from AMD and Intel. Compatible servers built from either Intel or AMD CPUs are available from HP, Dell, IBM, SGI, ZT Systems, Silicon Mechanics, and many others.  Any of these servers can support both proprietary and open source operating systems. The commodity server world is open and multi-sourced at every layer in the stack.


Open, multi-layer hardware and software stacks encourage innovation and rapidly drive down costs. The server world is clear evidence of what is possible when such an ecosystem emerges.


I’m excited about software defined networking because it provides a clean interface allowing switch providers to both innovate and compete. An additional benefit is that SDN allows innovation and experimentation at the network protocol layer.


In Nikil’s talk last week at Amazon, he explored integrating load balancing functionality into the network routing fabric. The team started with the hypothesis that load balancing is really just smart routing. They then implemented a distributed load balancing fabric by adding load balancing support to network routers using Software Defined Networking. Essentially they distribute the load balancing functionality throughout the network. What’s unusual here is that the ideas could be tested and tried over a 9 campus, North American wide network with only 500 lines of code. With conventional network protocol stacks, this research work would have been impossible in that vendors don’t open up protocol stacks. And, even if they did, it would have been complex and very time consuming.




James Hamilton



b: /


Friday, May 20, 2011 10:55:51 AM (Pacific Standard Time, UTC-08:00)  #    Comments [2] - Trackback
 Wednesday, May 04, 2011

This note looks at the Open Compute Project distributed Uninterruptable Power Supply (UPS) and server Power Supply Unit (PSU). This is the last in a series of notes looking at the Open Compute Project. Previous articles include:

·         Open Compute Project

·         Open Compute Server Design

·         Open Compute Mechanical Design


The open compute uses a semi-distributed uninterruptable power supply (UPS) system. Most data centers use central UPS systems where large the UPS is part of the central power distribution system. In this design, the UPS is in the 480 3 phase part of the central power distribution system prior to the step down to 208VAC. Typical capacities range from 750kVA to 1,000kVA. An alternative approach is a distributed UPS like that used by a previous generation Google server design.

In a distributed UPS, each server has its own 12VDC battery to serve as backup power. This design has the advantage of being very reliable with the batter directly connected to the server 12V rail. Another important advantage is the small fault containment zone (small “blast radius”) where a UPS failure will only impact a single server. With a central UPS, a failure could drop the load on 100 racks of servers or more. But, there are some downsides of distributed UPS. The first is that batteries are stored with the servers. Batteries take up considerable space, can emit corrosive gasses, don’t operate well at high temperature, and require chargers and battery monitoring circuits. As much as I like aspects of the distributed UPS design, it’s hard to cost effectively and, consequently, is very uncommon.


The Open Compute UPS design is a semi-distributed approach where each UPS is on the floor with servers but rather than having 1 UPS per server (distributed UPS) or 1 UPS per order 100 racks (central UPS with roughly 4,000 servers), they have 1 UPS per 6 racks (180 servers).




In this design the battery rack is the central rack flanked by two triple racks of servers. Like the server racks, the UPS is delivered 480VAC 3 phase directly. At the top of the battery rack, they have control circuitry, circuit breakers, and rectifiers to charge the battery banks.


What’s somewhat unusual in the output stage of the UPS doesn’t include inverters to convert the direct current back to the alternating current required by a standard server PSU. Instead the UPS output is 48V direct current which is delivered directly to the three racks on either side of the UPS. This has the upside of avoiding the final invert stage which increases efficiency. There is a cost to avoiding converting back to AC.  The most important downside is they need to effectively have two server power supplies where one accepts 277VAC and the other accepts 48VDC from the UPS. The second disadvantage is using 48V distribution is inefficient over longer distances due to conductor losses at high amperage.


The problem with power distribution efficiency is partially mitigated by keeping the UPS close to servers where the 6 racks its feeds are on either side of the UPS so the distances are actually quite short. And, since the UPS is only used during short periods of time between the start of a power failure and the generators taking over, the efficiency of distribution is actually not that important a factor. The second issue remains, each server power supply is effectively two independent PSUs.


The server PSU looks fairly conventional in that it’s a single box. But, included in the single box, is two independent PSUs and some control circuitry. This has the downside of forcing the use of a custom, non-commodity power supply. Lower volume components tend to cost more. However, the power supply is a small part of the cost of a server so this additional cost won’t have a substantially negative impact. And, it’s a nice, reliable design with a small fault containment zone which I really like.


The Open Compute UPS and power distribution system avoids one level of power conversion common in most data centers, delivers somewhat higher voltages (277VAC rather than 208VAC) close to the load, and has the advantage of a small fault zone.


James Hamilton



b: /


Wednesday, May 04, 2011 5:24:34 AM (Pacific Standard Time, UTC-08:00)  #    Comments [4] - Trackback
 Friday, April 29, 2011

Google cordially invites you to participate in a European Summit on sustainable Data Centres. This event will focus on energy-efficiency best practices that can be applied to multi-MW custom-designed facilities, office closets, and everything in between. Google and other industry leaders will present case studies that highlight easy, cost-effective practices to enhance the energy performance of Data Centres.

The summit will also include a dedicated session on cooling. Presenters will detail climate-specific implementations of free cooling as well as novel ways to utilise locally -available opportunities. We will also debate climate-independent PUE targets.

The agenda includes presentations and panel discussions featuring Amazon, DeepGreen, eBay, Google, IBM, Microsoft, Norman Disney & Young, PlusServer, Telecity Group, The Green Grid, UK's Chartered Institute for IT, UBS and others.

Attendance is free. However, space is limited and we therefore encourage you to register online at your earliest convenience. Your participation will be confirmed.

We look forward to seeing you and your colleagues in Zurich!

Friday, April 29, 2011 5:18:40 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
 Wednesday, April 20, 2011

Last Thursday Facebook announced the Open Compute Project where they released pictures and specifications for their Prineville Oregon datacenter and the servers and infrastructure that will populate that facility. In my last blog, Open Compute Mechanical System Design I walked through the mechanical system in some detail. In this posting, we’ll have a closer look at the Facebook Freedom Server design.


Chassis Design:

The first thing you’ll notice when looking at the Facebook chassis design is there are only 30 servers per rack. They are challenging one of the strongest held beliefs in the industry that is density is the primary design goal and more density is good. I 100% agree with Facebook and have long argued that density is a false god. See my rant Why Blade Servers aren’t the Answer to all Questions for more on this one.  Density isn’t a bad thing but paying more to get denser designs that cost more to cool is usually a mistake. This is what I’ve referred to in the past as the Blade Server Tax.


When you look closer at the Facebook design, you’ll note that the servers are more than 1 Rack Unit (RU) high but less than 2 RU. They choose a non-standard 1.5RU server pitch. The argument is that 1RU server fans are incredibly inefficient. Going with 60mm fans (fit in 1.5RU) dramatically increases their efficiency but moving further up to 2RU isn’t notably better. So, on that observation, they went with 60mm fans and a 1.5RU server pitch.

I completely agree that optimizing for density is a mistake and that 1RU fans should be avoided at all costs so, generally, I like this design point. One improvement worth considering is to move the fans out of the server chassis entirely and go with very large fans on the back of the rack. This allows a small gain in fan efficiency by going with larger still fans and allows a denser server configuration without loss of efficiency or additional cost. Density without cost is a fine thing and, in this case, I suspect 40 to 80 servers per rack could be delivered without loss of efficiency or additional cost so would be worth considering.


The next thing you’ll notice when studying the chassis above is that there is no server case.  All the components are exposed for easy service and excellent air flow. And, upon more careful inspection, you’ll note that all components are snap in and can be serviced without tools. Highlights:

·         1.5 RU pitch

·         1.2 MM stamped pre-plated steel

·         Neat, integrated cable management

·         4 rear mounted 60mm fans

·         Tool-less design with snap plungers holding all components

·         100% front cable access



The Open Compute project supports two motherboard designs where 1 uses an Intel processors and the other uses AMD.

Intel Motherboard:

AMD Motherboard:


Note that these boards are both 12V only designs. 


Power Supply:

The power supply (PSU) is an usual design in two dimensions: 1) it is a single output voltage 12v design and 2) it’s actually two independent power supplies in a single box. Single voltage supplies are getting more common but commodity server power supplies still usually deliver 12V, 5V, and 3.3V.  Even though processors and memory require somewhere between 1 and 2 volts depending upon the technology, both typically are fed by the 12V power rail through a Voltage Regulator Down (VRD) or Voltage Regulator Module (VRM). The Open Compute approach is to use deliver 12V only to the board and to produce all other required voltages via an Voltage Regulator Module on the mother board. This simplifies the power supply design somewhat and they avoid cabling by having the motherboard connecting directly to the server PSU.


The Open Compute Power Supply is has two power sources. The primary source is 277V alternating current (AC) and the backup power source is 48V direct current (DC). The output voltage from both supplies is the same 12V DC power rail that is delivered to the motherboard.


Essentially this supply is two independent PSUs with a single output rail. The choice of 277VAC is unusual with most high-scale data centers run on 208VAC. But 277 allows one power conversion stage to be avoided and is therefore more power efficient.


Most data centers have mid-voltage transformers(typically in the 13.2kv range but it can vary widely by location). This voltage is stepped down to 480V three phase power in North America and 400V 3 phase in much of the rest of the world.  The 480VAC 3p power is then stepped down to 208VAC for delivery to the servers.


The trick that Facebook is employing in their datacenter power distribution system is to avoid one power conversion by not doing the 480VAC to 208VAC conversion. Instead, they exploit the fact that each phase of 480 3p power is 277VAC between the phase and neutral.  This avoids a power transformation step which improves overall efficiency. The negatives of this approach are 1) commodity power supplies can’t be used (277VAC is beyond the range of commodity PSUs) and 2) the load on each of the three phases need to be balanced.  Generally, this is a good design tradeoff where the increase in efficiency justifies the additional cost and complexity.


An alternative but very similar approach that I like even better is to step down mid-voltage to 400VAC 3p and then play the same phase to neutral trick described above. This technique still has the advantage of avoiding 1 layer of power transformation. What is different is the resultant phase to neutral voltage delivered to the servers is 230VAC which allows commodity power supplies to be used.  The disadvantage of this design is that the mid-voltage to 400VAC 3p transformer is not in common use in North America. However this is a common transformer in other parts of the world so they are still fairly easily attainable.


Clearly, any design that avoids a power transformation stage is a substantial improvement over most current distribution systems. The ability to use commodity server power supplies unchanged makes the 400 3p to neutral trick look slightly better than the 480VAC 3p approach but all designs need to be considered in the larger context in which they operate. Since the Facebook power redundancy system requires the server PSU to accept both a primary alternating current input and a backup 48VDC input, special purpose build supplies need to be used. Since a custom PSU is needed for other reasons, going with 277VAC as the primary voltage makes perfect sense.


Overall a very efficient and elegant design that I’ve enjoyed studying. Thanks to Amir Michael of the Facebook hardware design team for the detail and pictures.




James Hamilton



b: /


Wednesday, April 20, 2011 6:56:09 PM (Pacific Standard Time, UTC-08:00)  #    Comments [16] - Trackback
 Saturday, April 09, 2011

Last week Facebook announced the Open Compute Project (Perspectives, Facebook). I linked to the detailed specs in my general notes on Perspectives and said I would follow up with more detail on key components and design decisions I thought were particularly noteworthy.  In this post we’ll go through the mechanical design in detail.


As long time readers of this blog will know, PUE has many issues (PUE is still broken and I still use it) and is mercilessly gamed in marketing literature (PUE and tPUE). The Facebook published literature predicts that this center will deliver a PUE of 1.07. Ignoring the power requirements of the mechanical systems, just delivering power to the servers at a 7% loss is a considerable challenge. I’m slightly skeptical of this number as a fully loaded, annual PUE number but I can say with confidence that it is one of the nicest mechanical designs I have come across. Hats off to Jay Park and the rest of the Facebook DC design team.


Perhaps the best way to understand the mechanical design is to first walk through the mechanical system diagram and then step through the actual deployment in pictures.

In the mechanical system diagram you will first note there is no process based cooling (no air conditioning) and no chilled water loop. It’s a 100% air cooled with all IT equipment cooling via outside air which is particularly effective with the favorable weather conditions experienced in central Oregon. 


The outside air is pulled in from outside, mixed with a controlled volume of return air to avoid over-cooling in winter, filtered, evaporative cooled, and run through a fan wall before being returned to the cold aisles below.  


Air Intake

The cooling system takes up the entire second floor of the facility with the IT equipment and office space on the first floor. In the picture below, you’ll see the building air intake on the left running the full width of the facility. This intake area in the picture is effectively “outside” so includes water drains in the floor. In the picture to the right towards the top, you’ll see the air path to the remainder of the air handling system. To the right at the bottom, you’ll see the white sheet metal sections covering the passage where hot exhaust air is brought up from the hot aisle below to be mixed with outside air to prevent over-cooling on cold days.


Mixing Plenum & Filtration:

In the next picture, you can see the next part of the full building air handling system. On the left the louvers at the top are bringing in outside air from the outside air intake room.  On the left at the bottom, there are thermostatically controlled louvers allowing hot exhaust air to come up from the IT equipment hot aisle. The hot and outside air are mixed in this full room plenum before passing through the filtration wall visible on the right side of the picture.


The facility grounds are still being worked upon so the filtration system includes an extra set of disposable paper filters to increase the life of the more aggressive filtration media not visible behind.


Misting Evaporative Cooling

In the next picture below you’ll see the temperature and humidity controlled high-pressure water misting system again running the full width of the facility. Most evaporative cooling systems used in data center applications use wet media. This system used high pressure water with stainless steel nozzles to produce an atomized mist. These designs produce excellent cooling (high delta-T) but are prone to calcification and maintenance without aggressive filtration which brings some expense. But they are very effective.


Just beyond the misters, you’ll see what looks like filtration media. This media is there to ensure no airborne water makes it out of the cooling room.


Exhaust System

In the final picture below, you’ll see we have now got to the other side of the building where the exhaust fans are found. We bring air in on one side, filter, cool it, and pump it, and then just before the exhaust fans visible in the picture, huge openings in the floor alow treated, cooled air to be brough down to the IT equipment cold aisle below.


The exhaust fans visible in the picture control pressure by exhausting air not needed back outside.


The Facebook facility has considerable similarity to the design EcoCooling mechanical design I posted last Monday (Example of Efficient Mechanical System Design). Both approaches use the entire building as the air ducting system and both designs use evaporative cooling. The notable differences are: 1) EcoCooling design is based upon wet media whereas Facebook is using a high pressure water misting system, and 2) the EcoCooling design is running the mechanical systems beside the compute floor whereas the Facebook design uses the second floor above the IT rooms for all mechanical system.


The most interesting aspects of the Facebook mechanical design: 1) full building ducting with huge plenum areas, 2) no-process based cooling, 3) mist-based evaporative cooling, 4) large, efficient impellers with variable frequency drive, and 4) full wall, low-resistance filtration.


I’ve been saying for years that mechanical systems are where the largest opportunities for improvement lie and are the area where most innovation is most required. The Facebook Prineville Facility is most of the way there and one of the most efficient mechanical system designs I’ve come across. Very elegant and very efficient.




James Hamilton



b: /


Saturday, April 09, 2011 9:43:40 AM (Pacific Standard Time, UTC-08:00)  #    Comments [17] - Trackback
 Thursday, April 07, 2011

The pace of innovation in data center design has been rapidly accelerating over the last 5 years driven by the mega-service operators. In fact, I believe we have seen more infrastructure innovation in the last 5 years than we did in the previous 15. Most very large service operators have teams of experts focused on server design, data center power distribution and redundancy, mechanical designs, real estate acquisition, and network hardware and protocols.  But, much of this advanced work is unpublished and practiced at a scale that  is hard to duplicate in a research setting.


At low scale with only a data or center or two, it would be crazy to have all these full time engineers and specialist focused on infrastructural improvements and expansion. But, at high scale with 10s of data centers, it would be crazy not to invest deeply in advancing the state of the art.


Looking specifically at cloud services, the difference between an unsuccessful cloud service and a profitable, self-sustaining business is the cost of the infrastructure. With continued innovation driving down infrastructure costs, there is investment capital available, services can be added and improved, and value can be passed on to customers through price reductions. Amazon Web Services, for example, has had 11 price reductions in 4 years. I don’t recall that happening in my first 20 years working on enterprise software. It really is an exciting time in our industry.


Facebook is a big business operating at high scale and they also have elected to invest in advanced infrastructure designs. Jonathan Heiliger and the Facebook infrastructure team have hired an excellent group of engineers over the past couple of years and are now bringing these designs to life in their new Prineville Oregon facility. I had the opportunity to visit this datacenter 6 weeks back just before it started taking production load. I had an excellent visit, got to catch up with some old friends, meet some new ones, and tour an impressive facility. I saw an unusually large number of elegant designs ranging from one of the cleanest mechanical systems I’ve come across, three phase 480VAC directly to the rack,  a low voltage direct current distributed uninterruptable power supply system, all the way through to custom server designs. But, what made this trip really unusual is that I’m actually able to talk about what I saw.


In fact, more than allowing me to talk about it, Facebook has decided to release most of the technical details surrounding these designs publically. In the past, I’ve seen some super interesting but top secret facilities and I’ve seen some public but not particularly advanced data centers. To my knowledge, this is the first time an industry leading design has been documented in detail and released publically.


The set of specifications Facebook is releasing are worth reading so I’m posting links to all below.  I encourage you to go through these in as much detail as you chose. In addition, I’ll also  post summary notes over the next couple of days explain aspects of the design I found most interesting or commenting upon the pros and cons of some of the approaches employed.


The specifications:

·         Data Center Design

·         Intel Motherboard

·         AMD Motherboard

·         Battery Cabinet (Distributed UPS)

·         Server PSU

·         Server Chassis and Triplet Hardware


My commendations to the specification authors Harry Li , Pierluigi Sarti, Steve Furuta, Jay Park and to the rest of the Facebook infrastructure team for releasing this work publically and for doing so in sufficient detail that others can build upon it. Well done.





·         Open Compute Web Site:

·         Live Blog of the Announcement:


James Hamilton



b: /



Thursday, April 07, 2011 9:30:02 AM (Pacific Standard Time, UTC-08:00)  #    Comments [6] - Trackback
 Monday, April 04, 2011

A bit more than a year back, I published Computer Room Evaporative Cooling where I showed an evaporative cooling design from EcoCooling. Periodically, Alan Beresford sends me designs he’s working on. This morning he sent me a design they are working on for a 7MW data center in Ireland.


I like the design for a couple of reasons: 1) It’s a simple design and efficient design, and 2) it’s a nice example of a few important industry trends. The trends exemplified by this design are: 1) air-side economization, 2) evaporative cooling, 3) hot-aisle containment, and 4) very large plenums with controlled hot-air recycling. The diagrams follow and, for the most part, speak for themselves.



I expect mechanical designs with these broad characteristics are going to be showing up increasingly frequently over the next year or so primarily because it is a cost effective and environmentally sounds approach.




James Hamilton



b: /


Monday, April 04, 2011 2:53:33 PM (Pacific Standard Time, UTC-08:00)  #    Comments [5] - Trackback
 Saturday, March 26, 2011

Brad Porter is Director and Senior Principal engineer at Amazon. We work in different parts of the company but I have known him for years and he’s actually one of the reasons I ended up joining Amazon Web Services. Last week Brad sent me the guest blog post that follows where, on the basis of his operational experience, he prioritizes the most important points in the Lisa paper On Designing and Deploying Internet-Scale Services.




Prioritizing the Principles in "On Designing and Deploying Internet-Scale Services"

By Brad Porter

James published what I consider to be the single best paper to come out of the highly-available systems world in many years. He gave simple practical advice for delivering on the promise of high-availability. James presented “On Designing and Deploying Internet-Scale Services at Lisa 2007.

A few folks have commented to me that implementing all of these principles is a tall hill to climb. I thought I might help by highlighting what I consider to be the most important elements and why.

1. Keep it simple

Much of the work in recovery-oriented computing has been driven by the observation that human errors are the number one cause of failure in large-scale systems. However, in my experience complexity is the number one cause of human error.

Complexity originates from a number of sources: lack of a clear architectural model, variance introduced by forking or branching software or configuration, and implementation cruft never cleaned up. I'm going to add three new sub-principles to this.

Have Well Defined Architectural Roles and Responsibilities: Robust systems are often described as having "good bones." The structural skeleton upon which the system has evolved and grown is solid. Good architecture starts from having a clear and widely shared understanding of the roles and responsibilities in the system. It should be possible to introduce the basic architecture to someone new in just a few minutes on a white-board.

Minimize Variance: Variance arises most often when engineering or operations teams use partitioning typically through branching or forking as a way to handle different use cases or requirements sets. Every new use case creates a slightly different variant. Variations occur along software boundaries, configuration boundaries, or hardware boundaries. To the extent possible, systems should be architected, deployed, and managed to minimize variance in the production environment.

Clean-Up Cruft: Cruft can be defined as those things that clearly should be fixed, but no one has bothered to fix. This can include unnecessary configuration values and variables, unnecessary log messages, test instances, unnecessary code branches, and low priority "bugs" that no one has fixed. Cleaning up cruft is a constant task, but it is a necessary to minimize complexity.

2. Expect failures

At its simplest, a production host or service need only exist in one of two states: on or off. On or off can be defined by whether that service is accepting requests or not. To "expect failures" is to recognize that "off" is always a valid state. A host or component may switch to the "off" state at any time without warning.

If you're willing to turn a component off at any time, you're immediately liberated. Most operational tasks become significantly simpler. You can perform upgrades when the component is off. In the event of any anomalous behavior, you can turn the component off.

3. Support version roll-back

Roll-back is similarly liberating. Many system problems are introduced on change-boundaries. If you can roll changes back quickly, you can minimize the impact of any change-induced problem. The perceived risk and cost of a change decreases dramatically when roll-back is enabled, immediately allowing for more rapid innovation and evolution, especially when combined with the next point.

4. Maintain forward-and-backward compatibility

Forcing simultaneous upgrade of many components introduces complexity, makes roll-back more difficult, and in some cases just isn't possible as customers may be unable or unwilling to upgrade at the same time.

If you have forward-and-backwards compatibility for each component, you can upgrade that component transparently. Dependent services need not know that the new version has been deployed. This allows staged or incremental roll-out. This also allows a subset of machines in the system to be upgraded and receive real production traffic as a last phase of the test cycle simultaneously with older versions of the component.

5. Give enough information to diagnose

Once you have the big ticket bugs out of the system, the persistent bugs will only happen one in a million times or even less frequently. These problems are almost impossible to reproduce cost effectively. With sufficient production data, you can perform forensic diagnosis of the issue. Without it, you're blind.

Maintaining production trace data is expensive, but ultimately less expensive than trying to build the infrastructure and tools to reproduce a one-in-a-million bug and it gives you the tools to answer exactly what happened quickly rather than guessing based on the results of a multi-day or multi-week simulation.

I rank these five as the most important because they liberate you to continue to evolve the system as time and resource permit to address the other dimensions the paper describes. If you fail to do the first five, you'll be endlessly fighting operational overhead costs as you attempt to make forward progress. 

  • If you haven't kept it simple, then you'll spend much of your time dealing with system dependencies, arguing over roles & responsibilities, managing variants, or sleuthing through data/config or code that is difficult to follow.
  • If you haven't expected failures, then you'll be reacting when the system does fail. You may also be dealing with complicated change-management processes designed to keep the system up and running while you're attempting to change it.
  • If you haven't implemented roll-back, then you'll live in fear of your next upgrade. After one or two failures, you will hesitate to make any further system change, no matter how beneficial.
  • Without forward-and-backward compatibility, you'll spend much of your time trying to force dependent customers through migrations.
  • Without enough information to diagnose, you'll spend substantial amounts of time debugging or attempting to reproduce difficult-to-find bugs.

I'll end with another encouragement to read the paper if you haven't already: "On Designing and Deploying Internet-Scale Services"


Saturday, March 26, 2011 8:21:16 AM (Pacific Standard Time, UTC-08:00)  #    Comments [10] - Trackback
 Sunday, March 20, 2011

Back in early 2008, I noticed an interesting phenomena: some workloads run more cost effectively on low-cost, low-power processors. The key observation behind this phenomena is that CPU bandwidth consumption is going up faster than memory bandwidth.  Essentially, it’s a two part observation: 1) many workloads are memory bandwidth bound and will not run faster with a faster processor unless the faster processor comes with a much improved memory subsystem and 2) the number of memory bound workloads is going up overtime.


One solution is improve both the memory bandwidth and the processor speed and this really does work but it is expensive. Fast processors are expensive and faster memory subsystems bring more cost as well. The cost of the improved performance goes up non-linearly.  N times faster costs far more than N times as much. There will always be workloads where these additional investments make sense. The canonical scale-up workload is relational database (RDBMS). These workloads tend to be tightly coupled and not scale out well. The non-linear cost of more expensive, faster gear is both affordable and justified by there really not being much of an alternative. If you can’t easily scale out and the workload is important, scale it up.


The poor scaling of RDBMS workloads is driving three approaches in our industry: 1) considerable spending on SANs, SSDs, flash memory, and scale-up servers, 2) the noSQL movement rebelling against the high-function, high-cost, and poor scaling characteristics of RDBMS, and 3) deep investments in massively parallel processing SQL solutions (Teradata, Greenplum, Vertica, Paracel, Aster Data, etc.).  


All three approaches have utility and I’m interested in all three. However, there are workloads that really are highly parallel. For these workloads, the non-linear cost escalation of scale-up servers is less cost effective than more commodity servers. The banner workload in this class are simple web servers where there are potentially thousands of parallel requests and its more cost effective to turn these workloads over a fleet of commodity servers.


The entire industry has pretty much moved to deploying these highly parallelizable workloads over fleets of commodity servers. The will always be workloads that are tightly coupled and hard to run effectively over large number of commodity servers but, for those that can be, the gains are great: 1) far less expensive, 2) much smaller unit of failure, 3) cheap redundancy, 4) small, inexpensive unit of growth (avoid forklift upgrades), and 5) no hard limit at the scale-up limit.


This tells us two things: 1) where we can run a workloads over a large number of commodity servers we should, and 2) the number of workloads where parallel solutions have been found continues to increase. Even workloads like the social networking hairball problem (see hairball problem in we now have economic parallel solutions. Even some RDBMS workloads can now be run cost effectively on scale-out clusters. The question I started thinking through about back in 2008 is how far can we take this? What if we used client processors or even embedded device processors? How far down the low-cost, low-power spectrum make sense for highly parallelizable workloads? Cleary the volume economics client and device processors make them appealing from a price perspective and the multi-decade focus on power efficiencies in the device world offers impressive power efficiencies for some workloads.


For more on the low-cost, low-power trend see:

·         Very low-Cost, Low-Power Servers

·         Microslice Servers

·         The Case of Low-Cost, Low-Power Servers

Or the full paper on the approach: Cooperative Expendable, Microslice Servers


I’m particularly interested in the possible application of ARM processors to server workloads:

·         Linux/Apache on ARM processors

·         ARM Cortex-A9 SMP Design Announced


Intel is competing with ARM on some device workloads and offers the Atom which is not as power efficient as ARM but has the upside of running Intel Architecture instruction set. ARM also icensees recognize the trend I’ve outlined above – some server workloads will run very cost effectively on low-cost, low-power processors – and they want to compete in the server business. One excellent example is SeaMicro who is taking Intel Atom processors and competing for the server business: SeaMicro Releases Innovative Intel Atom Server.


Competition is a wonderful thing and drives much of the innovation the fruits of which we enjoy today. We have three interesting points of competition here: 1) Intel is competing for the device market with Atom, 2) ARM licensees are competing with Intel Xeon in the server market, and 3) Intel Atom is being used to compete in the server market but not with solid Intel backing.


Using Atom to compete with server processors is, on one hand, good for Intel in that it gives them a means of competing with ARM at the low end of the server market but, on the other hand, the Atom is much cheaper than Xeon so Intel will lose money on every workload it wins where that workloads used to be Xeon hosted. This risk is limited by Intel not giving Atom ECC memory (You Really do Need ECC Memory). Hobbling Atom is a dangerous tactic in that it protects Xeon but, at the same time, may allow ARM to gain ground in the server processor market. Intel could have ECC support on Atom in months – there is nothing technically hard about it. It’s the business implications that are more complex.


The first step in the right direction was made earlier this week where Intel announced an ECC capable Atom for 2012:

·         Intel Plans on Bringing Atom to Servers in 2012, 20W SNB for Xeons in 2011

·         Intel Flexes Muscles with New Processor in the Face of ARM Threat


This is potentially very good news for the server market but its only potentially good news. Waiting until 2012 suggests Intel wants to give the low power Xeons time in the market to see how they do before setting the price on the ECC equipped Atom. If the ECC supporting version of Atom is priced like a server part rather than a low-cost, high volume device part, then nothing changes. If the Atom with ECC comes in priced like a device processor, then it’s truly interesting. This announcement says that Intel is interested in the low-cost, low-power server market but still plans to delay entry into the lowest end of that market for another year. Still, this is progress and I’m glad to see it.


Thanks to Matt Corddry of Amazon for sending the links above my way.




James Hamilton



b: /


Sunday, March 20, 2011 1:19:09 PM (Pacific Standard Time, UTC-08:00)  #    Comments [4] - Trackback
 Tuesday, March 15, 2011

Two of the highest leverage datacenter efficiency improving techniques currently sweeping the industry are: 1) operating at higher ambient temperatures ( and air-side economization  with evaporative cooling (


The American Society of Heating and Refrigeration, and Air-Conditioning Engineers (ASHRAE) currently recommends that servers not be operated at inlet temperatures beyond 81F. Its super common to hear that every 10C increase in temperatures leads to 2x the failure – some statements get repeated so frequently they become “true” and no longer get questioned. See Exploring the limits of Data Center Temperature for my argument that this rule of thumb doesn’t apply over the full range operating temperatures.


Another one of those “you can’t do that” statements is around air-side economization also referred to as Outside Air (OA) cooling. Stated simply air-side economization is essentially opening the window. Rather than taking 110F exhaust air and cooling it down and recirculation it back to cool the servers, dump the exhaust and take in outside air to cool the servers.  If the outside air is cooler than the server exhaust, and it almost always will be, then air-side economization is a win.


The most frequently referenced document explaining why you shouldn’t do this is: Particulate and Gaseous  Contamination Guidelines for Data Centers again published by ASHRAE.  Even the document title sounds scary. Do you really want your servers operating in an environment of gaseous contamination? But, upon further reflection, is it really the case that servers really need better air quality than the people that use them? Really?


From the ASHRAE document:

The recent increase in the rate of hardware failures in data centers high in sulfur-bearing gases, highlighted by the number of recent publications on the subject, led to the need for this white paper that recommends that in addition to temperature-humidity control, dust and gaseous contamination should also be monitored and controlled. These additional environmental measures are especially important for data centers located near industries and/or other sources that pollute the environment.


Effects of airborne contaminations on data center equipment can be broken into three main categories: chemical effects, mechanical effects, and electrical effects. Two common chemical failure modes are copper creep corrosion on circuit boards and the corrosion of silver metallization in miniature surface-mounted components.


Mechanical effects include heat sink fouling, optical signal interference, increased friction, etc. Electrical effects include changes in circuit impedance, arcing, etc. It should be noted that the reduction of circuit board feature sizes and the miniaturization of components, necessary to improve hardware performance, also make the hardware more prone to attack by contamination in the data center environment, and manufacturers must continually struggle to maintain the reliability of their ever shrinking hardware.


It’s hard to read this document and not be concerned about the user of air-side economization. But, on the other hand, most leading operators are using it and experiencing no measure deleterious effects.  Let’s go get some more data.


Digging deeper, the Data Center Efficiency Summit had a session on exactly this topic title: Particulate and Corrosive Gas Measurements of Data Center Airside Economization: Data From the Field – Customer Presented Case Studies and Analysis. The title is a bit of tongue twister but the content is useful. Selecting from the slides:


·         From Jeff Stein of Taylor Engineering:

o   Anecdotal evidence of failures in non-economizer data centers in extreme environments in India or China or industrial facilities

o   Published data on corrosion in industrial environments

o   No evidence of failures in US data centers or any connection to economizers

o   Recommendations that gaseous contamination should be monitored and that gas phase filtration is necessary for US data centers are not supported

·         From Arman Shehabi of the UC Berkeley Department of Civil and Environmental Engineering:

o    Particle concerns should not dissuade economizer use

        More particles during economizer use with MERV 7 filters than during non-economizer periods, but still below many IT guidelines

        I/O ratios with MERV 14 filters and economizers were near (and often below!) levels using MERV 7 filters w/o economizers

o   Energy savings from economizer use greatly outweighed the fan energy increase from improved filtration

        MERV 14 filters increased fan power by about 10%, but the absolute increase (6 kW) was much smaller than the ~100 kW of chiller power savings during economizer use (in August!)

        The fan power increase is constant throughout the year while chiller savings during economizer should increase during cooler period


If you are interested in much more detail that comes to the same conclusions that Air-Side Economization is a good technique, see the excellent paper Should Data Center Owners be Afraid of Air-Side Economization Use? – A review of ASHRAE TC 9.9 White Paper titled Gaseous and Particulate Contamination Guidelines for Data Centers.


I urge you to read the full LBNL paper but I excerpt from the conclusions:

The TC 9.9 white paper brings up what may be an important issue for IT equipment in harsh environments but the references do not shed light on IT equipment failures and their relationship to gaseous corrosion.  While the equipment manufacturers are reporting an uptick in failures, they are not able to provide information on the types of failures, the rates of failures, or whether the equipment failures are in new equipment or equipment that may be pre-Rojas. Data center hardware failures are not documented in any of the references in the white paper. The only evidence for increased failures of electronic equipment in data centers is anecdotal and appears to be limited to aggressive environments such as in India, China, or severe industrial facilities. Failures that have been anecdotally September 2010 presented occurred in data centers that did not use air economizers. The white paper recommendation that gaseous contamination should be monitored and that gas phase filtration is necessary for data centers with high contamination levels is not supported.  


We are concerned that data center owners will choose to eliminate air economizers (or not operate them if installed) based upon the ASHRAE white paper since there are implications that contamination could be worse if air economizers are used.  This does not appear to be the case in practice, or from the information presented by the ASHRAE white paper authors. 


I’ve never been particularly good at accepting “you can’t do that” and I’ve been frequently rewarded for challenging widely held beliefs. A good many of these hard and fast rules end up being somewhere between useful guidelines not applying in all conditions to merely opinions. There is a large and expanding body of data supporting the use of air-side economization.




James Hamilton



b: /


Tuesday, March 15, 2011 10:35:57 AM (Pacific Standard Time, UTC-08:00)  #    Comments [6] - Trackback
 Saturday, March 05, 2011

Chris Page, Director of Climate & Energy Strategy at Yahoo! spoke at the 2010 Data Center Efficiency Summit and presented Yahoo! Compute Coop Design.

 The primary attributes of the Yahoo! design are: 1) 100% free air cooling (no chillers), 2) slab concrete floor, 3) use of wind power to augment air handling units, and 4) pre-engineered building for construction speed.


Chris reports the idea to orient the building such that the wind force on the external wall facing the dominant wind direction and use this higher pressure to assist the air handling units was taken from looking at farm buildings in the Buffalo, New York area. An example given was the use of natural cooling in chicken coops.

Chiller free data centers are getting more common (Chillerless Data Center at 95F) but the design approach is still far from common place. The Yahoo! team reports they run the facility at 75F and use evaporative cooling to attempt to hold the server approach temperatures down below 80F using evaporative cooling.  A common approach and the one used by Microsoft in the chillerless datacenter at 95F example is install low efficiency coolers for those rare days when they are needed on the logic that low efficiency is cheap and really fine for very rare use. The Yahoo! approach is to avoid the capital cost and power consumption of chillers entirely by allowing the cold aisle temperatures to rise to 85F to 90F when they are unable to hold the temperature lower. They calculate they will only do this 34 hours a year which is less than 0.4% of the year.


Reported results:

·         PUE at 1.08 with evaporative cooling and believe they could do better in colder climates

·         Save ~36 million gallons of water / yr, compared to previous water cooled chiller plant w/o airside economization design

·         Saves over ~8 million gallons of sewer discharge / yr (zero discharge design)

·         Lower construction cost (not quantified)

·         6 months from dirt to operating facility


Chris’ slides: (the middle slides of the 3 slide deck set)


Thanks to Vijay Rao for sending this my way.




James Hamilton



b: /


Saturday, March 05, 2011 1:58:04 PM (Pacific Standard Time, UTC-08:00)  #    Comments [7] - Trackback
Hardware | Services
 Sunday, February 27, 2011

Datacenter temperature has been ramping up rapidly over the last 5 years. In fact, leading operators have been pushing temperatures up so quickly that the American Society of Heating, Refrigeration, and Air-Conditioning recommendations have become a become trailing indicator of what is being done rather than current guidance. ASHRAE responded in January of 2009 by raising the recommended limit from 77F to 80.6F (HVAC Group Says Datacenters Can be Warmer). This was a good move but many of us felt it was late and not nearly a big enough increment.  Earlier this month, ASHRAE announced they are again planning to take action and raise the recommended limit further but haven’t yet announced by how much (ASHRAE: Data Centers Can be Even Warmer). 


Many datacenters are operating reliably well in excess even the newest ASHRAE recommended temp of 81F. For example, back in 2009 Microsoft announced they were operating at least one facility without chillers at 95F server inlet temperatures.


As a measure of datacenter efficiency, we often use Power Usage Effectiveness. That’s the ratio of the power that arrives at the facility divided by the power actually delivered to the IT equipment (servers, storage, and networking). Clearly PUE says nothing about the efficiency of servers which is even more important but, just focusing on facility efficiency, we see that mechanical systems are the largest single power efficiency overhead in the datacenter.


There are many innovative solutions to addressing the power losses to mechanical systems. Many of these innovations work well and others have great potential but none are as powerful as simply increasing the server inlet temperatures. Obviously less cooling is cheaper than more. And, the higher the target inlet temperatures, the higher percentage of time that a facility can spend running on outside air (air-side economization) without process-based cooling.


The downsides of higher temperatures are 1) high semiconductor leakage losses, 2) higher server fan speed which increases the losses to air moving, and 3) higher server mortality rates. I’ve measured the former and, although these losses are inarguably present, these losses are measureable but have a very small impact at even quite high server inlet temperatures.  The negative impact of fan speed increases is real but can be mitigated via different server target temperatures and more efficient server cooling designs. If the servers are designed for higher inlet temperatures, the fans will be configured for these higher expected temperatures and won’t run faster. This is simply a server design decision and good mechanical designs work well at higher server temperatures without increased power consumption. It’s the third issue that remains the scary one: increased server mortality rates.


The net of these factors is fear of server mortality rates is the prime factor slowing an even more rapid increase in datacenter temperatures. An often quoted study eports the failure rate of electronics doubles with every 10C increase of temperature (MIL-HDBK 217F). This data point is incredibly widely used by the military, NASA space flight program, and in commercial electronic equipment design. I’m sure the work is excellent but it is a very old study, wasn’t focused on a large datacenter environment, and the rule of thumb that has emerged from is a linear model of failure to heat.


This linear failure model is helpful guidance but is clearly not correct across the entire possible temperatures range. We know that at low temperature, non-heat related failure modes dominate. The failure rate of gear at 60F (16C) is not ½ the rate of gear at 79F (26C). As the temperature ramps up, we expect the failure rate to increase non-linearly. Really what we want to find is the knee of the curve. When does the failure rate increase start to approach that of the power savings achieved at higher inlet temperatures?  Knowing that the rule of thumb that 10C increase double the failure rate is incorrect doesn’t really help us. What is correct?


What is correct is a difficult data point to get for two reasons: 1) nobody wants to bear the cost of the experiment – the failure case could run several hundred million dollars, and 2) those that have explored higher temperatures and have the data aren’t usually interested in sharing these results since the data has substantial commercial value.


I was happy to see a recent Datacenter Knowledge article titled What’s Next? Hotter Servers with ‘Gas Pedals’ (ignore the title – Rich just wants to catch your attention). This article include several interesting tidbits on high temperature operation. Most notable was the quote by Subodh Bapat, the former VP of Energy Efficiency at Sun Microsystems, who says:


Take the data center in the desert. Subodh Bapat, the former VP of Energy Efficiency at Sun Microsystems, shared an anecdote about a data center user in the Middle East that wanted to test server failure rates if it operated its data center at 45 degrees Celsius – that’s 115 degrees Fahrenheit.


Testing projected an annual equipment failure rate of 2.45 percent at 25 degrees C (77 degrees F), and then an increase of 0.36 percent for every additional degree. Thus, 45C would likely result in annual failure rate of 11.45 percent. “Even if they replaced 11 percent of their servers each year, they would save so much on air conditioning that they decided to go ahead with the project,” said Bapat. “They’ll go up to 45C using full air economization in the Middle East.”


This study caught my interest for a variety of reasons. First and most importantly, they studied failure rates at 45C and decided that, although they were high, it still made sense for them to operate at these temperatures. It is reported they are happy to pay the increased failure rate in order to save the cost of the mechanical gear. The second interesting data point from this work is that they have found a far greater failure rate for 15C of increased temperature than predicted by MIL-HDBK-217F. Consistent with the217F, they also report a linear relationship between temperature and failure rate. I almost guarantee that the failure rate increase between 40C and 45C was much higher than the difference between 25C and 30C. I don’t buy the linear relationship between failure rate and temperature based upon what we know of failure rates at lower temperatures. Many datacenters have raised temperatures between 20C (68F) and 25C (77F) and found no measureable increase in server failure rate. Some have raised their temperatures between 25C (77F) and 30C (86F) without finding the predicted 1.8% increase in failures but admitted the data remains sparse.


The linear relationship is absolutely not there across a wide temperature range. But I still find the data super interesting in two respects: 1) they did see a roughly a 9% increase in failure rate at 45C over the control case at 20C and 2) even with the high failure rate, they made an economic decision to run at that temperature. The article doesn’t say if the failure rate is a lifetime failure rate or an Annual Failure Rate (AFR). Assuming it is an AFR, many workloads and customers could not life with a 11.45% AFR nonetheless, the data is interesting and it good to see the first public report of an operator doing the study and reaching a conclusion that works for their workloads and customers.


The net of the whole discussion above is the current linear rules of thumb are wrong and don’t usefully predict failure rates in the temperature ranges we care about, there is very little public data out there, industry groups like ASHRAE are behind and it’s quite possible that cooling specialist may not be the first to recommend we stop cooling datacenters aggressively :-). Generally, it’s a sparse data environment out there but the potential economic and environmental gains are exciting so progress will continue to get made. I’m looking forward to efficient and reliable operation at high temperature becoming a critical data point in server fleet purchasing decisions. Let’s keep pushing.




James Hamilton



b: /


Sunday, February 27, 2011 10:25:43 AM (Pacific Standard Time, UTC-08:00)  #    Comments [6] - Trackback
 Sunday, February 20, 2011

Dileep Bhandarkar presented the keynote at Server design Summit last December.  I can never find the time to attend trade shows so I often end up reading slides instead. This one had lots of interesting tidbits so I’m posting a pointer to the talk and my rough notes here.


Dileep’s talk: Watt Matters in Energy Efficiency


My notes:

·         Microsoft Datacenter Capacities:

o   Quincy WA: 550k sq ft, 27MW

o   San Antonio Tx: 477k sq ft, 27MW

o   Chicago, Il: 707k sq ft, 60MW

§  Containers on bottom floor with “medium reliability”  (no generators) and standard rooms on top floor with full power redundancy

o   Dublin, Ireland: 570k sq ft, 27MW

·         Speeds and feeds from Microsoft Consumer Cloud Services:

o   Windows Live: 500M IDs

o   Live Hotmail: 355M Active Accounts

o   Live Messenger: 303M users

o   Bing: 4B Queries/month

o   Xbox Live: 25M users    

o   adCenter: 14B Ads served/month

o   Exchange Hosted Services: 2 to 4B emails/day

·         Datacenter Construction Costs

o   Land: <2%

o   Shell: 5 to 9%

o   Architectural: 4 to 7%

o   Mechanical & Electrical: 70 to 85%

·         Summarizing the above list, we get 80% of the costs scaling with power consumption and 10 to 20% scaling with floor space. Reflect on that number and you’ll understand why I think the industry is nuts to be focusing on density. See Why Blade Servers Aren’t the Answer to All Questions for more detail on this point – I think it’s a particularly important one.

·         Reports that datacenter build costs are $10M to $20M per MW and server TCO is the biggest single category. I would use the low end of this spectrum for a cost estimator with inexpensive facilities in the $9M to $10M/MW range. See Overall Datacenter Costs.

·         PUE is a good metric for evaluating datacenter infrastructure efficiency but Microsoft uses best server performance per watt per TCO$

o   Optimize the server design and datacenter together

·         Cost-Reduction Strategies:

o   Server cost reduction:

§  Right size server: Low Power Processors often offer best performance/watt (see Very Low-Power Server Progress for more)

§  Eliminate unnecessary components (very small gain)

§  Use higher efficiency parts

§  Optimize for server performance/watt/$ (cheap and low power tend to win at scale)

o   Infrastructure cost reduction:

§  Operate at higher temperatures

§  Use free air cooling & Eliminate chillers (More detail at: Chillerless Datacenter at 95F)

§  Use advanced power management with power capping to support power over-subscription with peak protection (see the 2007 paper for the early work on this topic)

·         Custom Server Design:

o   2 socket, half-width server design (6.3”W x 16.7”L)

o   4x SATA HDD connectors

o   4x DIMM slots per CPU socket

o   2x 1GigE NIC

o   1x 16x PCIe slot

·         Custom Rack Design:

o   480 VAC 3p power directly to the rack (higher voltage over a given conductor size reduces losses in distribution)

o   Very tall 56 RU rack (over 98” overall height)

o   12VDC distribution within the rack from two combined power supplies with distributed UPS

o   Power Supplies (PSU)

§  Input is 480VAC 3p

§  Output: 12V DC

§  Servers are 12VDC only boards

§  Each PSU is 4.5KW

§  2 PSUs/rack  so rack is 9.0KW max

o   Distributed UPS

§  Each PSU includes an UPS made up of 4 groups of 4013.2V batteries

§  Overall 160 discrete batteries per UPS

§  Technology not specified but based upon form factor and rough power estimate, I suspect they may be Li-ion 18650. See Commodity Parts for a note on these units.

o   By putting 2 PSUs per rack they avoid transporting low voltage (12VDC) further than 1/3 of a rack distance (under 1 yard) and only 4.5 KW is transported so moderately sized bus bars can be used.

o   Rack Networking:

§  Rack networking is interesting with 2 to 4 tor switches per rack. We know the servers are 1 GigE connected and there are up to 96 per rack which yields two possibilities: 1) they are using 24 port GigE switches or 2) they are using 48 port GigE and fat tree network topology (see VL2: A Scalable and Flexible Datacenter Network for more on a fat tree). 24 port TORs are not particularly cost effective, so I would assume 48x1GigE TORs in a fat tree design which is nice work.

o   Report that the CPUs can be lower power parts including Intel Atom, AMD Bobcat, and ARM (more on lower powered processors in server applications:

·         Power proportionality: Shows that server at 0% load consumes 32% of peak power (this is an amazingly good server – 45% to 60% is much closer to the norm)


James Hamilton



b: /


Sunday, February 20, 2011 7:00:07 AM (Pacific Standard Time, UTC-08:00)  #    Comments [6] - Trackback
 Friday, February 11, 2011

Ben Black always has interesting things on the go. He’s now down in San Francisco working on his startup Fastip which he describes as “an incredible platform for operating, exploring, and optimizing data networks.” A couple of days ago Deepak Singh sent me to a recent presentation of Ben’s I found interesting: Challenges and Trade-offs in Building a Web-scale Real-time Analytics System.


The problem described in this talk was “Collect, index, and query trillions of high dimensionality

records with seconds of latency for ingestion and response.” What Ben is doing is collecting per flow networking data with tcp/ip 11-tuples (src_mac, dst_mac, src_IP, dest_IP, …) as the dimension data and, as metrics, he is tracking start usecs, end usecs, packets, octets, and UID. This data is interesting for two reasons: 1) networks are huge, massively shared resources and most companies haven’t really a clue on the details of what traffic is clogging it and have only weak tools to understand what traffic is flowing – the data sets are so huge, the only hope is to sample it with solutions like Cisco's NetFlow. The second reason I find this data interesting is closely related: 2) it is simply vast and  I love big data problems. Even on small networks, this form of flow tracking produces a monstrous data set very quickly. So, it’s an interesting problem in that it’s both hard and very useful to solve.


Ben presented 3 possible solutions and why they don’t work before offering a solution. The failed approaches that couldn’t cope with high dimensionality and the sheer volume of the dataset:

1.       HBase: Insert into HBase then retrieve all records in a time range and filter, aggregate, and sort

2.       Cassandra: Insert all records into Cassandra partitioned over a large cluster with each dimension indexed independently. Select qualifying records on each dimension, aggregate, and sort.

3.       Statistical Cassandra: Run a statistical sample over the data stored in Cassandra in the previous attempt.


The end solution proposed by the presenter is to treat it as an Online Analytic Processing (OLAP) problem. He describes OLAP as “A business intelligence (BI) approach to swiftly answer multi-dimensional analytics queries by structuring the data specifically eliminate expensive processing at query time, even at a cost of enormous storage consumption.” OLAP is essentially a mechanism to support fast query over high-dimensional data by pre-computing all interesting aggregations and storing the pre-computed results in a highly compressed form that is often kept memory resident. However, in this case, the data set is far too large to be practical for an in-memory approach.


Ben draws from two papers to implement an OLAP based solution at this scale:

·         High-Dimensional OLAP: A Minimal Cubing Approach by Li, Han, and Gonzalez

·         Sorting improves word-aligned bitmap indexes by Lemire, Kaser, and Aouiche


The end solution is:

·         Insert records into Cassandra.

·         Materialize lower-dimensional cuboids using bitsets and then join as needed.

·         Perform all query steps directly in the database

Ben’s concluding advice:

·         Read the literature

·         Generic software is 90% wrong at scale, you just don’t know which 90%.

·         Iterate to discover and be prepared to start over


If you want to read more, the presentation is at: Challenges and Trade-offs in Building a Web-scale Real-time Analytics System.




James Hamilton



b: /


Friday, February 11, 2011 5:41:36 AM (Pacific Standard Time, UTC-08:00)  #    Comments [12] - Trackback

Disclaimer: The opinions expressed here are my own and do not necessarily represent those of current or past employers.

<June 2011>

This Blog
Member Login
All Content © 2015, James Hamilton
Theme created by Christoph De Baene / Modified 2007.10.28 by James Hamilton