Wednesday, May 04, 2011

This note looks at the Open Compute Project distributed Uninterruptable Power Supply (UPS) and server Power Supply Unit (PSU). This is the last in a series of notes looking at the Open Compute Project. Previous articles include:

·         Open Compute Project

·         Open Compute Server Design

·         Open Compute Mechanical Design


The open compute uses a semi-distributed uninterruptable power supply (UPS) system. Most data centers use central UPS systems where large the UPS is part of the central power distribution system. In this design, the UPS is in the 480 3 phase part of the central power distribution system prior to the step down to 208VAC. Typical capacities range from 750kVA to 1,000kVA. An alternative approach is a distributed UPS like that used by a previous generation Google server design.

In a distributed UPS, each server has its own 12VDC battery to serve as backup power. This design has the advantage of being very reliable with the batter directly connected to the server 12V rail. Another important advantage is the small fault containment zone (small “blast radius”) where a UPS failure will only impact a single server. With a central UPS, a failure could drop the load on 100 racks of servers or more. But, there are some downsides of distributed UPS. The first is that batteries are stored with the servers. Batteries take up considerable space, can emit corrosive gasses, don’t operate well at high temperature, and require chargers and battery monitoring circuits. As much as I like aspects of the distributed UPS design, it’s hard to cost effectively and, consequently, is very uncommon.


The Open Compute UPS design is a semi-distributed approach where each UPS is on the floor with servers but rather than having 1 UPS per server (distributed UPS) or 1 UPS per order 100 racks (central UPS with roughly 4,000 servers), they have 1 UPS per 6 racks (180 servers).




In this design the battery rack is the central rack flanked by two triple racks of servers. Like the server racks, the UPS is delivered 480VAC 3 phase directly. At the top of the battery rack, they have control circuitry, circuit breakers, and rectifiers to charge the battery banks.


What’s somewhat unusual in the output stage of the UPS doesn’t include inverters to convert the direct current back to the alternating current required by a standard server PSU. Instead the UPS output is 48V direct current which is delivered directly to the three racks on either side of the UPS. This has the upside of avoiding the final invert stage which increases efficiency. There is a cost to avoiding converting back to AC.  The most important downside is they need to effectively have two server power supplies where one accepts 277VAC and the other accepts 48VDC from the UPS. The second disadvantage is using 48V distribution is inefficient over longer distances due to conductor losses at high amperage.


The problem with power distribution efficiency is partially mitigated by keeping the UPS close to servers where the 6 racks its feeds are on either side of the UPS so the distances are actually quite short. And, since the UPS is only used during short periods of time between the start of a power failure and the generators taking over, the efficiency of distribution is actually not that important a factor. The second issue remains, each server power supply is effectively two independent PSUs.


The server PSU looks fairly conventional in that it’s a single box. But, included in the single box, is two independent PSUs and some control circuitry. This has the downside of forcing the use of a custom, non-commodity power supply. Lower volume components tend to cost more. However, the power supply is a small part of the cost of a server so this additional cost won’t have a substantially negative impact. And, it’s a nice, reliable design with a small fault containment zone which I really like.


The Open Compute UPS and power distribution system avoids one level of power conversion common in most data centers, delivers somewhat higher voltages (277VAC rather than 208VAC) close to the load, and has the advantage of a small fault zone.


James Hamilton



b: /


Wednesday, May 04, 2011 5:24:34 AM (Pacific Standard Time, UTC-08:00)  #    Comments [4] - Trackback
 Friday, April 29, 2011

Google cordially invites you to participate in a European Summit on sustainable Data Centres. This event will focus on energy-efficiency best practices that can be applied to multi-MW custom-designed facilities, office closets, and everything in between. Google and other industry leaders will present case studies that highlight easy, cost-effective practices to enhance the energy performance of Data Centres.

The summit will also include a dedicated session on cooling. Presenters will detail climate-specific implementations of free cooling as well as novel ways to utilise locally -available opportunities. We will also debate climate-independent PUE targets.

The agenda includes presentations and panel discussions featuring Amazon, DeepGreen, eBay, Google, IBM, Microsoft, Norman Disney & Young, PlusServer, Telecity Group, The Green Grid, UK's Chartered Institute for IT, UBS and others.

Attendance is free. However, space is limited and we therefore encourage you to register online at your earliest convenience. Your participation will be confirmed.

We look forward to seeing you and your colleagues in Zurich!

Friday, April 29, 2011 5:18:40 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
 Wednesday, April 20, 2011

Last Thursday Facebook announced the Open Compute Project where they released pictures and specifications for their Prineville Oregon datacenter and the servers and infrastructure that will populate that facility. In my last blog, Open Compute Mechanical System Design I walked through the mechanical system in some detail. In this posting, we’ll have a closer look at the Facebook Freedom Server design.


Chassis Design:

The first thing you’ll notice when looking at the Facebook chassis design is there are only 30 servers per rack. They are challenging one of the strongest held beliefs in the industry that is density is the primary design goal and more density is good. I 100% agree with Facebook and have long argued that density is a false god. See my rant Why Blade Servers aren’t the Answer to all Questions for more on this one.  Density isn’t a bad thing but paying more to get denser designs that cost more to cool is usually a mistake. This is what I’ve referred to in the past as the Blade Server Tax.


When you look closer at the Facebook design, you’ll note that the servers are more than 1 Rack Unit (RU) high but less than 2 RU. They choose a non-standard 1.5RU server pitch. The argument is that 1RU server fans are incredibly inefficient. Going with 60mm fans (fit in 1.5RU) dramatically increases their efficiency but moving further up to 2RU isn’t notably better. So, on that observation, they went with 60mm fans and a 1.5RU server pitch.

I completely agree that optimizing for density is a mistake and that 1RU fans should be avoided at all costs so, generally, I like this design point. One improvement worth considering is to move the fans out of the server chassis entirely and go with very large fans on the back of the rack. This allows a small gain in fan efficiency by going with larger still fans and allows a denser server configuration without loss of efficiency or additional cost. Density without cost is a fine thing and, in this case, I suspect 40 to 80 servers per rack could be delivered without loss of efficiency or additional cost so would be worth considering.


The next thing you’ll notice when studying the chassis above is that there is no server case.  All the components are exposed for easy service and excellent air flow. And, upon more careful inspection, you’ll note that all components are snap in and can be serviced without tools. Highlights:

·         1.5 RU pitch

·         1.2 MM stamped pre-plated steel

·         Neat, integrated cable management

·         4 rear mounted 60mm fans

·         Tool-less design with snap plungers holding all components

·         100% front cable access



The Open Compute project supports two motherboard designs where 1 uses an Intel processors and the other uses AMD.

Intel Motherboard:

AMD Motherboard:


Note that these boards are both 12V only designs. 


Power Supply:

The power supply (PSU) is an usual design in two dimensions: 1) it is a single output voltage 12v design and 2) it’s actually two independent power supplies in a single box. Single voltage supplies are getting more common but commodity server power supplies still usually deliver 12V, 5V, and 3.3V.  Even though processors and memory require somewhere between 1 and 2 volts depending upon the technology, both typically are fed by the 12V power rail through a Voltage Regulator Down (VRD) or Voltage Regulator Module (VRM). The Open Compute approach is to use deliver 12V only to the board and to produce all other required voltages via an Voltage Regulator Module on the mother board. This simplifies the power supply design somewhat and they avoid cabling by having the motherboard connecting directly to the server PSU.


The Open Compute Power Supply is has two power sources. The primary source is 277V alternating current (AC) and the backup power source is 48V direct current (DC). The output voltage from both supplies is the same 12V DC power rail that is delivered to the motherboard.


Essentially this supply is two independent PSUs with a single output rail. The choice of 277VAC is unusual with most high-scale data centers run on 208VAC. But 277 allows one power conversion stage to be avoided and is therefore more power efficient.


Most data centers have mid-voltage transformers(typically in the 13.2kv range but it can vary widely by location). This voltage is stepped down to 480V three phase power in North America and 400V 3 phase in much of the rest of the world.  The 480VAC 3p power is then stepped down to 208VAC for delivery to the servers.


The trick that Facebook is employing in their datacenter power distribution system is to avoid one power conversion by not doing the 480VAC to 208VAC conversion. Instead, they exploit the fact that each phase of 480 3p power is 277VAC between the phase and neutral.  This avoids a power transformation step which improves overall efficiency. The negatives of this approach are 1) commodity power supplies can’t be used (277VAC is beyond the range of commodity PSUs) and 2) the load on each of the three phases need to be balanced.  Generally, this is a good design tradeoff where the increase in efficiency justifies the additional cost and complexity.


An alternative but very similar approach that I like even better is to step down mid-voltage to 400VAC 3p and then play the same phase to neutral trick described above. This technique still has the advantage of avoiding 1 layer of power transformation. What is different is the resultant phase to neutral voltage delivered to the servers is 230VAC which allows commodity power supplies to be used.  The disadvantage of this design is that the mid-voltage to 400VAC 3p transformer is not in common use in North America. However this is a common transformer in other parts of the world so they are still fairly easily attainable.


Clearly, any design that avoids a power transformation stage is a substantial improvement over most current distribution systems. The ability to use commodity server power supplies unchanged makes the 400 3p to neutral trick look slightly better than the 480VAC 3p approach but all designs need to be considered in the larger context in which they operate. Since the Facebook power redundancy system requires the server PSU to accept both a primary alternating current input and a backup 48VDC input, special purpose build supplies need to be used. Since a custom PSU is needed for other reasons, going with 277VAC as the primary voltage makes perfect sense.


Overall a very efficient and elegant design that I’ve enjoyed studying. Thanks to Amir Michael of the Facebook hardware design team for the detail and pictures.




James Hamilton



b: /


Wednesday, April 20, 2011 6:56:09 PM (Pacific Standard Time, UTC-08:00)  #    Comments [16] - Trackback
 Saturday, April 09, 2011

Last week Facebook announced the Open Compute Project (Perspectives, Facebook). I linked to the detailed specs in my general notes on Perspectives and said I would follow up with more detail on key components and design decisions I thought were particularly noteworthy.  In this post we’ll go through the mechanical design in detail.


As long time readers of this blog will know, PUE has many issues (PUE is still broken and I still use it) and is mercilessly gamed in marketing literature (PUE and tPUE). The Facebook published literature predicts that this center will deliver a PUE of 1.07. Ignoring the power requirements of the mechanical systems, just delivering power to the servers at a 7% loss is a considerable challenge. I’m slightly skeptical of this number as a fully loaded, annual PUE number but I can say with confidence that it is one of the nicest mechanical designs I have come across. Hats off to Jay Park and the rest of the Facebook DC design team.


Perhaps the best way to understand the mechanical design is to first walk through the mechanical system diagram and then step through the actual deployment in pictures.

In the mechanical system diagram you will first note there is no process based cooling (no air conditioning) and no chilled water loop. It’s a 100% air cooled with all IT equipment cooling via outside air which is particularly effective with the favorable weather conditions experienced in central Oregon. 


The outside air is pulled in from outside, mixed with a controlled volume of return air to avoid over-cooling in winter, filtered, evaporative cooled, and run through a fan wall before being returned to the cold aisles below.  


Air Intake

The cooling system takes up the entire second floor of the facility with the IT equipment and office space on the first floor. In the picture below, you’ll see the building air intake on the left running the full width of the facility. This intake area in the picture is effectively “outside” so includes water drains in the floor. In the picture to the right towards the top, you’ll see the air path to the remainder of the air handling system. To the right at the bottom, you’ll see the white sheet metal sections covering the passage where hot exhaust air is brought up from the hot aisle below to be mixed with outside air to prevent over-cooling on cold days.


Mixing Plenum & Filtration:

In the next picture, you can see the next part of the full building air handling system. On the left the louvers at the top are bringing in outside air from the outside air intake room.  On the left at the bottom, there are thermostatically controlled louvers allowing hot exhaust air to come up from the IT equipment hot aisle. The hot and outside air are mixed in this full room plenum before passing through the filtration wall visible on the right side of the picture.


The facility grounds are still being worked upon so the filtration system includes an extra set of disposable paper filters to increase the life of the more aggressive filtration media not visible behind.


Misting Evaporative Cooling

In the next picture below you’ll see the temperature and humidity controlled high-pressure water misting system again running the full width of the facility. Most evaporative cooling systems used in data center applications use wet media. This system used high pressure water with stainless steel nozzles to produce an atomized mist. These designs produce excellent cooling (high delta-T) but are prone to calcification and maintenance without aggressive filtration which brings some expense. But they are very effective.


Just beyond the misters, you’ll see what looks like filtration media. This media is there to ensure no airborne water makes it out of the cooling room.


Exhaust System

In the final picture below, you’ll see we have now got to the other side of the building where the exhaust fans are found. We bring air in on one side, filter, cool it, and pump it, and then just before the exhaust fans visible in the picture, huge openings in the floor alow treated, cooled air to be brough down to the IT equipment cold aisle below.


The exhaust fans visible in the picture control pressure by exhausting air not needed back outside.


The Facebook facility has considerable similarity to the design EcoCooling mechanical design I posted last Monday (Example of Efficient Mechanical System Design). Both approaches use the entire building as the air ducting system and both designs use evaporative cooling. The notable differences are: 1) EcoCooling design is based upon wet media whereas Facebook is using a high pressure water misting system, and 2) the EcoCooling design is running the mechanical systems beside the compute floor whereas the Facebook design uses the second floor above the IT rooms for all mechanical system.


The most interesting aspects of the Facebook mechanical design: 1) full building ducting with huge plenum areas, 2) no-process based cooling, 3) mist-based evaporative cooling, 4) large, efficient impellers with variable frequency drive, and 4) full wall, low-resistance filtration.


I’ve been saying for years that mechanical systems are where the largest opportunities for improvement lie and are the area where most innovation is most required. The Facebook Prineville Facility is most of the way there and one of the most efficient mechanical system designs I’ve come across. Very elegant and very efficient.




James Hamilton



b: /


Saturday, April 09, 2011 9:43:40 AM (Pacific Standard Time, UTC-08:00)  #    Comments [17] - Trackback
 Thursday, April 07, 2011

The pace of innovation in data center design has been rapidly accelerating over the last 5 years driven by the mega-service operators. In fact, I believe we have seen more infrastructure innovation in the last 5 years than we did in the previous 15. Most very large service operators have teams of experts focused on server design, data center power distribution and redundancy, mechanical designs, real estate acquisition, and network hardware and protocols.  But, much of this advanced work is unpublished and practiced at a scale that  is hard to duplicate in a research setting.


At low scale with only a data or center or two, it would be crazy to have all these full time engineers and specialist focused on infrastructural improvements and expansion. But, at high scale with 10s of data centers, it would be crazy not to invest deeply in advancing the state of the art.


Looking specifically at cloud services, the difference between an unsuccessful cloud service and a profitable, self-sustaining business is the cost of the infrastructure. With continued innovation driving down infrastructure costs, there is investment capital available, services can be added and improved, and value can be passed on to customers through price reductions. Amazon Web Services, for example, has had 11 price reductions in 4 years. I don’t recall that happening in my first 20 years working on enterprise software. It really is an exciting time in our industry.


Facebook is a big business operating at high scale and they also have elected to invest in advanced infrastructure designs. Jonathan Heiliger and the Facebook infrastructure team have hired an excellent group of engineers over the past couple of years and are now bringing these designs to life in their new Prineville Oregon facility. I had the opportunity to visit this datacenter 6 weeks back just before it started taking production load. I had an excellent visit, got to catch up with some old friends, meet some new ones, and tour an impressive facility. I saw an unusually large number of elegant designs ranging from one of the cleanest mechanical systems I’ve come across, three phase 480VAC directly to the rack,  a low voltage direct current distributed uninterruptable power supply system, all the way through to custom server designs. But, what made this trip really unusual is that I’m actually able to talk about what I saw.


In fact, more than allowing me to talk about it, Facebook has decided to release most of the technical details surrounding these designs publically. In the past, I’ve seen some super interesting but top secret facilities and I’ve seen some public but not particularly advanced data centers. To my knowledge, this is the first time an industry leading design has been documented in detail and released publically.


The set of specifications Facebook is releasing are worth reading so I’m posting links to all below.  I encourage you to go through these in as much detail as you chose. In addition, I’ll also  post summary notes over the next couple of days explain aspects of the design I found most interesting or commenting upon the pros and cons of some of the approaches employed.


The specifications:

·         Data Center Design

·         Intel Motherboard

·         AMD Motherboard

·         Battery Cabinet (Distributed UPS)

·         Server PSU

·         Server Chassis and Triplet Hardware


My commendations to the specification authors Harry Li , Pierluigi Sarti, Steve Furuta, Jay Park and to the rest of the Facebook infrastructure team for releasing this work publically and for doing so in sufficient detail that others can build upon it. Well done.





·         Open Compute Web Site:

·         Live Blog of the Announcement:


James Hamilton



b: /



Thursday, April 07, 2011 9:30:02 AM (Pacific Standard Time, UTC-08:00)  #    Comments [6] - Trackback
 Monday, April 04, 2011

A bit more than a year back, I published Computer Room Evaporative Cooling where I showed an evaporative cooling design from EcoCooling. Periodically, Alan Beresford sends me designs he’s working on. This morning he sent me a design they are working on for a 7MW data center in Ireland.


I like the design for a couple of reasons: 1) It’s a simple design and efficient design, and 2) it’s a nice example of a few important industry trends. The trends exemplified by this design are: 1) air-side economization, 2) evaporative cooling, 3) hot-aisle containment, and 4) very large plenums with controlled hot-air recycling. The diagrams follow and, for the most part, speak for themselves.



I expect mechanical designs with these broad characteristics are going to be showing up increasingly frequently over the next year or so primarily because it is a cost effective and environmentally sounds approach.




James Hamilton



b: /


Monday, April 04, 2011 2:53:33 PM (Pacific Standard Time, UTC-08:00)  #    Comments [5] - Trackback
 Saturday, March 26, 2011

Brad Porter is Director and Senior Principal engineer at Amazon. We work in different parts of the company but I have known him for years and he’s actually one of the reasons I ended up joining Amazon Web Services. Last week Brad sent me the guest blog post that follows where, on the basis of his operational experience, he prioritizes the most important points in the Lisa paper On Designing and Deploying Internet-Scale Services.




Prioritizing the Principles in "On Designing and Deploying Internet-Scale Services"

By Brad Porter

James published what I consider to be the single best paper to come out of the highly-available systems world in many years. He gave simple practical advice for delivering on the promise of high-availability. James presented “On Designing and Deploying Internet-Scale Services at Lisa 2007.

A few folks have commented to me that implementing all of these principles is a tall hill to climb. I thought I might help by highlighting what I consider to be the most important elements and why.

1. Keep it simple

Much of the work in recovery-oriented computing has been driven by the observation that human errors are the number one cause of failure in large-scale systems. However, in my experience complexity is the number one cause of human error.

Complexity originates from a number of sources: lack of a clear architectural model, variance introduced by forking or branching software or configuration, and implementation cruft never cleaned up. I'm going to add three new sub-principles to this.

Have Well Defined Architectural Roles and Responsibilities: Robust systems are often described as having "good bones." The structural skeleton upon which the system has evolved and grown is solid. Good architecture starts from having a clear and widely shared understanding of the roles and responsibilities in the system. It should be possible to introduce the basic architecture to someone new in just a few minutes on a white-board.

Minimize Variance: Variance arises most often when engineering or operations teams use partitioning typically through branching or forking as a way to handle different use cases or requirements sets. Every new use case creates a slightly different variant. Variations occur along software boundaries, configuration boundaries, or hardware boundaries. To the extent possible, systems should be architected, deployed, and managed to minimize variance in the production environment.

Clean-Up Cruft: Cruft can be defined as those things that clearly should be fixed, but no one has bothered to fix. This can include unnecessary configuration values and variables, unnecessary log messages, test instances, unnecessary code branches, and low priority "bugs" that no one has fixed. Cleaning up cruft is a constant task, but it is a necessary to minimize complexity.

2. Expect failures

At its simplest, a production host or service need only exist in one of two states: on or off. On or off can be defined by whether that service is accepting requests or not. To "expect failures" is to recognize that "off" is always a valid state. A host or component may switch to the "off" state at any time without warning.

If you're willing to turn a component off at any time, you're immediately liberated. Most operational tasks become significantly simpler. You can perform upgrades when the component is off. In the event of any anomalous behavior, you can turn the component off.

3. Support version roll-back

Roll-back is similarly liberating. Many system problems are introduced on change-boundaries. If you can roll changes back quickly, you can minimize the impact of any change-induced problem. The perceived risk and cost of a change decreases dramatically when roll-back is enabled, immediately allowing for more rapid innovation and evolution, especially when combined with the next point.

4. Maintain forward-and-backward compatibility

Forcing simultaneous upgrade of many components introduces complexity, makes roll-back more difficult, and in some cases just isn't possible as customers may be unable or unwilling to upgrade at the same time.

If you have forward-and-backwards compatibility for each component, you can upgrade that component transparently. Dependent services need not know that the new version has been deployed. This allows staged or incremental roll-out. This also allows a subset of machines in the system to be upgraded and receive real production traffic as a last phase of the test cycle simultaneously with older versions of the component.

5. Give enough information to diagnose

Once you have the big ticket bugs out of the system, the persistent bugs will only happen one in a million times or even less frequently. These problems are almost impossible to reproduce cost effectively. With sufficient production data, you can perform forensic diagnosis of the issue. Without it, you're blind.

Maintaining production trace data is expensive, but ultimately less expensive than trying to build the infrastructure and tools to reproduce a one-in-a-million bug and it gives you the tools to answer exactly what happened quickly rather than guessing based on the results of a multi-day or multi-week simulation.

I rank these five as the most important because they liberate you to continue to evolve the system as time and resource permit to address the other dimensions the paper describes. If you fail to do the first five, you'll be endlessly fighting operational overhead costs as you attempt to make forward progress. 

  • If you haven't kept it simple, then you'll spend much of your time dealing with system dependencies, arguing over roles & responsibilities, managing variants, or sleuthing through data/config or code that is difficult to follow.
  • If you haven't expected failures, then you'll be reacting when the system does fail. You may also be dealing with complicated change-management processes designed to keep the system up and running while you're attempting to change it.
  • If you haven't implemented roll-back, then you'll live in fear of your next upgrade. After one or two failures, you will hesitate to make any further system change, no matter how beneficial.
  • Without forward-and-backward compatibility, you'll spend much of your time trying to force dependent customers through migrations.
  • Without enough information to diagnose, you'll spend substantial amounts of time debugging or attempting to reproduce difficult-to-find bugs.

I'll end with another encouragement to read the paper if you haven't already: "On Designing and Deploying Internet-Scale Services"


Saturday, March 26, 2011 8:21:16 AM (Pacific Standard Time, UTC-08:00)  #    Comments [10] - Trackback
 Sunday, March 20, 2011

Back in early 2008, I noticed an interesting phenomena: some workloads run more cost effectively on low-cost, low-power processors. The key observation behind this phenomena is that CPU bandwidth consumption is going up faster than memory bandwidth.  Essentially, it’s a two part observation: 1) many workloads are memory bandwidth bound and will not run faster with a faster processor unless the faster processor comes with a much improved memory subsystem and 2) the number of memory bound workloads is going up overtime.


One solution is improve both the memory bandwidth and the processor speed and this really does work but it is expensive. Fast processors are expensive and faster memory subsystems bring more cost as well. The cost of the improved performance goes up non-linearly.  N times faster costs far more than N times as much. There will always be workloads where these additional investments make sense. The canonical scale-up workload is relational database (RDBMS). These workloads tend to be tightly coupled and not scale out well. The non-linear cost of more expensive, faster gear is both affordable and justified by there really not being much of an alternative. If you can’t easily scale out and the workload is important, scale it up.


The poor scaling of RDBMS workloads is driving three approaches in our industry: 1) considerable spending on SANs, SSDs, flash memory, and scale-up servers, 2) the noSQL movement rebelling against the high-function, high-cost, and poor scaling characteristics of RDBMS, and 3) deep investments in massively parallel processing SQL solutions (Teradata, Greenplum, Vertica, Paracel, Aster Data, etc.).  


All three approaches have utility and I’m interested in all three. However, there are workloads that really are highly parallel. For these workloads, the non-linear cost escalation of scale-up servers is less cost effective than more commodity servers. The banner workload in this class are simple web servers where there are potentially thousands of parallel requests and its more cost effective to turn these workloads over a fleet of commodity servers.


The entire industry has pretty much moved to deploying these highly parallelizable workloads over fleets of commodity servers. The will always be workloads that are tightly coupled and hard to run effectively over large number of commodity servers but, for those that can be, the gains are great: 1) far less expensive, 2) much smaller unit of failure, 3) cheap redundancy, 4) small, inexpensive unit of growth (avoid forklift upgrades), and 5) no hard limit at the scale-up limit.


This tells us two things: 1) where we can run a workloads over a large number of commodity servers we should, and 2) the number of workloads where parallel solutions have been found continues to increase. Even workloads like the social networking hairball problem (see hairball problem in we now have economic parallel solutions. Even some RDBMS workloads can now be run cost effectively on scale-out clusters. The question I started thinking through about back in 2008 is how far can we take this? What if we used client processors or even embedded device processors? How far down the low-cost, low-power spectrum make sense for highly parallelizable workloads? Cleary the volume economics client and device processors make them appealing from a price perspective and the multi-decade focus on power efficiencies in the device world offers impressive power efficiencies for some workloads.


For more on the low-cost, low-power trend see:

·         Very low-Cost, Low-Power Servers

·         Microslice Servers

·         The Case of Low-Cost, Low-Power Servers

Or the full paper on the approach: Cooperative Expendable, Microslice Servers


I’m particularly interested in the possible application of ARM processors to server workloads:

·         Linux/Apache on ARM processors

·         ARM Cortex-A9 SMP Design Announced


Intel is competing with ARM on some device workloads and offers the Atom which is not as power efficient as ARM but has the upside of running Intel Architecture instruction set. ARM also icensees recognize the trend I’ve outlined above – some server workloads will run very cost effectively on low-cost, low-power processors – and they want to compete in the server business. One excellent example is SeaMicro who is taking Intel Atom processors and competing for the server business: SeaMicro Releases Innovative Intel Atom Server.


Competition is a wonderful thing and drives much of the innovation the fruits of which we enjoy today. We have three interesting points of competition here: 1) Intel is competing for the device market with Atom, 2) ARM licensees are competing with Intel Xeon in the server market, and 3) Intel Atom is being used to compete in the server market but not with solid Intel backing.


Using Atom to compete with server processors is, on one hand, good for Intel in that it gives them a means of competing with ARM at the low end of the server market but, on the other hand, the Atom is much cheaper than Xeon so Intel will lose money on every workload it wins where that workloads used to be Xeon hosted. This risk is limited by Intel not giving Atom ECC memory (You Really do Need ECC Memory). Hobbling Atom is a dangerous tactic in that it protects Xeon but, at the same time, may allow ARM to gain ground in the server processor market. Intel could have ECC support on Atom in months – there is nothing technically hard about it. It’s the business implications that are more complex.


The first step in the right direction was made earlier this week where Intel announced an ECC capable Atom for 2012:

·         Intel Plans on Bringing Atom to Servers in 2012, 20W SNB for Xeons in 2011

·         Intel Flexes Muscles with New Processor in the Face of ARM Threat


This is potentially very good news for the server market but its only potentially good news. Waiting until 2012 suggests Intel wants to give the low power Xeons time in the market to see how they do before setting the price on the ECC equipped Atom. If the ECC supporting version of Atom is priced like a server part rather than a low-cost, high volume device part, then nothing changes. If the Atom with ECC comes in priced like a device processor, then it’s truly interesting. This announcement says that Intel is interested in the low-cost, low-power server market but still plans to delay entry into the lowest end of that market for another year. Still, this is progress and I’m glad to see it.


Thanks to Matt Corddry of Amazon for sending the links above my way.




James Hamilton



b: /


Sunday, March 20, 2011 1:19:09 PM (Pacific Standard Time, UTC-08:00)  #    Comments [4] - Trackback
 Tuesday, March 15, 2011

Two of the highest leverage datacenter efficiency improving techniques currently sweeping the industry are: 1) operating at higher ambient temperatures ( and air-side economization  with evaporative cooling (


The American Society of Heating and Refrigeration, and Air-Conditioning Engineers (ASHRAE) currently recommends that servers not be operated at inlet temperatures beyond 81F. Its super common to hear that every 10C increase in temperatures leads to 2x the failure – some statements get repeated so frequently they become “true” and no longer get questioned. See Exploring the limits of Data Center Temperature for my argument that this rule of thumb doesn’t apply over the full range operating temperatures.


Another one of those “you can’t do that” statements is around air-side economization also referred to as Outside Air (OA) cooling. Stated simply air-side economization is essentially opening the window. Rather than taking 110F exhaust air and cooling it down and recirculation it back to cool the servers, dump the exhaust and take in outside air to cool the servers.  If the outside air is cooler than the server exhaust, and it almost always will be, then air-side economization is a win.


The most frequently referenced document explaining why you shouldn’t do this is: Particulate and Gaseous  Contamination Guidelines for Data Centers again published by ASHRAE.  Even the document title sounds scary. Do you really want your servers operating in an environment of gaseous contamination? But, upon further reflection, is it really the case that servers really need better air quality than the people that use them? Really?


From the ASHRAE document:

The recent increase in the rate of hardware failures in data centers high in sulfur-bearing gases, highlighted by the number of recent publications on the subject, led to the need for this white paper that recommends that in addition to temperature-humidity control, dust and gaseous contamination should also be monitored and controlled. These additional environmental measures are especially important for data centers located near industries and/or other sources that pollute the environment.


Effects of airborne contaminations on data center equipment can be broken into three main categories: chemical effects, mechanical effects, and electrical effects. Two common chemical failure modes are copper creep corrosion on circuit boards and the corrosion of silver metallization in miniature surface-mounted components.


Mechanical effects include heat sink fouling, optical signal interference, increased friction, etc. Electrical effects include changes in circuit impedance, arcing, etc. It should be noted that the reduction of circuit board feature sizes and the miniaturization of components, necessary to improve hardware performance, also make the hardware more prone to attack by contamination in the data center environment, and manufacturers must continually struggle to maintain the reliability of their ever shrinking hardware.


It’s hard to read this document and not be concerned about the user of air-side economization. But, on the other hand, most leading operators are using it and experiencing no measure deleterious effects.  Let’s go get some more data.


Digging deeper, the Data Center Efficiency Summit had a session on exactly this topic title: Particulate and Corrosive Gas Measurements of Data Center Airside Economization: Data From the Field – Customer Presented Case Studies and Analysis. The title is a bit of tongue twister but the content is useful. Selecting from the slides:


·         From Jeff Stein of Taylor Engineering:

o   Anecdotal evidence of failures in non-economizer data centers in extreme environments in India or China or industrial facilities

o   Published data on corrosion in industrial environments

o   No evidence of failures in US data centers or any connection to economizers

o   Recommendations that gaseous contamination should be monitored and that gas phase filtration is necessary for US data centers are not supported

·         From Arman Shehabi of the UC Berkeley Department of Civil and Environmental Engineering:

o    Particle concerns should not dissuade economizer use

        More particles during economizer use with MERV 7 filters than during non-economizer periods, but still below many IT guidelines

        I/O ratios with MERV 14 filters and economizers were near (and often below!) levels using MERV 7 filters w/o economizers

o   Energy savings from economizer use greatly outweighed the fan energy increase from improved filtration

        MERV 14 filters increased fan power by about 10%, but the absolute increase (6 kW) was much smaller than the ~100 kW of chiller power savings during economizer use (in August!)

        The fan power increase is constant throughout the year while chiller savings during economizer should increase during cooler period


If you are interested in much more detail that comes to the same conclusions that Air-Side Economization is a good technique, see the excellent paper Should Data Center Owners be Afraid of Air-Side Economization Use? – A review of ASHRAE TC 9.9 White Paper titled Gaseous and Particulate Contamination Guidelines for Data Centers.


I urge you to read the full LBNL paper but I excerpt from the conclusions:

The TC 9.9 white paper brings up what may be an important issue for IT equipment in harsh environments but the references do not shed light on IT equipment failures and their relationship to gaseous corrosion.  While the equipment manufacturers are reporting an uptick in failures, they are not able to provide information on the types of failures, the rates of failures, or whether the equipment failures are in new equipment or equipment that may be pre-Rojas. Data center hardware failures are not documented in any of the references in the white paper. The only evidence for increased failures of electronic equipment in data centers is anecdotal and appears to be limited to aggressive environments such as in India, China, or severe industrial facilities. Failures that have been anecdotally September 2010 presented occurred in data centers that did not use air economizers. The white paper recommendation that gaseous contamination should be monitored and that gas phase filtration is necessary for data centers with high contamination levels is not supported.  


We are concerned that data center owners will choose to eliminate air economizers (or not operate them if installed) based upon the ASHRAE white paper since there are implications that contamination could be worse if air economizers are used.  This does not appear to be the case in practice, or from the information presented by the ASHRAE white paper authors. 


I’ve never been particularly good at accepting “you can’t do that” and I’ve been frequently rewarded for challenging widely held beliefs. A good many of these hard and fast rules end up being somewhere between useful guidelines not applying in all conditions to merely opinions. There is a large and expanding body of data supporting the use of air-side economization.




James Hamilton



b: /


Tuesday, March 15, 2011 10:35:57 AM (Pacific Standard Time, UTC-08:00)  #    Comments [6] - Trackback
 Saturday, March 05, 2011

Chris Page, Director of Climate & Energy Strategy at Yahoo! spoke at the 2010 Data Center Efficiency Summit and presented Yahoo! Compute Coop Design.

 The primary attributes of the Yahoo! design are: 1) 100% free air cooling (no chillers), 2) slab concrete floor, 3) use of wind power to augment air handling units, and 4) pre-engineered building for construction speed.


Chris reports the idea to orient the building such that the wind force on the external wall facing the dominant wind direction and use this higher pressure to assist the air handling units was taken from looking at farm buildings in the Buffalo, New York area. An example given was the use of natural cooling in chicken coops.

Chiller free data centers are getting more common (Chillerless Data Center at 95F) but the design approach is still far from common place. The Yahoo! team reports they run the facility at 75F and use evaporative cooling to attempt to hold the server approach temperatures down below 80F using evaporative cooling.  A common approach and the one used by Microsoft in the chillerless datacenter at 95F example is install low efficiency coolers for those rare days when they are needed on the logic that low efficiency is cheap and really fine for very rare use. The Yahoo! approach is to avoid the capital cost and power consumption of chillers entirely by allowing the cold aisle temperatures to rise to 85F to 90F when they are unable to hold the temperature lower. They calculate they will only do this 34 hours a year which is less than 0.4% of the year.


Reported results:

·         PUE at 1.08 with evaporative cooling and believe they could do better in colder climates

·         Save ~36 million gallons of water / yr, compared to previous water cooled chiller plant w/o airside economization design

·         Saves over ~8 million gallons of sewer discharge / yr (zero discharge design)

·         Lower construction cost (not quantified)

·         6 months from dirt to operating facility


Chris’ slides: (the middle slides of the 3 slide deck set)


Thanks to Vijay Rao for sending this my way.




James Hamilton



b: /


Saturday, March 05, 2011 1:58:04 PM (Pacific Standard Time, UTC-08:00)  #    Comments [7] - Trackback
Hardware | Services
 Sunday, February 27, 2011

Datacenter temperature has been ramping up rapidly over the last 5 years. In fact, leading operators have been pushing temperatures up so quickly that the American Society of Heating, Refrigeration, and Air-Conditioning recommendations have become a become trailing indicator of what is being done rather than current guidance. ASHRAE responded in January of 2009 by raising the recommended limit from 77F to 80.6F (HVAC Group Says Datacenters Can be Warmer). This was a good move but many of us felt it was late and not nearly a big enough increment.  Earlier this month, ASHRAE announced they are again planning to take action and raise the recommended limit further but haven’t yet announced by how much (ASHRAE: Data Centers Can be Even Warmer). 


Many datacenters are operating reliably well in excess even the newest ASHRAE recommended temp of 81F. For example, back in 2009 Microsoft announced they were operating at least one facility without chillers at 95F server inlet temperatures.


As a measure of datacenter efficiency, we often use Power Usage Effectiveness. That’s the ratio of the power that arrives at the facility divided by the power actually delivered to the IT equipment (servers, storage, and networking). Clearly PUE says nothing about the efficiency of servers which is even more important but, just focusing on facility efficiency, we see that mechanical systems are the largest single power efficiency overhead in the datacenter.


There are many innovative solutions to addressing the power losses to mechanical systems. Many of these innovations work well and others have great potential but none are as powerful as simply increasing the server inlet temperatures. Obviously less cooling is cheaper than more. And, the higher the target inlet temperatures, the higher percentage of time that a facility can spend running on outside air (air-side economization) without process-based cooling.


The downsides of higher temperatures are 1) high semiconductor leakage losses, 2) higher server fan speed which increases the losses to air moving, and 3) higher server mortality rates. I’ve measured the former and, although these losses are inarguably present, these losses are measureable but have a very small impact at even quite high server inlet temperatures.  The negative impact of fan speed increases is real but can be mitigated via different server target temperatures and more efficient server cooling designs. If the servers are designed for higher inlet temperatures, the fans will be configured for these higher expected temperatures and won’t run faster. This is simply a server design decision and good mechanical designs work well at higher server temperatures without increased power consumption. It’s the third issue that remains the scary one: increased server mortality rates.


The net of these factors is fear of server mortality rates is the prime factor slowing an even more rapid increase in datacenter temperatures. An often quoted study eports the failure rate of electronics doubles with every 10C increase of temperature (MIL-HDBK 217F). This data point is incredibly widely used by the military, NASA space flight program, and in commercial electronic equipment design. I’m sure the work is excellent but it is a very old study, wasn’t focused on a large datacenter environment, and the rule of thumb that has emerged from is a linear model of failure to heat.


This linear failure model is helpful guidance but is clearly not correct across the entire possible temperatures range. We know that at low temperature, non-heat related failure modes dominate. The failure rate of gear at 60F (16C) is not ½ the rate of gear at 79F (26C). As the temperature ramps up, we expect the failure rate to increase non-linearly. Really what we want to find is the knee of the curve. When does the failure rate increase start to approach that of the power savings achieved at higher inlet temperatures?  Knowing that the rule of thumb that 10C increase double the failure rate is incorrect doesn’t really help us. What is correct?


What is correct is a difficult data point to get for two reasons: 1) nobody wants to bear the cost of the experiment – the failure case could run several hundred million dollars, and 2) those that have explored higher temperatures and have the data aren’t usually interested in sharing these results since the data has substantial commercial value.


I was happy to see a recent Datacenter Knowledge article titled What’s Next? Hotter Servers with ‘Gas Pedals’ (ignore the title – Rich just wants to catch your attention). This article include several interesting tidbits on high temperature operation. Most notable was the quote by Subodh Bapat, the former VP of Energy Efficiency at Sun Microsystems, who says:


Take the data center in the desert. Subodh Bapat, the former VP of Energy Efficiency at Sun Microsystems, shared an anecdote about a data center user in the Middle East that wanted to test server failure rates if it operated its data center at 45 degrees Celsius – that’s 115 degrees Fahrenheit.


Testing projected an annual equipment failure rate of 2.45 percent at 25 degrees C (77 degrees F), and then an increase of 0.36 percent for every additional degree. Thus, 45C would likely result in annual failure rate of 11.45 percent. “Even if they replaced 11 percent of their servers each year, they would save so much on air conditioning that they decided to go ahead with the project,” said Bapat. “They’ll go up to 45C using full air economization in the Middle East.”


This study caught my interest for a variety of reasons. First and most importantly, they studied failure rates at 45C and decided that, although they were high, it still made sense for them to operate at these temperatures. It is reported they are happy to pay the increased failure rate in order to save the cost of the mechanical gear. The second interesting data point from this work is that they have found a far greater failure rate for 15C of increased temperature than predicted by MIL-HDBK-217F. Consistent with the217F, they also report a linear relationship between temperature and failure rate. I almost guarantee that the failure rate increase between 40C and 45C was much higher than the difference between 25C and 30C. I don’t buy the linear relationship between failure rate and temperature based upon what we know of failure rates at lower temperatures. Many datacenters have raised temperatures between 20C (68F) and 25C (77F) and found no measureable increase in server failure rate. Some have raised their temperatures between 25C (77F) and 30C (86F) without finding the predicted 1.8% increase in failures but admitted the data remains sparse.


The linear relationship is absolutely not there across a wide temperature range. But I still find the data super interesting in two respects: 1) they did see a roughly a 9% increase in failure rate at 45C over the control case at 20C and 2) even with the high failure rate, they made an economic decision to run at that temperature. The article doesn’t say if the failure rate is a lifetime failure rate or an Annual Failure Rate (AFR). Assuming it is an AFR, many workloads and customers could not life with a 11.45% AFR nonetheless, the data is interesting and it good to see the first public report of an operator doing the study and reaching a conclusion that works for their workloads and customers.


The net of the whole discussion above is the current linear rules of thumb are wrong and don’t usefully predict failure rates in the temperature ranges we care about, there is very little public data out there, industry groups like ASHRAE are behind and it’s quite possible that cooling specialist may not be the first to recommend we stop cooling datacenters aggressively :-). Generally, it’s a sparse data environment out there but the potential economic and environmental gains are exciting so progress will continue to get made. I’m looking forward to efficient and reliable operation at high temperature becoming a critical data point in server fleet purchasing decisions. Let’s keep pushing.




James Hamilton



b: /


Sunday, February 27, 2011 10:25:43 AM (Pacific Standard Time, UTC-08:00)  #    Comments [6] - Trackback
 Sunday, February 20, 2011

Dileep Bhandarkar presented the keynote at Server design Summit last December.  I can never find the time to attend trade shows so I often end up reading slides instead. This one had lots of interesting tidbits so I’m posting a pointer to the talk and my rough notes here.


Dileep’s talk: Watt Matters in Energy Efficiency


My notes:

·         Microsoft Datacenter Capacities:

o   Quincy WA: 550k sq ft, 27MW

o   San Antonio Tx: 477k sq ft, 27MW

o   Chicago, Il: 707k sq ft, 60MW

§  Containers on bottom floor with “medium reliability”  (no generators) and standard rooms on top floor with full power redundancy

o   Dublin, Ireland: 570k sq ft, 27MW

·         Speeds and feeds from Microsoft Consumer Cloud Services:

o   Windows Live: 500M IDs

o   Live Hotmail: 355M Active Accounts

o   Live Messenger: 303M users

o   Bing: 4B Queries/month

o   Xbox Live: 25M users    

o   adCenter: 14B Ads served/month

o   Exchange Hosted Services: 2 to 4B emails/day

·         Datacenter Construction Costs

o   Land: <2%

o   Shell: 5 to 9%

o   Architectural: 4 to 7%

o   Mechanical & Electrical: 70 to 85%

·         Summarizing the above list, we get 80% of the costs scaling with power consumption and 10 to 20% scaling with floor space. Reflect on that number and you’ll understand why I think the industry is nuts to be focusing on density. See Why Blade Servers Aren’t the Answer to All Questions for more detail on this point – I think it’s a particularly important one.

·         Reports that datacenter build costs are $10M to $20M per MW and server TCO is the biggest single category. I would use the low end of this spectrum for a cost estimator with inexpensive facilities in the $9M to $10M/MW range. See Overall Datacenter Costs.

·         PUE is a good metric for evaluating datacenter infrastructure efficiency but Microsoft uses best server performance per watt per TCO$

o   Optimize the server design and datacenter together

·         Cost-Reduction Strategies:

o   Server cost reduction:

§  Right size server: Low Power Processors often offer best performance/watt (see Very Low-Power Server Progress for more)

§  Eliminate unnecessary components (very small gain)

§  Use higher efficiency parts

§  Optimize for server performance/watt/$ (cheap and low power tend to win at scale)

o   Infrastructure cost reduction:

§  Operate at higher temperatures

§  Use free air cooling & Eliminate chillers (More detail at: Chillerless Datacenter at 95F)

§  Use advanced power management with power capping to support power over-subscription with peak protection (see the 2007 paper for the early work on this topic)

·         Custom Server Design:

o   2 socket, half-width server design (6.3”W x 16.7”L)

o   4x SATA HDD connectors

o   4x DIMM slots per CPU socket

o   2x 1GigE NIC

o   1x 16x PCIe slot

·         Custom Rack Design:

o   480 VAC 3p power directly to the rack (higher voltage over a given conductor size reduces losses in distribution)

o   Very tall 56 RU rack (over 98” overall height)

o   12VDC distribution within the rack from two combined power supplies with distributed UPS

o   Power Supplies (PSU)

§  Input is 480VAC 3p

§  Output: 12V DC

§  Servers are 12VDC only boards

§  Each PSU is 4.5KW

§  2 PSUs/rack  so rack is 9.0KW max

o   Distributed UPS

§  Each PSU includes an UPS made up of 4 groups of 4013.2V batteries

§  Overall 160 discrete batteries per UPS

§  Technology not specified but based upon form factor and rough power estimate, I suspect they may be Li-ion 18650. See Commodity Parts for a note on these units.

o   By putting 2 PSUs per rack they avoid transporting low voltage (12VDC) further than 1/3 of a rack distance (under 1 yard) and only 4.5 KW is transported so moderately sized bus bars can be used.

o   Rack Networking:

§  Rack networking is interesting with 2 to 4 tor switches per rack. We know the servers are 1 GigE connected and there are up to 96 per rack which yields two possibilities: 1) they are using 24 port GigE switches or 2) they are using 48 port GigE and fat tree network topology (see VL2: A Scalable and Flexible Datacenter Network for more on a fat tree). 24 port TORs are not particularly cost effective, so I would assume 48x1GigE TORs in a fat tree design which is nice work.

o   Report that the CPUs can be lower power parts including Intel Atom, AMD Bobcat, and ARM (more on lower powered processors in server applications:

·         Power proportionality: Shows that server at 0% load consumes 32% of peak power (this is an amazingly good server – 45% to 60% is much closer to the norm)


James Hamilton



b: /


Sunday, February 20, 2011 7:00:07 AM (Pacific Standard Time, UTC-08:00)  #    Comments [6] - Trackback
 Friday, February 11, 2011

Ben Black always has interesting things on the go. He’s now down in San Francisco working on his startup Fastip which he describes as “an incredible platform for operating, exploring, and optimizing data networks.” A couple of days ago Deepak Singh sent me to a recent presentation of Ben’s I found interesting: Challenges and Trade-offs in Building a Web-scale Real-time Analytics System.


The problem described in this talk was “Collect, index, and query trillions of high dimensionality

records with seconds of latency for ingestion and response.” What Ben is doing is collecting per flow networking data with tcp/ip 11-tuples (src_mac, dst_mac, src_IP, dest_IP, …) as the dimension data and, as metrics, he is tracking start usecs, end usecs, packets, octets, and UID. This data is interesting for two reasons: 1) networks are huge, massively shared resources and most companies haven’t really a clue on the details of what traffic is clogging it and have only weak tools to understand what traffic is flowing – the data sets are so huge, the only hope is to sample it with solutions like Cisco's NetFlow. The second reason I find this data interesting is closely related: 2) it is simply vast and  I love big data problems. Even on small networks, this form of flow tracking produces a monstrous data set very quickly. So, it’s an interesting problem in that it’s both hard and very useful to solve.


Ben presented 3 possible solutions and why they don’t work before offering a solution. The failed approaches that couldn’t cope with high dimensionality and the sheer volume of the dataset:

1.       HBase: Insert into HBase then retrieve all records in a time range and filter, aggregate, and sort

2.       Cassandra: Insert all records into Cassandra partitioned over a large cluster with each dimension indexed independently. Select qualifying records on each dimension, aggregate, and sort.

3.       Statistical Cassandra: Run a statistical sample over the data stored in Cassandra in the previous attempt.


The end solution proposed by the presenter is to treat it as an Online Analytic Processing (OLAP) problem. He describes OLAP as “A business intelligence (BI) approach to swiftly answer multi-dimensional analytics queries by structuring the data specifically eliminate expensive processing at query time, even at a cost of enormous storage consumption.” OLAP is essentially a mechanism to support fast query over high-dimensional data by pre-computing all interesting aggregations and storing the pre-computed results in a highly compressed form that is often kept memory resident. However, in this case, the data set is far too large to be practical for an in-memory approach.


Ben draws from two papers to implement an OLAP based solution at this scale:

·         High-Dimensional OLAP: A Minimal Cubing Approach by Li, Han, and Gonzalez

·         Sorting improves word-aligned bitmap indexes by Lemire, Kaser, and Aouiche


The end solution is:

·         Insert records into Cassandra.

·         Materialize lower-dimensional cuboids using bitsets and then join as needed.

·         Perform all query steps directly in the database

Ben’s concluding advice:

·         Read the literature

·         Generic software is 90% wrong at scale, you just don’t know which 90%.

·         Iterate to discover and be prepared to start over


If you want to read more, the presentation is at: Challenges and Trade-offs in Building a Web-scale Real-time Analytics System.




James Hamilton



b: /


Friday, February 11, 2011 5:41:36 AM (Pacific Standard Time, UTC-08:00)  #    Comments [12] - Trackback
 Sunday, February 06, 2011

Last week, Sudipta Sengupta of Microsoft Research dropped by the Amazon Lake Union campus to give a talk on the flash memory work that he and the team at Microsoft Research have been doing over the past year.  Its super interesting work. You may recall Sudipta as one of the co-authors on the VL2 Paper (VL2: A Scalable and Flexible Data Center Network) I mentioned last October.


Sudipta’s slides for the flash memory talk are posted at Speeding Up Cloud/Server Applications With Flash Memory and my rough notes follow:

·         Technology has been used in client devices for more than a decade

·         Server side usage more recent and the difference between hard disk drive and flash characterizes brings some challenges that need to be managed in the on-device Flash Translation Layer (FTL)  or in the operating systems or Application layers.

·         Server requirements are more aggressive across several dimensions including required random I/O rates and higher reliability and durability (data life) requirements.

·         Key flash characteristics:

·         10x more expensive than HDD

·         10x cheaper than RAM

·         Multi Level Cell (MLC): ~$1/GB

·         Single Level Cell (SLC): ~$3/GB

·         Laid out as an linear array of flash blocks where a block is often 128k and a page is 2k

·         Unfortunately the unit of erasure is a full block but the unit of read or write is 2k and this makes the write in place technique used in disk drives not workable.

·         Block erase is a fairly slow operation at 1500 usec whereas read or write is 10 to 100 usec.

·         Wear is an issue with SLC supporting O(100k) erases and MLC O(10k)     

·         The FTL is responsible for managing the mapping between logical pages and physical pages such that logical pages can be overwritten and hot page wear is spread relatively evenly over the device.

·         Roughly 1/3 the power consumption of a commodity disk and 1/6 the power of an enterprise disk

·         100x the ruggedness over disk drives when active

·         Research Project: FlashStore

·         Use flash memory as a cache between RAM and HDD

·         Essentially a flash aware store where they implement a log structured block store (this is essentially what the FTLs do in the device implemented at the application layer.

·         Changed pages are written through to flash sequentially and an in-memory index of pages is maintained so that pages can be found quickly on the flash device.

·         On failure the index structure can be recovered by reading the flash device

·         Recently unused pages are destaged asynchronously to disk

·         A key contribution of this work is a very compact form for the index into the flash cache

·         Performance results excellent and you can find them in the slides and the papers referenced below

·         Research Project: ChunkStash

·         A very high performance, high throughput key-value store

·         Tested on two production workloads:

·         Xbox Live Primetime online gaming

·         Storage deduplication

·         The storage dedeuplication test is a good one in that dedupe is most effective with a large universe of objects to run deduplication over. But a large universe requires a large index. The most interesting challenge of deduplication is to keep the index size small through aggressive compaction

·         The slides include a summary of dedupe works and shows the performance and compression ratios they have achieved with ChunkStash


For those interested in digging deeper, the VLDB and USENIX papers are the right next stops:

· (FlashStore paper, VLDB 2010)

· (ChunkStash paper, USENIX ATC 2010)

·         Slides:


James Hamilton



b: /


Sunday, February 06, 2011 1:48:10 PM (Pacific Standard Time, UTC-08:00)  #    Comments [10] - Trackback
Hardware | Software
 Sunday, January 16, 2011

NVIDIA has been an ARM licensee for quite some time now.  Back in 2008 they announced Tegra, an embedded client processor including an ARM core and NVIDIA graphics aimed at smartphones and mobile handsets. 10 days ago, they announced Project Denver where they are building high-performance ARM-based CPUs, designed to power systems ranging from “personal computers and servers to workstations and supercomputers”. This is interesting for a variety of reasons, first they are entering the server CPU market. Second NVIDIA is joining Marvell and Calxeda (previously Smooth-Stone) in taking the ARM architecture and targeting server-side computing.


ARM is an interesting company in that they produce designs and these designs get adapted by licensees including Texas instruments, Samsung, Qualcomm, and even unlikely players such as Microsoft. These licensees may fab their own parts or be fab-less design shops depending upon others for volume production. Unlike Intel, ARM doesn’t really produce chips focusing just on design.


ARM has become an incredibly important instruction set architecture powering smartphones, low-end network routers, printers, copiers, tablets, and other embedded applications. But things are changing, arm is now producing designs appropriate for server-side computing at the same time that power consumption is becoming a key measure of server-side computing cost. The ARM design team are masters of low power designs and generations of ARMs have focused on power management. ARM has an impressively efficiently design.


Linux powers many of the ARM-based devices mentioned above and consequently Linux runs well on ARM processors. Completing the picture, Microsoft has announced that the next version of windows will also support ARM devices: Microsoft Announced Support of System on a Chip Architectures from Intel, AMD, and ARM for next Version of Windows.


Other articles are ARM and micro-slice computing:

·         CEMS: Low-Cost, Low-power Servers for Internet Scale Services

·         A Fast Array of Wimpy Nodes

·         Very Low-Cost, Low-Power Servers

·         Linux/Apache on ARM Processors

·         ARM Cortex-A9 SMP Design Announced

Other articles on the NVIDIA announcement:

·         Nvidia Developing ARM Processors for Servers

·         Nvidia Turns to ARM for Server Chips to Kill Intel

 A closely related article from the Barry Evans and Karl Freund of ARM-powered, server startup Calxeda:


We are on track for renewed competition in the server-side computing market segment and intense competition on power efficiency at the same time as internet-scale service operators are willing to run whatever processor is least expensive and most power efficient. With competition comes innovation and I see a good year coming.




James Hamilton



b: /


Sunday, January 16, 2011 6:06:38 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
 Thursday, January 13, 2011

If you have experience in database core engine development either professionally, on open source, or at university send me your resume. When I joined the DB world 20 years ago, the industry was young and the improvements were coming ridiculously fast.  In a single release we improved DB2 TPC-A performance by a factor of 10x. Things were changing quickly industry-wide.  These days single-server DBs are respectably good. It’s a fairly well understood space. Each year more features are added and a few percent performance improvement may happen but the code bases are monumentally large, many of the development teams are over 1,000 engineers, and things are happening anything but quickly.


If you are an excellent engineer, have done systems or DB work in the past, and are interested in working on the next decade’s problems in database, drop me a note (




James Hamilton



b: /

Thursday, January 13, 2011 8:12:54 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
 Sunday, January 09, 2011

Megastore is the data engine supporting the Google Application Engine. It’s a scalable structured data store providing full ACID semantics within partitions but lower consistency guarantees across partitions.


I wrote up some notes on it back in 2008 Under the Covers of the App Engine Datastore and posted Phil Bernstein’s excellent notes from a 2008 SIGMOD talk: Google Megastore. But there has been remarkably little written about this datastore over the intervening couple of years until this year’s CIDR conference papers were posted. CIDR 2011 includes Megastore: Providing Scalable, Highly Available Storage for Interactive Services.


My rough notes from the paper:

·         Megastore is built upon BigTable

·         Bigtable supports fault-tolerant storage within a single datacenter

·         Synchronous replication based upon Paxos and optimized for long distance inter-datacenter links

·         Partitioned into a vast space of small databases each with its own replicated log

·         Each log stored across a Paxos cluster

·         Because they are so aggressively partitioned, each Paxos group only has to accept logs for operations on a small partition. However, the design does serialize updates on each partition

·         3 billion writes and 20 billion read transactions per day

·         Support for consistency unusual for a NoSQL database but driven by (what I believe to be) the correct belief that inconsistent updates make many applications difficult to write (see I Love Eventual Consistency but …)

·         Data Model:

·         The data model is declared in a strongly typed schema

·         There are potentially many tables per schema

·         There are potentially many entities per table

·         There are potentially many strongly typed properties per entity

·         Repeating properties are allowed

·         Tables can be arranged hierarchically where child tables point to root tables

·         Megastore tables are either entity group root tables or child tables

·         The root table and all child tables are stored in the same entity group

·         Secondary indexes are supported

·         Local secondary indexes index a specific entity group and are maintained consistently

·         Global secondary indexes index across entity groups are asynchronously updates and eventually consistent

·         Repeated indexes: supports indexing repeated values (e.g. photo tags)

·         Inline indexes provide a way to denormalize data from source entities into a related target entity as a virtual repeated column.

·         There are physical storage hints:

·         “IN TABLE” directs Megastore to store two tables in the same underlying BigTable

·         “SCATTER” attribute prepends a 2 byte hash to each key to cool hot spots on tables with monotonically increasing values like dates (e.g. a history table).

·         “STORING” clause on an index supports index-only-access by redundantly storing additional data in an index. This avoids the double access often required of doing a secondary index lookup to find the correct entity and then selecting the correct properties from that entity through a second table access. By pulling values up into the secondary index, the base table doesn’t need to be accessed to obtain these properties.

·         3 levels of read consistency:

·         Current: Last committed value

·         Snapshot: Value as of start of the read transaction

·         Inconsistent reads: used for cross entity group reads

·         Update path:

·         Transaction writes its mutations to the entity groups write-ahead log and then apply the mutations to the data (write ahead logging).

·         Write transaction always begins with a current read to determine the next available log position. The commit operation gathers mutations into a log entry, assigns an increasing timestamp, and appends to log which is maintained using paxos.

·         Update rates within a entity group are seriously limited by:

·         When there is log contention, one wins and the rest fail and must be retried.

·         Paxos only accepts a very limited update rate (order 10^2 updates per second).

·         Paper reports that “limiting updates within an entity group to a few writes per second per entity group yields insignificant write conflicts”

·         Implication: programmers must shard aggressively to get even moderate update rates and consistent update across shards is only supported using two phase commit which is not recommended.

·         Cross entity group updates are supported by:

·         Two-phase commit with the fragility that it brings

·         Queueing and asynchronously applying the changes

·         Excellent support for backup and redundancy:

·         Synchronous replication to protect against media failure

·         Snapshots and incremental log backups


Overall, an excellent paper with lots of detail on a nicely executed storage system. Supporting consistent read and full ACID update semantics is impressive although the limitation of not being able to update an entity group at more than a “few per second” is limiting.


The paper:


Thanks to Zhu Han, Reto Kramer, and Chris Newcombe for all sending this paper my way.




James Hamilton



b: /


Sunday, January 09, 2011 10:38:39 AM (Pacific Standard Time, UTC-08:00)  #    Comments [3] - Trackback
 Thursday, December 16, 2010

Years ago I incorrectly believed special purpose hardware was a bad idea. What was a bad idea is high-markup, special purpose devices sold at low volume, through expensive channels. Hardware implementations are often best value measured in work done per dollar and work done per joule. The newest breed of commodity networking parts from Broadcom, Fulcrum, Dune (now Broadcom), and others is a beautiful example of Application Specific Integrated Circuits being the right answer for extremely hot code kernels that change rarely.


I’ve long been interested in highly parallel systems and in heterogeneous processing. General Purpose Graphics Processors are firmly hitting the mainstream with 17 of the Top 500 now using GPGPUs (Top 500: Chinese Supercomputer Reigns). You can now rent GPGPU clusters from EC2 $2.10/server/hour where each server has dual NVIDIA Tesla M2050 GPUs delivering a TeraFLOP per node. For more on GPGPUs, see HPC in the Cloud with GPGPUs and GPU Clusters in 10 minutes.


Some time ago Zach Hill sent me a paper writing up Radix sort using GPGPUs. The paper shows how to achieve a better than 3x on the NVIDIA GT200-hosted systems. For most of us, sort isn’t the most important software kernel we run, but I did find the detail behind the GPGPU-specific optimizations interesting. The paper is at and the abstract is below.




This paper presents efficient strategies for sorting large sequences of fixed-length keys (and values) using GPGPU stream processors. Compared to the state-of-the-art, our radix sorting methods exhibit speedup of at least 2x for all generations of NVIDIA GPGPUs, and up to 3.7x for current GT200-based models. Our implementations demonstrate sorting rates of 482 million key-value pairs per second, and 550 million keys per second (32-bit). For this domain of sorting problems, we believe our sorting primitive to be the fastest available for any fully-programmable microarchitecture. These results motivate a different breed of parallel primitives for GPGPU stream architectures that can better exploit the memory and computational resources while maintaining the flexibility of a reusable component. Our sorting performance is derived from a parallel scan stream primitive that has been generalized in two ways: (1) with local interfaces for producer/consumer operations (visiting logic), and (2) with interfaces for performing multiple related, concurrent prefix scans (multi-scan).


As part of this work, we demonstrate a method for encoding multiple compaction problems into a single, composite parallel scan. This technique yields a 2.5x speedup over bitonic sorting networks for small problem instances, i.e., sequences that can be entirely sorted within the shared memory local to a single GPU core.


James Hamilton



b: /


Thursday, December 16, 2010 9:06:43 AM (Pacific Standard Time, UTC-08:00)  #    Comments [3] - Trackback
 Monday, December 06, 2010

Even working in Amazon Web Services, I’m finding the frequency of new product announcements and updates a bit dizzying. It’s amazing how fast the cloud is taking shape and the feature set is filling out. Utility computing has really been on fire over the last 9 months. I’ve never seen an entire new industry created and come fully to life this fast. Fun times.

Before joining AWS, I used to say that I had an inside line on what AWS was working upon and what new features were coming in the near future.  My trick? I went to AWS customer meetings and just listened. AWS delivers what customers are asking for with such regularity that it’s really not all that hard to predict new product features soon to be delivered. This trend continues with today’s announcement. Customers have been asking for a Domain Name Service with consistency and, today, AWS is announcing the availability of Route 53, a scalable, highly-redundant and reliable, global DNS service.


The Domain Name System is essentially a global, distributed database that allows various pieces of information to be associated with a domain name.  In the most common case, DNS is used to look up the numeric IP address for an domain name. So, for example, I just looked up and found that one of the addresses being used to host is And, when your browser accessed this blog (assuming you came here directly rather than using RSS) it would have looked up to get an IP address. This mapping is stored in an DNS “A” (address) record. Other popular DNS records are CNAME (canonical name), MX (mail exchange), and SPF (Sender Policy Framework). A full list of DNS record types is at: Route 53 currently supports:

                    A (address record)

                    AAAA (IPv6 address record)

                    CNAME (canonical name record)

                    MX (mail exchange record)

                    NS (name server record)

                    PTR (pointer record)

                    SOA (start of authority record)

                    SPF (sender policy framework)

                    SRV (service locator)

                    TXT (text record)


DNS, on the surface, is fairly simple and is easy to understand. What is difficult with DNS is providing absolute rock-solid stability at scales ranging from a request per day on some domains to billions on others. Running DNS rock-solid, low-latency, and highly reliable is hard.  And it’s just the kind of problem that loves scale. Scale allows more investment in the underlying service and supports a wide, many-datacenter footprint.


The AWS Route 53 Service is hosted in a global network of edge locations including the following 16 facilities:

·         United States

                    Ashburn, VA

                    Dallas/Fort Worth, TX

                    Los Angeles, CA

                    Miami, FL

                    New York, NY

                    Newark, NJ

                    Palo Alto, CA

                    Seattle, WA

                    St. Louis, MO

·         Europe





·         Asia

                    Hong Kong




Many DNS lookups are resolved in local caches but, when there is a cache miss, it will need to be routed back to the authoritative name server.  The right approach to answering these requests with low latency is to route to the nearest datacenter hosting an appropriate DNS server.  In Route 53 this is done using anycast. Anycast is a cool routing trick where the same IP address range is advertised to be at many different locations. Using this technique, the same IP address range is advertized as being in each of the world-wide fleet of datacenters. This results in the request being routed to the nearest facility from a network perspective.


Route 53 routes to the nearest datacenter to deliver low-latency, reliable results. This is good but Route 53 is not the only DNS service that is well implemented over a globally distributed fleet of datacenters. What makes Route 53 unique is it’s a cloud service. Cloud means the price is advertised rather than negotiated.  Cloud means you make an API call rather than talking to a sales representative. Cloud means it’s a simple API and you don’t need professional services or a customer support contact. And cloud means its running NOW rather than tomorrow morning when the administration team comes in. Offering a rock-solid service is half the battle but it’s the cloud aspects of Route 53 that are most interesting. 


Route 53 pricing is advertised and available to all:

·         Hosted Zones: $1 per hosted zone per month

·         Requests: $0.50 per million queries for first billion queries and $0.25 per million queries over 1B month


You can have it running in less time than it took to read this posting. Go to: ROUTE 53 Details. You don’t need to talk to anyone, negotiate a volume discount, hire a professional service team, call the customer support group, or wait until tomorrow. Make the API calls to set it up and, on average, 60 seconds later you are fully operating.




James Hamilton



b: /


Monday, December 06, 2010 5:37:15 AM (Pacific Standard Time, UTC-08:00)  #    Comments [7] - Trackback
 Monday, November 29, 2010

I love high-scale systems and, more than anything, I love data from real systems. I’ve learned over the years that no environment is crueler, less forgiving, or harder to satisfy than real production workloads. Synthetic tests at scale are instructive but nothing catches my attention like data from real, high-scale, production systems. Consequently, I really liked the disk population studies from Google and CMU at FAST2007 (Failure Trends in a Large Disk Population, Disk Failures in the Real World: What does a MTBF of 100,000 hours mean to you).  These two papers presented actual results from independent production disk populations of 100,000 each. My quick summary of these 2 papers is basically “all you have learned about disks so far is probably wrong.”


Disk failures are often the #1 or #2 failing component in a storage system usually just ahead of memory. Occasionally fan failures lead disk but that isn’t the common case. We now have publically available data on disk failures but not much has been published on other component failure rates and even less on the overall storage stack failure rates. Cloud storage systems are multi-tiered, distributed systems involving 100s to even 10s of thousands of servers and huge quantities of software. Modeling the failure rates of discrete components in the stack is difficult but, with the large amount of component failure data available to large fleet operators, it can be done. What’s much more difficult to model are correlated failures.


Essentially, there are two challenges encountered when attempting to modeling overall storage system reliability: 1) availability of component failure data and 2) correlated failures. The former is available to very large fleet owners but is often unavailable publically. Two notable exceptions are disk reliability data from the two FAST’07 conference papers mentioned above. Other than these two data points, there is little credible component failure data publically available.  Admittedly, component manufacturers do publish MTBF data but these data are often owned by the marketing rather than engineering teams and they range between optimistic and epic works of fiction.


Even with good quality component failure data, modeling storage system failure modes and data durability remains incredibly difficult. What makes this hard is the second issue above: correlated failure. Failures don’t always happen alone, many are correlated, and certain types of rare failures can take down the entire fleet or large parts of it. Just about every model assumes failure independence and then works out data durability to many decimal points. It makes for impressive models with long strings of nines but the challenge is the model is only as good as the input. And one of the most important model inputs is the assumption of component failure independence which is violated by every real-world system of any complexity. Basically, these failure models are good at telling you when your design is not good enough but they can never tell you how good your design actually is nor whether it is good enough.


Where the models break down is in modeling rare events and non-independent failures. The best way to understand common correlated failure modes is to study storage systems at scale over longer periods of time. This won’t help us understand the impact of very rare events. For example, Two thousand years of history would not helped us model or predict that a airplane would be flown into the World Trade Center.  And certainly the odds of it happening again 16 min and 20 seconds later would be close to impossible. Studying historical storage system failure data will not help us understand the potential negative impact of very rare black swan events but it does help greatly in understanding the more common failure modes including correlated or non-independent failures.


Murray Stokely recently sent me Availability in Globally Distributed Storage Systems which is the work of a team from Google and Columbia University. They look at a high scale storage system at Google that includes multiple clusters of Bigtable which is layered over GFS which is implemented as a user–mode application over Linux file system. You might remember Stokely from my Using a post I did back in March titled Using a Market Economy. In this more recent paper, the authors study 10s of Google storage cells each of which is comprised of between 1,000 and 7,000 servers over a 1 year period. The storage cells studied are from multiple datacenters in different regions being used by different projects within Google.


I like the paper because it is full of data on a high-scale production system and it reinforces many key distributed storage system design lessons including:

·         Replicating data across multiple datacenters greatly improves availability because it protects against correlated failures.

o    Conclusion: Two way redundancy in two different datacenters is considerably more durable than 4 way redundancy in a single datacenter.

·         Correlation among node failures dwarfs all other contributions to unavailability in production environments.

·         Disk failures can result in permanent data loss but transitory node failures account for the majority of unavailability.


To read more:


The abstract of the paper:

Highly available cloud storage is often implemented with complex, multi-tiered distributed systems built on top of clusters of commodity servers and disk drives. Sophisticated management, load balancing and recovery techniques are needed to achieve high performance and availability amidst an abundance of failure sources that include software, hardware, network connectivity, and power issues. While there is a relative wealth of failure studies of individual components of storage systems, such as disk drives, relatively little has been reported so far on the overall availability behavior of large cloud-based storage services. We characterize the availability properties of cloud storage systems based on an extensive one year study of Google's main storage infrastructure and present statistical models that enable further insight into the impact of multiple design choices, such as data placement and replication strategies. With these models we compare data availability under a variety of system parameters given the real patterns of failures observed in our fleet.




James Hamilton



b: /


Monday, November 29, 2010 5:55:51 AM (Pacific Standard Time, UTC-08:00)  #    Comments [2] - Trackback

Disclaimer: The opinions expressed here are my own and do not necessarily represent those of current or past employers.

<May 2011>

This Blog
Member Login
All Content © 2014, James Hamilton
Theme created by Christoph De Baene / Modified 2007.10.28 by James Hamilton