Saturday, October 01, 2011

I’ve been posting frequently on networking issues with the key point being the market is on the precipice of a massive change. There is a new model emerging.

·         Datacenter Networks are in my way

·         Networking: The Last Bastion of Mainframe Computing


We now have merchant silicon providers for the core Application Specific Integrated Circuits (ASICs) that form the core network switches and routers including Broadcom, Fulcrum (recently purchased by Intel), Marvell, Dune (purchased by Broadcom). We have many competing offerings for the control processor that supports the protocol stack including Freescale, Arm, and Intel. The ASIC providers build reference designs that get improved by many competing switch hardware providers including Dell, NEC, Quanta, Celestica, DNI, and many others. We have competition at all layers below the protocol stack. What’s needed is an open, broadly used, broadly invested networking stack. Credible options are out there with Quagga perhaps being the strongest contender thus far. Xorp is another that has many users. But, there still isn’t a protocol stack with the broad use and critical mass that has emerged in the server world with the wide variety of Linux distributions available.


Two recent new addition to the community are 1) the Open Networking Foundation, and 2) the Open Source Routing Forum. More on each:

Open Networking Foundation:

Founded in 2011 by Deutsche Telekom, Facebook, Google, Microsoft, Verizon, and Yahoo!, the Open Networking Foundation (ONF) is a nonprofit organization whose goal is to rethink networking and quickly and collaboratively bring to market standards and solutions. ONF will accelerate the delivery and use of Software-Defined Networking (SDN) standards and foster a vibrant market of products, services, applications, customers, and users.


Open Source Routing Forum

OSR will establish a "platform" supporting committers and communities behind the open source routing protocols to help the release of a mainstream, and stable code base, beginning with Quagga, most popular routing code base. This "platform" will provide capabilities such as regression testing, performance/scale testing, bug analysis, and more. With a stable qualified routing code base and 24x7 support, service providers, academia, startup equipment vendors, and independent developers can accelerate existing projects like ALTO, Openflow, and software defined networks, and germinate new projects in service providers at a lower cost.


Want to be part of re-engineering datacenter networks at Amazon?

I need more help on a project I’m driving at Amazon where we continue to make big changes in our datacenter network to improve customer experience and drive down costs while, at the same time, deploying more gear into production each day than all of used back in 2000. It’s an exciting time and we have big changes happening in networking. If you enjoy and have experience in operating systems, networking protocol stacks, or embedded systems and you would like to work on one of the biggest networks in the world, send me your resume (




James Hamilton



b: /


Saturday, October 01, 2011 8:08:59 AM (Pacific Standard Time, UTC-08:00)  #    Comments [2] - Trackback

 Tuesday, September 20, 2011

If you read this blog in the past, you’ll know I view cloud computing as a game changer (Private Clouds are not the Future) and spot instances as a particularly powerful innovation within cloud computing. Over the years, I’ve enumerated many of the advantages of cloud computing over private infrastructure deployments. A particularly powerful cloud computing advantage is driven by noting that when combining a large number of non-correlated workloads, the overall infrastructure utilization is far higher for most workload combinations.  This is partly because the reserve capacity to ensure that all workloads are able to support peak workload demands is a tiny fraction of what is required to provide reserve surge capacity for each job individually.


This factor alone is a huge gain but an even bigger gain can be found by noting that all workloads are cyclic and go through sinusoidal capacity peaks and troughs. Some cycles are daily, some weekly, some hourly, and some on different cycles but nearly all workloads exhibit some normal expansion and contraction over time. This capacity pumping is in addition to handling unusual surge requirements or increasing demand discussed above.


To successfully run a workload, sufficient hardware must be provisioned to support the peak capacity requirement for that workload.  Cost is driven by peak requirements but monetization is driven by the average. The peak to average ratio gives a view into how efficiently the workload can be hosted.  Looking at an extreme, a tax preparation service has to provision enough capacity to support their busiest day and yet, in mid-summer, most of this hardware is largely unused. Tax preparation services have a very high peak to average ratio so, necessarily, utilization in a fleet dedicated to this single workload will be very low.


By hosting many diverse workloads in a cloud, the aggregate peak to average ratio trends towards flat. The overall efficiency to host the aggregate workload will be far higher than any individual workloads on private infrastructure.  In effect, the workload capacity peak to trough differences get smaller as the number of combined diverse workloads goes up.  Since costs tracks the provisioned capacity required at peak but monetization tracks the capacity actually being used, flattening this out can dramatically improve costs by increasing infrastructure utilization.


This is one of the most important advantages of cloud computing. But, it’s still not as much as can be done. Here’s the problem. Even with very large populations of diverse workloads, there is still some capacity that is only rarely used at peak. And, even in the limit with an infinitely large aggregated workload where the peak to average ratio gets very near flat, there still must be some reserved capacity such that surprise, unexpected capacity increases, new customers, or new applications can be satisfied.  We can minimize the pool of rarely used hardware but we can’t eliminate it.


What we have here is yet another cloud computing opportunity. Why not sell the unused reserve capacity on the spot market? This is exactly what AWS is doing with Amazon EC2 Spot Instances. From the Spot Instance detail page:


Spot Instances enable you to bid for unused Amazon EC2 capacity. Instances are charged the Spot Price set by Amazon EC2, which fluctuates periodically depending on the supply of and demand for Spot Instance capacity. To use Spot Instances, you place a Spot Instance request, specifying the instance type, the Availability Zone desired, the number of Spot Instances you want to run, and the maximum price you are willing to pay per instance hour. To determine how that maximum price compares to past Spot Prices, the Spot Price history for the past 90 days is available via the Amazon EC2 API and the AWS Management Console. If your maximum price bid exceeds the current Spot Price, your request is fulfilled and your instances will run until either you choose to terminate them or the Spot Price increases above your maximum price (whichever is sooner).

It’s important to note two points:

1.    You will often pay less per hour than your maximum bid price. The Spot Price is adjusted periodically as requests come in and available supply changes. Everyone pays that same Spot Price for that period regardless of whether their maximum bid price was higher. You will never pay more than your maximum bid price per hour.

2.    If you’re running Spot Instances and your maximum price no longer exceeds the current Spot Price, your instances will be terminated. This means that you will want to make sure that your workloads and applications are flexible enough to take advantage of this opportunistic capacity. It also means that if it’s important for you to run Spot Instances uninterrupted for a period of time, it’s advisable to submit a higher maximum bid price, especially since you often won’t pay that maximum bid price.


Spot Instances perform exactly like other Amazon EC2 instances while running, and like other Amazon EC2 instances, Spot Instances can be terminated when you no longer need them. If you terminate your instance, you will pay for any partial hour (as you do for On-Demand or Reserved Instances). However, if the Spot Price goes above your maximum price and your instance is terminated by Amazon EC2, you will not be charged for any partial hour of usage.

Spot instances effectively harvest unused infrastructure capacity. The servers, data center space, and network capacity are all sunk costs. Any workload worth more than the marginal costs of power is profitable to run. This is a great deal for customers in because it allows non-urgent workloads to be run at very low cost.  Spot Instances are also a great for the cloud provider because it further drives up utilization with the only additional cost being the cost of power consumed by the spot workloads. From Overall Data Center Costs, you’ll recall that the cost of power is a small portion of overall infrastructure expense.


I’m particularly excited about Spot instances because, while customers get incredible value, the feature is also a profitable one to offer.  Its perhaps the purest win/win in cloud computing.


Spot Instances only work in a large market with many diverse customers. This is a lesson learned from the public financial markets. Without a broad number of buyers and sellers brought together, the market can’t operate efficiently. Spot requires a large customer base to operate effectively and, as the customer base grows, it continues to gain efficiency with increased scale.


I recently came across a blog posting that ties these ideas together: New CycleCloud HPC Cluster Is a Triple Threat: 30000 cores, $1279/Hour, & Grill monitoring GUI for Chef. What’s described in this blog posting is a mammoth computational cluster assembled in the AWS cloud. The speeds and feeds for the clusters:

·         C1.xlarge instances:           3,809

·         Cores:                                  30,472

·         Memory:                              26.7 TB


The workload was molecular modeling. The cluster was managed using the Condor job scheduler and deployment was automated using the increasingly popular Opscode Chef. Monitoring was done using a packaged that CycleComputing wrote that provides a nice graphical interface to this large cluster: Grill for CycleServer (very nice).

 The cluster came to life without capital planning, there was no wait for hardware arrival, no datacenter space needed to be built or bought, the cluster ran 154,116 condor jobs with 95,078 compute hours of work and, when the project was done, was torn down without a trace.


What is truly eye opening for me in this example is that it’s a 30,000 core cluster for $1,279/hour. The cloud and Spot instances changes everything. $1,279/hour for 30k cores. Amazing.


Thanks to Deepak Singh for sending the CycleServer example my way.




James Hamilton



b: /


Tuesday, September 20, 2011 5:44:17 AM (Pacific Standard Time, UTC-08:00)  #    Comments [8] - Trackback

 Tuesday, August 16, 2011

I got a chance to chat with Eric Baldeschwieler while he was visiting Seattle a couple of weeks back and catch up on what’s happening in the Hadoop world at Yahoo and beyond. Eric recently started Hortonworks whose tag line is “architecting the future of big data.” I’ve known Eric for years when he led the Hadoop team at Yahoo! most recently as VP of Hadoop Engineering.  It was Eric’s team at Yahoo that contributed much of the code in Hadoop, Pig, and ZooKeeper. 


Many of that same group form the core of Hortonworks whose mission is revolutionize and commoditize the storage and processing of big data via open source. Hortonworks continues to supply Hadoop engineering to Yahoo! And Yahoo! Is a key investor in Hortonworks along with Benchmark Capital. Hortonworks intends to continue to leverage the large Yahoo! development, test, and operations team.  Yahoo! has over 1,000 Hadoop users and are running Hadoop over many clusters the largest of which was 4,000 nodes back in 2010. Hortonworks will be providing level 3 support for Yahoo! Engineering.


From Eric slides at the 2011 Hadoop summit, Hortonworks objectives:

      Make Apache Hadoop projects easier to install, manage & use

        Regular sustaining releases

        Compiled code for each project (e.g. RPMs)

        Testing at scale

      Make Apache Hadoop more robust

        Performance gains

        High availability

        Administration & monitoring

      Make Apache Hadoop easier to integrate & extend

        Open APIs for extension & experimentation


Hortonworks Technology Roadmap:

·         Phase 1: Making Hadoop Accessible (2011)

o   Release the most stable Hadoop version ever

o   Release directly usable code via Apache (RPMs, debs,…)

o   Frequent sustaining releases off of the stable branches

·         Phase 2: Next Generation Apache Hadoop (2012)

o   Address key product gaps (Hbase support, HA, Management, …)

o   Enable community and partner innovation via modular architecture & open APIs

o   Work with community to define integrated stack


Next generation Apache Hadoop:

·         Core

o   HDFS Federation

o   Next Gen MapReduce

o   New Write Pipeline (HBase support)

o   HA (no SPOF) and Wire compatibility

·         Data - HCatalog 0.3

o   Pig, Hive, MapReduce and Streaming as clients

o   HDFS and HBase as storage systems

o   Performance and storage improvements

·         Management & Ease of use

o   All components fully tested and deployable as a stack

o   Stack installation and centralized config management

o    REST and GUI for user tasks


Eric’s presentation from Hadoop Summit 2011 where he gave the keynote: Hortonworks: Architecting the Future of Big Data


James Hamilton



b: /


Tuesday, August 16, 2011 4:49:00 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
 Monday, August 01, 2011

It’s a clear sign that the Cloud Computing market is growing fast and the number of cloud providers is expanding quickly when startups begin to target cloud providers as their primary market. It’s not unusual for enterprise software companies to target cloud providers as well as their conventional enterprise customers but I’m now starting to see startups building products aimed exclusively at cloud providers. Years ago when there were only a handful of cloud services, targeting this market made no sense. There just weren’t enough buyers to make it an interesting market. And, many of the larger cloud providers are heavily biased to internal development further reducing the addressable market size.


A sign of the maturing of the cloud computing market is there now many companies interested in offering a cloud computing platform not all of which have substantial systems software teams. There is now a much larger number of companies to sell to and many are eager to purchase off the shelf products. Cloud providers have actually become a viable market to target in that there are many providers of all sizes and the overall market continues to expand faster than any I have seen any I’ve seen over the last 25 years.


An excellent example of this new trend of startups aiming to sell to the Cloud Computing market is SolidFire which targets the high performance block storage market with what can be loosely described as a distributed Storage Area Network. Enterprise SANs are typically expensive, single-box, proprietary hardware.  Enterprise SANs are mostly uninteresting to cloud providers due to high cost and the hard scaling limits that come from scale-up solutions. SolidFire implements a virtual SAN over a cluster of up to 100 nodes. Each node is a commodity 1RU, 10 drive storage server.  They are focused on the most demanding random IOPS workloads such as database and all 10 drives in the SolidFire node are Solid State Storage devices. The nodes are interconnected by up 2x 1GigE and 2x10GigE networking ports.


In aggregate, each node can deliver a booming 50,000 IOPS and the largest supported cluster with 100 nodes can support 5m IOPS in aggregate. The 100 node cluster scaling limit may sound like a hard service scaling limit but multiple storage clusters can be used to scale to any level.  Needing multiple clusters has the disadvantage of possibly fragmenting the storage but the advantage of dividing the fleet up into sub-clusters with rigid fault containment between them limiting the negative impact of software problems. Reducing the “blast radius” of a failure makes moderate sized sub-clusters a very good design point.


Offering distributed storage solution isn’t that rare – there are many out there. What caught my interest at SolidFire was 1) their exclusive use of SSDs, and 2) an unusually nice quality of service (QoS) approach. Going exclusively with SSD makes sense for block storage systems aimed exclusively at high random IOPS workload but they are not a great solution for storage bound workloads. The storage for these workloads is normally more cost-effectively hosted on hard disk drives. For more detail on where SSDs are win an where they are not:


·         When SSDs Make Sense in Server Applications

·         When SSDs Don’t Make Sense in Server Applications

·         When SSDs make sense in Client Applications (just about always)


The usual solution to this approach is do both but SolidFire wanted a single SSD optimized solution that would be cost effective across all workloads.  For many cloud providers, especially the smaller ones, a single versatile solution has significant appeal.


The SolidFire approach is pretty cool. They exploit the fact that SSDs have abundant IOPS but are capacity constrained and trade off IOPS to get capacity. Dave Wright the SolidFire CEO describes the design goal as SSD performance at a spinning media price point. The key tricks employed:

·         Multi-Layer Cell Flash: They use MLC Flash Memory storage since it is far cheaper than Single Level Cell, the slightly lower IOPS rate supported by MLC is still more than all but a handful of workloads require and they can solve the accelerated wear issues with MLC in the software layers above

·         Compression: Aggregate workload dependent gains estimated to be 30 to 70%

·         Data Deduplication: Aggregate workload dependent gains estimated to be 30 to 70%

·         Thin Provisioning: Only allocate blocks to a logical volume as they are actually written to. Many logical volumes never get close to the actual allocated size.

·         Performance Virtualization: Spread all volumes over many servers. Spreading the workload at a sub-volume level allows more control of meeting the individual volume performance SLA with good utilization and without negatively impacting other users


The combination of the capacity gains of thin provisioning, duplication, and compression bring the dollars per GB of the SolidFire solution very near to some hard disk based solutions at nearly 10x the IOPS performance.


The QoS solution is elegant in that they have three settings that allow multiple classes of storage to be sold. Each logical volume has 2 QoS settings: 1) Bandwidth, and 2) IOPS. Each setting has a min, max, and burst capacity setting.  The min setting sets a hard floor where capacity is reserved to ensure this resource is always available. The burst is the hard ceiling that prevents a single user for consuming excess resource. The max is the essentially the target. If you run below the max you build up credits that allow a certain time over the max. The Burst limits the potential negative impact of excursions above max on other users.


This system can support workloads that need dead reliable, never changing I/O requirements. It can also support dead reliable average case with rare excursions above (e.g. during a database checkpoint). Its also easy to  support workloads that soak up resources left over after satisfying the most demanding workloads without impacting other users. Overall, a nice simple and very flexible solution to a very difficult problem.




James Hamilton



b: /


Monday, August 01, 2011 12:49:03 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
 Sunday, July 24, 2011

Great things are happening in the networking market. We’re transitioning from vertically integrated mainframe-like economics to a model similar to what we have in the server world. In the server ecosystems, we have Intel, AMD and others competing to provide the CPU. At the layer above, we have ZT Systems, HP, Dell DCS, SGI, IBM, and many others building servers based upon whichever CPU the customer choses to use.  Above that layer, we have wide variety of open sources and proprietary software stacks that run on servers from any of the providers which include any of the CPU providers silicon. There is competition at all layers in the stack.


The networking world is finally heading down the right path with competing merchant silicone providers for the core data plane processing engines used in networking gear (Application Specific Integrated Circuits or ASICs). These ASICs are supplied by Broadcom, Fulcrum, Marvel, and others.  A given ASIC will be built into switch gear by many competitors. Above that layer, there are open source networking protocol stacks (Quagga and Xorp) as well as ASIC independent commercial protocol stacks (IP Infusion and Ariscent) and also proprietary stacks.


This is all good news.  In fact, things are moving so fast we’re already starting to see some consolidation in the industry with Broadcom acquiring Dune Networks in November 2009 and just last week Intel acquired Fulcrum Microsystems. The Intel acquisition of Fulcrum is a potentially a good thing for the industry in that Fulcrum may now have even more investment capital. However, Fulcrum were hardly starving before having been through five rounds of venture funding netting a booming $102M.  Broadcom remains the biggest player in the networking merchant silicon market with a market capitalization of $19.1B.


Hopefully this major acquisition will even further ramp up the pace of innovation and competition in the networking world. I suspect it will even though it feels early for this market to be consolidating.


Related press articles:

·         Intel press release

·         The Register

·         CNET

Related blogs:

·         Andy Bechtolsheim: The March to Merchant Silicon in 10Gbe Cloud Networking

·         Datacenter Networks are in My Way




James Hamilton



b: /


Sunday, July 24, 2011 3:13:14 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
 Thursday, June 23, 2011

Earlier this week, I was in Athens Greece attending annual conference of the ACM Machinery Special Interest Group on Management of Data. SIGMOD is one of the top two database events held each year attracting academic researchers and leading practitioners from industry.


I kicked off the conference with the Plenary keynote. In this talk I started with a short retrospection on the industry over the last 20 years. In my early days as a database developer, things were moving incredibly quickly. Customers were loving our products, the industry was growing fast and yet the products really weren’t all that good. You know you are working on important technology when customers are buying like crazy and the products aren’t anywhere close to where they should be.


In my first release as lead architect on DB2 20 years ago, we completely rewrote the DB2 database engine process model moving from a process-per-connected-user model to a single process where each connection only consumes a single thread supporting many more concurrent connections. It was a fairly fundamental architectural change completed in a single release. And in that same release, we improved TPC-A performance a booming factor of 10 and then did 4x more in the next release. It was a fun time and things were moving quickly.


From the mid-90s through to around 2005, the database world went through what I refer to as the dark ages. DBMS code bases had grown to the point where the smallest was more than 4 million lines of code, the commercial system engineering teams would no longer fit in a single building, and the number of database companies shrunk throughout the entire period down to only 3 major players. The pace of innovation was glacial and much of the research during the period was, in the words of Bruce Lindsay, “polishing the round ball”. The problem was that the products were actually passably good, customers didn’t have a lot of alternatives, and nothing slows innovation like large teams with huge code bases.


In the last 5 years, the database world has become exciting again. I’m seeing more opportunity in the database world now than any other time in the last 20 years. It’s now easy to get venture funding to do database products and the number of and diversity of viable products is exploding. My talk focused on what changed, why it happened, and some of the technical backdrop influencing.


A background thesis of the talk is that cloud computing solves two of the primary reasons why customers used to be stuck standardizing on a single database engine even though some of their workloads may have run poorly. The first is cost. Cloud computing reduces costs dramatically (some of the cloud economics argument: and charges by usage rather than via annual enterprise license. One of the favorite lock-ins of the enterprise software world is the enterprise license. Once you’ve signed one, you are completely owned and it’s hard to afford to run another product.  My fundamental rule of enterprise software is that any company that can afford to give you 50% to 80% reduction from “list price” is pretty clearly not a low margin operator. That is the way much of the enterprise computing world continues to work: start with a crazy price, negotiate down to a ½ crazy price, and then feel like a hero while you contribute to incredibly high profit margins.


Cloud computing charges by the use in small increments and any of the major database or open source offerings can be used at low cost. That is certainly a relevant reason but the really significant factor is the offloading of administrative complexity to the cloud provider.  One of the primary reasons to standardize on a single database is that each is so complex to administer, that it’s hard to have sufficient skill on staff to manage more than one. Cloud offerings like AWS Relational Database Service transfer much of the administrative work to the cloud provider making it easy to chose the database that best fits the application and to have many specialized engines in use across a given company.


As costs fall, more workloads become practical and existing workloads get larger.  For example, If analyzing three months of customer usage data has value to the business and it becomes affordable to analyze two years instead, customers correctly want to do it. The plunging cost of computing is fueling database size growth at a super-Moore pace requiring either partitioned (sharded) or parallel DB engines.


Customers now have larger and more complex data problems, they need the products always online, and they are now willing to use a wide variety of specialized solutions if needed. Data intensive workloads are growing quickly and never have there been so many opportunities and so many unsolved or incompletely solved problems. It’s a great time to be working on database systems.


·         The slides from the talk:

·         Proceedings extended abstract:

·         Video of talk: (select June 14th to get to the video)

The talk video is available but, unfortunately, only to ACM digital library subscribers (thanks to Simon Leinen for pointing out the availability of the video link above).



James Hamilton



b: /


Thursday, June 23, 2011 3:21:11 PM (Pacific Standard Time, UTC-08:00)  #    Comments [2] - Trackback
 Thursday, June 09, 2011

The Amazon Technology Open House was held Tuesday night at the Amazon South Lake Union Campus. I did a short presentation on the following:


       Quickening pace of infrastructure innovation

       Where does the money go?

       Power distribution infrastructure

       Mechanical systems

       Modular & Advanced Building Designs

       Sea Change in Networking


Slides and notes:




James Hamilton



b: /


Thursday, June 09, 2011 5:40:21 AM (Pacific Standard Time, UTC-08:00)  #    Comments [4] - Trackback
 Friday, June 03, 2011

Earlier today Alex Mallet reminded me of the excellent writing of Atul Gawnade by sending me a pointer to the New Yorker coverage of Gawande’s commencement address at the Harvard Medical School: Cowboys and Pit Crews.


Four years ago I wrote a couple of blog entries on Gawande’s work but, at the time, my blog was company internal so I’ve not posted these notes here in the past:


As a follow-on to the posting I made on professional engineering (also posted externally Edwin Young sent me a link to the following talk by Atul Gawande: Outcomes are very Personal. It’s from another domain, medicine, but is a phenomenally good presentation by a surgeon and his core premise applies equally to software: practitioners work and the outcomes of that work are spread on a bell curve.  The truly great are much better than the average and often an order of magnitude better than the lowest performing.  His book and the presentation is about the personal attributes and approaches of those at the very top. It’s well worth a view:


In my view, it’s an insightful presentation by a surgeon who loves data, loves understanding why we do well and how we can do better and is relentless in pursuit himself of doing everything better.  Subsequent to watching the presentation, I read a book by the same author “Better: A Surgeon's Notes on Performance” ( 


Software, like surgery, is part art and part science and there is tremendous variability between the average and the best.  Gawande studies the best in different specializations to understand why the performance of some practitioners is way out there at the positive end of the bell curve and, through a series of essays, makes observations on how to do improve performance of the population Aoverall.  Understanding that human performance is distributed on the bell curve means that, for whatever it is you are doing, there are average performers, terrible performers, and truly gifted performers.  Gawade looks for what he calls positive deviance – it’s always there in a bell curve distributed phenomena – and tries to understand what they do differently.  Worth reading.


A sampling of Gawande’s work:

·         Book:

·         Video:

·         Commencement Speech Notes:




James Hamilton



b: /


Friday, June 03, 2011 7:24:06 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
 Tuesday, May 31, 2011

As a boater, there are times when I know our survival is 100% dependent upon the weather conditions, the boat, and the state of its equipment. As a consequence, I think hard about human or equipment failure modes and how to mitigate them. I love reading the excellent reporting by the UK Marine Accident Investigation Board. This publication covers human and equipment related failures on commercial shipping, fishing, and recreational boats. I read it carefully and I’ve learned considerably from it.


I treat my work in much the same way. At work, human life is not typically at risk but large service failures can be very damaging and require the same care to avoid. As a consequence, at work I also think hard about possible human or equipment failure modes and how to mitigate them.


Wanting to deeply understand unusual failure modes and especially wanting to understand the errors that humans make when managing systems under stress, I spend time reading about system failures. Considerable learning can be drawn from reading about the failures of engineered systems and people under stress. All disasters or near disasters yield some unique lessons and re-enforce some old ones.


The hard part for me is getting enough detail to really learn from the situation. The press reports are often light on details partly because general audiences are not necessarily that interested but there also may be legal or competitive constraints preventing broad publication. NASA, FAA, Coast Guard and some other government reports to get to excellent detail. One analysis of system failure I learned greatly from was Feynman’s analysis of the space shuttle Challenger disaster as part of the Rogers Commission Report.


I just came across another report that is not quite a Feynman  classic but it is an excellent, just-the-facts description of a large scale failure. This report, from IEEE Spectrum, titled What Went Wrong in Japan’s Nuclear Reactors outlines what happened in the eventually catastrophic disaster at Japan’s Fukushima Dai-1 nuclear facility following the Tohoku earthquake and subsequent tsunami. In this report, the terminal failures of 4 of the 6 reactors at the facility is described in more detail than other accounts of that event I’ve come across.

All disasters are unique in some dimensions. What makes Fukushima particularly unusual is these failures occurred over multiple weeks rather than the seconds to hours of many events.  This one was relatively slow to develop and even slower to be brought under control. Looking forward, I suspect Fukushima will share some characteristics with Chernobyl where mitigating the environmental damage is still nowhere close to complete nearly three decades later. In 1998 the Ukraine government obtained economic aid from the European Bank for Reconstruction and Development to rebuild the failing Chernobyl sarcophagus. It is expected that yet more work will need to be done to continue to contain dangerous radioactive substances from escaping.  Similarly, I expect the environmental impact of the Fukushima disaster will be fought for decades at great cost both economic and human.


In many ways Fukishima was a classic disaster where a not particularly surprising event, in this case an earthquake near Japan, started the failure and then cascading natural disaster, equipment failure, and human decisions followed to yield an outcome that every aspect of the system design sought to avoid.


I recommend reading the IEEE report linked below and my rough notes from the write-up follow:

·         On March 11 an earthquake registering 9.0 magnitude was experienced off the coast of Japan

·         The tsunami hit the plant destroying power distribution gear cutting off power to the Fukushima facility

·         Backup generators and switch gear were also disabled by the Tsunami

·         Reactor building integrity was maintained through earthquake and Tsunami and the three reactors that were active at that point where all shut down properly

·         Due to the power failure and the damage to distribution gear and generators, plant cooling systems were not operating at any of the reactors nor the spent fuel rod storage pools

·         Even though the nuclear reaction had been stopped in the three reactors that were operational when the tsunami hit (reactors 1, 2, & 3), considerable heat was still being created putting the reactors at risk of meltdown. Meltdown is a condition where reactor core over temperature occurs, the coolant is boiled off, the fuel rods melt and form a pool of very hot, highly radioactive fuel in the bottom of the reactor. This hot, radioactive fluid then rapidly breaks down steel and concrete in the containment vessel and possibly escapes to the environment.

·         Another area of risk from the failed cooling systems are the spent nuclear fuel rod storage pools. These pools are also housed inside the reactor buildings near the primary containment vessel where the active nuclear reaction actually takes place. Although the fuel rods are no longer contributing to a nuclear reaction, they are both highly radioactive and still producing sufficient heat that active cooling is required. Without cooling these rods can heat the storage pool to the point that it boils off the cooling water and present a risk similar to the active rods inside the primary storage vessel.

·         I find it surprising that both the spent rod storage and the shut down reactor cores don’t appear to fail safe and self-stabilize when cooling water is removed given the considerably higher than zero probability of power failure and the seriously negative impact of radioactive release to the environment.

·         Events at Reactor #1:

o   March 12, a day after the power failure, heat in the recently shutdown reactor built up until the (not circulating) cooling water began to be boiled off.

o   As the water level fell, the now exposed fuel rods reacted with the steam in the primary containment vessel, and began producing hydrogen gas

o   The pressure rose to dangerous levels in the primary containment vessel and operators decided to vent the primary containment vessel into the reactor building.

o   The vented hydrogen gas when exposed to the relatively oxygen-rich environment in the reactor building, exploded blowing the top off the reactor building

o   The explosion may have also damaged the primary containment vessel and definitely released radioactive material

o   The operators chose to pump seawater into the building in an effort to control the escalating temperature inside the reactor and to avoid core meltdown

o   March 29, radioactive water was found outside the reactor building

o   April 5, reactor core temperatures have begun to fall indicating the system is coming back into control

o   Radioactivity levels in the building are very high and operators are injecting nitrogen to reduce the likelihood of subsequent hydrogen explosions.

o   May 12, TEPCO officials confirmed that the reactor had suffered a core meltdown and the bottom of the reactor building may be leaking highly radioactive water into the environment.

·         Events at Reactor #3:

o   March 14, 3 days after the tsunami and 2 days after the roof was blown off the Reactor #1 containment building, the same thing happened on Reactor #3

o   This explosion occurred despite plant operators pumping large quantities of cooling sea water into the reactor building

o   March 17, steam begins billowing from the reactor building confirming that the primary containment vessel was damaged and releasing radioactive compounds.

o   Helicopters dumped water on the building and police water cannons were used to pour water down onto the building.

o   Water was sprayed on the building for days with some interruptions as radiations levels rose sufficiently high that work had to be stopped.

o   March 24, workers laying power cables attempting to restore power to Reactor #3 waded into highly radioactive water requiring hospitalization.

o   March 28, dangerous plutonium was detected in the environment near Reactor #3.

·         Events at Reactor #2:

o   March 15, 4 days after the tsunami, 3 days after the roof was blown off Reactor #1, and a day after the roof was blown off Reactor #3, a serious explosion occurred at Reactor #2.

o   Reactor #2 was later confirmed to have experienced at least a partial core meltdown

o   March 27, highly radioactive water discovered outside of reactor building #2.

·         Subsequently large quantities of uncontained radioactive water has been found throughout the multi-reactor plan and the turbine facilities are flooded as are the cabling tunnels between the buildings. Serious radioactive water leaks into the ocean have been detected and subsequently corrected in one case by injecting 6,000 liters of liquid glass into the ground near the leak.

·         April 4th, 11,500 tons of radioactive water is pumped into the ocean. This water is 100x above the legal safety limit but was pumped into the environment in the hope that the storage facilities can be used to contain waste water that is 10,000x time radioactive limit for environmental release.

·         The spent fuel pools at the inactive reactors 4, 5, & 6 were all slowly overheating as a consequence of there being no cooling water. The Reactor #4 cooling pool either boiled off its water or it leaked off as a result of earthquake damage. The spent fuel rods exposed to atmosphere without cooling lead to fires inside Reactor building #4

·         Outcome:

o   Fukushima now rated to be as serious as the Chernobyl having been classified as a a magnitude 7 event, the worst on the International Nuclear Event Scale. However it is still consider to have released only 5 to 10% of the radiation released by Chernobyl.

o   All residents within 20 km evacuated

o   Voluntary evacuation of all residents between 20 and 30 km.

o   Agricultural products including milk and vegetables from the region contaminated

o   Tokyo’s tap water declared unfit for infants for 1 day

o   Decades of cleanup and containment remain


The report: What Went Wrong in Japan's Nuclear Reactors:


We all wish the situation had been avoided and, those of us involved in engineering projects whether they be life critical systems or not, need to ensure that the lessons from this one are learned well and applied faithfully to new designs. I won’t speculate on human risk in the efforts spent to mitigate this disaster but, clearly, the workers that brought these systems back under control and continue to manage the environmental impact are heroes and deserve our collective thanks. 




James Hamilton



b: /



Tuesday, May 31, 2011 5:39:33 AM (Pacific Standard Time, UTC-08:00)  #    Comments [8] - Trackback
 Wednesday, May 25, 2011

The European Data Center Summit 2011 was held yesterday at SihlCity CinCenter in Zurich. Google Senior VP Urs Hoelzle kicked off the event talking about why data center efficiency was important both economically and socially.  He went on to point out that the oft quoted number that US data centers represent is 2% of total energy consumption is usually mis-understood. The actual data point is that 2% of the US energy budget is spent on IT of which the vast majority is client side systems. This is unsurprising but a super important clarification.  The full breakdown of this data:


·         2% of US power

o   Datacenters:              14%

o   Telecom:                     37%

o   Client Device:            50%


The net is that 14% of 2% or 0.28% of the US power budget is consumed in datacenters.  This is a far smaller but still a very relevant number. In fact, that is the primary motivator behind the conference: how to get the best practices from industry leaders in datacenter efficiency available more broadly .


To help understand why this is important,

·         Of the 0.28% energy consumption by datacenters:

o   Small:            41%

o   Medium:     31%

o   Large:            28%


This later set of statistics predictably shows that the very largest data centers consume 28% of the data center energy budget while small and medium centers consume 72%.  High scale datacenter operates have large staffs of experts focused on increasing energy efficiency but small and medium sized centers can’t afford this overhead at their scale. Urs’s point and the motivation behind the conference is we need to get industry best practices available to all data center operations.


The driving goal behind the conference is that extremely efficient datacenter operations are possible using only broadly understood techniques. No magic is required.  It is true that the very large operators will continue to enjoy even better efficiency but existing industry best practices can easily get even small operators with limited budgets to within a few points of the same efficiency levels.


Using Power Usage Effectiveness as the measure while the industry leaders are at 1.1 to 1.2 where 1.2 means that every watt delivered to the servers requires 1.2 watts to be deliverd from the utility. Effectively it is a measure of the overhead or efficiency of the datacenter infrastructure. Unfortunately the average remains in the 1.8 to 2.0 range and the worst facilities can be as poor as 3.0.


Summarizing: Datacenters consume 0.28% of the annual US energy budget. 72% of these centers are small and medium sized centers that tend towards the lower efficiency levels.


The Datacenter Efficiency conference focused on making cost effective techniques more broadly understood showing how a PUE of 1.5 is available to all without large teams of experts or huge expense. This is good for the environment and less expensive to operate.




James Hamilton



b: /


Wednesday, May 25, 2011 5:52:18 AM (Pacific Standard Time, UTC-08:00)  #    Comments [10] - Trackback
 Monday, May 23, 2011

Guido van Rossum was at Amazon a week back doing a talk. Guido presented 21 Years of Python: From Pet Project to Programming Language of the Year.


The slides are linked below and my rough notes follow:

·         Significant Python influencers:

o   Algol 60, Pascal, C

o   ABC

o   Modula0-2+ and 3

o   Lisp and Icon

·         ABC was the strongest language influencer of this set

·         ABC design goals:

o   Professionals but not professional programmers (lab personal, scientists, etc.)

o   Easy to teach, easy to learn, easy to use

·         Parts of ABC most liked by Guido:

o   Design iterations based on user testing

§  E.g. colon before indented blocks

o   Simple design: IF, WHILE, FOR, …

o   Indentation for grouping (Knuth, occam)

o   Tuples, lists, dictionaries (though changed)

o   Immutable data types

o   No limits

o   The >>> prompt

·         Parts of ABC that most needed improvement:

o   Monolithic design – not extensible

§  E.g. no graphics, not easily added

o   Invented non-standard terminology

§  E.g. “how-to” instead of “procedure”

o   ALL'CAPS keywords

o   No integration with rest of system

§  No file-based I/O (persistent variables instead)

·         The beginnings of Python:

o   Amoeba project at CWI

§  Writing apps in C and sh and wanting something in between

·         Python design philosophy:

o   Borrow ideas whenever it makes sense

o   As simple as possible, no simpler (Einstein)

o   Do one thing well (UNIX)

o   Don’t fret about performance (fix it later)

o   Go with the flow (don’t fight environment)

o   Perfection is the enemy of the good

o   Cutting corners is okay (get back to it later)

·         User Centric Design Philosophy:

o   Avoid platform ties, but not religiously

o   Don’t bother the user with details

o   Discourage but allow coding to the platform

o   Offer multiple levels of extensibility

o   Errors should not be fatal, if possible

o   Errors should never pass silently

o   Don’t blame the user for bugs in Python

·         Core language stabilized quickly in the 1990 to 1991 timeframe

·         Early days of active Python community:

o   1990 – internal at CWI

§  More internal use than ABC ever had

§  Internal contributors

o   1991 – first release;

o   1994 – USENET group comp.lang.python

o   1994 – first workshop (NIST)

o   1995-1999 – from workshops to conferences

o   1995 – Python Software Association

o   1997 – goes online

o   1999 – Python Consortium

§  Modeled after X Consortium

o   2001 – Python Software Foundation

§  Modeled after Apache Software Foundation

·         Present day Python community:

o   PSF runs largest annual Python conference

§  PyCon Atlanta in 2011: 1500 attendees

§  2012-2013: Toronto; 2014-2015: Bay area

§  Also sponsors regional PyCons world-wide

o   EuroPython since 2002

o   Many local events, user groups



o   Stackoverflow etc.

·         Python 2 vs Python 3

o   Fixing deep bugs intrinsic in the design

o   Avoid two extremes:

§  perpetual backwards compatibility (C++)

§  rewrite from scratch (Perl 6)

o   Our approach:

§  evolve the implementation gradually

§  some backwards incompatibilities

§  separate tools to help users cope


Thanks to Guido for doing the well received Python presentation.


Guido’s slides and blog URLS:

·         Slides:

·         Blog:




James Hamilton



b: /


Monday, May 23, 2011 9:42:12 AM (Pacific Standard Time, UTC-08:00)  #    Comments [4] - Trackback
 Friday, May 20, 2011

I invited Nikhil Handigol to present at Amazon earlier this week. Nikhil is a Phd candidate at Stanford University working with networking legend Nick McKeown on the Software Defined Networking team. Software defined networking is an concept coined by Nick where the research team is separating the networking control plane from the data plane. The goal is a fast and dumb routing engine with the control plane factored out and supporting an open programming platform.


From Nikil’s presentation, we see the control plane hoisted up to a central, replicated network O/S configuring the distributed routing engines in each switch.


One implementation of software defined networking is OpenFlow where each router supports the OpenFlow protocol and a central OpenFlow Controller computes routing tables that are installed in each router:


What makes OpenFlow especially interesting is that it’s simple, easy to implement, and getting broad industry support with the Open Networking Foundation as the central organizing body.  The Open Networking Foundation’s primary mission is to advance software defined networking using OpenFlow as the protocol. Founding members of the Open Networking Foundation are Deutsche Telekom, Facebook, Google, Microsoft, Verizon, and Yahoo!.  Also included are networking equipment providers including: Broadcom, Dell, Cisco, Force10, HP, Juniper, Marvell, Mellanox, and many others.


Today, most networking equipment is shipped as a vertically integrated stack including both the control and data planes. There are many reasons why this is not good for the industry. The Stanford team argues it blocks innovation in that researches can’t try new protocols with a closed stack without a programming model.  I agree. This is a problem for both academia and industry but my dislike of the current model is much broader. In Networking: The Last Bastion of Mainframe Computing, I made the case that this vertically integrated approach is artificially holding prices high and slowing the pace of innovation. A quick summary of the argument:


When networking equipment is purchased, it’s packaged as a single sourced, vertically integrated stack. In contrast, in the commodity server world, starting at the most basic component, CPUs are multi-sourced. We can get CPUs from AMD and Intel. Compatible servers built from either Intel or AMD CPUs are available from HP, Dell, IBM, SGI, ZT Systems, Silicon Mechanics, and many others.  Any of these servers can support both proprietary and open source operating systems. The commodity server world is open and multi-sourced at every layer in the stack.


Open, multi-layer hardware and software stacks encourage innovation and rapidly drive down costs. The server world is clear evidence of what is possible when such an ecosystem emerges.


I’m excited about software defined networking because it provides a clean interface allowing switch providers to both innovate and compete. An additional benefit is that SDN allows innovation and experimentation at the network protocol layer.


In Nikil’s talk last week at Amazon, he explored integrating load balancing functionality into the network routing fabric. The team started with the hypothesis that load balancing is really just smart routing. They then implemented a distributed load balancing fabric by adding load balancing support to network routers using Software Defined Networking. Essentially they distribute the load balancing functionality throughout the network. What’s unusual here is that the ideas could be tested and tried over a 9 campus, North American wide network with only 500 lines of code. With conventional network protocol stacks, this research work would have been impossible in that vendors don’t open up protocol stacks. And, even if they did, it would have been complex and very time consuming.




James Hamilton



b: /


Friday, May 20, 2011 10:55:51 AM (Pacific Standard Time, UTC-08:00)  #    Comments [2] - Trackback
 Wednesday, May 04, 2011

This note looks at the Open Compute Project distributed Uninterruptable Power Supply (UPS) and server Power Supply Unit (PSU). This is the last in a series of notes looking at the Open Compute Project. Previous articles include:

·         Open Compute Project

·         Open Compute Server Design

·         Open Compute Mechanical Design


The open compute uses a semi-distributed uninterruptable power supply (UPS) system. Most data centers use central UPS systems where large the UPS is part of the central power distribution system. In this design, the UPS is in the 480 3 phase part of the central power distribution system prior to the step down to 208VAC. Typical capacities range from 750kVA to 1,000kVA. An alternative approach is a distributed UPS like that used by a previous generation Google server design.

In a distributed UPS, each server has its own 12VDC battery to serve as backup power. This design has the advantage of being very reliable with the batter directly connected to the server 12V rail. Another important advantage is the small fault containment zone (small “blast radius”) where a UPS failure will only impact a single server. With a central UPS, a failure could drop the load on 100 racks of servers or more. But, there are some downsides of distributed UPS. The first is that batteries are stored with the servers. Batteries take up considerable space, can emit corrosive gasses, don’t operate well at high temperature, and require chargers and battery monitoring circuits. As much as I like aspects of the distributed UPS design, it’s hard to cost effectively and, consequently, is very uncommon.


The Open Compute UPS design is a semi-distributed approach where each UPS is on the floor with servers but rather than having 1 UPS per server (distributed UPS) or 1 UPS per order 100 racks (central UPS with roughly 4,000 servers), they have 1 UPS per 6 racks (180 servers).




In this design the battery rack is the central rack flanked by two triple racks of servers. Like the server racks, the UPS is delivered 480VAC 3 phase directly. At the top of the battery rack, they have control circuitry, circuit breakers, and rectifiers to charge the battery banks.


What’s somewhat unusual in the output stage of the UPS doesn’t include inverters to convert the direct current back to the alternating current required by a standard server PSU. Instead the UPS output is 48V direct current which is delivered directly to the three racks on either side of the UPS. This has the upside of avoiding the final invert stage which increases efficiency. There is a cost to avoiding converting back to AC.  The most important downside is they need to effectively have two server power supplies where one accepts 277VAC and the other accepts 48VDC from the UPS. The second disadvantage is using 48V distribution is inefficient over longer distances due to conductor losses at high amperage.


The problem with power distribution efficiency is partially mitigated by keeping the UPS close to servers where the 6 racks its feeds are on either side of the UPS so the distances are actually quite short. And, since the UPS is only used during short periods of time between the start of a power failure and the generators taking over, the efficiency of distribution is actually not that important a factor. The second issue remains, each server power supply is effectively two independent PSUs.


The server PSU looks fairly conventional in that it’s a single box. But, included in the single box, is two independent PSUs and some control circuitry. This has the downside of forcing the use of a custom, non-commodity power supply. Lower volume components tend to cost more. However, the power supply is a small part of the cost of a server so this additional cost won’t have a substantially negative impact. And, it’s a nice, reliable design with a small fault containment zone which I really like.


The Open Compute UPS and power distribution system avoids one level of power conversion common in most data centers, delivers somewhat higher voltages (277VAC rather than 208VAC) close to the load, and has the advantage of a small fault zone.


James Hamilton



b: /


Wednesday, May 04, 2011 5:24:34 AM (Pacific Standard Time, UTC-08:00)  #    Comments [4] - Trackback
 Friday, April 29, 2011

Google cordially invites you to participate in a European Summit on sustainable Data Centres. This event will focus on energy-efficiency best practices that can be applied to multi-MW custom-designed facilities, office closets, and everything in between. Google and other industry leaders will present case studies that highlight easy, cost-effective practices to enhance the energy performance of Data Centres.

The summit will also include a dedicated session on cooling. Presenters will detail climate-specific implementations of free cooling as well as novel ways to utilise locally -available opportunities. We will also debate climate-independent PUE targets.

The agenda includes presentations and panel discussions featuring Amazon, DeepGreen, eBay, Google, IBM, Microsoft, Norman Disney & Young, PlusServer, Telecity Group, The Green Grid, UK's Chartered Institute for IT, UBS and others.

Attendance is free. However, space is limited and we therefore encourage you to register online at your earliest convenience. Your participation will be confirmed.

We look forward to seeing you and your colleagues in Zurich!

Friday, April 29, 2011 5:18:40 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
 Wednesday, April 20, 2011

Last Thursday Facebook announced the Open Compute Project where they released pictures and specifications for their Prineville Oregon datacenter and the servers and infrastructure that will populate that facility. In my last blog, Open Compute Mechanical System Design I walked through the mechanical system in some detail. In this posting, we’ll have a closer look at the Facebook Freedom Server design.


Chassis Design:

The first thing you’ll notice when looking at the Facebook chassis design is there are only 30 servers per rack. They are challenging one of the strongest held beliefs in the industry that is density is the primary design goal and more density is good. I 100% agree with Facebook and have long argued that density is a false god. See my rant Why Blade Servers aren’t the Answer to all Questions for more on this one.  Density isn’t a bad thing but paying more to get denser designs that cost more to cool is usually a mistake. This is what I’ve referred to in the past as the Blade Server Tax.


When you look closer at the Facebook design, you’ll note that the servers are more than 1 Rack Unit (RU) high but less than 2 RU. They choose a non-standard 1.5RU server pitch. The argument is that 1RU server fans are incredibly inefficient. Going with 60mm fans (fit in 1.5RU) dramatically increases their efficiency but moving further up to 2RU isn’t notably better. So, on that observation, they went with 60mm fans and a 1.5RU server pitch.

I completely agree that optimizing for density is a mistake and that 1RU fans should be avoided at all costs so, generally, I like this design point. One improvement worth considering is to move the fans out of the server chassis entirely and go with very large fans on the back of the rack. This allows a small gain in fan efficiency by going with larger still fans and allows a denser server configuration without loss of efficiency or additional cost. Density without cost is a fine thing and, in this case, I suspect 40 to 80 servers per rack could be delivered without loss of efficiency or additional cost so would be worth considering.


The next thing you’ll notice when studying the chassis above is that there is no server case.  All the components are exposed for easy service and excellent air flow. And, upon more careful inspection, you’ll note that all components are snap in and can be serviced without tools. Highlights:

·         1.5 RU pitch

·         1.2 MM stamped pre-plated steel

·         Neat, integrated cable management

·         4 rear mounted 60mm fans

·         Tool-less design with snap plungers holding all components

·         100% front cable access



The Open Compute project supports two motherboard designs where 1 uses an Intel processors and the other uses AMD.

Intel Motherboard:

AMD Motherboard:


Note that these boards are both 12V only designs. 


Power Supply:

The power supply (PSU) is an usual design in two dimensions: 1) it is a single output voltage 12v design and 2) it’s actually two independent power supplies in a single box. Single voltage supplies are getting more common but commodity server power supplies still usually deliver 12V, 5V, and 3.3V.  Even though processors and memory require somewhere between 1 and 2 volts depending upon the technology, both typically are fed by the 12V power rail through a Voltage Regulator Down (VRD) or Voltage Regulator Module (VRM). The Open Compute approach is to use deliver 12V only to the board and to produce all other required voltages via an Voltage Regulator Module on the mother board. This simplifies the power supply design somewhat and they avoid cabling by having the motherboard connecting directly to the server PSU.


The Open Compute Power Supply is has two power sources. The primary source is 277V alternating current (AC) and the backup power source is 48V direct current (DC). The output voltage from both supplies is the same 12V DC power rail that is delivered to the motherboard.


Essentially this supply is two independent PSUs with a single output rail. The choice of 277VAC is unusual with most high-scale data centers run on 208VAC. But 277 allows one power conversion stage to be avoided and is therefore more power efficient.


Most data centers have mid-voltage transformers(typically in the 13.2kv range but it can vary widely by location). This voltage is stepped down to 480V three phase power in North America and 400V 3 phase in much of the rest of the world.  The 480VAC 3p power is then stepped down to 208VAC for delivery to the servers.


The trick that Facebook is employing in their datacenter power distribution system is to avoid one power conversion by not doing the 480VAC to 208VAC conversion. Instead, they exploit the fact that each phase of 480 3p power is 277VAC between the phase and neutral.  This avoids a power transformation step which improves overall efficiency. The negatives of this approach are 1) commodity power supplies can’t be used (277VAC is beyond the range of commodity PSUs) and 2) the load on each of the three phases need to be balanced.  Generally, this is a good design tradeoff where the increase in efficiency justifies the additional cost and complexity.


An alternative but very similar approach that I like even better is to step down mid-voltage to 400VAC 3p and then play the same phase to neutral trick described above. This technique still has the advantage of avoiding 1 layer of power transformation. What is different is the resultant phase to neutral voltage delivered to the servers is 230VAC which allows commodity power supplies to be used.  The disadvantage of this design is that the mid-voltage to 400VAC 3p transformer is not in common use in North America. However this is a common transformer in other parts of the world so they are still fairly easily attainable.


Clearly, any design that avoids a power transformation stage is a substantial improvement over most current distribution systems. The ability to use commodity server power supplies unchanged makes the 400 3p to neutral trick look slightly better than the 480VAC 3p approach but all designs need to be considered in the larger context in which they operate. Since the Facebook power redundancy system requires the server PSU to accept both a primary alternating current input and a backup 48VDC input, special purpose build supplies need to be used. Since a custom PSU is needed for other reasons, going with 277VAC as the primary voltage makes perfect sense.


Overall a very efficient and elegant design that I’ve enjoyed studying. Thanks to Amir Michael of the Facebook hardware design team for the detail and pictures.




James Hamilton



b: /


Wednesday, April 20, 2011 6:56:09 PM (Pacific Standard Time, UTC-08:00)  #    Comments [16] - Trackback
 Saturday, April 09, 2011

Last week Facebook announced the Open Compute Project (Perspectives, Facebook). I linked to the detailed specs in my general notes on Perspectives and said I would follow up with more detail on key components and design decisions I thought were particularly noteworthy.  In this post we’ll go through the mechanical design in detail.


As long time readers of this blog will know, PUE has many issues (PUE is still broken and I still use it) and is mercilessly gamed in marketing literature (PUE and tPUE). The Facebook published literature predicts that this center will deliver a PUE of 1.07. Ignoring the power requirements of the mechanical systems, just delivering power to the servers at a 7% loss is a considerable challenge. I’m slightly skeptical of this number as a fully loaded, annual PUE number but I can say with confidence that it is one of the nicest mechanical designs I have come across. Hats off to Jay Park and the rest of the Facebook DC design team.


Perhaps the best way to understand the mechanical design is to first walk through the mechanical system diagram and then step through the actual deployment in pictures.

In the mechanical system diagram you will first note there is no process based cooling (no air conditioning) and no chilled water loop. It’s a 100% air cooled with all IT equipment cooling via outside air which is particularly effective with the favorable weather conditions experienced in central Oregon. 


The outside air is pulled in from outside, mixed with a controlled volume of return air to avoid over-cooling in winter, filtered, evaporative cooled, and run through a fan wall before being returned to the cold aisles below.  


Air Intake

The cooling system takes up the entire second floor of the facility with the IT equipment and office space on the first floor. In the picture below, you’ll see the building air intake on the left running the full width of the facility. This intake area in the picture is effectively “outside” so includes water drains in the floor. In the picture to the right towards the top, you’ll see the air path to the remainder of the air handling system. To the right at the bottom, you’ll see the white sheet metal sections covering the passage where hot exhaust air is brought up from the hot aisle below to be mixed with outside air to prevent over-cooling on cold days.


Mixing Plenum & Filtration:

In the next picture, you can see the next part of the full building air handling system. On the left the louvers at the top are bringing in outside air from the outside air intake room.  On the left at the bottom, there are thermostatically controlled louvers allowing hot exhaust air to come up from the IT equipment hot aisle. The hot and outside air are mixed in this full room plenum before passing through the filtration wall visible on the right side of the picture.


The facility grounds are still being worked upon so the filtration system includes an extra set of disposable paper filters to increase the life of the more aggressive filtration media not visible behind.


Misting Evaporative Cooling

In the next picture below you’ll see the temperature and humidity controlled high-pressure water misting system again running the full width of the facility. Most evaporative cooling systems used in data center applications use wet media. This system used high pressure water with stainless steel nozzles to produce an atomized mist. These designs produce excellent cooling (high delta-T) but are prone to calcification and maintenance without aggressive filtration which brings some expense. But they are very effective.


Just beyond the misters, you’ll see what looks like filtration media. This media is there to ensure no airborne water makes it out of the cooling room.


Exhaust System

In the final picture below, you’ll see we have now got to the other side of the building where the exhaust fans are found. We bring air in on one side, filter, cool it, and pump it, and then just before the exhaust fans visible in the picture, huge openings in the floor alow treated, cooled air to be brough down to the IT equipment cold aisle below.


The exhaust fans visible in the picture control pressure by exhausting air not needed back outside.


The Facebook facility has considerable similarity to the design EcoCooling mechanical design I posted last Monday (Example of Efficient Mechanical System Design). Both approaches use the entire building as the air ducting system and both designs use evaporative cooling. The notable differences are: 1) EcoCooling design is based upon wet media whereas Facebook is using a high pressure water misting system, and 2) the EcoCooling design is running the mechanical systems beside the compute floor whereas the Facebook design uses the second floor above the IT rooms for all mechanical system.


The most interesting aspects of the Facebook mechanical design: 1) full building ducting with huge plenum areas, 2) no-process based cooling, 3) mist-based evaporative cooling, 4) large, efficient impellers with variable frequency drive, and 4) full wall, low-resistance filtration.


I’ve been saying for years that mechanical systems are where the largest opportunities for improvement lie and are the area where most innovation is most required. The Facebook Prineville Facility is most of the way there and one of the most efficient mechanical system designs I’ve come across. Very elegant and very efficient.




James Hamilton



b: /


Saturday, April 09, 2011 9:43:40 AM (Pacific Standard Time, UTC-08:00)  #    Comments [17] - Trackback
 Thursday, April 07, 2011

The pace of innovation in data center design has been rapidly accelerating over the last 5 years driven by the mega-service operators. In fact, I believe we have seen more infrastructure innovation in the last 5 years than we did in the previous 15. Most very large service operators have teams of experts focused on server design, data center power distribution and redundancy, mechanical designs, real estate acquisition, and network hardware and protocols.  But, much of this advanced work is unpublished and practiced at a scale that  is hard to duplicate in a research setting.


At low scale with only a data or center or two, it would be crazy to have all these full time engineers and specialist focused on infrastructural improvements and expansion. But, at high scale with 10s of data centers, it would be crazy not to invest deeply in advancing the state of the art.


Looking specifically at cloud services, the difference between an unsuccessful cloud service and a profitable, self-sustaining business is the cost of the infrastructure. With continued innovation driving down infrastructure costs, there is investment capital available, services can be added and improved, and value can be passed on to customers through price reductions. Amazon Web Services, for example, has had 11 price reductions in 4 years. I don’t recall that happening in my first 20 years working on enterprise software. It really is an exciting time in our industry.


Facebook is a big business operating at high scale and they also have elected to invest in advanced infrastructure designs. Jonathan Heiliger and the Facebook infrastructure team have hired an excellent group of engineers over the past couple of years and are now bringing these designs to life in their new Prineville Oregon facility. I had the opportunity to visit this datacenter 6 weeks back just before it started taking production load. I had an excellent visit, got to catch up with some old friends, meet some new ones, and tour an impressive facility. I saw an unusually large number of elegant designs ranging from one of the cleanest mechanical systems I’ve come across, three phase 480VAC directly to the rack,  a low voltage direct current distributed uninterruptable power supply system, all the way through to custom server designs. But, what made this trip really unusual is that I’m actually able to talk about what I saw.


In fact, more than allowing me to talk about it, Facebook has decided to release most of the technical details surrounding these designs publically. In the past, I’ve seen some super interesting but top secret facilities and I’ve seen some public but not particularly advanced data centers. To my knowledge, this is the first time an industry leading design has been documented in detail and released publically.


The set of specifications Facebook is releasing are worth reading so I’m posting links to all below.  I encourage you to go through these in as much detail as you chose. In addition, I’ll also  post summary notes over the next couple of days explain aspects of the design I found most interesting or commenting upon the pros and cons of some of the approaches employed.


The specifications:

·         Data Center Design

·         Intel Motherboard

·         AMD Motherboard

·         Battery Cabinet (Distributed UPS)

·         Server PSU

·         Server Chassis and Triplet Hardware


My commendations to the specification authors Harry Li , Pierluigi Sarti, Steve Furuta, Jay Park and to the rest of the Facebook infrastructure team for releasing this work publically and for doing so in sufficient detail that others can build upon it. Well done.





·         Open Compute Web Site:

·         Live Blog of the Announcement:


James Hamilton



b: /



Thursday, April 07, 2011 9:30:02 AM (Pacific Standard Time, UTC-08:00)  #    Comments [6] - Trackback
 Monday, April 04, 2011

A bit more than a year back, I published Computer Room Evaporative Cooling where I showed an evaporative cooling design from EcoCooling. Periodically, Alan Beresford sends me designs he’s working on. This morning he sent me a design they are working on for a 7MW data center in Ireland.


I like the design for a couple of reasons: 1) It’s a simple design and efficient design, and 2) it’s a nice example of a few important industry trends. The trends exemplified by this design are: 1) air-side economization, 2) evaporative cooling, 3) hot-aisle containment, and 4) very large plenums with controlled hot-air recycling. The diagrams follow and, for the most part, speak for themselves.



I expect mechanical designs with these broad characteristics are going to be showing up increasingly frequently over the next year or so primarily because it is a cost effective and environmentally sounds approach.




James Hamilton



b: /


Monday, April 04, 2011 2:53:33 PM (Pacific Standard Time, UTC-08:00)  #    Comments [5] - Trackback
 Saturday, March 26, 2011

Brad Porter is Director and Senior Principal engineer at Amazon. We work in different parts of the company but I have known him for years and he’s actually one of the reasons I ended up joining Amazon Web Services. Last week Brad sent me the guest blog post that follows where, on the basis of his operational experience, he prioritizes the most important points in the Lisa paper On Designing and Deploying Internet-Scale Services.




Prioritizing the Principles in "On Designing and Deploying Internet-Scale Services"

By Brad Porter

James published what I consider to be the single best paper to come out of the highly-available systems world in many years. He gave simple practical advice for delivering on the promise of high-availability. James presented “On Designing and Deploying Internet-Scale Services at Lisa 2007.

A few folks have commented to me that implementing all of these principles is a tall hill to climb. I thought I might help by highlighting what I consider to be the most important elements and why.

1. Keep it simple

Much of the work in recovery-oriented computing has been driven by the observation that human errors are the number one cause of failure in large-scale systems. However, in my experience complexity is the number one cause of human error.

Complexity originates from a number of sources: lack of a clear architectural model, variance introduced by forking or branching software or configuration, and implementation cruft never cleaned up. I'm going to add three new sub-principles to this.

Have Well Defined Architectural Roles and Responsibilities: Robust systems are often described as having "good bones." The structural skeleton upon which the system has evolved and grown is solid. Good architecture starts from having a clear and widely shared understanding of the roles and responsibilities in the system. It should be possible to introduce the basic architecture to someone new in just a few minutes on a white-board.

Minimize Variance: Variance arises most often when engineering or operations teams use partitioning typically through branching or forking as a way to handle different use cases or requirements sets. Every new use case creates a slightly different variant. Variations occur along software boundaries, configuration boundaries, or hardware boundaries. To the extent possible, systems should be architected, deployed, and managed to minimize variance in the production environment.

Clean-Up Cruft: Cruft can be defined as those things that clearly should be fixed, but no one has bothered to fix. This can include unnecessary configuration values and variables, unnecessary log messages, test instances, unnecessary code branches, and low priority "bugs" that no one has fixed. Cleaning up cruft is a constant task, but it is a necessary to minimize complexity.

2. Expect failures

At its simplest, a production host or service need only exist in one of two states: on or off. On or off can be defined by whether that service is accepting requests or not. To "expect failures" is to recognize that "off" is always a valid state. A host or component may switch to the "off" state at any time without warning.

If you're willing to turn a component off at any time, you're immediately liberated. Most operational tasks become significantly simpler. You can perform upgrades when the component is off. In the event of any anomalous behavior, you can turn the component off.

3. Support version roll-back

Roll-back is similarly liberating. Many system problems are introduced on change-boundaries. If you can roll changes back quickly, you can minimize the impact of any change-induced problem. The perceived risk and cost of a change decreases dramatically when roll-back is enabled, immediately allowing for more rapid innovation and evolution, especially when combined with the next point.

4. Maintain forward-and-backward compatibility

Forcing simultaneous upgrade of many components introduces complexity, makes roll-back more difficult, and in some cases just isn't possible as customers may be unable or unwilling to upgrade at the same time.

If you have forward-and-backwards compatibility for each component, you can upgrade that component transparently. Dependent services need not know that the new version has been deployed. This allows staged or incremental roll-out. This also allows a subset of machines in the system to be upgraded and receive real production traffic as a last phase of the test cycle simultaneously with older versions of the component.

5. Give enough information to diagnose

Once you have the big ticket bugs out of the system, the persistent bugs will only happen one in a million times or even less frequently. These problems are almost impossible to reproduce cost effectively. With sufficient production data, you can perform forensic diagnosis of the issue. Without it, you're blind.

Maintaining production trace data is expensive, but ultimately less expensive than trying to build the infrastructure and tools to reproduce a one-in-a-million bug and it gives you the tools to answer exactly what happened quickly rather than guessing based on the results of a multi-day or multi-week simulation.

I rank these five as the most important because they liberate you to continue to evolve the system as time and resource permit to address the other dimensions the paper describes. If you fail to do the first five, you'll be endlessly fighting operational overhead costs as you attempt to make forward progress. 

  • If you haven't kept it simple, then you'll spend much of your time dealing with system dependencies, arguing over roles & responsibilities, managing variants, or sleuthing through data/config or code that is difficult to follow.
  • If you haven't expected failures, then you'll be reacting when the system does fail. You may also be dealing with complicated change-management processes designed to keep the system up and running while you're attempting to change it.
  • If you haven't implemented roll-back, then you'll live in fear of your next upgrade. After one or two failures, you will hesitate to make any further system change, no matter how beneficial.
  • Without forward-and-backward compatibility, you'll spend much of your time trying to force dependent customers through migrations.
  • Without enough information to diagnose, you'll spend substantial amounts of time debugging or attempting to reproduce difficult-to-find bugs.

I'll end with another encouragement to read the paper if you haven't already: "On Designing and Deploying Internet-Scale Services"


Saturday, March 26, 2011 8:21:16 AM (Pacific Standard Time, UTC-08:00)  #    Comments [10] - Trackback
 Sunday, March 20, 2011

Back in early 2008, I noticed an interesting phenomena: some workloads run more cost effectively on low-cost, low-power processors. The key observation behind this phenomena is that CPU bandwidth consumption is going up faster than memory bandwidth.  Essentially, it’s a two part observation: 1) many workloads are memory bandwidth bound and will not run faster with a faster processor unless the faster processor comes with a much improved memory subsystem and 2) the number of memory bound workloads is going up overtime.


One solution is improve both the memory bandwidth and the processor speed and this really does work but it is expensive. Fast processors are expensive and faster memory subsystems bring more cost as well. The cost of the improved performance goes up non-linearly.  N times faster costs far more than N times as much. There will always be workloads where these additional investments make sense. The canonical scale-up workload is relational database (RDBMS). These workloads tend to be tightly coupled and not scale out well. The non-linear cost of more expensive, faster gear is both affordable and justified by there really not being much of an alternative. If you can’t easily scale out and the workload is important, scale it up.


The poor scaling of RDBMS workloads is driving three approaches in our industry: 1) considerable spending on SANs, SSDs, flash memory, and scale-up servers, 2) the noSQL movement rebelling against the high-function, high-cost, and poor scaling characteristics of RDBMS, and 3) deep investments in massively parallel processing SQL solutions (Teradata, Greenplum, Vertica, Paracel, Aster Data, etc.).  


All three approaches have utility and I’m interested in all three. However, there are workloads that really are highly parallel. For these workloads, the non-linear cost escalation of scale-up servers is less cost effective than more commodity servers. The banner workload in this class are simple web servers where there are potentially thousands of parallel requests and its more cost effective to turn these workloads over a fleet of commodity servers.


The entire industry has pretty much moved to deploying these highly parallelizable workloads over fleets of commodity servers. The will always be workloads that are tightly coupled and hard to run effectively over large number of commodity servers but, for those that can be, the gains are great: 1) far less expensive, 2) much smaller unit of failure, 3) cheap redundancy, 4) small, inexpensive unit of growth (avoid forklift upgrades), and 5) no hard limit at the scale-up limit.


This tells us two things: 1) where we can run a workloads over a large number of commodity servers we should, and 2) the number of workloads where parallel solutions have been found continues to increase. Even workloads like the social networking hairball problem (see hairball problem in we now have economic parallel solutions. Even some RDBMS workloads can now be run cost effectively on scale-out clusters. The question I started thinking through about back in 2008 is how far can we take this? What if we used client processors or even embedded device processors? How far down the low-cost, low-power spectrum make sense for highly parallelizable workloads? Cleary the volume economics client and device processors make them appealing from a price perspective and the multi-decade focus on power efficiencies in the device world offers impressive power efficiencies for some workloads.


For more on the low-cost, low-power trend see:

·         Very low-Cost, Low-Power Servers

·         Microslice Servers

·         The Case of Low-Cost, Low-Power Servers

Or the full paper on the approach: Cooperative Expendable, Microslice Servers


I’m particularly interested in the possible application of ARM processors to server workloads:

·         Linux/Apache on ARM processors

·         ARM Cortex-A9 SMP Design Announced


Intel is competing with ARM on some device workloads and offers the Atom which is not as power efficient as ARM but has the upside of running Intel Architecture instruction set. ARM also icensees recognize the trend I’ve outlined above – some server workloads will run very cost effectively on low-cost, low-power processors – and they want to compete in the server business. One excellent example is SeaMicro who is taking Intel Atom processors and competing for the server business: SeaMicro Releases Innovative Intel Atom Server.


Competition is a wonderful thing and drives much of the innovation the fruits of which we enjoy today. We have three interesting points of competition here: 1) Intel is competing for the device market with Atom, 2) ARM licensees are competing with Intel Xeon in the server market, and 3) Intel Atom is being used to compete in the server market but not with solid Intel backing.


Using Atom to compete with server processors is, on one hand, good for Intel in that it gives them a means of competing with ARM at the low end of the server market but, on the other hand, the Atom is much cheaper than Xeon so Intel will lose money on every workload it wins where that workloads used to be Xeon hosted. This risk is limited by Intel not giving Atom ECC memory (You Really do Need ECC Memory). Hobbling Atom is a dangerous tactic in that it protects Xeon but, at the same time, may allow ARM to gain ground in the server processor market. Intel could have ECC support on Atom in months – there is nothing technically hard about it. It’s the business implications that are more complex.


The first step in the right direction was made earlier this week where Intel announced an ECC capable Atom for 2012:

·         Intel Plans on Bringing Atom to Servers in 2012, 20W SNB for Xeons in 2011

·         Intel Flexes Muscles with New Processor in the Face of ARM Threat


This is potentially very good news for the server market but its only potentially good news. Waiting until 2012 suggests Intel wants to give the low power Xeons time in the market to see how they do before setting the price on the ECC equipped Atom. If the ECC supporting version of Atom is priced like a server part rather than a low-cost, high volume device part, then nothing changes. If the Atom with ECC comes in priced like a device processor, then it’s truly interesting. This announcement says that Intel is interested in the low-cost, low-power server market but still plans to delay entry into the lowest end of that market for another year. Still, this is progress and I’m glad to see it.


Thanks to Matt Corddry of Amazon for sending the links above my way.




James Hamilton



b: /


Sunday, March 20, 2011 1:19:09 PM (Pacific Standard Time, UTC-08:00)  #    Comments [4] - Trackback

Disclaimer: The opinions expressed here are my own and do not necessarily represent those of current or past employers.

<October 2011>

This Blog
Member Login
All Content © 2015, James Hamilton
Theme created by Christoph De Baene / Modified 2007.10.28 by James Hamilton