Friday, November 07, 2014

In the Amazon Web Services world, this has always been a busy time of the year. Busy, because, although we aim for a  fairly even pace of new service announcements and new feature releases all year, invariably, somewhat more happens towards the end of the year than early on. And, busy, because the annual AWS re:Invent conference is in early November and this is an important time to roll out new services or important features. This year is no exception and, more than ever, there is a lot to announce at the conference. It should be fun.

I enjoy re:Invent because it’s a chance to talk to customers in more detail about what we have been building, learn how they are using it, and what we could do to make the services better. It’s always exciting to see what customer have done with AWS over the last year and I like the fact that customer presentations are a big part of the conference. It’s great to hear from an engineer how a service should work but almost always more interesting to see how customers are actually using the service. Each year there are more customers who have decided to move aggressively and go 100% cloud hosted and there are more and more stories of large, mission critical applications that have been moved to the cloud. It’s good to hear the details behind these decisions and how they were executed.


Andy Jassy who leads AWS will give the conference keynote Wednesday November 12th at 9am. If you can’t be at re:Invent in person, it will also be available via live streaming. This session is always packed with new service and major feature announcements so, if you can only watch one session, this is it.


Later that day at 4:30, I’ll be presenting AWS Innovation at Scale which will also be available via video streaming. In this session, I’ll be putting some quantitative numbers on current AWS growth rates. Then I’ll focus on two main areas: networking and database. Networking is interesting because it’s an area that is running counter-Moore and actually getting relatively more expensive while the rest of the industry is going the other way. And, at the same time that is happening, network traffic is increasing at rates faster than server deployments. We’ll lay out how we have fundamentally reengineered our network for more capacity, lower latency, and reduced jitter while lowering costs. Database is my second area of focus. It’s another area where we feel substantial new engineering investment can give developers better solutions at far lower costs partly based upon open source, partly based upon shifting the bulk of the complexity of database administration to AWS, and partly based upon new engine development optimized for cloud deployments. There is lots to cover in both networking and database but I’ll also show how AWS processes 10s of millions of records per second internally and how this solution has become an externally available cloud service.


I’ll post the slides I present once available to:

Live streaming for select sessions available at:


--James Hamilton,


Friday, November 07, 2014 9:36:06 PM (Pacific Standard Time, UTC-08:00)  #    Comments [5] - Trackback
 Sunday, October 05, 2014

Waste heat reclamation in datacenters has long been viewed as hard because the heat released is low grade. What this means is that rather than having a great concentration of heat, it is instead spread out and, in fact, only warm. The more concentrated the heat, the easier it is to use.  In fact, that is exactly how many power plants work. When the temperature of the cooling medium is several orders of magnitude cooler than burning fuels such LNG, Petroleum, or Coal, extracting useful energy becomes challenging.


However, data center heat reclamation si clearly a problem well worth solving since just about 100% of the power that enters each facility is released as heat into the environment. Ironically, not only is almost all the power that comes into the data center released to the environment as heat energy but, a considerable amount of energy is actually expended getting rid of this energy. The cooling plant, pumps, fans, water towers, etc. all take more power with their only contribution being to effectively remove all the power that we just finished buying.


Many ways have been proposed to greatly reduce the cost of datacenter cooling including the elimination of process-based cooling. A great example of this is the Facebook Mechanical System Design. This is good work that I still like today because it’s getting the cost of cooling down closer to zero.  Similar approaches using water rather than air as the heat transport media can be even more efficient. For example, the Deep Green Datacenter near Zurich that uses cold water from a lack to cool the facility (46MW with Water Cooling at a PUE of 1.10). These systems get rid of the waste heat efficiently.

But, returning to waste heat reclamation, what about actually using the heat released from a data center rather than attempting to discard it at very low cost?  Great effort has been expended attempting to generate power from the low grade waste heat from a data center but this hasn’t yield great results thus far.  What has been effective is using the heat in another useful process. In the past, I’ve proposed the use of data center waste heat in the growing of high value crops like Tomatoes in climates ordinarily too cool for this crop. Another example of data center heat reclamation I recently pointed out was a proposal to use the heat from a datacenter to heat sea water prior to entry into a desalination plant. This increases the efficiency of the desalination plant while providing near free cooling for the data center (Data Center Cooling Done Differently). Yet, another approach I recently came across that appears to have potential was a proposal by Teracool to use datacenter waste heat as part of the heating process in Liquid Petroleum Gas vaporization.


LNG is seeing rapid increase of use in world energy markets driven by it being much cleaner than coal and it is available in large quantity driven by hydraulic fracking. LNG, like all petroleum products is often shipped via large sea-going cargo ships. But, unlike petroleum, LPG is delivered for consumption in gaseous form although it is shipped in its liquid state. To liquefy it prior to shipment, it must be cooled to below -260F which, of course, requires considerable energy.  On the delivery side, the liquid must be heated to return the product to its gaseous form.  Ironically, this again requires considerable energy. In fact, some vaporization plants consume as much as 1.75% of the product in vaporization. What Teracool proposes is using this cold (absence of heat) to cool data centers. Going a bit further, they propose the waste heat from the data center to drive vaporization and then, as the product expands 600x from liquid state to gaseous state, use the power of expansion to drive gas turbines to generate power further reducing the energy loss and supplying at least part of the data center power requirements.


Research in the data center world has swung from doing excellent work in getting rid of waste heat to focusing on recovering this vast store of energy. I expect we are going to be seeing many interesting examples of how data centers can be run more efficiently by harvesting the energy released by the operation of the data center.


A related article on Teracool: Teracool’s Audacious Idea: Data Centers Next to LNG Plants

Teracool Presentation with more details: Recovering Cryogenic Refrigeration Energy


--James Hamilton,,


Sunday, October 05, 2014 6:04:16 PM (Pacific Standard Time, UTC-08:00)  #    Comments [6] - Trackback
 Monday, September 01, 2014

Dileep Bhandarkar put together a great presentation for the Computer History Museum a couple of weeks back. I have no idea how he got through the full presentation in under an hour – it covers a lot of material – but it’s an interesting walk through history. Over the years, Dileep has worked for Texas Instruments, Digital, Intel, Microsoft, and Qualcomm. As a consequence, he’s been near the early days of semiconductors, the rise and fall of the mini-computer during his 17 years at Digital, the rise to dominance of the microprocessor in his 12 years at Intel, the emergence of high-scale computing during his ½ decade at Microsoft, and he is now at Qualcomm.


I’ll touch on some of the high points but you should really read the through the presentation in its entirety. Well worth it:


From 2,300 transistors to well over 1 billion in less than 40 years:



November 15, 1971, while I was still in high school, Intel released the world’s first microprocessor the 4004.


Also in 1971, Intel delivered the 1103 DRAM memory and the move from core memory to DRAM had begun.

The IBM S/370 and the DEC PDP 10 were both iconic systems both of which I have used. In the case of the PDP 10, that was the system I used as an intern at the National Research Council in Ottawa. In 1986 working on an Ada Compiler for IBM in Toronto, I used a S/370 Model 600J the biggest water cooled S/370 IBM ever produced.


In 1977, Intel released the 8080 the processor that powered the Altair

In 1977 Digital Equipment Corporation introduced the enormously successful VAX 11/780. I used that system running BSD as I completed my undergraduate degree at the University of Victoria. It was actually an 11/785. Essentially an 11/780 with a floating point unit. A correction from Dileep: The 780 was built using standard TTL SSI from TI and had a cycle time of 200ns. It included the FP780 (4 boards). The 785 was a quick release using Schottky TTL which allows us to reduce the cycle time to 133ns. Star became SuperStar.


In 1979 Motorola released the 68000, one of my favorite instruction set architectures of the time.


1985 marked the release of the venerable Intel 386, Intel’s first 32 bit processor initially operating at 16Mhz.

In 1987 Sun introduced the incredibly successful SPARC architecture.


Mobile processors are now where the volume is with over 7 billion smart phones expected to ship between 2013 and 2017.


It’s an excellent presentation and a super interesting slice through history. Check it out in full at: From Mainframe To SmartPhone and the video is at:


--James Hamilton,,


Monday, September 01, 2014 4:38:29 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
 Wednesday, August 13, 2014

Over the last 10 years, there has been considerable innovation in data center cooling. Large operators are now able to operate at Power Usage Efficiency of 1.10 to 1.20. This means that less than 20% of the power delivered to the facility is lost to power distribution and cooling. These days, very nearly all of the power  delivered to the facility, is delivered to the servers.  


I would never say there is no more innovation coming in an but most of the ideas I’ve been seeing recently in data center cooling designs are familiar. Good engineering and often more somewhat more efficient than pass approaches but still largely the same as previous work. However, in recent discussions with DeepWater Desal, I came across an idea that I really think has potential. In this approach the co-location of a desalination plant and data centers is used to reduce the power consumption of both.


DeepWater Desal plans to build a desalination plant  at Monterey Bay. Desalination produces drinking water from sea water. Given the abundance of sea water in the world and the shortage of drinking water in many parts of the world, these plants are becoming more common. They are fairly power intensive techniques but still used extensively throughout the world especially in the Middle East.

Deep Water Desal proposes to mitigate the power consumption of desalination in a very creative way. Rather than reduce the power required to desalinate water, they proposed to co-locate up to 150MW of data center facilities on site and reduce the power required to cool the data center. Essentially the desalination plant and data centers would be symbiotic and the overall power consumption of the combination of the two plants together would be lower.


Here’s how it works. In order to avoid plankton and other life forms that plug up the plants filters and increase operating costs, the desalination plant will be drawing water from 100’ below the surface in Monterey Bay. This water will have upwelled from even deeper down the canyons of Monterey bay and will be quite cold.


Taking water from lower in the bay reduces the potential for negative impact on the local ecosystem by putting the intake below the majority of it but this has the downside of sourcing much colder water. Cold water is less efficient to desalinate and, consequently, considerably more water will need to pumped which increases the pumping power expenses considerably. If the water is first run through the data center cooling heat exchanger, at very little increased pumping losses, the data center now gets cooled for essentially free (just the costs of circulating their cooling plant). And, as an additional upside, the desalination plant gets warmer feed water which can reduce pumping losses by millions of dollars annually. A pretty nice solution.


There have been many examples in the past of data centers cooled by deep water cooling. For Example: 46MW with Water Cooling at a PUE of 1.10. There have also been examples of data centers cooled using salt water: Google Opening Saltwater-cooled data center. What’s different and interesting in this case is someone else is covering most of the data center pumping costs and there are additional and quite substantial gains in delivering warmer water to the co-located desalination plan.


Since Desalination, even when done efficiently, is a power intensive business, a new municipal utility is being created that will delivery to the co-located data center facilities, power at 6 to 8 cents per kWh which is higher than some geographies but is actually quite a good rate for data center commercial power in California.


If you are interested in siting a data center in Monterey that is better for the environment, cheaper to operate, and not a bad place to live, contact Grant Gordon, COO of DeepWater Desal (

--James Hamilton, /


Wednesday, August 13, 2014 5:12:09 PM (Pacific Standard Time, UTC-08:00)  #    Comments [4] - Trackback
 Wednesday, July 02, 2014

We all know that when designing and operating applications at scale, it is persistent state management that brings the most difficult challenges. Delivering state-free applications has always been (fairly) easy. But most interesting commercial and consumer applications need to manage persistent state. Advertising needs to be delivered, customer activity needs to be tracked, and products need to be purchased. Interesting applications that run at scale all have difficult persistent state problems. That’s why, other AWS customers, and even other AWS services make use of the various AWS database platform services. Delegating the challenge of managing high-performance, transactional, and distributed data management to underlying services makes applications more robust while reducing operational overhead and design complexity. But, it makes these mega-scale database systems absolutely critical resources that need to be rock solid with very well understood scaling capabilities.

One such service that is heavily used by AWS,, and external services is DynamoDB. On the surface, DynamoDB looks fairly simple. It has only 13 APIs and has been a service that is being widely adopted due to its ease of use and doing the undifferentiated heavy lifting of dealing with disk failures, host issues or network partitions etc. However, that external simplicity is built on a hidden substrate of complex distributed systems. Such complex internals are required to achieve high-availability while running on cost-efficient infrastructure, and also to cope with rapid business-growth. As an example of this growth, DynamoDB is handling millions of transactions every second in a single region while continuing to provide single digit millisecond latency even in the midst of disk failures, host failures and rack power disruptions.

Services like DynamoDB rely on fault-tolerant distributed algorithms for replication, consistency, concurrency control, provisioning, and auto scaling. There are many such algorithms in the literature, but combining them into a cohesive system is a significant challenge, as the algorithms usually need to be modified in order to interact properly in a real-world production system. In addition, we have found it necessary to invent algorithms of our own.

We all know that increasing complexity greatly magnifies the probability of human error in design, code, and operations. Errors in the core of the system could potentially cause service disruptions and impact our service availability goals. We work hard to avoid unnecessary complexity, but the complexity of the task remains high. Before launching a service or a new feature to service like DynamoDB, we need to reach extremely high confidence that the core of the system is correct.

Historically, software companies do this using a combination of design documents, design reviews, static code analysis, code reviews, stress testing, fault injection testing and many other techniques. While these techniques are necessary they may not be sufficient. If you are building a highly concurrent replication algorithm that serves as the backbone for systems like DynamoDB, you want to be able to model partial failures in a highly concurrent distributed systems. Moreover, you want to capture the failures at design level even as these might be harder to do in testing.

This is where AWS teams and Amazon DynamoDB and transactional services teams embarked on a path of building precise designs for the services they are building.  While one can argue that traditional methods of writing design docs serve a similar purpose we found that design docs lack precision. This is because they are written in prose and diagrams and they are not easy to test in an automated fashion. At the other end of the spectrum, once we have implemented the system in code, it becomes much too complex to establish algorithm correctness or to debug subtle issues. To this end, we looked for a way to express designs that will eliminate hand waving and that has sufficient tools that can be applied to check for errors in the design.

We wanted a language that allowed us to express things like replication algorithms or distributed algorithms in hundreds of lines of code. We wanted the tool to have existing ecosystems that allowed us to test various failure conditions at design level quickly. We found what we were looking for in TLA+, a formal specification language invented by ACM Turing award winner, Leslie Lamport. TLA+ is based on simple discrete math, basic set theory and predicates with which all engineers are quite familiar. A TLA+ specification simply describes the set of all possible legal behaviors (execution traces) of a system. While TLA+ syntax was unfamiliar to our engineers, TLA+ is accompanied by PlusCal, a pseudo code language that is closer to a C-style programming language. Several engineers at Amazon have found they are more productive in PlusCal than TLA+. However, in other cases, the additional flexibility of plain TLA+ has been very useful. 

PlusCal and TLA+ have proven very effective at establishing and maintaining the correctness through change of the fundamental components on which DynamoDB is based. We believe having these core components correct from day one has allowed the DynamoDB system to evolve more quickly and scale fast while avoiding the difficult times often experienced by engineers and customers early in a distributed systems life.

I’ve always been somewhat skeptical of formal methods in that some bring too much complexity to actually be applied to commercial systems while others tend to abstract much of the complexity but, in abstracting away complexity, give up precision. TLA+ and PlusCal appear to skirt these limitations and we believe that having the core and most fundamental algorithms on which DynamoDB is based provably correct helps speed innovation, ease the inherent complexity of scaling and, overall, improves both the customer experience and the experience of engineers working on the DynamoDB system at AWS. While this is an important part of our system, there are hundreds of innovations we do in building and operating robust scalable distributed systems on how we design, build, test, deploy, monitor and operate these large scale services.

If you are interested in reading more about the usage of TLA+ within AWS, this is a pretty good source: Use of Formal Methods at Amazon Web Services. And, if you are interested in joining the AWS database team and, especially if you have the experience, background, and interest to lead a team in our database group, drop me a note with your current resume ( We’re expanding into new database areas quickly and growing existing services even faster.

James Hamilton,


Wednesday, July 02, 2014 4:04:52 PM (Pacific Standard Time, UTC-08:00)  #    Comments [11] - Trackback
 Monday, June 09, 2014

The internet and the availability of content broadly and uniformly to all users has driven the largest wave of innovation ever e experienced in our industry. Small startups offering a service of value have the same access to customers as the largest and best funded incumbents. All customers have access to the same array of content regardless of their interests or content preferences. Some customers have access to faster access than others but, whatever the access speed, all customers have access to it all content uniformly. Some countries have done an amazing job of getting high speed access to an very broad swath of the population. South Korea has done a notably good job on this measure. The US is nowhere close to the top by this measure nor does the US have anything approaching the best price/performing access. But it's nowhere close to the worst either. And, up until the most recently Federal Communications Commission (FCC) proposal, it’s always been the case that when a user buys internet connectivity they get access to the entire internet.

Given how many huge US companies have been built upon the availability of broad internet connectivity, it's at least a bit surprising that the US market isn't closer to the best connected. And given that the US still has the largest number of internet-based startups -- almost certainly some of the next home run startups will come from this group -- one would expect a very strong belief in the importance of maintaining broad access to all content and the importance of even the smallest startups having uniform access to customers. Surprisingly, this is not the case and the US Federal Communications Commission has proposed that networks providers should be able to choose what content their customers get access to at what speed.

Clearly large owners of eyeball networks like Comcast and Verizon would like to have a two-sided market. On this model, they would like to be able to charge customers to access the internet and at the same time charge content providers to have access to "their" customers. I abstractly understand why they would want to be allowed to charge both consumers and providers. There is no question that this will be highly profitable. In many ways, that's one of the wonderful things about the free market economy. Companies will be very motivated to do what is best for themselves and their shareholders and this has driven much innovation. As long as there is competition without collusion, it's mostly a good thing. But, there are potential downsides. There needs to be guardrails to prevent the free market from doing things that are bad for society. Society shouldn’t let companies use child labor for example.  
Charging content providers for the privilege of being able to provide services to customers that want access to their services and have paid Comcast, Verizon, Time-Warner etc. for access is another example of corporate behavior not in the best interests of society as a whole. We want customers who paid to access the internet to get access to all of it and not just the slice of content from providers that paid the most. Comcast should not be the arbiter of who customer get to buy services from. Surprisingly, that is currently what the FCC is currently proposing

Allowing last-mile network providers to decide which services are available or usable by a large swatch of customers without access to competing services is a serious mistake. Losing network neutrality is not good for customers, it's not good for content providers, it's not good for innovation but its, oh so very good for Verizon and Comcast. Just as using child labor isn't something we allow companies to do even if it could help them to be more profitable, we should not allow these companies to be able to hold content providers hostage. Without network neutrality content providers must pay last mile network providers or lose access to customers connecting to the internet using those networks. This is why Netflix has been forced to grudging pay the “protection” fee money required by the major eye ball markets.


Netflix is perhaps the best example of a provider that customers would like to have access to but has been forced to pay last mile network owners like Verizon even though Verizon customers have already paid for access to Netflix.  Some recent articles:


·         Netflix blames Sluggish streaming on Verizon, other service providers

·         Netflix Agrees to Pay Verizon for faster Internet, Too

·         Netflix is Still Mad at Verizon and Has the Charts to Prove It

·         Netflix got worse on Verizon even faster Netflix agreed to pay Verizon

The open internet and network neutrality has helped to create a hotbed of innovation, has allowed new winners to emerge every day, and the proposal to give it up is hard to understand and is open to cynical interpretations revolving around the power of lobbyist and campaign contributions over what is best for the constituents.

Network neutrality is a serious topic but comedian John Oliver has done an excellent job of pointing out how ludicrous this proposed legislation really is. In fact, Oliver did such a good job of appealing to the broader population to actively comment during the FCC comment period on this proposed legislation that the FCC web site failed under the load. This video is a bit long at nearly just over 13 minutes but it really is worth watching John Oliver point out some of the harder to explain aspects of this legislation:

And, once you have seen the video, leave your feedback for the FCC on their proposal to give up network neutrality at
Enter your comments against the ironically titled "14-28 Protecting and Promoting the Open Internet." My comments follow:


The internet and the availability of content broadly and uniformly to all users (network neutrality) has driven the largest wave of innovation ever experienced in the technology sector. Small startups offering a service have the same access to customers as the largest and best funded incumbents. All customers have access to the same array of content regardless of their interests or content preferences. Some customers have access to faster access than others but, whatever the access speed, all customers have access to it all content uniformly. Some of the most successful companies in the country have been built on this broad and uniform access to customers.  The next wave of startups are coming and achieving incredible valuations in IPOs or acquisitions. Giving up network neutrality puts our technology industry by making it harder for new companies to get access to customers and places far too much power in the hands of access network providers where we have few competitive alternatives.


--James Hamilton,,

Monday, June 09, 2014 3:07:15 PM (Pacific Standard Time, UTC-08:00)  #    Comments [3] - Trackback
 Tuesday, May 13, 2014

It’s difficult to adequately test complex systems. But what’s really difficult is keeping a system adequately tested. Creating systems that do what they are designed to do is hard but, even with the complexity of these systems, many life critical systems have the engineering and production testing investment behind them to be reasonably safe when deployed. Its keeping them adequately tested over time as conditions and the software system changes where we sometimes fail.


There are exceptions to the general observation that we can build systems that operate safely when inside reasonable expectations of expected operating conditions. One I’ve written about was the Fukushima Dai-1 nuclear catastrophe.  Any reactor design that doesn’t anticipate total power failure including backups, is ignoring considerable history. Those events, although rare, do happen and the life critical designs need to expect them and this fault condition needs to be tested in a live production environment.  “Nearly expected” events shouldn’t bring life critical systems. The more difficult to protect against are 1) the impact of black swan events and 2) the combination of vastly changing environmental conditions over time.


Another even more common cause of complex system failure is the impact of human operators. Most systems are exposed to human factors and human error will always be a leading cause of complex system failure. One I’ve written about and continue to follow is the Costa Concordia Grounding. It is well on its way to becoming a text book case on how human error can lead to loss of life and how poor decisions can cascade to provide opportunity for yet more poor decisions by the same people that got the first one wrong and are now operating under incredible stress. It’s a super interesting situation that I will likely return to in the near future and summarize what has been learned over the last few years of investigation, salvage operation, and court cases.


Returning to the two other causes of complex system failure I mentioned earlier: 1) the impact of black swan events and 2) compounding changes in environmental conditions. Much time gets spent on how to mitigate the negative impact of rare and difficult to predict events. The former are, by definition, very difficult to adequately predict so most mitigations involve being able to reduce the blast radius (localize the negative impact as much as possible) and design the system to fail in a degraded but non-life threatening state (degraded operations mode). The more mundane of these two conditions is the second, compounding changes in the environmental conditions in which the complex system is operating. These are particularly difficult in that, quite often, none of the changes are large and none happen fast but scale changes, traffic conditions change, workload mix change, and the sum of all changes over a long period of time can put the system in a state far outside of those anticipated by the original designers and, consequently, were never tested.


I came across an example of these latter failure modes in my recent flight through Los Angeles Airport on April 30th.  While waiting for my flight from Los Angeles north to Vancouver, we were told there had been a regional air traffic control system failure and the entire Los Angeles area was down.  These regional air traffic control facilities are responsible for the air space between airports. In the US, there are currently 22 Area Control Centers referred to by the FAA as Air Route Traffic Control Centers (ARTCC). Each ARTCC is responsible for a portion of the US air space outside of the inverted pyramid controlled by each airport. This number of ARTCC has been increasing but, even with larger numbers, the negative impact of one going down is broad. In this instance all traffic for the LAX, Burbank, Long Beach, Ontario, and Orange County airports was all brought to a standstill.


As I waited to board my flight to Vancouver, more and more aircraft were accumulating at LAX. Over the next hour or so the air space in the region drained as flights landed or were diverted but none departed. 50 flights were canceled and 428 flights were delayed. The impact of the cancelations and delays rippled throughout North America and probably world-wide. As a consequence of this delay, I missed my flight from Vancouver to Victoria much later in the evening and many other passengers passing through Vancouver were impacted even though it’s a long way away in a different country many hours later. An ARTCC going down for even a short time, can leads to delays in the system that take nearly a day to fully resolve.


Having seen the impact personally, I got interested in what happened. What took down the ARTCC system controlling the entire Los Angeles region? It has taken some time and, in the interim, the attention of many news outlets has wandered elsewhere but this BBC article summarizes the failure: Air Traffic Control Memory Shortage Behind Air Chaos and this article has more detail: Fabled U-2 Spy Plane Begins Farewell Tour by Shutting Down Airports in the L.A. Region.


The failing system at the ARTCC was the En-Route Automation Modernization (ERAM) that can track up to 1,900 concurrent flights simultaneously using data from many sensors including 64 RADAR deployment systems. The system was deployed in 2010 so it’s fairly new but we all know that over time, all regions get more busy. The airports get better equipment allowing them to move planes safely at greater rates, take-off and landing frequency goes up, some add new runways, most add new gates. And the software system itself gets changes not all of which will improve scaling or maximum capability. Over time, the load keeps going up and the system moves further from the initial test conditions when it was designed, developed, and tested.  This happens to many highly complex systems and some end up operating at an order of magnitude higher load or different load mix than originally anticipated – this is a common cause for complex system failure. We know that systems eventually go non-linear as the load increases so we need to constantly probe 10x beyond what possible today to ensure that there remains adequate headroom between possible operating modes and the system failure point.


The FAA ERAM system is a critical life safety system so presumably it operates with more engineering headroom and safety margin than some commercial systems but yet it still went down hard and all backups failed in this situation. What happened in this case appears to have been a combination of slowly ramping load combined with a rare event. In this case a U-2 spy plane was making an high altitude pass over the Western US as part of its farewell tour. The U-2 is a spy plane with an operational ceiling of 70,000’ (13.2 miles above earth) and a range of 6,405 miles at a speed of 500 mph. It flies well above commercial air traffic. The U-2 is neither particularly fast nor long range but you have to remember its first flight was in 1955. It’s an aging technology and it was incredibly advanced when it was first produced by the famous Lockheed Skunk Works group lead by Kelly Johnson. Satellite imagery and drones have largely replaced the U-2 and the booming $32,000 bill for each flight hour has led to the planned retirement of the series. 


What brought the ERAM system down was the combination of the typically heavy air traffic in the LA region combined with an imprecise flight plan filed for the U-2 spy plane passing through the region on a farewell flight. It’s not clear why a plane flying well above the commercial flight ceilings was viewed as a collision threat by ERAM but the data stream driven by the U-2 flights ended up consuming massive amounts of memory which brought down both the primary and secondary software systems leaving the region without air traffic control.


The lessons here are at least twofold. First, as complex systems age, the environmental conditions under which they operate change dramatically. Workload typically goes up, workload mix changes over time, and there will be software changes made over time some of which will change the bounds of what system can reliably handle. Knowing this, we must always be retesting production systems with current workload mixes and we must probe the bounds well beyond any reasonable production workload. As these operating and environmental conditions evolve, the testing program must be updated. I have seen complex system fail where, upon closer inspection, it’s found that the system has been operating for years beyond its design objectives and its original test envelope. Systems have to be constantly probed to failure so we know where they will operate stably and where they will fail.


The second lesson is that rare events will happen. I doubt a U2 pass of the western US is all that rare but something about this one was unusual. We need to expect that complex systems will face unexpected environmental conditions and look hard for some form degraded operations mode. Failure fault containment zones should be made as small as possible, we want to look for ways to deliver the system such that some features may fail while others continue to operate. You don’t want to lose all functionality for the entire system in a failure. For all complex systems, we are looking for ways to divide up the problem such that a fault has a small blast radius (less impact) and we want the system to gracefully degrade to less functionality rather than have the entire system hard fail. With backup systems its particularly important that they are designed to fail down to less functionality rather than also hard failing. In this case, having the backup come up and require more flight separation and do less real time collision calculations might be the right answer. Generally, all systems need to have some way to operate at less than full functionality or less than full scale without completely hard failing. These degraded operation modes are hard to find but I’ve never seen a situation where we couldn’t find one.


It’s not really relevant but still ironic that both the trigger for the fault, the U-2 spy plane, and the system brought down by the fault, ERAM, were both Lockheed produced products although released 55 years apart.


Unfortunately, it’s not all about tuning the last bit of reliability out of complex systems.  There is still a lot of low hanging fruit out there. As proof that even some relatively simple systems can be poorly engineered and produce remarkably poor results even when working well within their design parameters, on a 747-400 flying back to New Zealand from San Francisco, at least 3 passengers on that flight had lost bags. My loss appears to be permanent at this point. It’s amazing that couriers can move 10s of millions of packages a year over far more complex routings and yet, in passenger air travel baggage handling, the results are several orders of magnitude worse. I suspect that if the operators were economically responsible for the full loss, the quality of air baggage tracking would come up to current technology levels fairly quickly.

James Hamilton,


Tuesday, May 13, 2014 5:05:57 PM (Pacific Standard Time, UTC-08:00)  #    Comments [2] - Trackback
 Friday, February 14, 2014

Most agree that cloud computing is inherently more efficient that on premise computing in each of several dimensions. Last November, I went after two of the easiest to argue gains: utilization and the ability to sell excess capacity (Datacenter Renewable Power Done Right):


Cloud computing is a fundamentally more efficiently way to operate compute infrastructure. The increases in efficiency driven by the cloud are many but a strong primary driver is increased utilization. All companies have to provision their compute infrastructure for peak usage. But, they only monetize the actual usage which goes up and down over time. What this leads to incredibly low average utilization levels with 30% being extraordinarily high and 10 to 20% the norm.  Cloud computing gets an easy win on this dimension. When non-correlated workloads from a diverse set of customers are hosted in the cloud, the peak to average flattens dramatically.  Immediately effective utilization sky rockets.  Where 70 to 80% of the resources are usually wasted, this number climbs rapidly with scale in cloud computing services flattening the peak to average ratio.


To further increase the efficiency of the cloud, Amazon Web Services added an interesting innovation where they sell the remaining capacity not fully consumed by this natural flattening of the peak to average. These troughs are sold on a spot market and customers are often able to buy computing at less the amortized cost of the equipment they are using (Amazon EC2 Spot Instances). Customers get clear benefit. And, it turns out, it’s profitable to sell unused capacity at any price over the marginal cost of power. This means the provider gets clear benefit as well. And, with higher utilization, the environment gets clear benefit as well.


Back in June, Lawrence Berkeley National Labs released a study that went after the same question quantitatively and across a much broader set of dimensions.  I first came across the report via coverage in Network Computing: Cloud Data Centers: Power Savings or Power drain? The paper was funded by Google which admittedly has an interest in cloud computing and high scale computing in general. But, even understanding that possible bias or influence, the paper is of interest. From Google’s summary of the findings (How Green is the Internet?):


Funded by Google, Lawrence Berkeley National Laboratory investigated the energy impact of cloud computing. Their research indicates that moving all office workers in the United States to the cloud could reduce the energy used by information technology by up to 87%.


These energy savings are mainly driven by increased data center efficiency when using cloud services (email, calendars, and more). The cloud supports many products at a time, so it can more efficiently distribute resources among many users. That means we can do more with less energy.


The paper attempts to quantify the gains achieved by moving workloads to the cloud by looking at all relevant dimensions of savings some fairly small and some quite substantial. From my perspective, there is room to debate any of the data one way or the other but the case is sufficiently clear that it’s hard to argue that there aren’t substantial environmental gains.  


Now, what I would really love to see is an analysis of an inefficient, poorly utilized private datacenter who’s operators wants to be green and needs to compare the installation of a fossil fuel consuming fuel cell power system with dropping the same workload down onto one of the major cloud computing platforms :-).  


The paper is in full available at: The Energy Efficiency Potential of Cloud-Based Software: A US Case Study.


James Hamilton 
b: /


Friday, February 14, 2014 6:37:17 PM (Pacific Standard Time, UTC-08:00)  #    Comments [2] - Trackback
 Saturday, January 25, 2014

It’s an unusual time in our industry where many of the most interesting server, storage, and networking advancements aren’t advertised, don’t have a sales team, don’t have price lists, and actually are often never even mentioned in public. The largest cloud providers build their own hardware designs and, since the equipment is not for sale, it’s typically not discussed publically.


A notable exception is Facebook. They are big enough that they do some custom gear but they don’t view their hardware investments as differentiating. That may sound a bit strange -- why spend on something if it is not differentiating? Their argument is that if some other social networking site adopts the same infrastructure that Facebook uses, it’s very unlikely to change social networking market share in any measureable way. Customers simply don’t care about infrastructure. I generally agree with that point. It’s pretty clear that if MySpace had adopted the same hardware platform as Facebook years ago, it really wouldn’t have changed the outcome. Facebook also correctly argues that OEM hardware just isn’t cost effective at the scale they operate. The core of this argument is custom hardware is needed and social networking customers go where their friends go whether or not the provider does a nice job on the hardware.


I love the fact that part of our industry is able to be open about the hardware they are building but I don’t fully agree that hardware is not differentiating in the social networking world. For example, maintaining a deep social graph is actually a fairly complex problem. In fact, I remember when tracking friend-of-a-friend over 10s of millions of users, a required capability of any social networking site today, was still just a dream. Nobody had found the means to do it without massive costs and/or long latencies. Lower cost hardware and software innovation made it possible and the social network user experience and engagement has improved as a consequence.


Looking at a more modern version of the same argument, It has not been cost effective to store full resolution photos and videos using today’s OEM storage systems. Consequently, most social networks haven’t done this at scale. It’s clear that storing full resolution images would be a better user experience and it’s another example where hardware innovation could be differentiating.


Of course the data storage problem is not restricted to social networks nor just to photo and video.The world is realizing the incredible value of data and the same time the costs of storage are plummeting. Most companies storage assets are growing quickly. Companies are hiring data scientists because even the most mundane bits of operational data can end up being hugely valuable. I’ve always believed in the value of data but more and more companies are starting to realize that data can be game changing to their businesses. The perceived value is going up fast while, at the same time, the industry is realizing that if you have weekly data, it is good. But daily is better, hourly is a lot better, 5 min is awesome but you really prefer 1 min granularity. This number keeps falling. The perceived value of data is climbing the resolution of measures is becoming finer and, as a consequence, the amount of data being stored is skyrocketing. Most estimates have data volumes doubling on 12 to 18 month centers -- somewhat faster than Moore’s law. Since all operational data backs up to cold storage, cold storage is always going to be larger than any other data storage category.


Next week, Facebook will show work they have been doing in cold storage mostly driven by their massive image storage problem. At OCP Summit V an innovative low-cost archival storage hardware platform will be shown. Archival projects always catch my interest because the vast majority of the world’s data is cold, the percentage that is cold is growing quickly, and I find the purity of a nearly single dimensional engineering problem to be super interesting. Almost the only dimension of relevance in cold storage is cost. See Glacier: Engineering for Cold Data Storage in the Cloud for more on this market segment and how Amazon Glacier is addressing it in the cloud.


This Facebook hardware project is particularly interesting in that it’s based upon an optical media rather than tape. Tape economics come from a combination of very low cost media combined with only a small number of fairly expensive drives. The tape is moved back and forth between storage slots and the drives when needed by robots. Facebook is taking the same basic approach of using robotic systems to allow a small number of drives to support a large media pool. But, rather than using tape, they are leveraging the high volume Blu-ray disk market with the volume economics driven by consumer media applications. Expect to see over a Petabyte of Blu-ray disks supplied by a Japanese media manufacturer housed in a rack built by a robotic systems supplier.


I’m a huge believer in leveraging consumer component volumes to produce innovative, low-cost server-side solutions. Optical is particularly interesting in this application and I’m looking forwarding to seeing more of the details behind the new storage platform. It looks like very interesting work.


James Hamilton 
b: /


Saturday, January 25, 2014 6:04:25 PM (Pacific Standard Time, UTC-08:00)  #    Comments [2] - Trackback
 Saturday, January 04, 2014

It’s not often I’m enthused about spending time in Las Vegas but this year’s AWS re:Invent conference was a good reason to be there. It’s exciting getting a chance to meet with customers who have committed their business to the cloud or are wrestling with that decision.


The pace of growth since last years was startling but what really caught my attention was the number of companies that had made the transition between testing on the cloud to committing their most valuable workloads to run there. I fully expected this to happen but I’ve seen these industry sweeping transitions before. They normally take a long, long time. This pace of the transition to the cloud is, at least in my experience, unprecedented.


I did a 15 min video interview with SiliconANGLE on Wednesday that has been poste:

·         Video:


I did a talk on Thursday that is posted as well:

·         Slides:

·         Video:


The talk focused mainly on the question of is the cloud really different?  Can’t I just get a virtualization platform and do a private cloud and enjoy the same advantages? To investigate that question, I started by looking at AWS scale since scale is the foundation on which all the other innovations are built. It’s scale that funds the innovations and it’s scale that makes even small gains worth pursuing since the multiplier is so large. I then  walked through some of the innovations happening in AWS and talked about why they probably wouldn’t make sense for an on-premise deployment.  Looking at the innovations I tried to show the breadth of what was possible by sampling from many different domains including custom high voltage power substations, custom server design, custom networking hardware, a networking protocol development team, purpose built storage rack designs, the AWS Spot market, and supply chain and procurement optimization.


On scale, I laid out several data points that we hadn’t been talking about externally in the past but still used my favorite that is worth repeating here: Every day AWS adds enough new server capacity to support all of Amazon’s global infrastructure when it was a $7B annual enterprise. This point more than any other underlines what I was saying earlier: big companies are moving to the cloud and this transition is happening faster than any industry transition I’ve ever seen.


Each day enough servers are deployed to support a $7B ecommerce company. It happens every day and the pace continues to accelerate. Just imagine the complexity of getting all that gear installed each day. Forget innovation. Just the logistical challenge is a pretty interesting problem.


The cloud is happening in a big way, the pace is somewhere between exciting and staggering, and any company that isn’t running at least some workloads in the cloud and learning is making a mistake.


If you have time to attend re:Invent next year, give it some thought. More than 1/3 of the sessions are from customers talking about what they have deployed and how they did it and it’s an interesting and educational week. Most of the re:Invent talks are posted at:


I’ve not been doing a good job of posting talks to as I should so here’s a cummulative update:

·         2013.11.14: Why Scale Matters and How the Cloud is Different (talk, video), re:Invent 2013, Las Vegas, NV

·         2013.11.13: AWS The Cube at AWS re:Invent 2013 with James Hamilton (video), re:Invent 2013, Las Vegas NV

·         2013.01.22: Infrastructure Innovation Opportunities (talk), Ycombinator, Mountain View, CA

·         2012.11.29: Failures at Scale and how to Ride Through Them (talk, video), re:Invent 2012, Las Vegas, NV

·         2012.11.28: Building Web Scale Applications with AWS (talk, video), re:Invent 2012, Las Vegas, NV

·         2012.10.09: Infrastructure Innovation Opportunities (talk, video), First Round Capital CTO Summit, San Francisco, CA.

·         2012.09.25: Cloud Computing Driving Infrastructure Innovation (talk), Intel Distinguished Speaker Series, Hillsboro, OR


James Hamilton 
b: /


Saturday, January 04, 2014 10:53:08 AM (Pacific Standard Time, UTC-08:00)  #    Comments [3] - Trackback
 Saturday, November 30, 2013

Facebook Iowa Data Center

In 2007, the EPA released a study on datacenter power consumption at the request of the US Congress (EPA Report to Congress on Server and Data Center Efficiency).  The report estimated that the power consumption of datacenters represented about 1.5% of the US Energy Budget in 2005 and this number would double by 2010. In a way, this report was believable in that datacenter usage was clearly on the increase. What the report didn’t predict was the pace of innovation in datacenter efficiency during that period.  Increased use, spurred increased investment, which has led to a near 50% improvement in industry leading datacenter efficiency. 


Also difficult to predict at the time of the report was the rapid growth of cloud computing. Cloud computing is a fundamentally more efficiently way to operate compute infrastructure. The increases in efficiency driven by the cloud are many but a strong primary driver is increased utilization. All companies have to provision their compute infrastructure for peak usage. But, they only monetize the actual usage which goes up and down over time. What this leads to incredibly low average utilization levels with 30% being extraordinarily high and 10 to 20% the norm.  Cloud computing gets an easy win on this dimension. When non-correlated workloads from a diverse set of customers are hosted in the cloud, the peak to average flattens dramatically.  Immediately effective utilization sky rockets.  Where 70 to 80% of the resources are usually wasted, this number climbs rapidly with scale in cloud computing services flattening the peak to average ratio.


To further increase the efficiency of the cloud, Amazon Web Services added an interesting innovation where they sell the remaining capacity not fully consumed by this natural flattening of the peak to average. These troughs are sold on a spot market and customers are often able to buy computing at less the amortized cost of the equipment they are using (Amazon EC2 Spot Instances). Customers get clear benefit. And, it turns out, it’s profitable to sell unused capacity at any price over the marginal cost of power. This means the provider gets clear benefit as well. And, with higher utilization, the environment gets clear benefit as well.


It’s easy to see why the EPA datacenter power predictions were incorrect and, in fact, the most recent data shows just how far off the original estimates actually were. The EPA report estimated that in 2005, the datacenter industry was consuming just 1.5% of the nation’s energy budget but what really caused international concern was the projection that this percentage would surge to 3.0% in 2010. This 3% in 2010 number has been repeated so many times that it’s now considered a fact by many and it has to have been one of the most referenced statistics in the datacenter world.


The data on 2010 is now available and the power consumption of datacenters in 2010 is currently estimated to have been in the 1.1 to 1.5% in 2010. Almost no changes since 2005. Usage has skyrocketed. Most photos are ending up in the cloud, social networking is incredibly heavily used, computational science has exploded, machine learning is tackling business problems previously not cost effective to address, much of the world now have portable devices and a rapidly increasing percentage of these devices are cloud connected. Server side computing is exploding and yet power consumption is not.


However, the report caused real concern both in the popular press and across the industry and there have been many efforts to not only increase efficiency but to also to increase the usage of clean power. The progress on efficiency has been stellar but the early efforts on green power have been less exciting.


I’ve mentioned some of the efforts to go after cleaner data center power in the past. The solar array at the Facebook Prineville datacenter is a good example: It’s a 100kW solar array “powering” a 25MW+ facility. Pure optics with little contribution to the datacenter power mix (I love solar but… ). Higher scale efforts such as the Apple Maiden North Carolina facility are getting much closer to being able to power the entire facility go far beyond mere marketing programs. But, for me, the land consumed is disappointing with 171 acres of land cleared of trees in order to provide green power. There is no question that solar can augment the power at major datacenters but onsite generation with solar just doesn’t look practical for most facilities particularly those in densely packed urban settings. In the article Solar at Scale: How Big is a Solar Array with 9MW Average Output, we see that a 9MW average output solar plant takes approximately 314 acres (13.6 million sq/ft). This power density is not practical for onsite generation at most datacenters and may not be the best approach to clean power for datacenters.


The biggest issue with solar applied to datacenters is the mass space requirements are not practical with most datacenters being located in urban centers and it may not even be the right choice for rural centers. Another approach gaining rapid popularity is the use of fuel cells to power datacenters. At least partly because of highly skilled marketing, fuel cells are considered by many jurisdictions to be “green”. The good news is that some actually are green. A fuel cell running on waste biogas is indeed a green solution with considerable upside. This is what Apple plans for its Maiden North Carolina fuel cell farm and it’s a very nice solution (Apple’s Biogas Fuel Cell Plant Could Go Live by June).


The challenge with fuel cells in the datacenter application is just about all of them are not biogas powered relying instead on non-renewable energy. For example, the widely heralded eBay Salt Lake City fuel cell deployment is natural gas powered (New eBay Data Center Runs Almost Entirely on Bloom Fuel Cells). There have been many criticisms of the efficiency of these fuel cell-based power plants (884 may be Bloom Energy’s Fatal Number and their cost effectiveness (Delaware Press Wakes Up to Bloom Energy Boondoggle).


I’m mostly ignoring these reports and focusing on the simpler question: how can running a low-scale power plant on a fossil fuel possibly be green and is on-site generation really going to be the most environmentally conscious means of powering data centers? Most datacenters lack space for large on-site generation facilities, most datacenter operators don’t have the skill, and generally the most efficient power generation plants are far bigger than a single datacenter could ever consume on its worst day.


The argument for on-site generation is to avoid the power lost in transmission. The US Energy Information Administration (EIA)  estimates power lost in transmission to be 7%.  That’s certainly a non-zero number and the loss is relevant but I still find myself skeptical that the world leaders in operating efficient and reliable power plans are going to be datacenter operators. And, even if that prospect does excite you, does having a single plant on one location being the sole power source for the datacenter sounds like a good solution? I’m finding it hard to get excited. Maintenance, natural disasters, and human error will eventually yield some unintended consequences.


A solution that I really like is the Facebook Altoona Iowa facility (Facebook Says Its New Data Center Will Run Entirely on Wind). It is a standard grid connected datacenter without on-site power generation. But they have partnered to have built a 138MW wind project in nearby Wellsburg Iowa. What I like about this approach is: 1) no clear cutting was required to prepare the land for generation and the land remains multi-use, 2) it’s not fossil fuel powered, 3) the facility will be run by a major power generation operator rather than as a sideline by the datacenter operator, and 4) far more clean power is being produced than will be actually used by the datacenter so they are actually adding more clean power to the grid than they are consuming by a fairly significant margin.


Hats off to Facebook for doing it clean data center energy right. Nice work.


James Hamilton 
b: /


Saturday, November 30, 2013 1:38:31 PM (Pacific Standard Time, UTC-08:00)  #    Comments [6] - Trackback
 Saturday, November 09, 2013

I frequently get asked “why not just put solar panels on data center roofs and run them on that.” The short answer is datacenter roofs are just way too small. In a previous article (I Love Solar But…) I did a quick back of envelope calculation and, assuming a conventional single floor build with current power densities, each square foot of datacenter space would require roughly 362 sq ft of solar panels. The roof would only contribute roughly 1% of the facility requirements. Quite possibly still worth doing but there is simply no way a roof top array is going to power an entire datacenter.


There are other issues with roof top arrays, the two biggest of which are weight and strong wind protection. A roof requires significant re-enforcement to support the weight of an array and the array needs to be strongly secured against strong winds. But both of these issues are fairly simple engineering problems and very solvable. The key problem with powering datacenters with solar is insufficient solar array power density. If we were to believe the data center lighting manufacturers estimates (Lighting is the Unsung Hero in Data Center Energy Efficiency), a roof top array wouldn’t be able to fully power the facility lighting system. Most modern data centers have moved to more efficient lighting systems but the difficult fact remains: a roof top array will not supply even 1% of the overall facility draw.


When last discussing the space requirements of a solar array sufficient to power a datacenter in I love Solar but …, a few folks took exception to my use of sq ft as the measure of solar array size. The legitimate criticism raised was that in using sq ft and then computing that a large datacenter would require 181 million sq ft of solar array, I made the solar farm look unreasonably large. Arguably 4,172 acres is a better unit and produces a much smaller number. All true.


The real challenge for most people is in trying to understand the practicality of solar to power datacenters is to get a reasonable feel for how big the land requirements actually would be. They sound big but data centers are big and everything associated with them is big. Large numbers aren’t remarkable. One approach to calibrating the “how big is it?” question is to go with a ratio. Each square foot of data center would require approximately 362 square feet of solar array, is one way to get calibration of the true size requirements. Roughly 100:1 which tells us that roof top solar power is a contributor but not a solution. It also helps explain why solar is not a practical solution in densely packed urban areas especially where multi-floor facilities are in use.


The opening last week of the Kagoshima Nanatsujima Mega Solar Power Plant is large array at over 70MW and it gives us another scaling point of land requirements for large solar plants:

·         Name: Kagoshima Nanatsujima Mega Solar Power Plant

·         Location: 2 Nanatsujima, Kagoshima City, Kagoshima Prefecture, Japan

·         Annual output: Approx. 78,800MWh (projected)

·         Construction timeline: September 2012 through October 2013

·         Total Investment: Approx. 27 billion yen (approx. 275.5 million U.S. dollars*4)


This is an impressive deployment with 70MW peak output over 314 acres. That’s roughly 4.5 acres/MW which is fairly consistent with other solar plants I’ve written up in the past.  The estimated annual output for Kagoshima is 78,800MWh which is, on average, 9MW output. The reason the expected average output is only 13% of the peak output is solar plants aren’t productive 24x7. 


Datacenter power consumption tends to be fairly constant day and night. Night being somewhat lower due to typically lower cooling requirements. But with only 10 to 15% of the total power in a modern datacenter going to cooling, the difference between day and night is actually not that large.  Where there are big differences in datacenter power consumption is in the power consumption at full utilization vs partial. Idle critical loads can be as low as 50% of full loads so some variability in power consumption will be seen in a datacenter but datacenters are largely classified as base load (near constant) demand. Using these numbers, the Kagoshima solar facility is large enough to power ½ of a large datacenter.


The Kagoshima Nanatsujima Mega Solar Power Plant is a vast plant and anything at scale is interesting. It’s well worth learning more about the engineering behind this project but one aspect that caught my interest is there are some excellent pictures of this deployment that do a better job of scaling that my previous “several million sq ft” reference. The Kagoshima array is 313 acres which is 13.6 million sq ft or 1.27 million sq meters but these numbers aren’t that meaningful to most of us and pictures tell a far better story.


Above Photos from:



Photo Credit:


No matter how you look at it, the Kagoshima array is truly vast. I’m super interested in all forms of energy production and consider myself fairly familiar with numbers on this scale and yet I still find the size of this plant when shown in its entirety to still be surprising. It really is big.


The cost of solar is falling fast due to deep R&D investment but a facility of this scope and scale is still fairly expensive. At US$275 million the solar plant is more expensive than the datacenter it would be able to power.


Despite the shortage of land space in Japan, expect to see continued investment in large scale solar arrays for at least another year or two driven by the Feed-in Tariff legislation which requires Japanese power companies to buy all the power produced by any solar plant bigger than 10kW at a generous 20 year fixed rate of 42 yen/kWh which at current exchange rates is US$0.42kWh (Japan Creates Potential $9.6 Billion Solar Boom with FITs and also 1.8GW of New Solar Power for Japan in Q2). As a point of comparison, in the US commercial power, if purchased in large quantities is often available at rates below $0.05/kWh.


Yano research institute predicts that the massive Japanese solar boom driven by the FIT legislation will be grow until 2014 and then taper off due to land availability pressure: Japan Market for Large-Scale Solar to Drop After 2014, Yano Says.


One possible solution to both the solar space consumption problem and the inability to produce power in poor light conditions are orbital based solar arrays (Japan Aims to Beam Solar Energy Down From Orbit). This technology is nowhere close to a reality today but it is still of interest.


More articles on the Kagoshima Nanatsujima Mega Solar Power Plant:

·         KYOCERA Starts Operation of 70MW Solar Power Plant, the Largest in Japan

·         Japan's biggest solar plant starts generating power in SW. Japan

·         Kyocera completes Japan’s largest offshore solar energy plant in Kagoshima


For those of you able to attend the AWS re:Invent conference in Vegas, I’ll see you there. I’ll presenting Why Scale Matters and How the Cloud Really is Different Thursday at 4:15 in the Delfino room and I’ll be around the conference all week.


James Hamilton 
b: /


Saturday, November 09, 2013 2:29:22 PM (Pacific Standard Time, UTC-08:00)  #    Comments [7] - Trackback
 Tuesday, July 16, 2013

At the Microsoft World-Wide Partners Conference, Microsoft CEO Steve Ballmer announced that “We have something over a million servers in our data center infrastructure. Google is bigger than we are. Amazon is a little bit smaller. You get Yahoo! and Facebook, and then everybody else is 100,000 units probably or less.


That’s a surprising data point for a variety of reasons. The most surprising is that the data point was released at all. Just about nobody at the top of the server world chooses to boast with the server count data point. Partly because it’s not all that useful a number but mostly because a single data point is open to a lot of misinterpretation by even skilled industry observers. Basically, it’s pretty hard to see the value of talking about server counts and it is very easy to see the many negative implications that follow from such a number


The first question when thinking about this number is where does the comparative data actually come from?  I know for sure that Amazon has never released server count data. Google hasn’t either although estimates of their server footprint abound. Interestingly the estimates of Google server counts 5 years ago was 1,000,000 servers whereas current estimates have them only in the 900k to 1m range.


The Microsoft number is surprising when compared against past external estimates, data center build rates, or ramp rates from previous hints and estimates. But, as I said, little data has been released by any of the large players and what’s out there is typically nothing more than speculation. Counting servers is hard and credibly comparing server counts is close to impossible.


Assume that each server runs 150 to 300W including all server components with a weighted average of say 250W/server. And as a Power Usage Effectiveness estimator, we will use 1.2 (only 16.7% of the power is lost to datacenter cooling and power distribution losses). With these scaling points, the implied total power consumption is over 300MW. Three hundred million watts or, looking at annual MW-h, we get an annual consumption of over 2,629,743 MWh or 2.6 terawatt hours – that’s a hefty slice of power even by my standards. As a point of scale since these are big numbers, the US Energy Information Administration reports that in 2011 the average US household consumed 11.28MWh. Using that data point, 2.6TWh is just a bit more than the power consumed by 230,000 US homes.


Continuing through the data and thinking through what follows from “over 1M” servers, the capital expense of servers will be $1.45 billion dollars assuming a very inexpensive $1,450 per server. Assuming a mix of different servers with an average cost of $2,000/server, the overall capital expense would be $2B before looking at the data center infrastructure and networking costs required to house them. With the overall power consumption computed above of 300MW which is 250MW of critical power (using a PUE of 1.2) and assuming a data center build cost at a low $9/W of critical load (Uptime Institute estimates numbers closer to $12/W), we would have a data center build cost of $2.25B dollars.  The implication of “over 1M servers” is an infrastructure investment of $4.25B including the servers. That’s an athletic expense even for Microsoft but certainly possible.


How many datacenters would be implied by “more than one million servers?” Ignoring the small points of presence since they don’t move the needle, and focusing on the big centers, let’s assume 50,000 servers in each facility. That assumption would lead to 30 major facilities.  As a cross check, if we instead focus on power consumption as a way to compute facility count and assume a total datacenter power consumption of 20MW each and the previously computed 300MW total power consumption, we would have roughly 15 large facilities. Not an unreasonable number in this context.


The summary following from these data and the “over 1M servers” number:

·         Facilities: 15 to 30 large facilities

·         Capital expense: $4.25B

·         Total power: 300MW

·         Power Consumption: 2.6TWh annually


Over one million servers is a pretty big number even in a web scaled world.




Data Center Knowledge article: Ballmer: Microsoft has 1 Million Servers

Transcript of Ballmer presentation:  Steve Ballmer: World-Wide Partner Conference Keynote

Additional note: Todd Hoff of High Scalability made a fun observation in his post Steve Ballmer says Microsoft has over 1 million servers. From Todd: “The [Microsoft data center] power consumption is about the same as used by Nicaragua and the capital cost is about a third of what Americans spent on video games in 2012.

 James Hamilton 

b: /



Tuesday, July 16, 2013 4:34:17 PM (Pacific Standard Time, UTC-08:00)  #    Comments [11] - Trackback
 Tuesday, June 18, 2013
Back in 2009, in Datacenter Networks are in my way, I argued that the networking world was stuck in the mainframe business model: everything vertically integrated. In most datacenter networking equipment, the core Application Specific Integrated Circuit (ASIC – the heart of a switch or router), the entire hardware platform for the ASIC including power and physical network connections, and the software stack including all the protocols all come from a single vender and there is no practical mechanism to make different choices. This is how the server world operated back 40 years ago and we get much the same result. Networking gear is expensive, interoperates poorly, is expensive to manage and is almost always over-subscribed and constraining the rest of the equipment in the datacenter. 

Further exaggerating what is already a serious problem, unlike the mainframe server world of 40 years back, networking equipment is also unreliable. Each has 10s of millions of lines of code under the hood forming frustratingly productive bug farms. Each fault is met with a request from the vendor to “install the latest version” – the new build is usually different than what was previously running but “different” isn’t always good when running production systems. It’s just a new set of bugs to start chasing.  The core problem is many customers ask for complex features they believe will help make their network easier to manage. The networking vendor knows delivering these features keeps the customer uniquely dependent upon that vendor’s single solution. One obvious downside is the vendor lock-in that follows from these helpful features but the larger problem is these extensions are usually not broadly used, not well tested in subsequent releases, and the overall vendor protocol stacks become yet more brittle as they become more complex aggregating all these unique customer requests. This is an area in desperate need for change.

Because networking gear is complex and, despite them all implementing the same RFCs, equipment from different vendors (and sometimes the same vendor) still interoperates poorly. It’s very hard to deliver reliable networks at controllable administration costs from multiple vendors freely mixing and matching. The customer is locked in, the vendors know it, and the network equipment prices reflect that realization. 

Not only is networking gear expensive absolutely but the relative expensive of networking is actually increasing over time. Tracking the cost of networking gear as a ratio of all the IT equipment (servers, storage, and networking) in a data center, a terrible reality emerges.  For a given spend on servers and storage, the required network cost has been going up each year I have been tracking it. Without a fundamental change in the existing networking equipment business model, there is no reason to expect this trend will change.

Many of the needed ingredients for change actually have been in place for more than half a decade now. We have very high function networking ASICs available from Broadcom, Marvell, Fulcrum (Intel), and many others. Each competes with the others driving much faster innovation and ensuring that cost decreases are passed on to customers rather than simply driving more profit margin. Each ASIC design house produces references designs that are built by multiple competing Original Design Manufacturers each with their own improvements. Taking the widely used Broadcom ASIC as an example, routers based upon this same ASIC are made by Quanta, Accton, DNI, Foxconn, Celestica, and many others. Each competes with the others driving much faster innovation and ensuring that the cost decreases are passed on to customers rather than further padding networking equipment vendor margins.

What is missing is high quality control software, management systems, and networking protocol stacks that can run across a broad range of competing, commodity networking hardware. It’s still very hard to take merchant silicon ASICs packaged in ODM produced routers and deploy production networks. Very big datacenter operators actually do it but it’s sufficiently hard that this gear is largely unavailable to the vast majority of networking customers.  

One of my favorite startups, Cumulus Networks, has gone after exactly the problem of making ODM produced commodity networking gear available broadly with high quality software support. Cumulus supports a broad range of ODM produced routing platforms built upon Broadcom networking ASICs. They provide everything it takes above the bare metal router to turn an ODM platform into a production quality router.  Included is support for both layer 2 switching and layer 3 routing protocols including OSPF (v2 and V3) and BGP.  Because the Cumulus system includes and is hosted on a Linux distribution (Debian), many of the standard tools, management, and monitoring systems just work. For example, they support Puppet, Chef, collectd, SNMP, Nagios, bash, python, perl, and ruby. 

Rather than implement a proprietary device with proprietary management as the big networking players typically do, or make it looks like a CISCO router as many of the smaller payers often do, Cumulus makes the switch look like a Linux server with high-performance routing optimizations. Essentially it’s just a routing optimized Linux server.

The business model is similar to Red Hat Linux where the software and support are available on a subscription model at a price point that makes a standard network support contract look like the hostage payout that it actually is. The subscription includes entire turnkey stack with everything needed to take one of these ODM produced hardware platforms and deploy a production quality network. Subscriptions will be available directly from Cumulus and through an extensive VAR network.

Cumulus supported platforms include Accton AS4600-54T (48x1G & 4x10G), Accton AS5600-52x (48x10G & 4x40G), Agema (DNI brand) AG-6448CU (48x1G & 4x10G), Agema AG-7448CU (48x10G & 4x40G), Quanta QCT T1048-LB9 (48x1G & 4x10G), and Quanta QCT T-3048-LY2 (48x10G & 4x40G). Here’s a picture of many of these routing platforms from the Cumulus QA lab:

In addition to these single ASIC routing and switching platforms, Cumulus is also working on a chassis-based router to be released later this year:

This platform has all the protocol support outlined above and delivers 512 ports of 10G or 128 ports of 40G in a single chassis. High-port count chassis-based routers have always been exclusively available from the big, vertically integrated networking companies mostly because high-port count routers are expensive to design and are sold in lower volumes than the simpler, single ASIC designs commonly used as Top of Rack or as components of aggregation layer fabrics. Cumulus and their hardware partners are not yet ready to release more details on the chassis but the plans are exciting and the planned price point is game changing.  Expect to see this later in the year.

Cumulus Networks was founded by JR Rivers and Nolan Leake in early 2010. Both are phenomenal engineers and I’ve been huge fans of their work since meeting them as they were first bringing the company together.  They raised seed funding in 2011 from Andreessen Horowitz, Battery Ventures, Peter Wagner, Gurav Garv, Mendel Rosenblum, Diane Greene, and Ed Bugnion.  In mid-2012, they did an A-round from Andreessen Horowitz and Battery Ventures

The pace of change continues to pick up in the networking world and I’m looking forward to the formal announcement of the Cumulus chassis-based router.


James Hamilton 
b: /

Tuesday, June 18, 2013 11:15:27 AM (Pacific Standard Time, UTC-08:00)  #    Comments [6] - Trackback
 Monday, February 04, 2013

In the data center world, there are few events taken more seriously than power failure and considerable effort is spent to make them rare. When a datacenter experiences a power failure, it’s a really big deal for all involved. But, a big deal in the infrastructure world still really isn’t a big deal on the world stage. The Super Bowl absolutely is a big deal by any measure. On average over the last couple of years, the Super Bowl has attracted 111 million viewers and is the number 1 most watched television show in North America eclipsing the final episode of Mash.  World-wide, the Super Bowl is only behind the European Cup (UEFA Champions Leaque) which draws 178 million viewers.


When the 2013 Super Bowl power event occurred, the Baltimore Ravens had just run back the second half opening kick for a touchdown and they were dominating the game with a 28 to 6 point lead. The 49ers had already played half the game and failed to get a single touchdown. The Ravens were absolutely dominating and they started the second half by tying the record for the longest kickoff return in NFL history at 108 yards. The game momentum was strongly with Baltimore.


At 13:22 in the third quarter, just 98 seconds into the second half, ½ of the Superdome lost primary power. Fortunately it wasn’t during the runback that started the second half.  The power failure let to a 34 min delay to restore full lighting the field and, when the game restarted, the 49ers were on fire. The game was fundamentally changed by the outage with the 49ers rallying back to a narrow defeat of only 3 points. The game ended 34 to 31 and it really did come down to the wire where either team could have won. There is no question the game was exciting and some will argue the power failure actually made the game more exciting. But, NFL championships should be decided on the field and not impacted by the electrical system used by the host stadium.


What happened at 13:22 in the third quarter when much of the field lighting failed?  Entergy, the utility supply power to the Superdome reported their “distribution and transmission feeders that serve the Superdome were never interrupted” (Before Game Is Decided, Superdome Goes Dark). It was a problem at the facility.


The joint report from SMG the company that manages the Superdome and Entergy, the utility power provider, said:


A piece of equipment that is designed to monitor electrical load sensed an abnormality in the system. Once the issue was detected, the sensing equipment operated as designed and opened a breaker, causing power to be partially cut to the Superdome in order to isolate the issue. Backup generators kicked in immediately as designed.


Entergy and SMG subsequently coordinated start-up procedures, ensuring that full power was safely restored to the Superdome. The fault-sensing equipment activated where the Superdome equipment intersects with Entergy’s feed into the facility. There were no additional issues detected. Entergy and SMG will continue to investigate the root cause of the abnormality.



Essentially, the utility circuit breaker detected an “anomaly” and opened the breaker. Modern switchgear have many sensors monitored by firmware running on a programmable logic controller. The advantage of these software systems is they are incredibly flexible and can be configured uniquely for each installation. The disadvantage of software systems is the wide variety of configurations they can support can be complex and the default configurations are used perhaps more often than they should. The default configurations in a country where legal settlements can be substantial tend towards the conservative side. We don’t know if that was a factor in this event but we do know that no fault was found and the power was stable for the remainder of the game. This was almost certainly a false trigger.


Because the cause has not yet been reported and, quite often, the underlying root cause is never found. But, it’s worth asking, is it possible to avoid long game outages and what would it cost?  As when looking at any system faults, the tools we have to mitigate the impact are: 1) avoid the fault entirely, 2) protect against the fault with redundancy, 3) minimize the impact of the fault through small fault zones, and 4) minimize the impact through fast recovery.


Fault avoidance: Avoidance starts with using good quality equipment, configuring it properly, maintaining it well, and testing it frequently. Given the Superdome just went through $336 million renovation, the switch gear may have been relatively new and, even if it wasn’t, it likely was almost certainly recently maintained and inspected.


Where issues often arise are in configuration. Modern switch gear have an amazingly large number of parameters many of which interact with each other and, in total, can be difficult to fully understand. And, given the switch gear manufactures know little about the intended end-use application of each switchgear sold, they ship conservative default settings. Generally, the risk and potential negative impact of a false positive (breaker opens when it shouldn’t) is far less than a breaker that fails to open. Consequently conservative settings are common.


Another common cause of problems is lack of testing. The best way to verify that equipment works is to test at full production load in a full production environment in a non-mission critical setting. Then test it just short of overload to ensure that it can still reliably support the full load even though the production design will never run it that close to the limit, and finally, test it into overload to ensure that the equipment opens up on real faults.


The first, testing in full production environment in non-mission critical setting is always done prior to a  major event. But the latter two tests are much less common: 1) testing at rated load, and 2) testing beyond rated load.  Both require synthetic load banks and skill electricians and so these tests are often not done. You really can’t beat testing in a non-mission critical setting as a means of ensuring that things work well in a mission critical setting (game time).


Redundancy: If we can’t avoid a fault entirely, the next best thing is to have redundancy to mask the fault. Faults will happen. The electrical fault at the Monday Night Football game back in December of 2011 was caused by utility sub-station failing. These faults are unavoidable and will happen occasionally. But is protection against utility failure possible and affordable? Sure, absolutely. Let’s use the Superdome fault yesterday as an example.


The entire Superdome load is only 4.6MW. This load would be easy to support on two 2.5 to 3.0MW utility feeds each protected by its own generator. Generators in the 2.5 to 3.0 MW range are substantial V16 diesel engines the size of a mid-sized bus. And they are expensive running just under $1M each but they are also available in mobile form and inexpensive to rent. The rental option is a no-brainer but let’s ignore that and look at what it would cost to protect the Superdome year around with a permanent installation. We would need 2 generators, the switchgear to connect it to the load and uninterruptable power supplies to hold the load during the first few seconds of a power failure until the generators start up and are able to pick up the load. To be super safe, we’ll buy third generator just in case there is a problem and one of the two generators don’t start. The generators are under $1m each and the overall cost of the entire redundant power configuration with the extra generator could be had for under $10m.  Looking at statistics from the 2012 event, a 30 second commercial costs just over $4m.


For the price of just over 60 seconds of commercials the facility could protected against fault. And, using rental generators, less than 30 seconds of commercials would provide the needed redundancy to avoid impact from any utility failure. Given how common utility failures are and the negative impact of power disruptions at a professional sporting event, this looks like good value to me. Most sports facilities chose to avoid this “unnecessary” expense and I suspect the Superdome doesn’t have full redundancy for all of its field lighting. But even if it did, this failure mode can sometimes cause the generators to be locked out and not pick up the load during a some power events. In this failure mode, when a utility breaker incorrectly senses a ground fault within the facility, it is frequently configured to not put the generator at risk by switching it into a potential ground fault. My take is I would rather run the risk of damaging the generator and avoid the outage so I’m not a big fan of this “safety” configuration but it is a common choice.


Minimize Fault Zones: The reason why only ½ the power to the Superdome went down was because the system installed at the facility has two fault containment zones. In this design, a single switchgear event can only take down ½ of the facility.


Clearly the first choice is to avoid the fault entirely. And, if that doesn’t work, have redundancy take over and completely mask the fault. But, in the rare cases where none of these mitigations work, the next defense are small fault containment zones. Rather than using 2 zones, spend more on utility breakers and have 4 or 6 and, rather than losing ½ the facility, lose ¼ or 1/6.  And, if the lighting power is checker boarded over the facility lights, (lights in a contiguous region are not all powered by the same utility feed but the feeds are distributed over the lights evenly), rather than losing ¼ or 1/6 of the lights in one area of the stadium, we would lose that fraction of the lights evenly over the entire facility. Under these conditions, it might be possible to operate with slightly degraded field lighting and be able to continue the game without waiting for light recovery.


Fast Recovery: Before we get to this fourth option, fast recovery, we have tried hard to avoid failure, then we have used power redundancy to mask the failure, then we have used small fault zones to minimize the impact. The next best thing we can do is to recover quickly. Fast recovery depends broadly on two things: 1) if possible automate recovery so it can happen in seconds rather than the rate at which humans can act, 2) if humans are needed, ensure they have access to adequate monitoring and event recording gear so they can see what happened quickly and they have trained extensively and are able to act quickly.


In this particular event, the recovery was not automated. Skilled electrical technicians were required. They spent nearly 15 minute checking system states before deciding it was safe to restore power. Generally, 15 min on a human judgment driven recover decision isn’t bad. But the overall outage was 34 min. If the power was restored in 15 min, what happened during the next 20?  The gas discharge lighting still favored at large sporting venues, take roughly 15 minutes to restart after a momentary outage. Even a very short power interruption will still suffer the same long recovery time. Newer light technologies are becoming available that are both more power efficient and don’t suffer from these long warm-up periods.


It doesn’t appear that the final victor of Super Bowl XLVII was changed by the power failure but there is no question the game was broadly impacted. If the light failure had happened during the kickoff return starting the third quarter, the game may have been changed in a very fundamental way. Better power distribution architectures are cheap by comparison. Given the value of the game, the relative low cost of power redundancy equipment, I would argue it’s time to start retrofitting major sporting venues with more redundant design and employing more aggressive pre-game testing.




James Hamilton 
b: /


Monday, February 04, 2013 11:16:06 AM (Pacific Standard Time, UTC-08:00)  #    Comments [18] - Trackback
Hardware | Ramblings
 Monday, January 14, 2013

In the cloud there is nothing more important than customer trust. Without customer trust, a cloud business can’t succeed. When you are taking care of someone else’s assets, you have to treat those assets as more important than your own. Security has to be rock solid and absolutely unassailable. Data loss or data corruption has to be close to impossible and incredibly rare.  And all commitments to customers have to be respected through business changes. These are hard standards to meet but, without success against these standards, a cloud service will always fail. Customers can leave any time and, if they have to leave, they will remember you did this to them.


These are facts and anyone working in cloud services labors under these requirements every day. It’s almost reflexive and nearly second nature. What brought this up for me over the weekend was a note I got from one of my cloud service providers. It emphasized that it really is worth talking more about customer trust.


Let’s start with some history.  Many years ago, Michael Merhej and Tom Klienpeter started a company called ByteTaxi that eventually offered a product called Foldershare. It was a simple service with a simple UI but it did peer-to-peer file sync incredibly well, it did it through firewalls, it did it without install confusion and, well, it just worked. It was a simple service but was well executed and very useful. In 2005, Microsoft acquired Foldershare and continued to offer the service. It didn’t get enhanced much for years but it remained useful.  Then Microsoft came up with a broader plan called Windows Live Mesh and the Foldershare service was renamed. Actually the core peer-to-peer functionality passed through an array of names and implementations from Foldershare, Windows Live Foldershare, Windows Live Sync and finally Windows Live Mesh.


During the early days at Microsoft, it was virtually uncared for and had little developer attention. As new names and implementations were announced and the feature actually had developer attention, it was getting enhanced but, ironically, it was also getting somewhat harder to use and definitely less stable. But, it still worked and the functionality lived on in Live Mesh. Microsoft has another service called Skydrive that does the same thing that all the other cloud sync services do: sync files to cloud hosted storage. Unfortunately, it doesn’t include the core peer-to-peer functionality of Live Mesh.  Reportedly 40% of the Live Mesh users also use Skydrive.


This is where we get back to customer trust. Over the weekend, Microsoft sent out a note to all Mesh users confirming it will be shut off next month as a follow up to their announcement that the service will be killed that went out in December. They explained the reason to terminate the service and remove the peer-to-peer file sync functionality:


Currently 40% of Mesh customers are actively using SkyDrive and based on the positive response and our increasing focus on improving personal cloud storage, it makes sense to merge SkyDrive and Mesh into a single product for anytime and anywhere access for files.


Live Mesh is being killed without a replacement service.  It’s not a big deal but 2 months isn’t a lot of warning. I know that this sort of thing can happen to small startups anytime and, at any time, customers could get left unsupported. But, Microsoft seems well beyond the startup phase at this point.  I get that strategic decisions have to be made but there are times when I wonder how much thought went into the decision. I suspect it was something like “there are only 3 million Live Mesh customers so it’s really not worth continuing with it.” And, it actually may not be worth continuing the service. But, there is this customer trust thing. And I just hate to see it violated – it’s bad for all cloud provider when anyone in the industry makes a decision that raises the customer trust question.


Fortunately, there is a Mesh replacement service:  I’ve been using it since the early days when it was in controlled beta. Over the last month or so Cubby has moved to full, unrestricted production. It’s been solid for the period I’ve been using it and, like Foldershare, its simple and it works. I really like it. If you are a Mesh user, were a Foldershare user, or just would like to be able to sync your files between your different systems, try Cubby.  Cubby also add support for Android or IOS devices without extra cost. Cubby is well executed and stable.


It must be Cloud Cleaning week at Microsoft.  A friend forwarded the note sent to the millions of active Microsoft Messenger customers this month: the service is being “retired” and users are recommended to consider Skype.


If you are interested in reading more on the Live Mesh service elimination, the following is the text of the note sent to all current Mesh users:

Dear Mesh customer,

Recently we released the latest version of SkyDrive, which you can use to:

  • Choose the files and folders on your SkyDrive that sync on each computer.
  • Access your SkyDrive using a brand new app for Android v2.3 or the updated apps for Windows Phone, iPhone, and iPad.
  • Collaborate online with the new Office Web apps, including Excel forms, co-authoring in PowerPoint and embeddable Word documents.

Currently 40% of Mesh customers are actively using SkyDrive and based on the positive response and our increasing focus on improving personal cloud storage, it makes sense to merge SkyDrive and Mesh into a single product for anytime and anywhere access for files. As a result, we will retire Mesh on February 13, 2013. After this date, some Mesh functions, such as remote desktop and peer to peer sync, will no longer be available and any data on the Mesh cloud, called Mesh synced storage or SkyDrive synced storage, will be removed. The folders you synced with Mesh will stop syncing, and you will not be able to connect to your PCs remotely using Mesh.

We encourage you to try out the new SkyDrive to see how it can meet your needs. During the transition period, we suggest that, in addition to using Mesh, you sync your Mesh files using SkyDrive. This way, you can try out SkyDrive without changing your existing Mesh setup. For tips on transitioning to SkyDrive, see SkyDrive for Mesh users on the Windows website. If you have questions, you can post them in the SkyDrive forums.

Mesh customers have been influential and your feedback has helped shape our strategy for Mesh and SkyDrive. We would not be here without your support and hope you continue to give us feedback as you use SkyDrive.


The Windows Live Mesh and SkyDrive teams


There is real danger of thinking of customers as faceless aggregations of hundreds of thousands or even millions of users. We need to think through decisions one user at a time and make it work for them individually. If millions of active users are on Microsoft Messenger, what would it take to make them want to use Skype?  If 60% of the Windows Live Mesh users chose not to use Microsoft Skydrive, why is that? Considering customers one at a time is clearly the right thing for customers but, long haul, it’s also the right thing for the business. It builds the most important asset in the cloud, customer trust.




James Hamilton 
b: /

Monday, January 14, 2013 8:32:32 PM (Pacific Standard Time, UTC-08:00)  #    Comments [16] - Trackback
 Tuesday, December 11, 2012

Since 2008, I’ve been excited by, working on, and writing about Microservers. In these early days, some of the workloads I worked with were I/O bound and didn’t really need or use high single-thread performance. Replacing the server class processors that supported these applications with high-volume, low-cost client system CPUs yielded both better price/performance and power/performance. Fortunately, at that time, there were good client processors available with ECC enabled (see You Really DO Need ECC) and most embedded system processors also supported ECC.


I wrote up some of the advantages of these early microserver deployments and showed performance results from a production deployment in an internet-scale mail processing application in Cooperative, Expendable, Microslice, Servers: Low-Cost, Low-Power Servers for Internet-Scale Services.


Intel recognizes the value of low-power, low-cost processors for less CPU demanding applications and announced this morning the newest members of the Atom family, the S1200 series. These new processors support 2 cores and 4 threads and are available in variants of up to 2Ghz while staying under 8.5 watts. The lowest power members of the family come in at just over 6W. Intel has demonstrated an S1200 reference board running spec_web at 7.9W including memory, SATA, Networking, BMC, and other on-board components.


Unlike past Atom processors, the S1200 series supports full ECC memory. And all members of the family support hardware virtualization (Intel VT-x2), 64 bit addressing, and up to 8GB of memory. These are real server parts.


Centerton (S1200 series) features:

One of my favorite Original Design Manufacturers, Quanta Computer, has already produced a shared infrastructure rack design that packs 48 Atom S1200 servers into a 3 rack unit form factor (5.25”).


Quanta S900-X31A front and back view:



Quanta S900-X31a server drawer:


Quanta has done a nice job with this shared infrastructure rack. Using this design, they can pack a booming 624 servers into a standard 42 RU rack.


I’m excited by the S1200 announcement because it’s both a good price/performer and power/performer and shows that Intel is serious about the microserver market. This new Atom gives customers access to microserver pricing without having to change instruction set architectures. The combination of low-cost, low-power, and the familiar Intel ISA with its rich tool chain and broad application availability is a compelling combination. It’s exciting to see the microserver market heating up and I like Intel’s roadmap looking forward.




Related Microserver focused postings:

·         Cooperative Expendable Microslice Servers: Low-cost, Low-power Servers for Internet Scale Services

·         The Case for Low-Cost, Low-Power Servers

·         Low Power Amdahl Blades for Data Intensive Computing

·         Microslice Servers

·         ARMCortext-A9 Design Announced

·         2010 the Year of the Microslice Server

·         Very Low Power Server Progress

·         Nvidia Project Denver: ARM Powered Servers

·         ARM V8 Architecture

·         AMD Announced Server-Targeted ARM Part

·         Quanta S900-X31A


James Hamilton 
b: /


Tuesday, December 11, 2012 8:55:45 AM (Pacific Standard Time, UTC-08:00)  #    Comments [12] - Trackback
 Wednesday, November 28, 2012

I’ve worked in or near the database engine world for more than 25 years. And, ironically, every company I’ve ever worked at has been working on a massive-scale, parallel, clustered RDBMS system. The earliest variant was IBM DB2 Parallel Edition released in the mid-90s. It’s now called the Database Partitioning Feature.


Massive, multi-node parallelism is the only way to scale a relational database system so these systems can be incredibly important. Very high-scale MapReduce systems are an excellent alternative for  many workloads. But some customers and workloads want the flexibility and power of being able to run ad hoc SQL queries against petabyte sized databases. These are the workloads targeted by massive, multi-node relational database clusters and there are now many solutions out there with Oracle RAC being perhaps the most well-known but there are many others including Vertica, GreenPlum, Aster Data, ParAccel, Netezza, and Teradata.


What’s common across all these products is that big databases are very expensive. Today, that is changing with the release of Amazon Redshift. It’s a relational, column-oriented, compressed, shared nothing, fully managed, cloud hosted, data warehouse. Each node can store up to 16TB of compressed data and up to 100 nodes are supported in a single cluster.


Amazon Redshift manages all the work needed to set up, operate, and scale a data warehouse cluster, from provisioning capacity to monitoring and backing up the cluster, to applying patches and upgrades. Scaling a cluster to improve performance or increase capacity is simple and incurs no downtime. The service continuously monitors the health of the cluster and automatically replaces any component, if needed.


The core node on which the Redshift clusters are build, includes 24 disk drives with an aggregate capacity of 16TB of local storage. Each node has 16 virtual cores and 120 Gig of memory and is connected via a high speed 10Gbps, non-blocking network. This a meaty core node and Redshift supports up to 100 of these in a single cluster.


There are many pricing options available (see for more detail) but the most favorable comes in at only $999 per TB per year. I find it amazing to think of having the services of an enterprise scale data warehouse for under a thousand dollars by terabyte per year. And, this is a fully managed system so much of the administrative load is take care of by Amazon Web Services.


Service highlights from:


Fast and Powerful – Amazon Redshift uses a variety to innovations to obtain very high query performance on datasets ranging in size from hundreds of gigabytes to a petabyte or more. First, it uses columnar storage and data compression to reduce the amount of IO needed to perform queries. Second, it runs on hardware that is optimized for data warehousing, with local attached storage and 10GigE network connections between nodes. Finally, it has a massively parallel processing (MPP) architecture, which enables you to scale up or down, without downtime, as your performance and storage needs change.

You have a choice of two node types when provisioning your own cluster, an extra large node (XL) with 2TB of compressed storage or an eight extra large node (8XL) with 16TB of compressed storage. You can start with a single XL node and scale up to a 100 node eight extra large cluster. XL clusters can contain 1 to 32 nodes while 8XL clusters can contain 2 to 100 nodes.


Scalable – With a few clicks of the AWS Management Console or a simple API call, you can easily scale the number of nodes in your data warehouse to improve performance or increase capacity, without incurring downtime. Amazon Redshift enables you to start with a single 2TB XL node and scale up to a hundred 16TB 8XL nodes for 1.6PB of compressed user data. Resize functionality is not available during the limited preview but will be available when the service launches.


Inexpensive – You pay very low rates and only for the resources you actually provision. You benefit from the option of On-Demand pricing with no up-front or long-term commitments, or even lower rates via our reserved pricing option. On-demand pricing starts at just $0.85 per hour for a two terabyte data warehouse, scaling linearly up to a petabyte and more. Reserved Instance pricing lowers the effective price to $0.228 per hour, under $1,000 per terabyte per year.

Fully Managed – Amazon Redshift manages all the work needed to set up, operate, and scale a data warehouse, from provisioning capacity to monitoring and backing up the cluster, and to applying patches and upgrades. By handling all these time consuming, labor-intensive tasks, Amazon Redshift frees you up to focus on your data and business insights.


Secure – Amazon Redshift provides a number of mechanisms to secure your data warehouse cluster. It currently supports SSL to encrypt data in transit, includes web service interfaces to configure firewall settings that control network access to your data warehouse, and enables you to create users within your data warehouse cluster. When the service launches, we plan to support encrypting data at rest and Amazon Virtual Private Cloud (Amazon VPC).


Reliable – Amazon Redshift has multiple features that enhance the reliability of your data warehouse cluster. All data written to a node in your cluster is automatically replicated to other nodes within the cluster and all data is continuously backed up to Amazon S3. Amazon Redshift continuously monitors the health of the cluster and automatically replaces any component, as necessary.


Compatible – Amazon Redshift is certified by Jaspersoft and Microstrategy, with additional business intelligence tools coming soon. You can connect your SQL client or business intelligence tool to your Amazon Redshift data warehouse cluster using standard PostgreSQL JBDBC or ODBC drivers.


Designed for use with other AWS Services – Amazon Redshift is integrated with other AWS services and has built in commands to load data in parallel to each node from Amazon Simple Storage Service (S3) and Amazon DynamoDB, with support for Amazon Relational Database Service and Amazon Elastic MapReduce coming soon.


Petabyte-scale data warehouses no longer need command retail prices of upwards $80,000 per core. You don’t have to negotiate an enterprise deal and work hard to get the 60 to 80% discount that always seems magically possible in the enterprise software world.  You don’t even have to hire a team of administrators. Just load the data and get going. Nice to see.




James Hamilton 
b: /


Wednesday, November 28, 2012 9:37:51 AM (Pacific Standard Time, UTC-08:00)  #    Comments [6] - Trackback
 Monday, October 29, 2012

I have been interested in, and writing about, microservers since 2007.  Microservers can be built using any instruction set architecture but I’m particularly interested in ARM processors and their application to server-side workloads. Today Advanced Micro Devices announced they are going to build an ARM CPU targeting the server market. This will be 4-core, 64 bit, more than 2Ghz part that is expected to sample in 2013 and ship in volume in early 2014.


AMD is far from new to microserver market. In fact, much of my past work on microservers has been AMD-powered. What’s different today is that AMD is applying their server processor skills while, at the same time, leveraging the massive ARM processor ecosystem. ARM processors power Apple iPhones, Samsung smartphones, tablets, disk drives, and applications you didn’t even know had computers in them.


The defining characteristic of server processor selection is to focus first and most on raw CPU performance and accept the high cost and high-power consumption that follows from that goal. The defining characteristic of Microservers is we leverage the high-volume client and connected device ecosystem and make a CPU selection on the basis of price/performance and power/performance with an emphasis on building balanced servers. The case for microservers is anchored upon these 4 observations:


·         Volume economics: Rather than draw on the small-volume economics of the server market, with Microservers we leverage the massive volume economics of the smart device world driven by cell phones, tablets, and clients. To give some scale to this observation, IDC reports that there were 7.6M server units sold in 2010. ARM reports that there were 6.1B Arm processors shipped last year. The connected and embedded device market volumes are 1000x larger than that of the server market and the performance gap is shrinking rapidly. Semiconductor analyst Semicast estimates that by 2015 there will be 2 ARM processors for every person in the world. In 2010, ARM reported that, on average, there were 2.5 ARM-based processors in each Smartphone. The connected and embedded device market is 1000x that of that of the server world.


Having watched and participated in our industry for nearly 3 decades, one reality seems to dominate all others: high-volume economics drives innovation and just about always wins. As an example, IBM mainframes ran just about every important server-side workload in the mid-80s. But, they were largely swept aside by higher-volume RISC servers running UNIX. At the time I loved RISC systems – databases systems would just scream on them and they offered customers excellent price/performance. But, the same trend played out again. The higher-volume X86 processors from the client world swept the superior raw performing RISC systems aside.


Invariably what we see happening about once a decade is a high-volume, lower-priced technology takes over the low end of the market. When this happens many engineers correctly point out that these systems can’t hold a candle to the previous generation server technology and then incorrectly believe they won’t get replaced. The new generation is almost never better in absolute terms but they are better price/performers so they first are adopted for the less performance critical applications.  Once this happens, the die is cast and the outcome is just about assured. The high-volume parts move up market and eventually take over even the most performance critical workloads of the previous generation. We see this same scenario play out roughly once a decade.


·         Not CPU bound: Most discussion in our industry centers on the more demanding server workloads like databases but, in reality, many workloads are not pushing CPU limits and are instead storage, networking, or memory bound.  There are two major classes of workloads that don’t need or can’t fully utilize more CPU:

1.      Some workloads simply do not require the highest performing CPUs to achieve their SLAs.  You can pay more and buy a higher performing processor but it will achieve little for these applications. Some workloads just don’t require more CPU performance to meet their goals.

2.      This second class of workloads is characterized by being blocked on networking, storage, or memory. And by memory bound I don’t mean the memory is too small. In this case it isn’t the size of the memory that is the problem, but the bandwidth.  The processor looks to be fully utilized from an operating system perspective but the bulk of its cycles are waiting for memory. Disk and CPU bound systems are easy to detect by looking for which is running close to 100% utilization while the CPU load is way lower. Memory bound is more challenging to detect but its super common so worth talking about it.  Most server processors are super-scalar, which is to say they can retire multiple instructions each cycle. On many workloads, less than 1 instruction is retired each cycle (you can see this by monitoring Instructions per cycle) because the processor is waiting for memory transfers.


If a workload is bound on network, storage, or memory, spending more on a faster CPU will not deliver results. The same is true for non-demanding workloads. They too are not bound on CPU so a faster part won’t help in this case either.


·         Price/performance: Device price/performance is far better than current generation server CPUs. Because there is less competition in server processors, prices are far higher and price/performance is relatively low compared to the device world. Using server parts, performance is excellent but price is not.


Let’s use an example again: A server CPU is hundreds of dollars sometimes approaching $1,000 whereas the ARM processor in an iPhone comes in at just under $15. My general rule of thumb in comparing ARM processors with server CPUs is they are capable of ¼ the processing rate at roughly 1/10th the cost. And, super important, the massive shipping volume of the ARM ecosystem feeds the innovation and completion and this performance gap shrinks the performance gap with each processor generation. Each generational improvement captures more possible server workloads while further improving price/performance


·         Power/performance: Most modern servers run over 200W, and many are well over 500W, while microservers can weigh in at 10 to 20W. Nowhere is power/performance more important than in portable devices, so the pace of power/performance innovation in the ARM world is incredibly strong. In fact, I’ve long used mobile devices as a window into future innovations coming to the server market. The technologies you seen in the current generation of cell phones has a very high probability of being used in a future server CPU generation.


This is not the first ARM based server processor that has been announced.  And, even more announcements are coming over the next year. In fact, that is one of the strengths of the ARM ecosystem. The R&D investments can be leveraged over huge shipping volume from many producers to bring more competition, lower costs, more choice, and a faster pace of innovation.


This is a good day for customers, a good day for the server ecosystem, and I’m excited to see AMD help drive the next phase in the evolution of the ARM Server market. The pace of innovation continues to accelerate industry-wide and it’s going to be an exciting rest of the decade.


Past notes on Microservers:











James Hamilton 
b: /


Monday, October 29, 2012 1:33:22 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
 Saturday, October 20, 2012

When I come across interesting innovations or designs notably different from the norm, I love to dig in and learn the details. More often than not I post them here. Earlier this week, Google posted a number of pictures taken from their datacenters (Google Data Center Tech). The pictures are beautiful and of interest to just about anyone, somewhat more interesting to those working in technology, and worthy of detailed study for those working in datacenter design. My general rule with Google has always been that anything they show publically is always at least one generation old and typically more. Nonetheless, the Google team does good work so the older designs are still worth understanding so I always have a look.


Some examples of older but interesting Google data center technology:

·         Efficient Data Center Summit

·         Rough Notes: Data Center Efficiency Summit

·         Rough notes: Data Center Efficiency Summit (posting #3)

·         2011 European Data Center Summit


The set of pictures posted last week (Google Data Center Tech) is a bit unusual in that they are showing current pictures of current facilities running their latest work.  What was published was only pictures without explanatory detail but, as the old cliché says, a picture is worth a thousand words. I found the mechanical design to be most notable so I’ll dig into that area a bit but let’s start with showing a conventional datacenter mechanical design as a foil against which to compare the Google approach.

The conventional design has numerous issues the most obvious being that any design that is 40 years old and probably could use some innovation. Notable problems with the conventional design: 1) no hot aisle/cold aisle containment so there is air leakage  and mixing of hot and cold air, 2) air is moved long distances between the Computer Room Air Handers (CRAHs) and the servers and air is an expensive fluid to move, and 3) it’s a closed system and hot air is recirculated after cooling rather than released outside with fresh air brought in and cooled if needed.


An example of an excellent design that does a modern job of addressing most of these failings is the Facebook Prineville Oregon facility:


I’m a big fan of the Facebook facility. In this design they eliminate the chilled water system entirely, have no chillers (expensive to buy and power), have full hot aisle isolation, use outside air with evaporative cooling, and treat the entire building as a giant, high-efficiency air duct. More detail on the Facebook design at: Open Compute Mechanical System Design.


Let’s have a look at the Google Concil Bluffs Iowa Facility:



You can see that have chosen a very large, single room approach rather than sub-dividing up into pods. As with any good, modern facility they have hot aisle containment which just about completely eliminates leakage of air around the servers or over the racks. All chilled air passes through the servers and none of the hot air leaks back prior to passing through the heat exchanger. Air containment is a very important efficiency gain and the single largest gain after air-side economization. Air-side economization is the use of outside air rather than taking hot server exhaust and cooling it to the desired inlet temperature (see the diagram above showing the Facebook use of full building ducting with air-side economization).


From the Council Bluffs picture, we see Google has taken a completely different approach. Rather than completely eliminate the chilled water system and use the entire building as an air duct, they have instead kept the piped water cooling system and instead focused on making it as efficient as possible and exploiting some of the advantages of water based systems. This shot from the Google Hamina Finland facility shows the multi-coil heat exchanger at the top of the hot aisle containment system.


From inside the hot aisle, this shot picture from the  Mayes County data center, we can see the water is brought up from below the floor in the hot aisle using steel braided flexible chilled water hoses. These pipes bring cool water up to the top-of-hot-aisle heat exchangers that cool the server exhaust air before it is released above the racks of servers.


One of the key advantages of water cooling is that water is a cheaper to move fluid than air for a given thermal capacity. In the Google, design they exploit fact by bringing water all the way to the rack. This isn’t an industry first but it is nicely executed in the Google design. IBM iDataPlex brought water directly to the back of the rack and many high power density HPC systems have done this as well.


I don’t see the value of the short stacks above the heat exchanges. I would think that any gain in air acceleration through the smoke stack effect would be dwarfed by the loses of having the passive air stacks as restrictions over the heat exchangers.


Bringing water directly to the rack is efficient but I still somewhat prefer air-side economization systems. Any system that can reject hot air outside and bring in outside air for cooling (if needed) for delivery to the servers is tough to beat (see Diagram at the top for an example approach). I still prefer the outside air model, however, as server density climbs we will eventually get to power densities sufficiently high that water is needed either very near the server as Google has done or direct water cooling as used by IBM Mainframes in the 80s (thermal conduction module). One very nice contemporary direct water cooling system is the work by Green Revolution Cooling where they completely immerse otherwise unmodified servers in a bath of chilled oil.


Hat’s off to Google for publishing a very informative set of data center pictures. The pictures are well done and the engineering is very nice. Good work!

·         Here’s a very cool Google Street view based tour of the Google Lenoir NC Datacenter.

·         The detailed pictures released last week: Google Data Center Photo Album




James Hamilton 
b: /


Saturday, October 20, 2012 2:48:55 PM (Pacific Standard Time, UTC-08:00)  #    Comments [4] - Trackback

Disclaimer: The opinions expressed here are my own and do not necessarily represent those of current or past employers.

<November 2014>

This Blog
Member Login
All Content © 2014, James Hamilton
Theme created by Christoph De Baene / Modified 2007.10.28 by James Hamilton