Wednesday, August 01, 2012

In I/O Performance (no longer) Sucks in the Cloud, I said


Many workloads have high I/O rate data stores at the core. The success of the entire application is dependent upon a few servers running MySQL, Oracle, SQL Server, MongoDB, Cassandra, or some other central database.


Last week a new Amazon Elastic Compute Cloud (EC2) instance type based upon SSDs was announced that delivers 120k reads per second and 10k to 85k writes per second. This instance type with direct attached SSDs is an incredible I/O machine ideal for database workloads, but most database workloads run on virtual storage today. The administrative and operational advantages of virtual storage are many. You can allocate more storage with a call of an API. Blocks are redundantly stored on multiple servers. It’s easy to checkpoint to S3. Server failures don’t impact storage availability.


The AWS virtual block storage solution is the Elastic Block Store (EBS).  Earlier today two key features were released to support high performance databases and other random I/O intensive workloads on EBS. The key observation is that these random I/O-intensive workloads need to have IOPS available whenever they are needed. When a database runs slowly, the entire application runs poorly. Best effort is not enough and competing for resources with other workloads doesn’t work. When high I/O rates are needed, they are needed immediately and must be there reliably.


Perhaps the best way to understand the two new features is to look at how demanding database workloads are often hosted on-premise. Typically large servers are used so the memory and CPU resources are available when needed. Because a high performance storage system is needed and because it is important to be able to scale the storage capacity and I/O rates during the life of the application, direct attached disk isn’t the common choice. Most enterprise customers put these workloads on Storage Area Network devices which are typically connected to the server by a Fiber Channel network (a private communication channel used only for storage).


The aim of the announcement today is to take some of what has been learned from 30+ years of on-premise storage evolution. Customers want virtualized storage but, at the same time, they need the ability to reserve resources for demanding workloads. In this announcement, we take some of the best aspects what has emerged in on-premise storage solutions and  give EC2 customers the ability to scale high-performance storage as needed, reserve and scale the available I/Os per Second (IOPS) as needed, and reserve dedicated network bandwidth to the storage device. The latter is perhaps the most important and the combination allows workloads to reserve both the IOPS rates at the storage as well as the network channel to get to the storage and be assured it will be there when they need it.


The storage, IOPS, and network capacity is there even if you haven’t used it recently. It’s there even if your neighbors are also busy using their respective reservations. It’s even there if you are running full networking traffic load to the EC2 instance. Just as when an on-premise customer allocates a SAN volume with a Fiber Channel attach that doesn’t compete with other network traffic, allocated resources stay reserved and they stay available. Let’s look at the two features that deliver a low-jitter, virtual SAN solution in AWS.

Provisioned IOPS is a feature of Amazon Elastic Block Store. EBS has always allowed customers to allocate storage volumes of the size they need and to attach these virtual volumes to their EC2 instances. Provisioned IOPS allows customers to declare the I/O rate they need the volumes to be able to deliver, up to 1,000 I/Os per second (IOPS) per volume. Volumes can be striped together to achieve reliable, low-latency virtual volumes of 20,000 IOPS or more. The ability to reliably configure and reserve over 10,000 IOPS means the vast majority of database workloads can be supported. And, in the near future, this limit will be raised allowing increasingly demanding workloads to be hosted on EC2 using EBS.


EBS-Optimized EC2 instances are a feature of EC2 that is the virtual equivalent of installing a dedicated network channel to storage. Depending upon the instance type, 500 Mbps up to a full 1Gbps are allocated and dedicated for storage use only. This storage communications channel is in addition to the network connection to the instance. Storage and network traffic no longer compete and, on large instance types, you can drive full 1Gbps line rate network traffic while, at the same time, also be consuming 1Gbps to storage. Essentially EBS Optimized instances have a dedicated storage channel that doesn’t compete with instance network traffic.


From the EBS detail page:


EBS standard volumes offer cost effective storage for applications with light or bursty I/O requirements.  Standard volumes deliver approximately 100 IOPS on average with a best effort ability to burst to hundreds of IOPS.  Standard volumes are also well suited for use as boot volumes, where the burst capability provides fast instance start-up times.


Provisioned IOPS volumes are designed to deliver predictable, high performance for I/O intensive workloads such as databases.  With Provisioned IOPS, you specify an IOPS rate when creating a volume, and then Amazon EBS provisions that rate for the lifetime of the volume.  Amazon EBS currently supports up to 1,000 IOPS per Provisioned IOPS volume, with higher limits coming soon.  You can stripe multiple volumes together to deliver thousands of IOPS per Amazon EC2 instance to your application. 

To enable your Amazon EC2 instances to fully utilize the IOPS provisioned on an EBS volume, you can launch selected Amazon EC2 instance types as “EBS-Optimized” instances.  EBS-optimized instances deliver dedicated throughput between Amazon EC2 and Amazon EBS, with options between 500 Mbps and 1000 Mbps depending on the instance type used.  When attached to EBS-Optimized instances, Provisioned IOPS volumes are designed to deliver within 10% of the provisioned IOPS performance 99.9% of the time.  See Amazon EC2 Instance Types to find out more about instance types that can be launched as EBS-Optimized instances. 


Providing scalable block storage at-scale, in 8 regions around the world is one of the most interesting combinations of distributed systems and storage problems we face. The problem has been well solved in high-cost on-premise solutions. We now get to apply what has been learned over the last 30+ years to solve the problem at cloud-scale with low-cost and 100s of thousands of concurrent customers. An incredible number of EC2 customers depend upon EBS for their virtual storage needs, the number is growing daily, and we are really only just getting started. If you want to be part of the engineering effort to make Elastic Block Store the virtual storage solution for the cloud, send us a note at


With the announcement today, EC2 customers now have access to two very high performance storage solutions. The first solution is the EC2 High I/O Instance type announced last week which delivers a direct attached, SSD-powered 100k IOIPS for $3.10/hour. In today’s announcement this direct attached storage solution is joined by a high-performance virtual storage solution. This new type of EBS storage allows the creation of striped storage volumes that can reliably delivery 10,000 to 20,000 IOPS across a dedicated virtual storage network.


Amazon EC2 customers now have both high-performance, direct attached storage and high-performance virtual storage with a dedicated virtual storage connection.




James Hamilton 
b: /


Wednesday, August 01, 2012 4:57:52 AM (Pacific Standard Time, UTC-08:00)  #    Comments [2] - Trackback
 Friday, July 20, 2012

Many workloads have high I/O rate data stores at the core. The success of the entire application is dependent upon a few servers running MySQL, Oracle, SQL Server, MongoDB, Cassandra, or some other central database.


The best design patter for any highly reliable and scalable application whether on-premise or in cloud hosted, is to shard the database. You can’t be dependent upon a single server being able to scale sufficiently to hold the entire workload. Theoretically, that’s the solution and all workloads should run well on a sufficiently large fleet even if that fleet has a low individual server I/O performance. Unfortunately, few workloads scale as badly as database workloads. Even scalable systems such as MongoDB or Cassandra need to have a per-server I/O rate that meets some minimum bar to host the workload cost effectively with stable I/O performance.


The easy solution is to depend upon a hosted service like DynamoDB that can transparently scale to order 10^6 transactions per second and deliver low jitter performance. For many workloads, that is the final answer. Take the complexity of configuring and administering a scalable database and give it to a team that focuses on nothing else 24x7 and does it well.


Unfortunately,  in the database world, One Size Does Not Fit All. DynamoDB is a great solution for some workloads but many workloads are written to different stores or depend upon features not offered in DynamoDB. What if you have an application written to run on sharded Oracle (or MySQL) servers and each database requires 10s of thousands of I/Os per second? For years, this has been the prototypical “difficult to host in the cloud” workload. All servers in the application are perfect for the cloud but the overall application won’t run unless the central database server can support the workload.


Consequently, these workloads have been difficult to host on the major cloud services. They are difficult to scale out to avoid needing very high single node I/O performance and they won’t yield a good customer experience unless the database has the aggregate IOPS needed. 


Yesterday an ideal EC2 instance type was announced. It’s the screamer needed by these workloads. The new EC2 High I/O Instance type is a born database machine. Whether you are running Relational or NoSQL, if the workload is I/O intense and difficult to cost effectively scale-out without bound, this instance type is the solution. It will deliver a booming 120,000 4k reads per second and between 10,000 and 85,000 4k random writes per second. The new instance type:

·         60.5 GB of memory

·         35 EC2 Compute Units (8 virtual cores with 4.4 EC2 Compute Units each)

·         2 SSD-based volumes each with 1024 GB of instance storage

·         64-bit platform

·         I/O Performance: 10 Gigabit Ethernet

·         API name: hi1.4xlarge


If you have a difficult to host I/O intensive workload, EC2 has the answer for you. 120,000 read IOPS and 10,000 to 85,000 write IOPS for $3.10/hour Linux on demand or $3.58/hour Windows on demand. Because these I/O workloads are seldom scaled up and down in real time, the Heavy Utilization Reserved instance is a good choice where the server capacity can be reserved for $10,960 for a three year term and usage is $0.482/hour.


·         Amazon EC2 detail page:

·         Amazon EC2 pricing page:

·         Amazon EC2 Instance Types:

·         Amazon Web Services Blog:

·         Werner Vogels:


Adrian Cockcroft of Netflix wrote an excellent blog on this instance type where he gave benchmarking results from Netflix: Benchmarking High Performance I/O with SSD for Cassandra on AWS.


You can now have 100k IOPS for $3.10/hour.




James Hamilton 
b: /


Friday, July 20, 2012 5:57:25 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
 Friday, July 13, 2012

Why are there so many data centers in New York, Hong Kong, and Tokyo? These urban centers have some of the most expensive real estate in the world. The  cost of labor is high. The tax environment is unfavorable. Power costs are high.  Construction is difficult to permit and expensive. Urban datacenters are incredibly expensive facilities and yet a huge percentage of the world’s computing is done in expensive urban centers.


One of my favorite examples is the 111 8th Ave data center in New York. Google bought this datacenter for $1.9B.  They already have facilities on the Columbia river where the power and land are cheap. Why go to New York when neither is true? Google is innovating in cooling technologies in their Belgium facility where they are using waste water cooling. Why go to New York where the facility is conventional, the power source predominantly coal-sourced, and the opportunity for energy innovation is restricted by legacy design and the lack of real estate available in the area around the facility. It’s pretty clear that 111 8th Ave isn’t going to be wind farm powered. A solar array could likely be placed on the roof but that wouldn’t have the capacity to run the interior lights in this large facility (See I love Solar but … for more on the space challenges of solar power at data center power densities). There isn’t space to do anything relevant along these dimensions.


Google has some of the most efficient datacenters in the world, running on some of the cleanest power sources in the world, and custom engineered from the ground up to meet their needs. Why would they buy an old facility, in a very expensive metropolitan area, with a legacy design? Are they nuts?  Of course not, Google is in New York because many millions of Google customers are in New York or nearby.


Companies site datacenters near the customers of those data centers. Why not serve the planet from Iceland where the power is both cheap and clean? When your latency budget to serve customers is 200 msec, you can’t give up ¾ of that time budget on speed of light delays traveling long distances. Just crossing the continent from California to New York is a 74 msec round trip time (RTT). New York to London is 70 msec RTT. The speed of light is unbending. Actually, it’s even worse than the speed of light in that the speed of light in a fiber is about 2/3 of the speed of light in a vacuum (see Communicating Beyond the Speed of Light).


Because of the cruel realities of the speed of light, companies must site data centers where their customers are. That’s why companies selling world-wide, often need to have datacenters all over the world. That’s why the Akamai content distribution network has over 1,200 points of presence world-wide.  To serve customers competitively, you need to be near those customers. The reason datacenters are located in Tokyo, New York, London, Singapore and other expensive metropolitan locations is they need to be near customers or near data that is in those locations. It costs considerably to maintain datacenters all over the world but there is little alternative.


Many articles recently have been quoting the Greenpeace open letter asking Ballmer, Bezos and Cook to “go to Iceland”. See for example Letter to Ballmer, Bezos, and Cook: Go to Iceland. Having come many of these articles recently, it seemed worth stopping and reflecting on why this hasn’t already happened. It’s not like company just love paying more or using less environmentally friendly power sources for their data centers.


Google is in New York because it has millions of customers in New York. If it were physically possible to serve these customers from an already built, hyper efficient datacenter like Google Dalles, they certainly would. But that facility is 70 msec round trip away from New York. What about Iceland? Roughly the same distance. It simply doesn’t work competitively. Companies build near their users because physics of the speed of light is unbending and uncaring.


So, what can we do? It turns out that many workloads are not latency sensitive. The right strategy is to house latency sensitive workloads near customers or the data needed at low latency and house latency insensitive workloads optimizing on other dimensions. This is exactly what Google does but, to do that, you need to have many datacenters all over the world so the appropriate facility can be selected on a workload-by-workload basis. This isn’t a practical approach for many smaller companies with only 1 or 2 datacenters to choose from.


This is another area where cloud computing can help. Cloud computing can allow mid-sized and even small companies to have many different datacenters optimized for different goals all over the world. Using Amazon Web Services, a company can house workloads near customers in Singapore, Tokyo, Brazil, and Ireland to be close to their international customers. Being close to these customers makes a big difference in the overall quality of customer experience (see: The Cost of Latency for more detail on how much latency really matters). As well as allowing a company to cost effectively have an international presence, cloud computing also allows companies to make careful decisions on where they locate workloads in North America. Again using AWS as the example, customers can place workloads in Virginia to serve the east coast or use Northern California to serve the population dense California region. If the workloads are not latency sensitive or is serving customers near the Pacific Northwest, they can be housed in the AWS Oregon region where the workload can be hosted coal free and less expensively than in Northern California.


The reality is that physics is uncaring and many workloads do need to be close to users. Cloud computing allows all companies to have access to datacenters all over the world so they can target individual workloads to the facilities that most closely meet their goals and the needs of their customers. Some computing will have to stay in New York even though it is mostly coal powered, expensive, and difficult to expand. But some workload will run very economically in the AWS West (Oregon) region where there is no coal power, expansion is cheap, and power inexpensive.


Workload placement decisions are more complex than “move to Iceland.”




James Hamilton 
b: /


Friday, July 13, 2012 7:03:27 AM (Pacific Standard Time, UTC-08:00)  #    Comments [12] - Trackback
 Monday, July 09, 2012

Last night, Tom Klienpeter sent me The Official Report of the Fukushima Nuclear Accident Independent Investigation Commission Executive Summary. They must have hardy executives in Japan in that the executive summary runs 86 pages in length. Overall, It’s an interesting document but I only managed to read in to the first page before starting to feel disappointed. What I was hoping for is a deep dive into why the reactors failed, the root causes of the failures, and what can be done to rectify it.


Because of the nature of my job, I’ve spent considerable time investigating hardware and software system failures and what I find most difficult and really time consuming is getting to the real details. It’s easy to say there was a tsunami and it damaged the reactor complex and loss of power caused radiation release. But why did loss of power cause radiation release? Why didn’t the backup power systems work? Why does the design depend upon the successful operation of backup power systems? Digging to the root cause takes the time, requires that all assumptions be challenged, and invariably leads to many issues that need to be addresses. Good post mortems are detailed, get to the root cause, and it’s rare that a detailed investigation of any complex system doesn’t yield a long, detailed list of design and operational changes. The Rogers Commission on the Space Shuttle Challenger failure is perhaps the best example of digging deeply, finding root cause both technical and operational, and making detailed recommendations.


On the second page of this report, the committee members were enumerated. The committed includes 1) seismologist, 2) 2 medical doctors, 3) chemist, 4) journalist, 5) 2 lawyers, 6) social system designer, 7) one politician, and 8) no nuclear scientist, no reactor designers, and no reactor operators. The earthquake and subsequent tsunami was clearly the seed for the event but since we can’t prevent these, I would argue that they should only play a contextual role in the post mortem.  What we need to understand is exactly why the both the reactor and nuclear material storage design were not stable in the presence of cooling system failure. It's weird that there were no experts in the subject area where the most dangerous technical problems were encountered. Basically we can’t stop earthquakes and tsunamis so we need to ensure that systems remain safe in the presence of them.


Obviously the investigative team is very qualified to deal with the follow-on events both in assessing radiation exposure risk,  how the evacuation was carried out, and regulatory effectiveness. And it is clear these factors are all important. But still, it feels like the core problem is that cooling system flow was lost and the both the reactors and nuclear material storage ponds overheated. Using materials that, when overheated, release explosive hydrogen gas is a particularly important area of investigation.


Personally, the largest part of my interest were it my investigation, would be focused on achieving designs stable in the presence of failure. Failing that, getting really good at evacuation seems like a good idea but still less important than ensuring these reactors and others in the country fail into a safe state.


The report reads like a political document. Its heavy on blame, light on root cause and the technical details of the root cause failure, and the recommended solution depends upon more regulatory oversight. The document focuses on more oversight by the Japanese Diet (a political body) and regulatory agencies but doesn't go after the core issues that lead to the nuclear release. From my perspective, the key issues are 1) scramming the reactor has to 100% stop the reaction and the passive cooling has to be sufficient to ensure the system can cool from full operating load without external power, operational oversight, or other input beyond dropping the rods. Good SCRAM systems automatically deploy and stop the nuclear reaction. This is common. What is uncommon is ensuring the system can successfully cool from a full load operational state without external input of power, cooling water, or administrative input.


The second key point that this nuclear release drove home for me is 2) all nuclear material storage areas must be seismically stable, above flood water height, maintain integrity through natural disasters, and must be able to stay stable and safe without active input or supervision for long periods of time. They can't depends upon pumped water cooling and have to 100% passive and stable for long periods without tending.


My third recommendation is arguably less important than my first two but applies to all systems: operators can’t figure out what is happening or take appropriate action without detailed visibility into the state of the system. The monitoring system needs to be independent (power, communications, sensors, …) , detailed, and able to operate correctly with large parts of the system destroyed or inoperative.


My fourth recommendation is absolutely vital and I would never trust any critical system without this: test failure modes frequently. Shut down all power to the entire facility at full operational load and establish that temperatures fall rather than rise and no containment systems are negatively impacted. Shut off the monitoring system and ensure that the system continues to operate safely. Never trust any system in any mode that hasn’t been tested.


The recommendations from the Official Report of the Fukushima Nuclear Accident Independent Investigation Commission Executive Summary follow:

Recommendation 1:

Monitoring of the nuclear regulatory body by the National Diet

A permanent committee to deal with issues regarding nuclear power must be established in the National Diet in order to supervise the regulators to secure the safety of the public. Its responsibilities should be:

1.       To conduct regular investigations and explanatory hearings of regulatory agencies, academics and stakeholders.

2.       To establish an advisory body, including independent experts with a global perspective, to keep the committee’s knowledge updated in its dealings with regulators.

3.       To continue investigations on other relevant issues.

4.       To make regular reports on their activities and the implementation of their recommendations.

Recommendation 2:

Reform the crisis management system

A fundamental reexamination of the crisis management system must be made. The boundaries dividing the responsibilities of the national and local governments and the operators must be made clear. This includes:

1.       A reexamination of the crisis management structure of the government. A structure must be established with a consolidated chain of command and the power to deal with emergency situations.

2.       National and local governments must bear responsibility for the response to off-site radiation release. They must act with public health and safety as the priority.

3.       The operator must assume responsibility for on-site accident response, including the halting of operations, and reactor cooling and containment.

Recommendation 3:

Government responsibility for public health and welfare

Regarding the responsibility to protect public health, the following must be implemented as soon as possible:

1.       A system must be established to deal with long-term public health effects, including stress-related illness. Medical diagnosis and treatment should be covered by state funding. Information should be disclosed with public health and safety as the priority, instead of government convenience. This information must be comprehensive, for use by individual residents to make informed decisions.

2.       Continued monitoring of hotspots and the spread of radioactive contamination must be undertaken to protect communities and the public. Measures to prevent any potential spread should also be implemented.

3.       The government must establish a detailed and transparent program of decontamination and relocation, as well as provide information so that all residents will be knowledgeable about their compensation options. 

Recommendation 4:

Monitoring the operators

TEPCO must undergo fundamental corporate changes, including strengthening its governance, working towards building an organizational culture which prioritizes safety, changing its stance on information disclosure, and establishing a system which prioritizes the site. In order to prevent the Federation of Electric Power Companies (FEPC) from being used as a route for negotiating with regulatory agencies, new relationships among the electric power companies must also be established—built on safety issues, mutual supervision and transparency.

1.       The government must set rules and disclose information regarding its relationship with the operators.NAIIC 23

2.       Operators must construct a cross-monitoring system to maintain safety standards at the highest global levels.

3.       TEPCO must undergo dramatic corporate reform, including governance and risk management and information disclosure—with safety as the sole priority.

4.       All operators must accept an agency appointed by the National Diet as a monitoring authority of all aspects of their operations, including risk management, governance and safety standards, with rights to on-site investigations.

Recommendation 5:

Criteria for the new regulatory body

The new regulatory organization must adhere to the following conditions. It must be:

1.       Independent: The chain of command, responsible authority and work processes must be: (i) Independent from organizations promoted by the government (ii) Independent from the operators (iii) Independent from politics.

2.       Transparent: (i) The decision-making process should exclude the involvement of electric power operator stakeholders. (ii) Disclosure of the decision-making process to the National Diet is a must. (iii) The committee must keep minutes of all other negotiations and meetings with promotional organizations, operators and other political organizations and disclose them to the public. (iv) The National Diet shall make the final selection of the commissioners after receiving third-party advice.

3.       Professional: (i) The personnel must meet global standards. Exchange programs with overseas regulatory bodies must be promoted, and interaction and exchange of human resources must be increased. (ii) An advisory organization including knowledgeable personnel must be established. (iii) The no-return rule should be applied without exception.

4.       Consolidated: The functions of the organizations, especially emergency communications, decision-making and control, should be consolidated.

5.       Proactive: The organizations should keep up with the latest knowledge and technology, and undergo continuous reform activities under the supervision of the Diet.

Recommendation 6:

Reforming laws related to nuclear energy

Laws concerning nuclear issues must be thoroughly reformed.

1.       Existing laws should be consolidated and rewritten in order to meet global standards of safety, public health and welfare.

2.       The roles for operators and all government agencies involved in emergency response activities must be clearly defined.

3.       Regular monitoring and updates must be implemented, in order to maintain the highest standards and the highest technological levels of the international nuclear community.

4.       New rules must be created that oversee the backfit operations of old reactors, and set criteria to determine whether reactors should be decommissioned.

Recommendation 7:

Develop a system of independent investigation commissions

A system for appointing independent investigation committees, including experts largely from the private sector, must be developed to deal with unresolved issues, including, but not limited to, the decommissioning process of reactors, dealing with spent fuel issues, limiting accident effects and decontamination.


Many of the report recommendations are useful but they fall short of addressing the root cause. Here’s what I would like to see:

1.       Scramming the reactor has to 100% stop the reaction and the passive cooling has to be sufficient to ensure the system can cool from full operating load without external power, operational oversight, or other input beyond dropping the rods.

2.       All nuclear material storage areas must be seismically stable, above flood water height, maintain integrity through natural disasters, and must be able to stay stable and safe without active input or supervision for long periods of time.

3.       The monitoring system needs to be independent, detailed, and able to operate correctly with large parts of the system destroyed or inoperative.

4.       Test all failure modes frequently. Assume that all systems that haven’t been tested will not work. Surprisingly frequently, they don’t.



The Official Report of the Fukushima Nuclear Accident Independent Investigation Commission Executive Summary can be found at:


Since our focus here is primarily on building reliable hardware and software systems, this best practices document may be of interest: Designing & Deploying Internet-Scale Services:




James Hamilton 
b: /


Monday, July 09, 2012 6:39:00 AM (Pacific Standard Time, UTC-08:00)  #    Comments [8] - Trackback
Hardware | Process | Services
 Tuesday, June 19, 2012

The NASCAR Sprint Cup Stock Car Series kicks its season off with a bang and, unlike other sports, starts the season off with the biggest event of the year rather than closing with it. Daytona Speed Weeks is a multi-week, many race event the finale of which is the Daytona 500. The 500 starts with a huge field of 43 cars and is perhaps famous for some of the massive multi-car wrecks. The 17 car pile-up of 2011, made a 43 card field look like the appropriate amount of redundancy just to get a car over the finish line at the end.


Watching 43 stock cars race for the green flag at the start of the race is an impressive show of power as 146,000 lbs of metal charge towards the start line at nearly 200 miles per hour running so close that they appear to be connected.  From the stands, the noise is deafening, the wall of air they are pushing can be felt 20 rows up and the air is hot from all the waste heat spilling off the field as they scream to the line. 


Imagine harnessing all the power of all the engines from the 43 cars heading towards the start line at Daytona in a single engine? In fact, let’s make it harder, imagine having all the power of all the cars that take the green flag at both Daytona Sprnt Cup races each year. That would be a single engine capable of putting out 64,500 hp. Actually, for safety reasons, NASCAR restricts engine output at the Daytona and Talladega superspeedways to approximately 430 hp but let’s stick with the 750 hp they can produce when unrestricted. If we harnessed that power into a single engine, we would have an unbelievable 64,500 HP. Last week Jennifer and I were invited to tour the Hanjin Oslo container ship which happens to be single engine powered. Believe it or not, that single engine is more powerful that the aggregate horsepower of both Daytona starting fields. It has a single 74,700 hp engine.



Last week Peter Kim who supervises the Hanjin shipping port at Terminal 46 invited us to tour the port facility and the Hanjin Oslo container ship.  I love technology, scale, and learning how well run operations work so I jumped on the opportunity.


Shortly after arriving, we watched the Oslo being brought into terminal 46. The captain and pilot were both looking down from the bridge wing towering more than 100’ above us giving commands to the tugs as the Oslo is being eased into the dock. Even before the ship was tied off, the port was rapidly coming to life. Dock workers were scrambling to their stations, trucks were starting, container cranes were moving into position, Customs and Border Patrol was getting ready to board, and line handlers were preparing to tie the ship off. There were workers and heavy equipment moving into position throughout the terminal. And, over the next 12 hours, more than a thousand containers would be moved before the ship would be off to its next destination at 6:30am the following morning.



The Oslo is not the newest ship in the Hanjin fleet having been built in 1998. It’s not the biggest ship nor is it the most powerful. But it’s a great example of a well-run, super clean, and expertly maintained container ship. And, starting with the size, here’s the view from the bridge.



The ship truly is huge. What I find even more amazing is that, as large as the Oslo is, there are container ships out there with up to twice the cargo carrying capacity and as much as 45% more horse power. In fact, the world’s most powerful diesel engine is deployed in a container ship. It’s a 14 cylinder, 3 floor high monster that produces 109,000 hp designed by the Finnish company Wartsila.



The Hanjin Oslo uses a (slightly) smaller inline 10 cylinder version of the same engine design. The key difference between it and the world’s largest diesel shown above is that the engine in the Oslo is 4 cylinders shorter at 10 cylinders inline rather than 14 and it produces proportionally less power. On the Oslo, the engine spans 3 decks so you can only see 1/3 of it at any one time. Here’s the view from the Hanjin Oslo engine room top deck, mid deck, and lower deck:





The engine is clearly notable for its size and power output. But, what I find most surprising is it’s a two stroke engine. Two stroke engines produce power at the beginning of the power stroke where the piston is heading down, dump the exhaust towards the end of that stroke, then bring in fresh air at the beginning of the next stroke as the piston begins heading back up, and then compresses the air for the remainder of that stroke. Towards the end of the compression stroke, fuel is injected into the cylinder where it combusts rapidly building pressure and pushing the piston back down on the power stroke. Four stroke engines separate these functions into four strokes: 1) power going down, 2) exhaust going up, 3) intake going down, and then 4) compression going up.


Two-stroke engines are common in lawn mowers, chainsaws, and some very small outboards because of their high power to weight ratio and simplicity of design that makes very low cost engines possible. Larger diesel engines used in trucks and automobiles are almost exclusively 4 stroke engines. Ironically, the very highest output diesel engines found in large marine applications are also two strokes.



From Spending an evening with the Hanjin team, I was super impressed. I love the technology, the scale was immense, everything was very well maintained, and they are clearly excellent operators. If I was moving goods between continents, I would look first to Hanjin.

Other pictures from our Hanjin Oslo visit:




James Hamilton 
b: /

Tuesday, June 19, 2012 3:51:43 PM (Pacific Standard Time, UTC-08:00)  #    Comments [4] - Trackback
 Monday, May 28, 2012

Cooling is the largest single non-IT (overhead) load in a modern datacenter. There are many innovative solutions to addressing the power losses in cooling systems. Many of these mechanical system innovations work well and others have great potential but none are as powerful as simply increasing the server inlet temperatures. Obviously less cooling is cheaper than more. And, the higher the target inlet temperatures, the higher percentage of time that a facility can spend running on outside air (air-side economization) without process-based cooling.


The downsides of higher temperatures are 1) high semiconductor leakage losses, 2) higher server fan speed which increases the losses to air moving, and 3) higher server mortality rates. I’ve measured the former and, although these losses are inarguably present, these losses are measureable but have a very small impact at even quite high server inlet temperatures.  The negative impact of fan speed increases is real but can be mitigated via different server target temperatures and more efficient server cooling designs. If the servers are designed for higher inlet temperatures, the fans will be configured for these higher expected temperatures and won’t run faster. This is simply a server design decision and good mechanical designs work well at higher server temperatures without increased power consumption. It’s the third issue that remains the scary one: increased server mortality rates.


The net of these factors is fear of higher server mortality rates is the prime factor slowing an even more rapid increase in datacenter temperatures. An often quoted study reports the failure rate of electronics doubles with every 10C increase of temperature (MIL-HDBK 217F). This data point is incredibly widely used by the military, NASA space flight program, and in commercial electronic equipment design. I’m sure the work is excellent but it is a very old study, wasn’t focused on a large datacenter environment, and the rule of thumb that has emerged from is a linear model of failure to heat.


A recent paper that does an excellent job of methodically digging through the possible issues of high datacenter temperature and investigating each concern methodically. I like Temperature Management in Data Centers: Why Some (Might) Like it Hot for two reasons: 1) it unemotionally works through the key issues and concerns, and 2) it draws from a sample of 7 production data centers at Google so the results are credible and from a substantial sample


From the introduction:


Interestingly, one key aspect in the thermal management of a data center is still not very well understood: controlling the setpoint temperature at which to run a data center’s cooling system. Data centers typically operate in a temperature range between 20C and 22C, some are as cold as 13C degrees [8, 29]. Due to lack of scientific data, these values are often chosen based on equipment manufacturers’ (conservative) suggestions. Some estimate that increasing the setpoint temperature by just one degree can reduce energy consumption by 2 to 5 percent [8, 9]. Microsoft reports that raising the temperature by two to four degrees in one of its Silicon Valley data centers saved $250,000 in annual energy costs [29]. Google and Facebook have also been considering increasing the temperature in their data centers [29].


The authors go on to observe that “the details of how increased data center temperatures will affect hardware reliability are not well understood and existing evidence is contradictory.” The remainder of the paper presents the data as measured in the 7 production datacenters under study and concludes each section with an observation. I encourage you to read the paper and I’ll cover just the observations here:


Observation 1: For the temperature range that our data covers with statistical significance (< 50C), the prevalence of latent sector errors increases much more slowly with temperature, than reliability models suggest. Half of our model/data center pairs show no evidence of an increase, while for the others the increase is linear rather than exponential.


Observation 2: The variability in temperature tends to have a more pronounced and consistent effect on Latent Sector Error rates than mere average temperature


Observation 3: Higher temperatures do not increase the expected number of Latent Sector Errors (LSEs) once a drive develops LSEs, possibly indicating that the mechanisms that cause LSEs are the same under high or low temperatures.


Observation 4: Within a range of 0-36 months, older drives are not more likely to develop Latent Sector Errors under temperature than younger drives.


Observation 5: High utilization does not increase Latent Sector Error rates under temperatures.


Observation 6: For temperatures below 50C, disk failure rates grow more slowly with temperature than common models predict. The increase tends to be linear rather than exponential, and the expected increase in failure rates for each degree increase in temperature is small compared to the magnitude of existing failure rates.


Observation 7: Neither utilization nor the age of a drive significantly affect drive failure rates as a function of temperature.


Observation 8: We do not observe evidence for increasing rates of uncorrectable DRAM errors, DRAM DIMM replacements or node outages caused by DRAM problems as a function of temperature (within the range of temperature our data comprises).


Observation 9: We observe no evidence that hotter nodes have a higher rate of node outages, node downtime or hardware replacements than colder nodes.


Observation 10: We find that high variability in temperature seems to have a stronger effect on node reliability than average temperature.


Observation 11: As ambient temperature increases, the resulting increase in power is significant and can be mostly attributed to fan power. In comparison, leakage power is negligible.


Observation 12: Smart control of server fan speeds is imperative to run data centers hotter. A significant fraction of the observed increase in power dissipation in our experiments could likely be avoided by more sophisticated algorithms controlling the fan speeds.


Observation 13: The degree of temperature variation across the nodes in a data center is surprisingly similar for all data centers in our study. The hottest 5% nodes tend to be more than 5C hotter than the typical node, while the hottest 1%

nodes tend to be more than 8–10C hotter.


The paper under discussion:


Other notes on increased data center temperatures:

·         Exploring the Limits of Datacenter Temperature

·         Chillerless Data Center at 95F

·         Computer Room Evaporative Cooling

·         Next Point of Server Differentiation: Efficiency at Very High Temperature

·         Open Compute Mechanical System Design

·         Example of Efficient Mechanical Design

·         Innovative Datacenter Design: Ishikari Datacenter


James Hamilton 
b: /


Monday, May 28, 2012 7:48:11 AM (Pacific Standard Time, UTC-08:00)  #    Comments [2] - Trackback
 Wednesday, May 23, 2012
Untitled 1

Urs Holzle did the keynote talk at the 2012 Open Networking Summit where he focused on Software Defined Networking in Wide Area Networking. Urs leads the Technical Infrastructure group at Google where he is Senior VP and Technical Fellow. Software defined networking (SDN) is the central management of networking routing decisions rather than depending upon distributed routing algorithms running semi-autonomously on each router.  Essentially what is playing out in the networking world is a replay of what we have seen in the server world across many dimensions. The dimension that is central to the SDN discussion is a datacenter full of 10k to 50k servers are not managed individually by an administrator and the nodes making up the networking fabric shouldn’t be either.


The key observations behind SDN are 1) if the entire system is under single administrative control, central routing control is possible, 2) at the scale of a single administrative domain, central control of networking routing decisions is practical, and 3) central routing control allows many advantages including faster convergence on failure, priority-based routing decisions when resource constrained, application-aware routing and it enables the same software system that manages application deployment to manage network configuration.


In Holzle’s talk, he motivated SDN by first talking about WAN economics:

·         Cost per bit/sec delivered should go down with scale rather than up (consider analogy in compute and storage)

·         However, cost/bit doesn’t naturally decrease with size due to:

o   Quadratic complexity in pairwise interactions

o   Manual management and configuration of individual elements

o   Complexity of automation due to non-standard vendor configuration APIs

·         Solution: Manage the WAN as a fabric rather than as a collection of individual boxes

·         Current equipment and protocols don’t support this:

o   Internet protocols are box-centric rather than fabric-centric

o   Little support for monitoring and operations

o   Optimized for “eventual  consistency” in networking

o   Little baseline support for low-latency routing and fast failover

·         Advantages of central traffic engineering:

o   Better networking utilization with a global view

o   Converges faster to target optimum on failure

o   Allows more control and to specify application intent:

§  Deterministic behavior simplifies planning vs overprovisioning for worst case variability

o   Can mirror product event streams for testing to support faster innovation and roust software development

o   Controller uses modern server hardware (50x better performance)

·         Testability matters:

o   Decentralized requires a full scale test bed of production network to test new traffic engineering features

o   Centralized can tap real production input to research new ideas and to test new implementations

·         SDN Testing Strategy:

o   Various logical modules enable testing in isolation

o   Virtual environment to experiment and test with the complete system end-to-end

§  Everything is real except the hardware

o   Allows use of tools to validate state across all devices after every update from central server

§  Enforce ‘make before break’ semantics

o   Able to simulate the entire back-bone with real monitoring and alerts

·         Google is using custom networking equipment with 100s of ports of 10GigE

o   Dataplane runs on merchant silicon routing ASICs

o   Control plane runs on Linux hosted on custom hardware

o   Supports OpenFlow

o   Quagga BGP and ISIS stacks

o   Only supports the protocols in use at Google

·         OpenFlow Deployment History:

o   The OpenFlow deployment was done on the Google internal (non-customer facing) network

o   Phase I: Spring 2010

§  Install OpenFlow-controlled switches but make them look like regular routers

§  BGP/ISIS/OSPF now interfaces with OpenFlow controller to program switch state

§  Installation procedure:

·         Pre-deploy gear at one site, take down 50% of bandwidth, perform upgrade, bring new equipment online and repeat with the remaining capacity

·         Repeat at other sites

o   Phase II: Mid 2011

§  Activate simple SDN without traffic engineering

§  Ramp traffic up on test network

§  Test transparent software rollouts

o   Phase III: Early 2012

§  All datacenter backbone traffic carried by new network

§  Rolled out central traffic engineering

·         Optimized routing based upon 7 application level priorities

·         Globally optimized flow placement

§  External copy scheduler works with the OpenFlow controller to implement deadline scheduling for large data copies

·         Google SDN Experience:

o   Much faster iteration: deployed production quality centralized traffic engineering in 2 months

§  Fewer devices to update

§  Much better testing prior to roll-out

o   Simplified high-fidelity test environment

o   No packet loss during upgrade

o   No capacity loss during upgrade

o   Most features don’t touch the switch

o   Higher network utilization

o   More stable

o   Unified view of entire network fabric (rather than router-by-router view)

o   Able to implement:

§  Traffic engineering with higher quality of service awareness and predictability

§  Latency, loss, bandwidth, and deadline sensitivity in routing decisions

o   Improved routing decisions:

§  Based upon a priori knowledge of network topology

§  Based upon L1 and L3 connectivity

o   Improved monitoring and alerts

·         SDN Challenges:

o   OpenFlow protocol barebones but good enough

o   Master election/control plane partition challenging to handle

o   What to leave on router and what to run centrally?

o   Flow programming can be slow for large networks

·         Conclusions:

o   OpenFlow is ready for real world use

o   SDN is ready for real world use

§  Enables rich feature deployment

§  Simplified network management

o   Googles Datacenter WAN runs on OpenFlow

§  Largest production network at Google

§  Improved manageability

§  Lower cost


A video of Urs’ talk is available at: OpenFlow @ Google


James Hamilton 
b: /



Wednesday, May 23, 2012 6:28:21 AM (Pacific Standard Time, UTC-08:00)  #    Comments [2] - Trackback
 Tuesday, April 17, 2012

Most of the time I write about the challenges posed by scaling infrastructure.  Today, though, I wanted mention some upcoming events that have to do with a different sort of scale.

In Amazon Web Services we are tackling lots of really hairy challenges as we build out one the world’s largest cloud computing platforms.  From data center design, to network architecture, to data persistence, to high-performance computing  and beyond we have a virtually limitless set of  problems needing to be solved.  Over the coming years AWS will be blazing new trails in virtually every aspect of computing and infrastructure.

In order to tackle these opportunities we are searching for innovative technologists to join the AWS team.  In other words we need to scale our engineering staff.  AWS has hundreds of open positions throughout the organization.  Every single AWS team is hiring including EC2, S3, EBS, EMR, CloudFront, RDS, DynamoDB and even the AWS-powered Amazon Silk web browser.

On May 17th and 18th we will be holding recruiting events in three cities: Houston, Minneapolis, and Nashville.  If you live near any of those cities and are passionate about defining and building the future of computing you will find more information at the following URL  You can also send your resume to and we will follow up with you.




James Hamilton 
b: /

Tuesday, April 17, 2012 11:44:52 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
 Thursday, March 29, 2012

I met Google’s Wolf-Dietrich Weber at the 2009 CIDR conference where he presented what is still one of my favorite datacenter power-related papers. I liked the paper because the gain was large, the authors weren’t confused or distracted by much of what is incorrectly written on datacenter power consumption, and the technique is actually practical. In Power Provisioning for a Warehouse-sized Computer, the authors argue that we should oversell power, the most valuable resource in a data center.  Just as airlines oversell seats, their key revenue producing asset, datacenter operators should oversell power.


Most datacenter operators take the critical power, the total power available to the data center less power distribution losses and mechanical system cooling loads, then reduce it by at least 10 to 20% to protect against the risk of overdraw which can draw penalty or power loss. Servers are then provisioned to this reduced critical power level. But, the key point is that almost no data center is ever anywhere close to 100% utilized (or even close to 50% for that matter but that’s another discussion) so there is close to no chance that all servers will draw their full load at the same time.  And, with some diversity of workloads, even with some services spiking to 100%, we can often exploit the fact that peak loads across dissimilar services are not fully correlated. On this understanding, we can provision more servers than we actually have critical power.


This exactly what airlines do when selling seats. And, just as airlines need to be able to offer a free ticket to Hawaii in the unusual event that they find a flight over-subscribed, we need the same safety valve here. Some datacenter equivalents of a free ticket to Hawaii is: 1) delay all non-customer impacting workloads (administrative and operational batch jobs, 2) stop non-critical or best-effort workloads, 3) force servers into lower power states. This last one is a favorite research topic but is almost never done in practice because it is the  equivalent of solving the oversold airline seat problem by actually having two people sit in the same seat. It sort of works but isn’t safe and doesn’t make for happy customers. Option #3 reduces the resources available to all workloads by lowering overall quality of service. For most businesses this is not a good economic choice. The best answers are options 1 and 2 above.


One class of application that is particularly difficult to manage efficiently are online data-intensive workloads. Web search, advertising, and machine translation are examples of this workload type.  These workloads can be very profitable so option #3 above, that of reducing the quality of service doesn’t make economic sense. In the note the cost of latency we reviewed the importance of very rapid response in these workload types and ecommerce systems. Reducing the quality of service for these high value workloads to save power, doesn’t make economic sense.


The best answer for these workloads is what Barroso and Hoelzle refer to Energy Proportional Computing (The Case for Energy Proportional Computing). Essentially the goal of energy proportional computing is that a server at 10% load should consume 10% of the power of a server running at 100% load.  Clearly there is overhead and this goal will never be fully achieved but, the closer we get, the lower the cost and environmental impact for hosing OLDI workloads.


The good news is there has been progress. When energy proportional computing was first proposed, many servers at idle would consume 80% of the power that it would consume at full load. Today, a good server can be as low as 45% at idle. We are nowhere close to where we want to be but good progress is being made. In fact, CPUs are quite good by this measure today -- the worst offenders are the other components in the server. Memory has big opportunities and the mobile consumer device world shows us what is possible. I expect we’ll continue to progress by stealing ideas from the cell phone industry and applying them to servers.


In Power Management of Online Data-Intensive Services, a research team from Google and the University of Michigan target the OLDI power proportionality problem focusing on Google search, advertising, and translation workloads. These workloads are difficult because the latency goals are achieved using large in-memory caches and, as workload moves from peak to valley, all these machines need to stay available in order to meet the application latency goals.  It is not an option to concentrate the workload on fewer servers – the cache size requires all the servers continue to be available so, as workload goes down towards idle, all the servers continue to have some small amount of workload so they can’t be dropped into full system low power states.


The data cache size requires the memory of all the servers so as the workload volume goes down, each server gets progressively less busy but never actually hit idle. They always need to be online and available so the next request can be served at the required latency. The paper draws the following conclusions:

·         CPU active low-power modes provide the best single power-performance mechanism but, by themselves, cannot achieve power proportionality

·         There is a pressing need to improve idle low-power modes for shared caches and on-chip memory controllers

·         There is a substantial opportunity to save memory system power with low-power modes [mobile systems do this well today so the techniques are available]

·         Even with query-batching, full system idle low-power modes cannot provide acceptable latency-power tradeoffs

·         Coordinated, full-system active low-power modes hold the greatest promise to achieve energy proportionality with acceptable query latency


Summarizing the OLDI workload type as presented in the paper, the workload latency goals are achieved by spreading very large data caches over the operational servers.  As the workload goes from peak to trough, these servers all get less busy but never are actually at idle so can’t be dropped into a full system lower power state.  


I like to look at servers supporting these workloads as being in a two dimensional grid. Each row represents one entire copy of the cache spread over 100s of servers. A single row could serve the workload and successfully deliver on the application latency goals but a single row will not scale. To scale to workloads beyond that which can be served on a single row, more rows are added. When a search query comes into the system, it is sent to the 100s of systems in a single row but only to the servers in a single row. Looking at the workload this way, I would argue, we actually do have some ability to make OLDI workloads power proportional at a warehouse-scale. When the workload goes up towards peak, more rows are needed. When the workload reduces towards trough, fewer rows are used and the rows not currently in use can be used to support other workloads.


This row-level scaling technique produces very nearly full proportionality at the overall datacenter level with two problems: 1) the workload can’t scale down below a row for all the reason outlined in the paper, 2) if the workload is very dynamic and jumps from trough to peak quickly, more rows need to be kept ready in case they are needed which further reduces the power proportionality of the technique.


If a workload is substantially higher scale than a single row and predictably swings from trough to peak, this per-row scaling technique produces very good results. It fails where workloads change dramatically or where less-than-single row scaling is needed.


The two referenced papers:

·         Power Management of Online Data-Intensive Services

·         Power Provisioning for a Warehouse-sized Computer


Thanks to Alex Mallet for sending me Power Management of Data-Intensive Services.




James Hamilton 
b: /


Thursday, March 29, 2012 6:22:31 AM (Pacific Standard Time, UTC-08:00)  #    Comments [2] - Trackback
 Saturday, March 17, 2012

I love solar power, but in reflecting carefully on a couple of high profile datacenter deployments of solar power, I’m really developing serious reservations that this is the path to reducing data center environmental impact. I just can’t make the math work and find myself wondering if these large solar farms are really somewhere between a bad idea and pure marketing, where the environmental impact is purely optical.


Facebook Prineville

The first of my two examples is the high profile installation of a large solar array at the Facebook Prineville Oregon Facility. The installation of 100 kilowatts of solar power was the culmination of the unfriend coal campaign run by Greenpeace. Many in the industry believe the campaign worked.  In the purest sense, I suppose it did. But let’s look at the data more closely and make sure this really is environmental progress. What was installed in Prineville was a 100 kilowatt solar array at a more than 25 megawatt facility (Facebook Installs Solar Panels at new Data Center ). Even though this is actually a fairly large solar array, its only providing 0.4% of the overall facility power.


Unfortunately, the actually numbers are further negatively impacted by weather and high latitude. Solar arrays produce far less than their rated capacity due to night duration, cloud cover, and other negative impacts from weather.  I really don’t want to screw up my Seattle recruiting pitch too much but let’s just say  that occasionally there are clouds in the pacific northwest :-). Clearly there fewer clouds at 2,868’ elevation in the Oregon desert but, even at that altitude, the sun spends the bulk of the time poorly positioned for power generation.


Using this solar panel output estimator, we can see that the panels at this location and altitude, yield an effective output of 13.75%. That means that, on average, this array will only put out 13.75 killowatts. That would have this array contributing 0.055% of the facility power or, worded differently, it might run the lights in the datacenter but it has almost no measurable possible impact on the overall energy consumed. Although this is pointed to as an environmentally conscious decisions, it really has close to no influence on the overall environmental impact of this facility. As a point of comparison, this entire solar farm produces approximately as much output as one high density rack of servers consumes. Just one rack of servers is not success, it doesn’t measurably change the coal consumption, and almost certainly isn’t good price/performance.


Having said that the Facebook solar array is very close to purely marketing expense, I hasten to add that Facebook is one of the most power-efficient and environmentally-focused large datacenter operators. Ironically, they are in fact very good environmental stewards, but the solar array isn’t really a material contributor to what they are achieving.


Apple iDataCenter, Maiden, North Carolina

The second example I wanted to look at is Apple’s facility at Maiden, North Carolina, often referred as iDataCenter.  In the Facebook example discussed above, the solar array was so small as to have nearly no impact on the composition or amount of power consumed by the facility. However, in this example, the solar farm deployed at the Apple Maiden facility is absolutely massive. In fact, this photo voltaic deployment is reported to be largest commercial deployment in the US at 20 megawatts. Given the scale of this deployment, it has a far better chance to work economically.


The Apple Maiden facility is reported to cost $1B for the 500,000 sq ft datacenter.  Apple wisely chose not to publicly announce their power consumption numbers but estimates have been as high as 100 megawatts. If you conservatively assume that only 60% of the square footage is raised floor and they are averaging a fairly low 200W/sq ft, the critical load would still be 60MW (the same as the 700,000 sq ft Microsoft Chicago datacenter).  At a moderate Power Usage Efficiency (PUE) of 1.3, Apple Maiden would be at 78MW of total power. Even using these fairly conservative numbers for a modern datacenter build, it would be 78MW total power, which is huge. The actual number is likely somewhat higher.


Apple elected to put in a 20MW solar array at this facility. Again, using the location and elevation data from Wikipedia and the solar array output model referenced above, we see that the Apple location is more solar friendly than Oregon. Using this model, we see that the 20MW photo voltaic deployment has an average output of 15.8% which yields 3.2MW.


The solar array requires 171 acres of land which is 7.4 million sq ft. What if we were to build an solar array large enough to power the entire facility using these solar and land consumption numbers? If the solar farm were to be able to supply all the power of the facility it would need to be 24.4 times larger. It would be a 488 megawatt capacity array requiring 4,172 acres which is 181 million sq ft.  That means that a 500,000 sq ft facility would require 181 million sq ft of power generation or, converted to a ratio, each data center sq ft would require 362 sq ft of land.


Do we really want to give up that much space at each data center? Most data centers are in highly populated areas, where a ratio of 1 sq ft of datacenter floor space requiring 362 sq ft of power generation space is ridiculous on its own and made close to impossible by the power generation space needing to be un-shadowed. There isn’t enough roof top space across all of NY to take this approach. It is simply not possible in that venue.


Let’s focus instead on large datacenters in rural areas where the space can be found. Apple is reported to have cleared trees off of 171 acres of land in order to provide photo voltaic power for 4% of their overall estimate data center consumption. Is that gain worth clearing and consuming 171 acres? In Apple Planning Solar Array Near iDataCenter, the author Rich Miller of Data Center Knowledge quotes local North Carolina media reporting that “local residents are complaining about smoke in the area from fires to burn off cleared trees and debris on the Apple property.”


I’m personally not crazy about clearing 171 acres in order to supply only 4% of the power at this facility. There are many ways to radically reduce aggregate data center environmental impact without as much land consumption. Personally, I look first to increasing the efficiency of power distribution, cooling, storage, networking and server and increasing overall utilization and the best routes to lowering industry environmental impact.


Looking more deeply at the Solar Array at Apple Maiden, the panels are built by SunPower.  Sunpower is reportedly carrying $820m in debt and has received a $1.2B federal government loan guarantee. The panels are built on taxpayer guarantees and installed using tax payer funded tax incentives. It might possibly be a win for the overall economy but, as I work through the numbers, it seems less clear. And, after the spectacular failure of solar cell producer Solyndra which failed in bankruptcy with a $535 million dollar federal loan guarantee, it’s obvious there are large costs being carried by tax payers in these deployments. Generally, as much as I like data centers, I’m not convinced that tax payers should by paying to power them.


As I work through the numbers from two of the most widely reported upon datacenter solar array deployments, they just don’t seem to balance out positively without tax incentives. I’m not convinced that having the tax base fund datacenter deployments is a scalable solution. And, even if it could be shown that this will eventually become tax neutral, I’m not convinced we want to see datacenter deployments consuming 100s of acres of land on power generation. And, when trees are taken down to allow the solar deployment, it’s even harder to feel good about it.  From what I have seen so far, this is not heading in the right direction. If we had $x dollars to invest in lowering datacenter environmental impact and the marketing department was not involved in the decision, I’m not convinced the right next step will be solar.


James Hamilton 
b: /


Saturday, March 17, 2012 12:22:58 PM (Pacific Standard Time, UTC-08:00)  #    Comments [30] - Trackback
 Sunday, February 26, 2012

Every couple of weeks I get questions along the lines of “should I checksum application files, given that the disk already has error correction?” or “given that TCP/IP has error correction on every communications packet, why do I need to have application level network error detection?” Another frequent question is “non-ECC mother boards are much cheaper -- do we really need ECC on memory?” The answer is always yes. At scale, error detection and correction at lower levels fails to correct or even detect some problems. Software stacks above introduce errors. Hardware introduces more errors. Firmware introduces errors. Errors creep in everywhere and absolutely nobody and nothing can be trusted.

Over the years, each time I have had an opportunity to see the impact of adding a new layer of error detection, the result has been the same. It fires fast and it fires frequently. In each of these cases, I predicted we would find issues at scale. But, even starting from that perspective, each time I was amazed at the frequency the error correction code fired.

On one high scale, on-premise server product I worked upon, page checksums were temporarily added to detect issues during a limited beta release. The code fired constantly, and customers were complaining that the new beta version was “so buggy they couldn’t use it”. Upon deep investigation at some customer sites, we found the software was fine, but each customer had one, and sometimes several, latent data corruptions on disk. Perhaps it was introduced by hardware, perhaps firmware, or possibly software. It could have even been corruption introduced by one of our previous release when those pages where last written. Some of these pages may not have been written for years.

I was amazed at the amount of corruption we found and started reflecting on how often I had seen “index corruption” or other reported product problems that were probably corruption introduced in the software and hardware stacks below us. The disk has complex hardware and hundreds of thousands of lines of code, while the storage area network has complex data paths and over a million lines of code. The device driver has tens of thousands of lines of code. The operating systems has millions of lines of code. And our application had millions of lines of code. Any of us can screw-up, each has an opportunity to corrupt, and its highly likely that the entire aggregated millions of lines of code have never been tested in precisely the combination and on the hardware that any specific customer is actually currently running.

Another example. In this case, a fleet of tens of thousands of servers was instrumented to monitor how frequently the DRAM ECC was correcting. Over the course of several months, the result was somewhere between amazing and frightening. ECC is firing constantly.

The immediate lesson is you absolutely do need ECC in server application and it is just about crazy to even contemplate running valuable applications without it. The extension of that learning is to ask what is really different about clients? Servers mostly have ECC but most clients don’t. On a client, each of these corrections would instead be a corruption. Client DRAM is not better and, in fact, often is worse on some dimensions. These data corruptions are happening out there on client systems every day. Each day client data is silently corrupted. Each day applications crash without obvious explanation. At scale, the additional cost of ECC asymptotically approaches the cost of the additional memory to store the ECC. I’ve argued for years that Microsoft should require ECC for Windows Hardware Certification on all systems including clients. It would be good for the ecosystem and remove a substantial source of customer frustration. In fact, it’s that observation that leads most embedded systems parts to support ECC. Nobody wants their car, camera, or TV crashing. Given the cost at scale is low, ECC memory should be part of all client systems.

Here’s an interesting example from the space flight world. It caught my attention and I ended up digging ever deeper into the details last week and learning at each step. The Russian space mission Phobos-Grunt (also written Fobos-Grunt both of which roughly translate to Phobos Ground) was a space mission designed to, amongst other objectives, return soil samples from the Martian moon Phobos. This mission was launched atop the Zenit-2SB launch vehicle taking off from the Baikonur Cosmodrome 2:16am on November 9th 2011. On November 24th it was officially reported that the mission had failed and the vehicle was stuck in low earth orbit. Orbital decay has subsequently sent the satellite plunging to earth in a fiery end of what was a very expensive mission.

What went wrong aboard Phobos-Grunt? February 3rd the official accident report was released: The main conclusions of the Interdepartmental Commission for the analysis of the causes of abnormal situations arising in the course of flight testing of the spacecraft "Phobos-Grunt". Of course, this document is released in Russian but Google Translate actually does a very good job with it. And, IEEE Spectrum Magazine reported on the failing as well. The IEEE article, Did Bad Memory Chips Down Russia’s Mars Probe, is a good summary and the translated Russian article offers more detail if you are interested in digging deeper.

The conclusion of the report is that there was a double memory fault on board Phobos-Grunt. Essentially both computers in a dual-redundant set failed at the same or similar times with a Static Random Access Memory failure. The computer was part of the newly-developed flight control system that had focused on dropping the mass of the flight control systems from 30 kgs (66 lbs) to 1.5 kgs (3.3 lbs). Less weight in flight control is more weight that can be in payload, so these gains are important. However, this new flight control system was blamed for the delay of the mission by 2 years and the eventual demise of the mission.

The two flight control computers are both identical TsM22 computer systems supplied by Techcom, a spin-off of the Argon Design Bureau Phobos Grunt Design). The official postmortem reports that both computers suffered an SRAM failure in a WS512K32V20G24M SRAM. These SRAMS are manufactured by White Electronic Design and the model number can be decoded as “W” for White Electronic Design, “S” for SRAM, “512K32” for a 512k memory by 32 bit wide access, “V” is the improvement mark, “20” for 20ns memory access time, “G24” is the package type, and “M” indicates it is a military grade part.

In the paper " Extreme latchup susceptibility in modern commercial-off-the-shelf (COTS) monolithic 1M and 4M CMOS static random-access memory (SRAM) devices" Joe Benedetto reports that these SRAM packages are very susceptible to “latchup”, a condition which requires power recycling to return to operation and can be permanent in some cases. Steven McClure of NASA Jet Propulsion Laboratory is the leader of the Radiation Effects Group. He reports these SRAM parts would be very unlikely to be approved for use at JPL (Did Bad Memory Chips Down Russia’s Mars Probe).

It is rare that even two failures will lead to disaster and this case is no exception. Upon double failure of the flight control systems, the spacecraft autonomously goes into “safe mode” where the vehicle attempts to stay stable in low-earth orbit and orients its solar cells towards the sun so that it continues to have sufficient power. This is a common design pattern where the system is able to stabilize itself in an extreme condition to allow flight control personal back on earth to figure out what steps to take to mitigate the problem. In this case, the mitigation is likely fairly simple in just restarting both computers (which probably happened automatically) and restarting the mission would likely have been sufficient.

Unfortunately there was still one more failure, this one a design fault. When the spacecraft goes into safe mode, it is incapable of communicating with earth stations, probably due to spacecraft orientation. Essentially if the system needs to go into safe mode while it is still in earth orbit, the mission is lost because ground control will never be able to command it out of safe mode.

I find this last fault fascinating. Smart people could never make such an obviously incorrect mistake, and yet this sort of design flaw shows up all the time on large systems. Experts in each vertical area or component do good work. But the interaction across vertical areas are complex and, if there is not sufficiently deep, cross-vertical-area technical expertise, these design flaws may not get seen. Good people design good components and yet there often exist obvious fault modes across components that get missed.

Systems sufficiently complex enough to require deep vertical technical specialization risk complexity blindness. Each vertical team knows their component well but nobody understands the interactions of all the components. The two solutions are 1) well-defined and well-documented interfaces between components, be they hardware or software, and 2) and very experienced, highly-skilled engineer(s) on the team focusing on understanding inter-component interaction and overall system operation, especially in fault modes. Assigning this responsibility to a senior manager often isn’t sufficiently effective.

The faults that follow from complexity blindness are often serious and depressingly easy to see in retrospect, as was the case in this example.

Summarizing some of the lessons from this loss: The SRAM chip probably was a poor choice. The computer systems should restart, scrub memory for faults, and be able to detect and load corrupt code from secondary locations before going into safe-mode. Safe-mode has to actually allow mitigating actions to be taken from a ground station or it is useless. Software systems should be constantly scrubbing memory for faults and check-summing the running software for corruption. A tiny amount of processor power spent on continuous, redundant checking and a few more lines of code to implement simple recovery paths when fault is encountered may have saved the mission. Finally we have to all remember the old adage “nothing works if it is not tested.” Every major fault has to be tested. Error paths are the common ones to not be tested so it is particularly important to focus on them. The general rule is to keep error paths simple, use the fewest possible, and test frequently.

Back in 2007, I wrote up a set of best practices on software design, testing, and operations of high scale systems: On Designing and Deploying Internet-Scale Services. This paper targets large-scale services but it’s surprising to me that some, and perhaps many, of the suggestions could be applied successfully to a complex space flight system. The common theme across these two only partly-related domains is that the biggest enemy is complexity, and the exploding number of failure modes that follow from that complexity.

This incident reminds us of the importance of never trusting anything from any component in a multi-component system. Checksum every data block and have well-designed, and well-tested failure modes for even unlikely events. Rather than have complex recovery logic for the near infinite number of faults possible, have simple, brute-force recovery paths that you can use broadly and test frequently. Remember that all hardware, all firmware, and all software have faults and introduce errors. Don’t trust anyone or anything. Have test systems that bit flips and corrupts and ensure the production system can operate through these faults – at scale, rare events are amazingly common.

To dig deeper in the Phobos-Grunt loss:

James Hamilton
b: /

Sunday, February 26, 2012 10:48:54 AM (Pacific Standard Time, UTC-08:00)  #    Comments [9] - Trackback
 Tuesday, February 21, 2012

In the past, I’ve written about the cost of latency and how reducing latency can drive more customer engagement and increase revenue. Two example of this are: 1) The Cost of Latency and 2) Economic Incentives applied to Web Latency. Nowhere is latency reduction more valuable than in high frequency trading applications.  Because these trades can be incredibly valuable, the cost of the infrastructure on which they trade is more or less an afterthought.  Good people at the major trading firms work hard to minimize costs but, if the cost of infrastructure was to double tomorrow, high frequency trading would continue unabated.


High frequency trading is very sensitive to latency and it is nearly insensitive to costs. That makes it an interesting application area and its one I watch reasonably closely. It’s a great domain to test ideas that might not yet make economic sense more broadly.  Some of these ideas will never see more general use but many ideas get proved out in high frequency trading and can be applied to more cost sensitive application areas once the techniques have been refined or there is more volume. 


One suggestion that comes up in jest on nearly every team upon which I have worked is the need to move bits faster than the speed of light. Faster than the speed of light communications would help cloud hosted applications and cloud computing in general but physics blocks progress in this area resolutely.


What if it really were possible to transmit data at roughly 33% faster than the speed of light? It turns out this is actually possible and may even make economic sense in high frequency trading. Before you cancel your RSS feed to this blog, let’s look more deeply at what is being sped up, how much, and why it really is possible to substantially beat today’s optical communication links. 


When you get into the details, every “law” is actually more complex than the simple statement that gets repeated over and over. This is one of the reasons I tell anyone who joins Amazon that the only engineering law around here is there are no unchallengeable laws. It’s all about understanding the details and applying good engineering judgment.


For example the speed of light is 186,000 miles per second right?  Absolutely. But the fine print is that the speed of light is 186k m/s in a vacuum. The actual speed of light is dependent upon the medium in which the light is propagating. In an optical fiber, the speed of light is actually roughly 33% slower than a in a vacuum. More specifically, the index of refraction of most common optical fibers is 1.52. What this means is that the speed of light in a fiber is actually just over 122,000 miles/second.


The index of refraction of light in air is very close to 1 which is to say that the speed of light in air is just about the same as the speed of light in vacuum. This means that free space optics -- the use of light for data communications without a fiber wave guide -- is roughly 50% faster than the speed of light in a fiber. Unfortunately, this only matters over long distances but its only practical over short distances. There have been test deployments over metro-area distances – we actually have one where I work – but, generally, it’s a niche technology that hasn’t proven practical and widely applicable. On this approach, I’m not particularly excited.


Continuing this search for low refraction index data communications, we find that microwaves transmitted in air are again have a refraction index near 1 which is to say that microwave is around 50% faster than light in a fiber. As before, this is only of interest over longer distances but, unlike free space optics, Microwave is very practical over longer distances.  On longer runs, it needs to be received and retransmitted periodically but this is practical, cost effective, and is fairly heavily used in the telecom industry. What hasn’t been exploited in the past is that Microwave is actually faster than the speed of light in a fiber.


The 50% speed-up of Microwave over fiber optics seems exploitable and an enterprising set of entrepreneurs are doing exactly that. This plan was outlined in the Gigaom article from yesterday titled Wall Street gains edge by trading over microwave.


In this approach, McKay Brothers are planning on linking New York city with Chicago using microwave transmission. This is a 790 mile distance but fiber seldom takes the most direct route. Let’s assume a fiber path distance of 850 miles which will yield 6.9 msec propagation delay if there are no routers or other networking gear in the way. Give that both optical and microwave require repeaters, I’m not including their impact in this analysis.  Covering the 790 miles using microwave will require 4.2 msec. Using these data, we would have the microwave link a full 2.7 msec faster. That’s a very substantial time difference and, in the world of high frequency trading and 2.7 msec is very monetizable. In fact, I’ve seen HFT customers extremely excited about very small portions of a msec. Getting 2.7msec back is potentially a very big deal.


From the McKay Brothers web site:


Profitability in High Frequency Trading (“HFT”) is about being the first to respond to market events. Events which occur in Chicago markets impact New York markets. The first to learn about this information in New York can take appropriate positions and benefit. There is nothing new in this principle. Paul Reuters, founder of the Reuters news agency, used carrier pigeons to fill a gap in the telegraph lines and bring financial news from Berlin to Paris. The groundbreaking idea of the time was to use an old technology – the carrier pigeon – to fill a gap. What Paul Reuters did 160 years ago is being done again.


Today, we are revisiting an old technology, microwave transmission, to connect Chicago and New York at speeds faster than fiber optic transmission will ever be able to deliver.


This technology is emerging just two years after Spread Networks is reported to have spent 300 million dollars developing a low latency fiber optic connection between Chicago and New York. Spread’s fiber connection will soon be much slower than routes available by microwave.


The Gigaom article is at: The McKay Brothers web site is at: Thanks for Amazon’s Alan Judge for pointing me to this one.




James Hamilton



b: /


Tuesday, February 21, 2012 3:41:23 PM (Pacific Standard Time, UTC-08:00)  #    Comments [14] - Trackback
Hardware | Services
 Saturday, February 11, 2012

Last week I wrote up Studying the Costa Concordia Grounding.  Many folks sent me mail with interesting perspectives. Two were sufficiently interesting that I wanted to repeat them here. The first was from someone who was actually on the ship on that final cruise. The latter is from a professional captain with over 35 years’ experience as a certified Ocean Master.


Experiences From a Costa Concordia Passenger


One of the engineers I work with at Amazon was actually on the Costa Concordia when it grounded. Rory Browne works in the Dublin office and he made an excellent and very detailed presentation on what took place that final trip.  He asked me not to post his slides but OK me posting my notes from his presentation.


Here are my notes from Rory Browne’s experiences on the final cruise of the Costa Concordia:

·         Boarded the ship at 1400

·         Went to bed at 1700 (long trip from Ireland)

·         Woke at 2140 and started getting dressed

·         Fell towards mirror a few minutes later and the lights went out

·         The next hour:

o   Public address announcement stating that an electrical fault had been experienced but the situation under control

o   I Explored ship and noticed some “foam or froth on one side of the boat” – thought it might be a maneuvering thruster but, in retrospect, this was likely the side of the boat that had been ripped open by the grounding

o   Noticed the crew had asked restaurant customers to put their dishes on the floor

o   Returned to cabin to get out of the way of the crew

·         Seven whistles were subsequently sounded indicated abandon ship

o   Proceeded to muster station #4

o   People still blocking stairwells and pushing to get onto lifeboats

o   Lifeboat entrances were very crowded with long lines but I noticed a second lifeboat entrance with only a couple of people in line and was able to get on quickly

o   People on lifeboat didn’t move away from the entrance but it was easy to slip past them to the far corner

o   Estimate that they could easily have fit another 10 on the lifeboat (there were roughly 25 on it)

·         When lifeboat was lowered the roof hit something and the fiberglass roof was bashed in behind my head. Now slightly worried about lifeboat integrity


Observations from a Licensed Master


What follows is one of the more interesting notes I got after blogging the Costa Concordia incident. This one from a professional captain. He’s given me permission to reprint it here but preferred not to include his name:


The original Letter:

Your January 29th blog discussing the Costa Concordia incident was an excellent presentation, and the links you provided were excellent as well.  Because of your boating experience you have an understanding better than most of what took place with the cruise ship.


Like you, I often look further into incidents and disasters in order to have a better understanding of what actually took place, primarily because I know press releases and news reports of an incident rarely if ever delve into the underlying facts of a case.  More often than not the media isn’t interested in much beyond sensationalism.  Costa Concordia is the perfect example, as was Fukashima Dai-1.  The miracle of Costa Concordia of course was that more lives weren’t lost.


There were two points you made in your discussion that might not be correct.  Notice that I say, “might.”  The first point being your mention that Captain Schettino was “clearly very experienced,” and the second point you made was that Captain Schettino’s ship handling after the initial grounding “appeared excellent.”


Regarding the first point, I’m not sure much is known about the quality and extent of Schettino’s actual hands-on experience at sea.  At this point I think about all we can safely assume is that Schettino’s personality and demeanor were well suited to representing the cruise line to the paying passengers.  Beyond that, I think we know little.  Its one thing to set for and obtain a Master’s license, but it’s quite another to have the practical experience to captain a 114,000 GT vessel.


An experienced captain of a vessel, no matter what size, would never approach landfall at night (or even in daylight with good visibility) without repeatedly checking his radar.  An experienced captain would know the maneuvering characteristics of his vessel, the turn radius, the advance and transfer when making a turn, the use and calculation of turn bearings, etc.  On the other hand, I’m not sure at this point we know which officer actually had command of the vessel during the interval leading up to the initial grounding. 


The second point you touched on was that Schettino’s handling of the vessel after the initial grounding “appeared excellent.”  It’s well that you included the qualifier “appeared.”  I’m not sure we know or will ever know what Schettino’s thinking was after the grounding, so at this point I believe all we can go on is to speculate what was he was doing based upon the available AIS data.  Schettino might have been taking the action a prudent seaman would take, once propulsion power was lost; however, I’m not sure we know yet what effects the wind, the current and the attitude of the vessel were having.  Perhaps there wasn’t enough force to overcome these and other outside influences on the maneuverability of the vessel, so perhaps the vessel once it went almost dead in the water was at the mercy of influences outside the control of the captain.  Perhaps the vessel was simply lucky to have found itself grounded back on the island.


One of the many things that haven’t been explored fully regarding the Costa Concordia is the vessels stability, and in particular the stability after the ingress of the water began when she was initially holed on the port side.  It some point in time there will be a computerized animation showing the progressive changes to her stability, the free surface effects, which compartments were impacted by the initial flooding, how the flooding progressed through the vessel, the effects of maintaining or not maintaining water tight integrity in her various compartments, the effects of wind and current, etc.  That will be interesting.     


I have over 35 years experience on the water and at sea and was a licensed oceans Master, so I have a little understanding of how this ship stuff works.


Again, I want to complement you on your Costa Concordia blog.  You did a super job.

My response:

A super interesting note. I really enjoyed your background points.


One point you argued was where he had experience at anything beyond essentially being the front man for a 1,500 room hotel. Specifically you said “An experienced captain of a vessel, no matter what size, would never approach landfall at night (or even in daylight with good visibility) without repeatedly checking his radar.  An experienced captain would know the maneuvering characteristics of his vessel, the turn radius, the advance and transfer when making a turn, the use and calculation of turn bearings, etc.  On the other hand, I’m not sure at this point we know which officer actually had command of the vessel during the interval leading up to the initial grounding.


It’s hard to not agree with your conclusion. Bringing that large a ship that near the rocks at over 15 kts is incredibly bad judgment. But, that is my point.  Very experienced operators sometimes make catastrophically bad judgment. Lapses that are incredibly hard to explain. For example the Captain of the Washington State Ferry Elwha going on a 15 mile unauthorized pleasure cruise that ended in grounding ( The captain of the Valdez drunk, not at the helm, and trusting his 3rd mate to take the ship through the most dangerous part of their entire trip. I have been to Bligh rock in Prince William Sound and it’s a LOOONG way from the shipping lanes. Even the 3rd mate had too much experience to have put the boat there. There are many, many stories of operators “buzzing the tower” even though they have experience and should absolutely know better.


My conclusion is that experience is not a cure. Perhaps it’s because bad judgment isn’t expressed frequently enough that it gets filtered out before the person has a significant command. Or perhaps the bad judgment actually comes from the over-confidence that experience can bring.


I’m not debating your point that it was crazy to head for the rocks at 15kts but I am arguing that very experienced people really do make some incredibly bad judgments.  


Your point on boat handling is well taken.  It’s not possible to establish whether the captain made good decisions after his one catastrophically bad one. The helm orders appear correct for the conditions. The use of the thruster seemed to work. But, some have speculated the ship would have been better out in the channel so it could launch life racks (they are speculating that it wouldn’t have developed the significant list so quickly). And, you are right, current conditions and other factors, may have put the ship where it landed with commands form the Captain not being the dominant influence.  Certainly all possible.


My conclusion in the article was “pilot error” and my main point is that experience is either not a solution or perhaps it was a contributor to what was very poor judgment that led to loss of life.


Thanks for the your observations from experience with commercial vessels.


James Hamilton



b: /


Saturday, February 11, 2012 3:36:19 PM (Pacific Standard Time, UTC-08:00)  #    Comments [2] - Trackback

 Sunday, January 29, 2012

Don't be a show-off. Never be too proud to turn back. There are old pilots and bold pilots, but no old, bold pilots.


I first heard the latter part of this famous quote made by US Airmail Pilot E. Hamilton Lee back when I raced cars. At that time, one of the better drivers in town, Gordon Monroe, used a variant of that quote (with pilots replaced by racers) when giving me driving advice. Gord’s basic message was that it is impossible to win a race if you crash out of it.


Nearly all of us have taken the odd chance and made some decisions that, in retrospect, just didn’t make sense from a risk vs reward perspective. Age and experience clearly helps but mistakes still get made and none of us are exempt. Most people’s mistakes at work don’t have life safety consequences and their mistakes are not typically picked up widely by the world news services as was the case in the recent grounding of the Costa Concordia cruise ship. But, we all make mistakes.


I often study engineering disasters and accidents in the belief that understanding mistakes, failures, and accidents deeply is a much lower cost way of learning.  My last note on this topic was What Went Wrong at Fukushima Dai-1 where we looked at the nuclear release following the 2011 Tohuku Earthquake and Tsunami


Living on a boat and cruising extensively (our boat blog is at makes me particularly interested in the Costa Concordia incident of January 13th 2012. The Concordia is a 114,137 gross ton floating city that cost $570m when it was delivered in 2006. It is 952’ long, has 17 decks, and is power by 6 Wartsila diesel engines with a combined output of 101,400 horse power. The ship is capable of 23 kts (26.5 mph) and has a service speed of 21 kts. At capacity, it carries 3,780 passengers with a crew of 1,100.




The Italian cruise ship Costa Concordia partially sank on Friday the 13th of January 2012 after hitting a reef off the Italian coast and running aground at Isola del Giglio, Tuscany, requiring the evacuation of 4,197 people on board. At least 16 people died, including 15 passengers and one crewman; 64 others were injured (three seriously) and 17 are missing. Two passengers and a crewmember trapped below deck were rescued.


The captain, Francesco Schettino, had deviated from the ship's computer-programmed route in order to treat people on Giglio Island to the spectacle of a close sail-past. He was later arrested on preliminary charges of multiple manslaughter, failure to assist passengers in need and abandonment of ship. First Officer Ciro Ambrosio was also arrested.


It is far too early to know exactly what happened on the Costa Concordia and, because there was loss of life and considerable property damage, the legal proceedings will almost certainly run for years. Unfortunately, rather than illuminating the mistakes and failures and helping us avoid them in the future, these proceedings typically focus on culpability and distributing blame. That’s not our interest here. I’m mostly focused on what happened and getting all the data I could find on the table to see what lessons the situation yields.


A fellow boater, Milt Baker pointed me towards an excellent video that offers considerable data into exactly what happened in the final 1 hour and 30 min. You can find the video at: Grounding of Costa Concordia. Another interesting data source is the video commentary available at: John Konrad Narrates the Final Maneuvers of the Costa Concordia. In what follows, I’ve combined snapshots of the first video intermixed with data available from other sources including the second video.


The source data for the two videos above is a wonderful safety system called Automatic Identification System. AIS is a safety system required on larger commercial craft and also used on many recreational boats as well. AIS works by frequently transmitting (up to every 2 seconds for fast moving ships) via VHF radio the ships GPS position, course, speed, name, and other pertinent navigational data. Receiving stations on other ships automatically plot transmitting AIS targets on electronic charts. Some receiving systems are also able to plot an expected target course and compute the time and location of the estimated closest point of approach. AIS an excellent tool to help reduce the frequency of ship-to-ship collisions.


Since AIS data is broadcast over VHF radio, it is widely available to both ships and land stations and this data can be used in many ways. For example, if you are interested in the boats in Seattle’s Elliott Bay, have a look at and enter “Seattle” as the port in the data entry box near the top left corner of the screen (you might see our boat Dirona there as well).


AIS data is often archived and, because of that, we have a very precise record of the Costa Concordia’s course as well as core navigational data as it proceeded towards the rocks. In the pictures that follow, the red images of the ship are at the ship’s position as transmitted by the Costa Concordia’s AIS system. The black line between these images is the interpolated course between these known locations. The video itself (Costa Concordia Interpolated.wmv) uses a roughly 5:1 time compression.

In this screen shot, you can see the Concordia already very close to the Italian Isol del Giglio. From the BBC report the Captain has said he turned too late (Costa Concordia: Captain Schettino ‘Turned Too Late’). From that article:


According to the leaked transcript quoted by Italian media, Capt Schettino said the route of the Costa Concordia on the first day of its Mediterranean cruise had been decided as it left the port of Civitavecchia, near Rome, on Friday.


The captain reportedly told the investigating judge in the city of Grosseto that he had decided to sail close to Giglio to salute a former captain who had a home on the Tuscan island. "I was navigating by sight because I knew the depths well and I had done this maneuver three or four times," he reportedly said.


"But this time I ordered the turn too late and I ended up in water that was too shallow. I don't know why it happened."


In this screen shot of the boat at 20:44:47 just prior to the grounding, you can see the boat turned to 348.8 degrees but the massive 114,137 gross ton vessel is essentially plowing sideways through the water on a course of 332.7 degrees. The Captain can and has turned the ship with the rudder but, at 15.6 kts, it does not follow the exact course steered with inertia tending to widen and straiten the intended turn. 


Given the speed of the boat and nearness of shore at this point, the die is cast and the ship is going to hit ground.


This screen shot was taken is just past the point of impact. You will note that it has slowed to 14.0 kts. You might also notice the Captain is turning aggressively to the starboard. He has the ship turned to a 8.9 degrees heading whereas the actual ships course lags behind at 356.2 degrees.


This screen shot is only 44 seconds after the previous one but the boat has already slowed from 14.0 kts to 8.1 and is still slowing quickly.  Some of the slowing will have come from the grounding itself but passengers report that they heard the boat hard astern after the grounding.


You can also see the captain has swung the helm over from the starboard course he was steering trying to avoid the rocks over to port course now that he has struck them. This is almost certainly in an effort to minimize damage. What makes this (possibly counter-intuitive) decision a good one is the ships pivot point is approximately 1/3 of the way back from the bow so turning to port (towards the shore) will actually cause the stern to rotate away from the rocks they just struck.


The ship decelerated quickly to just under 6.0 knots but, in the two minutes prior to this screen shot, it has only slowed a further 0.9 kts down to 5.1. There were reports of a loss of power on the Concordia. Likely what happened is ship was hard astern taking off speed until a couple of minutes prior to this screen shot when water intrusion caused a power failure. The ship is a diesel electric and likely lost power to its main prop due to rapid water ingress.


At 5 kts and very likely without main engine power, the Concordia is still going much too quickly to risk running into the mud and sand shore so the Captain now turns hard away from shore and he is heading back out into the open channel.


With the helm hard over the starboard with the likely assistance of the bow thrusters the ship is turning hard which is pulling speed off fairly quickly. It is now down to 3.0 kts and it continues to slow.


The Concordia is now down to 1.6 kts and the Captain is clearly using the bow thrusters heavily as the bow continues to rotate quickly. He has now turned to a 41 degree heading.


It now has been just over 29 min since the ship first struck the rocks. It has essentially stopped and the bow is being brought all the way back round using bow thrusters in an effort to drive the ship back in towards shore presumably because the Captain believes it is at risk of sinking so he is seeking shallow water.


The Captain continues to force the Concordia to shore under bow thruster power. In this video narrative (John Konrad Narrates the Final Maneuvers of the Costa Concordia), the commentator reported that the combination of bow thrusters and the prevailing currents where being used in combination by the Captain to drive the boat into shore.


A further 11 min and 22 seconds have past and the ship has now accelerated back up to 0.9 kts now heading towards shore.


It has been more than an hour and 11 minutes since the original contact with the rocks and the Costa Concordia is now at rest in its final grounding point.


The Coast Guard transcript of the radio communications with the Captain are at Costa Concordia Transcript: Coastguard Orders Captain to return to Stricken Ship. In the following text De Falco is the Coast Guard Commander and Schettino is the Captain of the Costa Concordia:


De Falco: "This is De Falco speaking from Livorno. Am I speaking with the commander?"

Schettino: "Yes. Good evening, Cmdr De Falco."

De Falco: "Please tell me your name."

Schettino: "I'm Cmdr Schettino, commander."

De Falco: "Schettino? Listen Schettino. There are people trapped on board. Now you go with your boat under the prow on the starboard side. There is a pilot ladder. You will climb that ladder and go on board. You go on board and then you will tell me how many people there are. Is that clear? I'm recording this conversation, Cmdr Schettino …"

Schettino: "Commander, let me tell you one thing …"

De Falco: "Speak up! Put your hand in front of the microphone and speak more loudly, is that clear?"

Schettino: "In this moment, the boat is tipping …"

De Falco: "I understand that, listen, there are people that are coming down the pilot ladder of the prow. You go up that pilot ladder, get on that ship and tell me how many people are still on board. And what they need. Is that clear? You need to tell me if there are children, women or people in need of assistance. And tell me the exact number of each of these categories. Is that clear? Listen Schettino, that you saved yourself from the sea, but I am going to … really do something bad to you … I am going to make you pay for this. Go on board, (expletive)!"

Schettino: "Commander, please …"

De Falco: "No, please. You now get up and go on board. They are telling me that on board there are still …"

Schettino: "I am here with the rescue boats, I am here, I am not going anywhere, I am here …"

De Falco: "What are you doing, commander?"

Schettino: "I am here to co-ordinate the rescue …"

De Falco: "What are you co-ordinating there? Go on board! Co-ordinate the rescue from aboard the ship. Are you refusing?"

Schettino: "No, I am not refusing."

De Falco: "Are you refusing to go aboard, commander? Can you tell me the reason why you are not going?"

Schettino: "I am not going because the other lifeboat is stopped."

De Falco: "You go aboard. It is an order. Don't make any more excuses. You have declared 'abandon ship'. Now I am in charge. You go on board! Is that clear? Do you hear me? Go, and call me when you are aboard. My air rescue crew is there."

Schettino: "Where are your rescuers?"

De Falco: "My air rescue is on the prow. Go. There are already bodies, Schettino."

Schettino: "How many bodies are there?"

De Falco: "I don't know. I have heard of one. You are the one who has to tell me how many there are. Christ!"

Schettino: "But do you realize it is dark and here we can't see anything …"

De Falco: "And so what? You want to go home, Schettino? It is dark and you want to go home? Get on that prow of the boat using the pilot ladder and tell me what can be done, how many people there are and what their needs are. Now!"

Schettino: "… I am with my second in command."

De Falco: "So both of you go up then … You and your second go on board now. Is that clear?"

Schettino: "Commander, I want to go on board, but it is simply that the other boat here … there are other rescuers. It has stopped and is waiting …"

De Falco: "It has been an hour that you have been telling me the same thing. Now, go on board. Go on board! And then tell me immediately how many people there are there."

Schettino: "OK, commander."

De Falco: "Go, immediately!"


At least 16 died in the accident and 17 were still missing when this was written (Costa Concordia Disaster).The Captain of the Costa Concordia, Francesco Schettino, has been charged with manslaughter and abandoning ship.


At the time of the grounding, the ship was carrying 2,200 metric tons of heavy fuel oil and 185 metric tons of diesel and remains environmental risk remains (Costa Concordia Salvage Experts Ready to Begin Pumping Fuel from Capsized Cruise Ship Off Coast of Italy). The 170 year old salvage firm Smit Salvage will be leading the operation.


All situations are complex and few disasters have only a single cause. However, the facts as presented to this point pretty strongly towards pilot error as the primary contributor in this event.  The Captain is clearly very experienced and his ship handling after the original grounding appear excellent. But, it’s hard to explain why the ship was that close to the rocks, the captain has reported that he turned too late, and public reports have him on the phone at or near the time of the original grounding.


What I take away from the data points presented here is that experience, ironically,  can be our biggest enemy. As we get increasingly proficient at a task, we often stop paying as much attention. And, with less dedicated focus on a task, over time, we run the risk of a crucial mistake that we probably wouldn’t have made when we were effectively less experienced and perhaps less skilled. There is danger in becoming comfortable.


The videos referenced in the above can be found at:

·         Grounding of Costa Concordia Interpolated

·         gCaptain’s John Konrad Narrates the Final Maneuvers of the Costa Concordia


If you are interested in reading more:













James Hamilton



b: /


Sunday, January 29, 2012 11:24:25 AM (Pacific Standard Time, UTC-08:00)  #    Comments [8] - Trackback
 Thursday, January 26, 2012

Ordinarily I focus this blog on areas of computing where I spend most of my time from high performance computing to database internals and cloud computing. An area that interests me greatly  but I’ve seldom written about is entrepreneurship and startups.


One of the Seattle areas startups with which I stay in touch is Socrata. They are focused on enabling federal, state, and local governments to improve the reach, usability and social utility of their public information assets.  Essentially making public information available and useful to their constituents. They are used by: the World Bank, the United Nations, the World Economic Forum, the US Data.Gov, Health & Human Services, Centers for Disease Control, several most major cities including NYC, Seattle, Chicago, San Francisco and Austin and many county and state governments. Even foreign governments like the Country of Kenya have adopted Socrata.


I first met Kevin Merritt, the founder and CEO of Socrata, back in 2005 when I was doing technical diligence for the Microsoft acquisition of the LA-based Frontbridge Technologies. I love doing diligence on startups because it’s an opportunity to dive in and spend a day or more digging deeply and understanding what smart people have produced, where things worked really well, and areas where things didn’t pan out as well as they could have. I’ve learned a lot in these roles and I’m  lucky to have been able to do many of them first at IBM, later at Microsoft, and now at Amazon.


What made this one a bit different is I got a call shortly after the deal closed asking if I wanted to be the General Manager of the Microsoft subsidiary that was formed in the acquisition. An opportunity to run mid-sized business in its entirety. Development, test, operations, and customer support. Absolutely! I’ve never learned so much as I did in the first year or so at what would become Microsoft Exchange Hosted Services.


It was a great experience and I’ve been 100% focused on cloud services since that time. And, as a consequence of leading Frontbridge, I got to know Kevin Merritt well. He is an excellent strategic thinker and an even better operator. Whenever Kevin was involved, customers were happy and the service was rapidly improving and expanding.  Kevin eventually left to form Socrata and he and I have stayed in touch since then. He knows I’m a sucker for a beer and some wings :-).


Based in Seattle, Socrata is venture-backed with a small and talented engineering team.  They are enjoying strong customer demand and their market success is fueling growth in the engineering team. They are currently looking for a CTO and, if I didn’t already have one of the best job out there, I would seriously considering joining Kevin and the team.  If you are a technology leader interested in big data, cloud computing, architecture of distributed systems, ops automation, and the user experience of making data easy to find and use, you should send Kevin, their founder and CEO, a note at




James Hamilton



b: /

Thursday, January 26, 2012 5:13:24 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
 Wednesday, January 18, 2012

Finally! I’ve been dying to talk about DynamoDB since work began on this scalable, low-latency, high-performance NoSQL service at AWS. This morning, AWS announced availability of DynamoDB: Amazon Web Services Launches Amazon DynamoDB – A New NoSQL Database Service Designed for the Scale of the Internet.


In a past blog entry, One Size Does Not Fit All, I offered a taxonomy of 4 different types of structured storage system, argued that Relational Database Management Systems are not sufficient, and walked through some of the reasons why NoSQL databases have emerged and continue to grow market share quickly. The four database categories I introduced were: 1) features-first, 2) scale-first, 3) simple structure storage, and 4) purpose-optimized stores. RDBMS own the first category.


DynamoDB targets workloads fitting into the Scale-First and Simple Structured storage categories where NoSQL database systems have been so popular over the last few years.  Looking at these two categories in more detail, Scale-First is:


Scale-first applications are those that absolutely must scale without bound and being able to do this without restriction is much more important than more features. These applications are exemplified by very high scale web sites such as Facebook, MySpace, Gmail, Yahoo, and Some of these sites actually do make use of relational databases but many do not. The common theme across all of these services is that scale is more important than features and none of them could possibly run on a single RDBMS. As soon as a single RDBMS instance won’t handle the workload, there are two broad possibilities: 1) shard the application data over a large number of RDBMS systems, or 2) use a highly scalable key-value store.


And, Simple Structured Storage:


There are many applications that have a structured storage requirement but they really don’t need the features, cost, or complexity of an RDBMS. Nor are they focused on the scale required by the scale-first structured storage segment. They just need a simple key value store. A file system or BLOB-store is not sufficiently rich in that simple query and index access is needed but nothing even close to the full set of RDBMS features is needed. Simple, cheap, fast, and low operational burden are the most important requirements of this segment of the market.


More detail at: One Size Does Not Fit All.


The DynamoDB service is a unified purpose-built hardware platform and software offering. The hardware is based upon a custom server design using Flash Storage spread over a scalable high speed network joining multiple data centers.


DynamoDB supports a provisioned throughput model. A DynamoDB application programmer decides the number of database requests per second their application should be capable of supporting and DynamoDB automatically spreads the table over an appropriate number of servers. At the same time, it also reserves the required network, server, and flash memory capacity to ensure that request rate can be reliably delivered day and  night, week after week, and year after year.  There is no need to worry about a neighboring application getting busy or running wild and taking all the needed resources. They are reserved and there whenever needed.


The sharding techniques needed to achieve high requests rates are well understood industry-wide but implementing them does take some work. Reliably reserving capacity so it is always there when you need it, takes yet more work.  Supporting the ability to allocate more resources, or even less, while online and without disturbing the current request rate takes still more work. DynamoDB makes all this easy. It supports online scaling between very low transaction rates to applications requiring millions of requests per second. No downtime and no disturbance to the currently configured application request rate while resharding. These changes are done online only by changing the DynamoDB provisioned request rate up and down through an API call.


In addition to supporting transparent, on-line scaling of provisioned request rates up and down over 6+ orders of magnitude with resource reservation, DynamoDB is also both consistent and multi-datacenter redundant. Eventual consistency is a fine programming model for some applications but it can yield confusing results under some circumstances. For example, if you set a value to 3 and then later set it to 4, then read it back, 3 can be returned. Worse, the value could be set to 4, verified to be 4 by reading it, and yet 3 could be returned later. It’s a tough programming model for some applications and it tends to be overused in an effort to achieve low-latency and high throughput.  DynamoDB avoids forcing this by supporting low-latency and high throughout while offering full consistency. It also offers eventual consistency at lower request cost for those applications that run well with that model. Both consistency models are supported.


It is not unusual for a NoSQL store to be able to support high transaction rates. What is somewhat unusual is to be able to scale the provisioned rate up and down while on-line. Achieving that while, at the same time, maintaining synchronous, multi-datacenter redundancy is where I start to get excited.


Clearly nobody wants to run the risk of losing data but NoSQL systems are scale-first by definition. If the only way to high throughput and scale, is to run risk and not commit the data to persistent storage at commit time, that is exactly what is often done. This is where  DynamoDB really shines. When data is sent to DynamoDB, it is committed to persistent and reliable storage before the request is acknowledged. Again this is easy to do but doing it with average low single digit millisecond latencies is both harder and requires better hardware. Hard disk drives can’t do it and in-memory systems are not persistent so flash memory is the most cost effective solution.


But what if the server to which the data was committed fails, or the storage fails, or the datacenter is destroyed? On most NoSQL systems you would lose your most recent changes.  On the better implementations, the data might be saved but could be offline and unavailable. With dynamoDB, if data is committed just as the entire datacenter burns to the ground, the data is safe, and the application can continue to run without negative impact at exactly the same provisioned throughput rate. The loss of an entire datacenter isn’t even inconvenient (unless you work at Amazon :-)) and has no impact on your running application performance.


Combining rock solid synchronous, multi-datacenter redundancy with average latency in the single digits, and throughput scaling to the millions of requests per second is both an excellent engineering challenge and one often not achieved.


More information on DynamoDB:

·         Press Release:

·         DynamoDB detail Page:

·         DynamoDB Developer Guide:

·         Blog entries:

o     Werner:

o    Jeff Barr:

·         DynamoDB Frequently Asked Questions:

·         DynamoDB Pricing:

·         GigaOM:

·         eWeek:

·         Seattle Times:


Relational systems remain an excellent solution for applications requiring Feature-First structured storage. AWS Relational Database Service supports both the MySQL and Oracle and relational database management systems:


Just as I was blown away when I saw it possible to create the world’s 42nd most powerful super computer with a few API calls to AWS (42: the Answer to the Ultimate Question of Life, the Universe and Everything), it is truly cool to see a couple of API calls to DynamoDB be all that it takes to get a scalable, consistent, low-latency, multi-datacenter redundant, NoSQL service configured, operational and online.




James Hamilton



b: /


Wednesday, January 18, 2012 1:00:06 PM (Pacific Standard Time, UTC-08:00)  #    Comments [6] - Trackback
 Monday, January 16, 2012

Occasionally I come across a noteworthy datacenter design that is worth covering. Late last year a very interesting Japanese facility was brought to my attention by Mikio Uzawa an IT consultant who authors the Agile Cat blog. I know Mikio because he occasionally translates Perspectives articles for publication in Japan.


Mikio pointed me to the Ishikari Datacenter in Ishikari City, Hokkaido Japan. Phase I of this facility was just completed in November 2011. This facility is interesting for a variety of reasons but the design features I found most interesting are: 1) High voltage direct current power distribution, 2) whole building ductless cooling, and 3) aggressive free air cooling.


High Voltage Direct Current Power Distribution

I first came across the use of direct current when Annabel Pratt took me through the joint work Intel was doing with Lawrence Berkeley National Lab on datacenter HVDC distribution (Evaluation of Direct Current Distribution in Data Centers to Improve Energy Efficiency). In this approach they distribute 400V direct current rather than the more conventional 208V to 240V alternating current used in most facilities today.


High voltage direct current work in datacenters has been around for around a decade and it is in extensive test at many facilities world-wide.  Many companies are 100% focused on HVDC design consulting with Validus being one of the better known. 


The savings potential of HVDC are often shown to be very exciting with numbers beyond 30% frequently quoted. But the marketing material I’ve gone through in detail compare excellent HVDC designs with very poor AC designs. Predictably the savings are around 30%. Unfortunately, the difference between good AC and bad AC designs are also around 30% :-).


When I look closely at HVDC distribution, I see slight improvements in efficiency at around 3 to 5%, somewhat higher costs of equipment since it is less broadly used, less equipment availability and longer delivery times, and somewhat more complex jurisdictional issues with permitting and other approvals taking longer in some regions. Nonetheless, the picture continues to improve, the industry as a whole continues to learn, and I think there is a good chance that high voltage DC distribution will end up becoming a more common choice in modern datacenters.


The Ishikari facility is a high voltage DC distribution design. I’m looking forward to learning more about this aspect of the facility and watching how the system performs.


Whole Building Ductless Cooling

Air handling ducts costs money and restrict flow so why not recognize that the entire purpose of a datacenter shell is to keep the equipment dry and secure and to transport heat. Instead of installing extensive duct work, just treat the entire building as a very large air duct.


Perhaps the nicest mechanical design I’ve come across based upon ductless cooling is the Facebook Prineville facility. In this design, they use the entire second floor of the building for air handling and the lower floor for the server rooms.

The Ishikari design shares many design aspects with the Intel Jones Farms facility where the IT equipment is on the second floor and the electrical equipment is on the first.


Aggressive Free-Air Cooling

Looking at the air flow diagram above, you can see that the Ishikari Datacenter is making good use of the datacenter friendly climate of Japan and aggressively using free-air cooling. Free-air cooling, often called air side economization, is one of the most effective ways of driving down datacenter costs and substantially increasing overall efficiency. It’s good to see this design point spreading rapidly.


More information is available at:


Some datacenter designs I’ve covered in the past:

·         Facebook Prineville Mechanical Design

·         Facebook Prineville UPS & Power Supply

·         Example of Efficient Mechanical Design

·         46MW with Water Cooling at a PUE of 1.10

·         Yahoo! Compute Coop Design

·         Microsoft Gen 4 Modular Data Centers



James Hamilton



b: /


Monday, January 16, 2012 10:00:38 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
 Monday, January 02, 2012

Years ago, Dave Patterson remarked that most server innovations were coming from the mobile device world. He’s right. Commodity system innovation is driven by volume and nowhere is there more volume than in the mobile device world.  The power management techniques applied fairly successfully over the last 5 years had their genesis in the mobile world.  And, as processor power efficiency improves, memory is on track to become the biggest power consumer in the data center. I expect the ideas to rein in memory power consumption will again come from the mobile device world. Just as Eskimo’s are reported (apparently incorrectly) to have 7 words for snow, mobile memory systems have a large array of low power states with subtly different power dissipations and recovery times. I expect the same techniques will arrive fairly quickly to the server world.


ARM processors are used extensively in cell phones and embedded devices. I’ve written frequently of the possible impact of ARM on the server-side computing world.

·         Linux/Apache on ARM Processors

·         ARM Cortex-A9 SMP Design Announced

·         Very Low-Cost, Low-Power Servers

·         NVIDIA Project Denver: ARM Powered Servers


ARM remain power efficient while at the same time they are rapidly gaining the performance and features needed to run demanding server-side workloads. A key next step was made late last year when ARM announced the ARM V8 architecture. Key attributes of the new ARM architecture are:

·         64 bit virtual addressing

·         40 bit physical addresses

·         HW virtualization support

The first implementation of the ARM V8 architecture was announced the same day by Applied Micro Devices. The APM design is available in an FPGA implementation for development work this month and is expected to be in final system-on-a-chip form in 2H2012. The APM X-Gene offers:


·         64bit addressing

·         3 Ghz

·         Up to 128 cores

·         Super-scalar, quad issue processor

·         CPU and I/O virtualization support

·         Out of order processing

·         80 GB/sec memory throughput

·         Integrated Ethernet and PCIe

·         Full LAMP software stack port


APM X-Gene announcement:

·         Press Release: AppliedMicro Showcases World’s First 64-bit ARM v8 Core

·         Slides: Applied Micro Announces X-Gene


More ARM and low power servers reading:

·         ARM V8 Press Release:

·         AnandTech:

·         Ars technica:

·         CIDR Paper on low power computing:

·         The Case for Energy Proportional Computing:

·         ARM V8 Architecture:


In the 2nd half of 2012 we will have a very capable, 64bit, server-targeted ARM processor implementation available to systems builders.




James Hamilton



b: /


Monday, January 02, 2012 9:21:02 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
 Wednesday, December 14, 2011

If you work in the database world, you already know Phil Bernstein. He’s the author of Principles of Transaction Processing and has a long track record as a successful and prolific database researcher.  Past readers of this blog may remember Phil’s guest blog posting on Google Megastore. Over the past few years, Phil has been working on an innovative NoSQL system based upon flash storage. I like the work because it pushes the limit of what can be done on a single server with transaction rates approaching 400,000, leverages the characteristics of flash storage in a thought provoking way, and employs interesting techniques such as log-only storage.


Phil presented Hyder at the Amazon ECS series a couple of weeks back (a past ECS presentation at: High Availability for Cloud Computing Database Systems.


In the Hyder system, all cores operate on a single shared transaction log. Each core (or thread) processes Optimistic Concurrency Control (OCC) database transactions one at a time. Each transaction posts its after-image to the shared log. One core does OCC and rolls forward the log. The database is a binary search tree serialized into the log (A B-tree would work equally well in this application). Because the log is effectively a no-overwrite, log-only datastore, a changed node require that the parent must now point to this new node which forces the parent to be updated as well. Now its parent needs updating and this cascading set of changes proceeds to the root on each update.


The tree is maintained via copy-on-write semantics where updates are written to the front of the log with references to unchanged tree nodes pointing back to the appropriate locations in the log. Whenever a node changes, the changed node is written to the front of the log. Consequently all database changes result in changes to all nodes to the top of the search tree.


This has the downside of requiring many tree nodes to be updated on each database update but has the upside of the writes all being sequential at the front of the log. Since it is a no-overwrite store, when an update is made, the old nodes remain so transactional time travel is easy. The old search tree root still point to a complete tree that was current as of the point in time when that root was the current root of the search tree.  As new nodes are written, some old nodes are no longer part of the current search tree and can be garbage collected over time.

Transactions are implemented by writing an intention log record to the front of the log with all changes required by this transaction and these tree nodes point either to other nodes within the intention record or to unchanged nodes further back in the log. This can be done quickly and all updates can proceed  in parallel without need for locking or synchronization.


Before the transaction can be completed, it must now be checked for conflict using Optimistic Concurrency Control. If there are no conflicts, the root of the search tree is atomically moved to point to the new root and the transaction is acknowledged as successful. If the transaction is in conflict, it is failed and the tree root is not advanced and the intention record becomes garbage.


Most of the transactional update work can be done concurrently without locks but two issues come to mind quickly:


1)      Garbage collection: because the systems is constantly rewriting large portions of the search tree, old versions of the tree a spread throughout the log and need to be recovered.

2)      Transaction Rate: The transaction rate is limited by the rate at which conflicts can be checked and the tree root advanced.


The latter is the biggest concern and the rest of the presentation focuses on the rate with which this bottleneck can be processed.  The presenter showed that rates in 400,000 transaction per second where obtained in performance testing so this is a hard limit but it is a fairly high hard limit. This design can go a long way before partitioning is required.


If you want to dig deeper, the Hyder presentation is at:


More detailed papers can be found at:


Philip A. Bernstein, Colin W. Reid, Sudipto Das: Hyder - A Transactional Record Manager for Shared Flash. CIDR 2011: 9-20


Philip A. Bernstein, Colin W. Reid, Ming Wu, Xinhao Yuan: Optimistic Concurrency Control by Melding Trees. PVLDB 4(11): 944-955 (2011)


Colin W. Reid, Philip A. Bernstein: Implementing an Append-Only Interface for Semiconductor Storage. IEEE Data Eng. Bull. 33(4): 14-20 (2010)


Mahesh Balakrishnan, Philip A. Bernstein, Dahlia Malkhi, Vijayan Prabhakaran, Colin W. Reid: Brief Announcement: Flash-Log - A High Throughput Log. DISC 2010: 401-403


James Hamilton



b: /


Wednesday, December 14, 2011 9:43:25 AM (Pacific Standard Time, UTC-08:00)  #    Comments [4] - Trackback
 Sunday, November 27, 2011

While at Microsoft I hosted a weekly talk series called the Enterprise Computing Series (ECS) where I mostly scheduled technical talks on server and high-scale service topics. I said “mostly” because the series occasionally roamed as far afield as having an ex-member of the Ferrari Formula 1 team present. Client-side topics are also occasionally on the list either because I particularly liked the work or technology behind it or thought it was a broadly relevant topic.


The Enterprise Computing Series has an interesting history. It was started by Jim Gray at Tandem.  Pat Helland picked up the mantle from Jim and ran it for years before Pat moved to Andy Heller’s Hal Computer Systems. He continued the ECS at HAL and then brought it with him when he joined Microsoft where he continued to run it for years. Pat eventually passed it to me and I hosted the ECS series for 8 or 9 years myself before moving to Amazon Web Services. Ironically when I arrived at Amazon, I found that Pat Helland had again created a series in the same vein as the ECS called the Principals of Amazon (PoA) series.


The PoA series is excellent but it doesn’t include external speakers and is hosted on a fixed day of the week so I occasionally come across a talk that I would like to host at Amazon that doesn’t fit the PoA. For those occasions, the Enterprise Computing Series lives on!


In this ECS talk Ashraf Aboulnaga of the University of Waterloo presented High Availability for Database Systems in Cloud Computing Environments. Ashraf presented two topics, 1) RemusDB: Database high availability using virtualization, and 2) DBECS: Database high availability and availability using eventually consistent cloud storage. The first topic was based upon the VLDB 2011 Best Paper Award “RemusDB: Transparent HighAvailability for Database Systems” by Umar Farooq Minha, Shriram Rajagopalan, Brendan Cully, Ashraf Aboulnaga, Ken Salem, and Andrew Warfield. The second topic is work that is not yet published nor as fully developed.


Focusing on the first paper, they built an active/standby database system using Remus. Remus implements transparent high availability for Xen VMs. It does this by reflecting all writes to memory in the active virtual machine to the non-active, backup VM.  Remus keeps the backup VM ready to take over with exactly the same memory state as the primary server. On failover, it can take over with the same memory contents including an already warm cache.

Remus is a simple and easy to understand approach to getting very fast takeover from a primary VM. The challenge is that memory write latencies are a fraction of network latencies so any solution that turns memory write latencies into network write latencies simply will not perform adequately for most workloads. Remus tackles this problem using the expected solution: batching many requests in a single network transfer. By default, every 25msec Remus suspends the primary VM, copies all changed pages to a Dom0 (hypervisor) buffer and the allows the VM to continue. The Dom0 buffer is used to minimized the length of time that the guest VM needs to be suspended but comes at the expense of requiring sufficient Dom0 memory for the largest group of changed pages in 25msec.


Once the guest machine changed pages are copied to Dom0, the primary VM is released from suspend state and the changes just copied to dom0 are then transferred to the secondary system and applied to the ready to run backup VM.


The downsides to the Remus approach are 1) a potentially large dom0 buffer is required and 2) up to 25msec of forward progress can be lost on failover, 3) the checkpoint work consumes considerable resources including time. The time to copy the changed pages may be acceptable but the other overheads are sufficiently high that it is very difficult to host demanding workloads like database workloads on Remus.


The authors tackle this problem but noting that Remus actually does more than is needed for database workloads. Or, worded differently, a Remus optimized for database workloads can dramatically reduce the implementation overhead. They introduced the following optimizations:

·         Asynchronous checkpoint compression: Maintain an LRU buffer of recent pages and only ship a delta of these pages. This optimization is based upon the assumption that DB systems modify some pages frequently and typically only change a small part of these pages between checkpoints.

·         Disk read tracking: don’t mark pages read from disk as dirty since they are already available to the backup server via an I/O

·         Memory deprotection: allows DB to declare regions of memory that don’t need to be replicated. This turned out not to be as powerful an optimization as the others and had the further downside of requiring database engine changes

·         Network optimization/Commit protection: Remus buffers every outgoing network packet to ensure clients never see the results of unsafe execution but this increases latency by not allowing any response back to the client until the next Remus checkpoint. Because DBs can fail and transactions can be aborted, they DB optimization is to send all packets back to client in real time except for commit, abort, or other database transaction state changing operations. On failover, any client in an unprotected network state (changes have been sent since the last checkpoint) has the transaction failed. A correct client will re-run the transaction and proceed without issue.


What was achieved is Remus, fast-failover protection for database workloads and far lower replication overhead. The authors used the database transaction benchmark TPC-C to show that Remus with DB optimizations has all the protection of Remus but with roughly 1/10th the overhead.



                VLDB Paper:


I'm not 100% convinced Remus is the best solution to the database high availability problem but I like the solution, learned from the proposed optimizations, and enjoyed the talk. Thanks to Pradeep Madhavarapu, who leads part of the Amazon database kernel engineering team (and is hiring :-)), for organizing this talk and to  Ashraf Aboulnaga for doing it.




James Hamilton



b: /


Sunday, November 27, 2011 12:50:18 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback

Disclaimer: The opinions expressed here are my own and do not necessarily represent those of current or past employers.

<August 2012>

This Blog
Member Login
All Content © 2015, James Hamilton
Theme created by Christoph De Baene / Modified 2007.10.28 by James Hamilton