Back in 2007, the audacious RE<C project was started. The goal of RE<C was simple: make renewable energy less costly than coal and let economics do the hard work of converting the worlds energy producers to go renewable. I blogged the project in Solving World Problems With Economic Incentives summarizing the project with “the core idea is that, if renewable energy sources were cheaper the coal, economic forces would quickly make the right thing happen and we would actually stop burning coal. I love the approach but it is fiendishly difficult.”
Unfortunately, RE<C really was fiendishly difficult and the project was subsequently abandoned in 2011. But I still love it partly because it was attempting to address an incredibly important world problem and partly because I like the use of economic incentives to solve world problems. Anytime we can make it financially advantageous for companies and countries to do the right thing, it’s far more likely to happen. Economic incentives are one of the best ways to influence governments of all sorts. Democratic governments, autocratic regimes, and pretty much all forms of government in between. With economic incentives aligned, lobby groups are far less likely to successfully block progress. Admittedly, even the obvious transitions with clear benefit still take time but there are few forces more powerful at forcing global change than economics.
Moving down from the global level to that of individual companies, I’ve long advocated the use of economic incentives to drive innovative uses of computing resources inside the company while preventing costs from spiraling out of control. Most IT departments control costs by having computing resources in short supply and only buying more resources slowly and with considerable care. Effectively computing is a scarce resource so it needs to get used carefully. This effectively limits IT cost growth and controls wastage but it also limits overall corporate innovation and the gains driven by the experiments that need these additional resources.
I’m a big believer in making effectively infinite computing resources available internally and billing them back precisely to the team that used them. Of course, each internal group needs to show the customer value of their resource consumption. Asking every group to effectively be a standalone profit center is, in some ways, complex in that the “product” from some groups is hard to quantitatively measure. Giving teams the resources they need to experiment and then allowing successful experiments to progress rapidly into production encourages innovation, makes for a more exciting place to work, and the improvements brought by successful experiments help the company be more competitive and better serve its customers.
I argue that all employees should be limited only by their ability rather than an absence of resources or an inability to argue convincingly for more. This is one of the most important yet least discussed advantages of cloud computing: taking away artificial resource limitations in support light-weight experimentation and rapid innovation. Making individual engineers and teams responsible to deliver more value for more resources consumed makes it possible encourage experimentation without fear that costs will rise without sufficient value being produced. And, because cloud computing is so inexpensive and comes without a long term commitment, a single engineer to do a trial run of a 1,000 core analysis to improve supply chain logistics without appreciable financial risk. If it works, keep doing it and reap the economic gain. If it doesn’t work, little was spent and it may have been a failed experiment but it was an inexpensive failed experiment. Economic systems are very powerful at driving innovation. That’s one of the reasons why the venture funded startup community has been successful innovating faster than some even very well-funded big companies.
Returning to the RE<C project, two engineers from the project recently wrote up what was learned on the project for the IEEE Spectrum article “What It Would Really Take to Reverse Climate Change.” The article talks briefly about the project, it’s goals, but rather than digging into why it failed, they instead discuss why the project succeeding wouldn’t have achieved the overall goals of reducing climate change. The latter is surprising in that how could producing energy less expensively than coal possibly not succeed in the project goals of reversing climate change? Technically the goals of RE<C were to drive down the cost of renewable energy below the cost of coal rather than to reverse climate change (RE<C Initiative) so it actually isn’t 100% correct to speculate that it could succeed and yet simultaneously fail at that definitional goal. But, the authors of this article are arguing an important point that RE<C could actually have succeeded and yet still have failed to reverse climate change which was clearly at least a motivator for the funding of the RE<C effort.
How could renewable energy less expensive than coal possibly fail to reverse climate change? There are two primary factors in place the first of which is fairly obvious and the latter perhaps less so. Looking first at the more obvious reason, the RE<C project ended up deciding to focus on solar power and solar has the downside of not producing around the clock. Solar arrays produce far less on cloudy days, generally are poor producers in geographies with non-favorable weather patterns, and don’t produce at all during the night. Coal produces 24x7 so beating coal requires that the renewable energy either be produced on demand or that it be efficiently stored to support load through lower production periods. Getting both power storage and power production costs less than coal is an even harder problem than the simple RE<C goal. However, there is still no question in my mind that delivering on RE<C even without a high-scale storage solution would have still had a phenomenally positive impact on the world climate problem. There is already an abundance of good work going on in utility scale energy storage using a flywheels, Li-Ion batteries, compressing air, and lifting large volumes of water amongst other solutions.
The second reason why RE<C successful might not have been sufficient to reverse climate change is the core focus of the article (What It Would Really Take to Reverse Climate Change). The authors present data and argue that the limit of carbon dioxide concentration in earth’s atmosphere to avoid global warming is around 350 parts per million (ppm). We are already at around 400 ppm and steadily worsening so, to reverse climate change, not only must we stop putting more carbon dioxide into the atmosphere but we must also remove 13% of the CO2 already in the atmosphere.
I enjoyed the article and generally found the data presented credible but I don’t look at the problem with quite so binary a perspective. Delivering RE<C but not finding a way to remove CO2 from the atmosphere would be a tremendous success. I would love to only have face the carbon sequestration problem :-). I also don’t view research that drops the costs of renewable energy but fails to get it below the cost of coal as failure. The closer we get to these goals even if they are not fully met, the more likely industry is to choose the cleaner solution. Good article, great topic, and, whether alive or dead, I still love the RE<C project.
· Summary of the RE<C Project: https://www.google.org/rec.html
· IEEE Spectrum article: http://spectrum.ieee.org/energy/renewables/what-it-would-really-take-to-reverse-climate-change
--James Hamilton, http://perspectives.mvdirona.com, firstname.lastname@example.org
The internet and the availability of content broadly and uniformly to all users has driven the largest wave of innovation ever e experienced in our industry. Small startups offering a service of value have the same access to customers as the largest and best funded incumbents. All customers have access to the same array of content regardless of their interests or content preferences. Some customers have access to faster access than others but, whatever the access speed, all customers have access to it all content uniformly. Some countries have done an amazing job of getting high speed access to an very broad swath of the population. South Korea has done a notably good job on this measure. The US is nowhere close to the top by this measure nor does the US have anything approaching the best price/performing access. But it's nowhere close to the worst either. And, up until the most recently Federal Communications Commission (FCC) proposal, it’s always been the case that when a user buys internet connectivity they get access to the entire internet.
Given how many huge US companies have been built upon the availability of broad internet connectivity, it's at least a bit surprising that the US market isn't closer to the best connected. And given that the US still has the largest number of internet-based startups -- almost certainly some of the next home run startups will come from this group -- one would expect a very strong belief in the importance of maintaining broad access to all content and the importance of even the smallest startups having uniform access to customers. Surprisingly, this is not the case and the US Federal Communications Commission has proposed that networks providers should be able to choose what content their customers get access to at what speed.
Clearly large owners of eyeball networks like Comcast and Verizon would like to have a two-sided market. On this model, they would like to be able to charge customers to access the internet and at the same time charge content providers to have access to "their" customers. I abstractly understand why they would want to be allowed to charge both consumers and providers. There is no question that this will be highly profitable. In many ways, that's one of the wonderful things about the free market economy. Companies will be very motivated to do what is best for themselves and their shareholders and this has driven much innovation. As long as there is competition without collusion, it's mostly a good thing. But, there are potential downsides. There needs to be guardrails to prevent the free market from doing things that are bad for society. Society shouldn’t let companies use child labor for example. Charging content providers for the privilege of being able to provide services to customers that want access to their services and have paid Comcast, Verizon, Time-Warner etc. for access is another example of corporate behavior not in the best interests of society as a whole. We want customers who paid to access the internet to get access to all of it and not just the slice of content from providers that paid the most. Comcast should not be the arbiter of who customer get to buy services from. Surprisingly, that is currently what the FCC is currently proposing
Allowing last-mile network providers to decide which services are available or usable by a large swatch of customers without access to competing services is a serious mistake. Losing network neutrality is not good for customers, it's not good for content providers, it's not good for innovation but its, oh so very good for Verizon and Comcast. Just as using child labor isn't something we allow companies to do even if it could help them to be more profitable, we should not allow these companies to be able to hold content providers hostage. Without network neutrality content providers must pay last mile network providers or lose access to customers connecting to the internet using those networks. This is why Netflix has been forced to grudging pay the “protection” fee money required by the major eye ball markets.
Netflix is perhaps the best example of a provider that customers would like to have access to but has been forced to pay last mile network owners like Verizon even though Verizon customers have already paid for access to Netflix. Some recent articles:
· Netflix blames Sluggish streaming on Verizon, other service providers
· Netflix Agrees to Pay Verizon for faster Internet, Too
· Netflix is Still Mad at Verizon and Has the Charts to Prove It
· Netflix got worse on Verizon even faster Netflix agreed to pay Verizon
The open internet and network neutrality has helped to create a hotbed of innovation, has allowed new winners to emerge every day, and the proposal to give it up is hard to understand and is open to cynical interpretations revolving around the power of lobbyist and campaign contributions over what is best for the constituents.
Network neutrality is a serious topic but comedian John Oliver has done an excellent job of pointing out how ludicrous this proposed legislation really is. In fact, Oliver did such a good job of appealing to the broader population to actively comment during the FCC comment period on this proposed legislation that the FCC web site failed under the load. This video is a bit long at nearly just over 13 minutes but it really is worth watching John Oliver point out some of the harder to explain aspects of this legislation: http://www.theguardian.com/technology/2014/jun/03/john-oliver-fcc-website-net-neutrality?CMP=fb_us.
And, once you have seen the video, leave your feedback for the FCC on their proposal to give up network neutrality at http://fcc.gov/comments.
Enter your comments against the ironically titled "14-28 Protecting and Promoting the Open Internet." My comments follow:
The internet and the availability of content broadly and uniformly to all users (network neutrality) has driven the largest wave of innovation ever experienced in the technology sector. Small startups offering a service have the same access to customers as the largest and best funded incumbents. All customers have access to the same array of content regardless of their interests or content preferences. Some customers have access to faster access than others but, whatever the access speed, all customers have access to it all content uniformly. Some of the most successful companies in the country have been built on this broad and uniform access to customers. The next wave of startups are coming and achieving incredible valuations in IPOs or acquisitions. Giving up network neutrality puts our technology industry by making it harder for new companies to get access to customers and places far too much power in the hands of access network providers where we have few competitive alternatives.
--James Hamilton, email@example.com, http://perspectives.mvdirona.com
It’s difficult to adequately test complex systems. But what’s really difficult is keeping a system adequately tested. Creating systems that do what they are designed to do is hard but, even with the complexity of these systems, many life critical systems have the engineering and production testing investment behind them to be reasonably safe when deployed. Its keeping them adequately tested over time as conditions and the software system changes where we sometimes fail.
There are exceptions to the general observation that we can build systems that operate safely when inside reasonable expectations of expected operating conditions. One I’ve written about was the Fukushima Dai-1 nuclear catastrophe. Any reactor design that doesn’t anticipate total power failure including backups, is ignoring considerable history. Those events, although rare, do happen and the life critical designs need to expect them and this fault condition needs to be tested in a live production environment. “Nearly expected” events shouldn’t bring life critical systems. The more difficult to protect against are 1) the impact of black swan events and 2) the combination of vastly changing environmental conditions over time.
Another even more common cause of complex system failure is the impact of human operators. Most systems are exposed to human factors and human error will always be a leading cause of complex system failure. One I’ve written about and continue to follow is the Costa Concordia Grounding. It is well on its way to becoming a text book case on how human error can lead to loss of life and how poor decisions can cascade to provide opportunity for yet more poor decisions by the same people that got the first one wrong and are now operating under incredible stress. It’s a super interesting situation that I will likely return to in the near future and summarize what has been learned over the last few years of investigation, salvage operation, and court cases.
Returning to the two other causes of complex system failure I mentioned earlier: 1) the impact of black swan events and 2) compounding changes in environmental conditions. Much time gets spent on how to mitigate the negative impact of rare and difficult to predict events. The former are, by definition, very difficult to adequately predict so most mitigations involve being able to reduce the blast radius (localize the negative impact as much as possible) and design the system to fail in a degraded but non-life threatening state (degraded operations mode). The more mundane of these two conditions is the second, compounding changes in the environmental conditions in which the complex system is operating. These are particularly difficult in that, quite often, none of the changes are large and none happen fast but scale changes, traffic conditions change, workload mix change, and the sum of all changes over a long period of time can put the system in a state far outside of those anticipated by the original designers and, consequently, were never tested.
I came across an example of these latter failure modes in my recent flight through Los Angeles Airport on April 30th. While waiting for my flight from Los Angeles north to Vancouver, we were told there had been a regional air traffic control system failure and the entire Los Angeles area was down. These regional air traffic control facilities are responsible for the air space between airports. In the US, there are currently 22 Area Control Centers referred to by the FAA as Air Route Traffic Control Centers (ARTCC). Each ARTCC is responsible for a portion of the US air space outside of the inverted pyramid controlled by each airport. This number of ARTCC has been increasing but, even with larger numbers, the negative impact of one going down is broad. In this instance all traffic for the LAX, Burbank, Long Beach, Ontario, and Orange County airports was all brought to a standstill.
As I waited to board my flight to Vancouver, more and more aircraft were accumulating at LAX. Over the next hour or so the air space in the region drained as flights landed or were diverted but none departed. 50 flights were canceled and 428 flights were delayed. The impact of the cancelations and delays rippled throughout North America and probably world-wide. As a consequence of this delay, I missed my flight from Vancouver to Victoria much later in the evening and many other passengers passing through Vancouver were impacted even though it’s a long way away in a different country many hours later. An ARTCC going down for even a short time, can leads to delays in the system that take nearly a day to fully resolve.
Having seen the impact personally, I got interested in what happened. What took down the ARTCC system controlling the entire Los Angeles region? It has taken some time and, in the interim, the attention of many news outlets has wandered elsewhere but this BBC article summarizes the failure: Air Traffic Control Memory Shortage Behind Air Chaos and this article has more detail: Fabled U-2 Spy Plane Begins Farewell Tour by Shutting Down Airports in the L.A. Region.
The failing system at the ARTCC was the En-Route Automation Modernization (ERAM) that can track up to 1,900 concurrent flights simultaneously using data from many sensors including 64 RADAR deployment systems. The system was deployed in 2010 so it’s fairly new but we all know that over time, all regions get more busy. The airports get better equipment allowing them to move planes safely at greater rates, take-off and landing frequency goes up, some add new runways, most add new gates. And the software system itself gets changes not all of which will improve scaling or maximum capability. Over time, the load keeps going up and the system moves further from the initial test conditions when it was designed, developed, and tested. This happens to many highly complex systems and some end up operating at an order of magnitude higher load or different load mix than originally anticipated – this is a common cause for complex system failure. We know that systems eventually go non-linear as the load increases so we need to constantly probe 10x beyond what possible today to ensure that there remains adequate headroom between possible operating modes and the system failure point.
The FAA ERAM system is a critical life safety system so presumably it operates with more engineering headroom and safety margin than some commercial systems but yet it still went down hard and all backups failed in this situation. What happened in this case appears to have been a combination of slowly ramping load combined with a rare event. In this case a U-2 spy plane was making an high altitude pass over the Western US as part of its farewell tour. The U-2 is a spy plane with an operational ceiling of 70,000’ (13.2 miles above earth) and a range of 6,405 miles at a speed of 500 mph. It flies well above commercial air traffic. The U-2 is neither particularly fast nor long range but you have to remember its first flight was in 1955. It’s an aging technology and it was incredibly advanced when it was first produced by the famous Lockheed Skunk Works group lead by Kelly Johnson. Satellite imagery and drones have largely replaced the U-2 and the booming $32,000 bill for each flight hour has led to the planned retirement of the series.
What brought the ERAM system down was the combination of the typically heavy air traffic in the LA region combined with an imprecise flight plan filed for the U-2 spy plane passing through the region on a farewell flight. It’s not clear why a plane flying well above the commercial flight ceilings was viewed as a collision threat by ERAM but the data stream driven by the U-2 flights ended up consuming massive amounts of memory which brought down both the primary and secondary software systems leaving the region without air traffic control.
The lessons here are at least twofold. First, as complex systems age, the environmental conditions under which they operate change dramatically. Workload typically goes up, workload mix changes over time, and there will be software changes made over time some of which will change the bounds of what system can reliably handle. Knowing this, we must always be retesting production systems with current workload mixes and we must probe the bounds well beyond any reasonable production workload. As these operating and environmental conditions evolve, the testing program must be updated. I have seen complex system fail where, upon closer inspection, it’s found that the system has been operating for years beyond its design objectives and its original test envelope. Systems have to be constantly probed to failure so we know where they will operate stably and where they will fail.
The second lesson is that rare events will happen. I doubt a U2 pass of the western US is all that rare but something about this one was unusual. We need to expect that complex systems will face unexpected environmental conditions and look hard for some form degraded operations mode. Failure fault containment zones should be made as small as possible, we want to look for ways to deliver the system such that some features may fail while others continue to operate. You don’t want to lose all functionality for the entire system in a failure. For all complex systems, we are looking for ways to divide up the problem such that a fault has a small blast radius (less impact) and we want the system to gracefully degrade to less functionality rather than have the entire system hard fail. With backup systems its particularly important that they are designed to fail down to less functionality rather than also hard failing. In this case, having the backup come up and require more flight separation and do less real time collision calculations might be the right answer. Generally, all systems need to have some way to operate at less than full functionality or less than full scale without completely hard failing. These degraded operation modes are hard to find but I’ve never seen a situation where we couldn’t find one.
It’s not really relevant but still ironic that both the trigger for the fault, the U-2 spy plane, and the system brought down by the fault, ERAM, were both Lockheed produced products although released 55 years apart.
Unfortunately, it’s not all about tuning the last bit of reliability out of complex systems. There is still a lot of low hanging fruit out there. As proof that even some relatively simple systems can be poorly engineered and produce remarkably poor results even when working well within their design parameters, on a 747-400 flying back to New Zealand from San Francisco, at least 3 passengers on that flight had lost bags. My loss appears to be permanent at this point. It’s amazing that couriers can move 10s of millions of packages a year over far more complex routings and yet, in passenger air travel baggage handling, the results are several orders of magnitude worse. I suspect that if the operators were economically responsible for the full loss, the quality of air baggage tracking would come up to current technology levels fairly quickly.
James Hamilton, firstname.lastname@example.org
In the data center world, there are few events taken more seriously than power failure and considerable effort is spent to make them rare. When a datacenter experiences a power failure, it’s a really big deal for all involved. But, a big deal in the infrastructure world still really isn’t a big deal on the world stage. The Super Bowl absolutely is a big deal by any measure. On average over the last couple of years, the Super Bowl has attracted 111 million viewers and is the number 1 most watched television show in North America eclipsing the final episode of Mash. World-wide, the Super Bowl is only behind the European Cup (UEFA Champions Leaque) which draws 178 million viewers.
When the 2013 Super Bowl power event occurred, the Baltimore Ravens had just run back the second half opening kick for a touchdown and they were dominating the game with a 28 to 6 point lead. The 49ers had already played half the game and failed to get a single touchdown. The Ravens were absolutely dominating and they started the second half by tying the record for the longest kickoff return in NFL history at 108 yards. The game momentum was strongly with Baltimore.
At 13:22 in the third quarter, just 98 seconds into the second half, ½ of the Superdome lost primary power. Fortunately it wasn’t during the runback that started the second half. The power failure let to a 34 min delay to restore full lighting the field and, when the game restarted, the 49ers were on fire. The game was fundamentally changed by the outage with the 49ers rallying back to a narrow defeat of only 3 points. The game ended 34 to 31 and it really did come down to the wire where either team could have won. There is no question the game was exciting and some will argue the power failure actually made the game more exciting. But, NFL championships should be decided on the field and not impacted by the electrical system used by the host stadium.
What happened at 13:22 in the third quarter when much of the field lighting failed? Entergy, the utility supply power to the Superdome reported their “distribution and transmission feeders that serve the Superdome were never interrupted” (Before Game Is Decided, Superdome Goes Dark). It was a problem at the facility.
The joint report from SMG the company that manages the Superdome and Entergy, the utility power provider, said:
A piece of equipment that is designed to monitor electrical load sensed an abnormality in the system. Once the issue was detected, the sensing equipment operated as designed and opened a breaker, causing power to be partially cut to the Superdome in order to isolate the issue. Backup generators kicked in immediately as designed.
Entergy and SMG subsequently coordinated start-up procedures, ensuring that full power was safely restored to the Superdome. The fault-sensing equipment activated where the Superdome equipment intersects with Entergy’s feed into the facility. There were no additional issues detected. Entergy and SMG will continue to investigate the root cause of the abnormality.
Essentially, the utility circuit breaker detected an “anomaly” and opened the breaker. Modern switchgear have many sensors monitored by firmware running on a programmable logic controller. The advantage of these software systems is they are incredibly flexible and can be configured uniquely for each installation. The disadvantage of software systems is the wide variety of configurations they can support can be complex and the default configurations are used perhaps more often than they should. The default configurations in a country where legal settlements can be substantial tend towards the conservative side. We don’t know if that was a factor in this event but we do know that no fault was found and the power was stable for the remainder of the game. This was almost certainly a false trigger.
Because the cause has not yet been reported and, quite often, the underlying root cause is never found. But, it’s worth asking, is it possible to avoid long game outages and what would it cost? As when looking at any system faults, the tools we have to mitigate the impact are: 1) avoid the fault entirely, 2) protect against the fault with redundancy, 3) minimize the impact of the fault through small fault zones, and 4) minimize the impact through fast recovery.
Fault avoidance: Avoidance starts with using good quality equipment, configuring it properly, maintaining it well, and testing it frequently. Given the Superdome just went through $336 million renovation, the switch gear may have been relatively new and, even if it wasn’t, it likely was almost certainly recently maintained and inspected.
Where issues often arise are in configuration. Modern switch gear have an amazingly large number of parameters many of which interact with each other and, in total, can be difficult to fully understand. And, given the switch gear manufactures know little about the intended end-use application of each switchgear sold, they ship conservative default settings. Generally, the risk and potential negative impact of a false positive (breaker opens when it shouldn’t) is far less than a breaker that fails to open. Consequently conservative settings are common.
Another common cause of problems is lack of testing. The best way to verify that equipment works is to test at full production load in a full production environment in a non-mission critical setting. Then test it just short of overload to ensure that it can still reliably support the full load even though the production design will never run it that close to the limit, and finally, test it into overload to ensure that the equipment opens up on real faults.
The first, testing in full production environment in non-mission critical setting is always done prior to a major event. But the latter two tests are much less common: 1) testing at rated load, and 2) testing beyond rated load. Both require synthetic load banks and skill electricians and so these tests are often not done. You really can’t beat testing in a non-mission critical setting as a means of ensuring that things work well in a mission critical setting (game time).
Redundancy: If we can’t avoid a fault entirely, the next best thing is to have redundancy to mask the fault. Faults will happen. The electrical fault at the Monday Night Football game back in December of 2011 was caused by utility sub-station failing. These faults are unavoidable and will happen occasionally. But is protection against utility failure possible and affordable? Sure, absolutely. Let’s use the Superdome fault yesterday as an example.
The entire Superdome load is only 4.6MW. This load would be easy to support on two 2.5 to 3.0MW utility feeds each protected by its own generator. Generators in the 2.5 to 3.0 MW range are substantial V16 diesel engines the size of a mid-sized bus. And they are expensive running just under $1M each but they are also available in mobile form and inexpensive to rent. The rental option is a no-brainer but let’s ignore that and look at what it would cost to protect the Superdome year around with a permanent installation. We would need 2 generators, the switchgear to connect it to the load and uninterruptable power supplies to hold the load during the first few seconds of a power failure until the generators start up and are able to pick up the load. To be super safe, we’ll buy third generator just in case there is a problem and one of the two generators don’t start. The generators are under $1m each and the overall cost of the entire redundant power configuration with the extra generator could be had for under $10m. Looking at statistics from the 2012 event, a 30 second commercial costs just over $4m.
For the price of just over 60 seconds of commercials the facility could protected against fault. And, using rental generators, less than 30 seconds of commercials would provide the needed redundancy to avoid impact from any utility failure. Given how common utility failures are and the negative impact of power disruptions at a professional sporting event, this looks like good value to me. Most sports facilities chose to avoid this “unnecessary” expense and I suspect the Superdome doesn’t have full redundancy for all of its field lighting. But even if it did, this failure mode can sometimes cause the generators to be locked out and not pick up the load during a some power events. In this failure mode, when a utility breaker incorrectly senses a ground fault within the facility, it is frequently configured to not put the generator at risk by switching it into a potential ground fault. My take is I would rather run the risk of damaging the generator and avoid the outage so I’m not a big fan of this “safety” configuration but it is a common choice.
Minimize Fault Zones: The reason why only ½ the power to the Superdome went down was because the system installed at the facility has two fault containment zones. In this design, a single switchgear event can only take down ½ of the facility.
Clearly the first choice is to avoid the fault entirely. And, if that doesn’t work, have redundancy take over and completely mask the fault. But, in the rare cases where none of these mitigations work, the next defense are small fault containment zones. Rather than using 2 zones, spend more on utility breakers and have 4 or 6 and, rather than losing ½ the facility, lose ¼ or 1/6. And, if the lighting power is checker boarded over the facility lights, (lights in a contiguous region are not all powered by the same utility feed but the feeds are distributed over the lights evenly), rather than losing ¼ or 1/6 of the lights in one area of the stadium, we would lose that fraction of the lights evenly over the entire facility. Under these conditions, it might be possible to operate with slightly degraded field lighting and be able to continue the game without waiting for light recovery.
Fast Recovery: Before we get to this fourth option, fast recovery, we have tried hard to avoid failure, then we have used power redundancy to mask the failure, then we have used small fault zones to minimize the impact. The next best thing we can do is to recover quickly. Fast recovery depends broadly on two things: 1) if possible automate recovery so it can happen in seconds rather than the rate at which humans can act, 2) if humans are needed, ensure they have access to adequate monitoring and event recording gear so they can see what happened quickly and they have trained extensively and are able to act quickly.
In this particular event, the recovery was not automated. Skilled electrical technicians were required. They spent nearly 15 minute checking system states before deciding it was safe to restore power. Generally, 15 min on a human judgment driven recover decision isn’t bad. But the overall outage was 34 min. If the power was restored in 15 min, what happened during the next 20? The gas discharge lighting still favored at large sporting venues, take roughly 15 minutes to restart after a momentary outage. Even a very short power interruption will still suffer the same long recovery time. Newer light technologies are becoming available that are both more power efficient and don’t suffer from these long warm-up periods.
It doesn’t appear that the final victor of Super Bowl XLVII was changed by the power failure but there is no question the game was broadly impacted. If the light failure had happened during the kickoff return starting the third quarter, the game may have been changed in a very fundamental way. Better power distribution architectures are cheap by comparison. Given the value of the game, the relative low cost of power redundancy equipment, I would argue it’s time to start retrofitting major sporting venues with more redundant design and employing more aggressive pre-game testing.
b: http://blog.mvdirona.com / http://perspectives.mvdirona.com
The last few weeks have been busy and it has been way too long since I have blogged. I’m currently thinking through the server tax and what’s wrong with the current server hardware ecosystem but don’t have anything yet ready to go on that just yet. But, there are a few other things on the go. I did a talk at Intel a couple of weeks back and last week at the First Round Capital CTO summit. I’ve summarized what I covered below with pointers to slides.
In addition, I’ll be at the Amazon in Palo Alto event this evening and will do a talk there as well. If you are interested in Amazon in general or in AWS specifically, we have a new office open in Palo Alto and you are welcome to come down this evening to learn more about AWS, have a beer or the refreshment of your choice, and talk about scalable systems. Feel free to attend if you are interested:
Amazon in Palo Alto
October 11, 2012 at 5:00 PM - 9:00 PM
Pampas 529 Alma Street Palo Alto, CA 94301
First Round Capital CTO Summit:
I started this session by arguing that cost or value models are the right way to ensure you are working on the right problem. I come across far too many engineers and even companies that are working on interesting problems but they fail at the “top 10 problem” test. You never want to first have to explain the problem to a perspective customer before you get a chance to explain your solution. It is way more rewarding to be working on top 10 problems where the value of what you are doing is obvious and you only need to convince someone that your solution actually works.
Cost models are a good way to force yourself to really understand all aspects of what the customer is doing and know precisely what savings or advantage you bring. A 25% improvement on an 80% problem is way better than 50% solution to a 5% problem. Cost or value models are a great way of keeping yourself honest on what the real savings or improvement of your approach actually are. And its quantifiable data that you can verify in early tests and prove in alpha or beta deployments.
I then covered three areas of infrastructure where I see considerable innovation and showed all the cost model helped drive me there:
· Networking: The networking eco-system is still operating on the closed, vertically integrated, mainframe model but the ingredients are now in place to change this. See Networking, the Last Bastion of Mainframe Computing for more detail. The industry is currently going through great change. Big change is a hard transition for the established high-margin industry players but it’s a huge opportunity for startups.
· Storage: The storage (and database) worlds are going through a unprecedented change where all high-performance random access storage is migrating from hard disk drives to flash storage. The early flash storage players have focused on performance over price so there is still considerable room for innovation. Another change happening in the industry is the explosion of cold storage (low I/O density storage that I jokingly refer to as write-only) due to falling prices, increasing compliance requirements, and an industry realization that data has great value. This explosion in cold storage is opening much innovation and many startup opportunities. The AWS entrant in this market is Glacier where you can store seldom accessed data at one penny per GB per month (for more on Glacier: Glacier: Engineering for Cold Storage in the Cloud.
· Cloud Computing: I used to argue that targeting cloud computing was a terrible idea for startups since the biggest cloud operators like Google and Amazon tend to do all custom hardware and software and purchase very little commercially. I may have been correct initially but, with the cloud market growing so incredibly fast, every teleco is entering the market, each colo provider is entering, most hardware providers are entering, … the number of players is going from 10s to 1000s. And, at 1,000s, it’s a great market for a startup to target. Most of these companies are not going to build custom networking, server, and storage hardware but they do have the need to innovate with the rest of the industry.
Slides: First Round Capital CTO Summit
Intel Distinguished Speaker Series:
In this talk I started with how fast the cloud computing market segment is growing using examples form AWS. I then talked about why cloud computing is such an incredible customer value proposition. This isn’t just a short term fad that will pass over time. I mostly focused on how that statement I occasionally hear just can’t be possibly be correct: “I can run my on-premise computing infrastructure less expensively then hosting it in the cloud”. I walk through some of the reasons why this statement can only be made with partial knowledge. There are reasons why some computing will be in the cloud and some will be hosted locally and industry transitions absolutely do take time but cost isn’t one of the reasons that some workloads aren’t in the cloud.
I think walked through 5 areas of infrastructure innovation and some of what is happening in each area:
· Power Distribution
· Mechanical Systems
· Data Center Building Design
Slides: Intel Distinguished Speaker Series
I hope to see you tonight at the Amazon Palo Alto event at Pampas (http://goo.gl/maps/dBZxb). The event starts at 5pm and I’ll do a short talk at 6:35.
Facebook recently released a detailed report on their energy consumption and carbon footprint: Facebook’s Carbon and Energy Impact. Facebook has always been super open with the details behind there infrastructure. For example, they invited me to tour the Prineville datacenter just prior to its opening:
· Open Compute Project
· Open Compute Mechanical System Design
· Open Compute Server Design
· Open Compute UPS & Power Supply
Reading through the Facebook Carbon and Energy Impact page, we see they consumed 532 million kWh of energy in 2011 of which 509m kWh went to their datacenters. High scale data centers have fairly small daily variation in power consumption as server load goes up and down and there are some variations in power consumption due to external temperature conditions since hot days require more cooling than chilly days. But, highly efficient datacenters tend to be effected less by weather spending only a tiny fraction of their total power on cooling. Assuming a flat consumption model, Facebook is averaging, over the course of the year, 58.07MW of total power delivered to its data centers.
Facebook reports an unbelievably good 1.07 Power Usage Effectiveness (PUE) which means that for every 1 Watt delivered to their servers they lose only 0.07W in power distribution and mechanical systems. I always take publicly released PUE numbers with a grain of salt in that there has been a bit of a PUE race going on between some of the large operators. It’s just about assured that there are different interpretations and different measurement techniques being employed in computing these numbers so comparing them probably doesn’t tell us much. See PUE is Still Broken but I Still use it and PUE and Total Power Usage Efficiency for more on PUE and some of the issues in using it comparatively.
Using the Facebook PUE number of 1.07, we know they are delivering 54.27MW to the IT load (servers and storage). We don’t know the average server draw at Facebook but they have excellent server designs (see Open Compute Server Design) so they likely average at or below as 300W per server. Since 300W is an estimate, let’s also look at 250W and 400W per server:
· 250W/server: 217,080 servers
· 300W/server: 180,900 servers
· 350W/server: 155,057 servers
As a comparative data point, Google’s data centers consume 260MW in aggregate (Google Details, and Defends, It’s use of Electricity). Google reports their PUE is 1.14 so we know they are delivering 228MW to their IT infrastructure (servers and storage). Google is perhaps the most focused in the industry on low power consuming servers. They invest deeply in custom designs and are willing to spend considerably more to reduce energy consumption. Estimating their average server power draw at 250W and looking at the +/-25W about that average consumption rate:
· 225W/server: 1,155,555 servers
· 250W/server: 1,040,000 servers
· 275W/server: 945,454 servers
I find the Google and Facebook server counts interesting for two reasons. First, Google was estimated to have 1 million servers more than 5 years ago. The number may have been high at the time but it’s very clear that they have been super focused on work load efficiency and infrastructure utilization. To grow the search and advertising as much as they have without growing the server count at anywhere close to the same rate (if at all) is impressive. Continuing to add computationally expensive search features and new products and yet still being able to hold the server count near flat is even more impressive.
The second notable observation from this data is that the Facebook server count is growing fast. Back in October of 2009, they had 30,000 servers. In June of 2010 the count climbed to 60,000 servers. Today they are over 150k.
b: http://blog.mvdirona.com / http://perspectives.mvdirona.com
The NASCAR Sprint Cup Stock Car Series kicks its season off with a bang and, unlike other sports, starts the season off with the biggest event of the year rather than closing with it. Daytona Speed Weeks is a multi-week, many race event the finale of which is the Daytona 500. The 500 starts with a huge field of 43 cars and is perhaps famous for some of the massive multi-car wrecks. The 17 car pile-up of 2011, made a 43 card field look like the appropriate amount of redundancy just to get a car over the finish line at the end.
Watching 43 stock cars race for the green flag at the start of the race is an impressive show of power as 146,000 lbs of metal charge towards the start line at nearly 200 miles per hour running so close that they appear to be connected. From the stands, the noise is deafening, the wall of air they are pushing can be felt 20 rows up and the air is hot from all the waste heat spilling off the field as they scream to the line.
Imagine harnessing all the power of all the engines from the 43 cars heading towards the start line at Daytona in a single engine? In fact, let’s make it harder, imagine having all the power of all the cars that take the green flag at both Daytona Sprnt Cup races each year. That would be a single engine capable of putting out 64,500 hp. Actually, for safety reasons, NASCAR restricts engine output at the Daytona and Talladega superspeedways to approximately 430 hp but let’s stick with the 750 hp they can produce when unrestricted. If we harnessed that power into a single engine, we would have an unbelievable 64,500 HP. Last week Jennifer and I were invited to tour the Hanjin Oslo container ship which happens to be single engine powered. Believe it or not, that single engine is more powerful that the aggregate horsepower of both Daytona starting fields. It has a single 74,700 hp engine.
Last week Peter Kim who supervises the Hanjin shipping port at Terminal 46 invited us to tour the port facility and the Hanjin Oslo container ship. I love technology, scale, and learning how well run operations work so I jumped on the opportunity.
Shortly after arriving, we watched the Oslo being brought into terminal 46. The captain and pilot were both looking down from the bridge wing towering more than 100’ above us giving commands to the tugs as the Oslo is being eased into the dock. Even before the ship was tied off, the port was rapidly coming to life. Dock workers were scrambling to their stations, trucks were starting, container cranes were moving into position, Customs and Border Patrol was getting ready to board, and line handlers were preparing to tie the ship off. There were workers and heavy equipment moving into position throughout the terminal. And, over the next 12 hours, more than a thousand containers would be moved before the ship would be off to its next destination at 6:30am the following morning.
The Oslo is not the newest ship in the Hanjin fleet having been built in 1998. It’s not the biggest ship nor is it the most powerful. But it’s a great example of a well-run, super clean, and expertly maintained container ship. And, starting with the size, here’s the view from the bridge.
The ship truly is huge. What I find even more amazing is that, as large as the Oslo is, there are container ships out there with up to twice the cargo carrying capacity and as much as 45% more horse power. In fact, the world’s most powerful diesel engine is deployed in a container ship. It’s a 14 cylinder, 3 floor high monster that produces 109,000 hp designed by the Finnish company Wartsila.
The Hanjin Oslo uses a (slightly) smaller inline 10 cylinder version of the same engine design. The key difference between it and the world’s largest diesel shown above is that the engine in the Oslo is 4 cylinders shorter at 10 cylinders inline rather than 14 and it produces proportionally less power. On the Oslo, the engine spans 3 decks so you can only see 1/3 of it at any one time. Here’s the view from the Hanjin Oslo engine room top deck, mid deck, and lower deck:
The engine is clearly notable for its size and power output. But, what I find most surprising is it’s a two stroke engine. Two stroke engines produce power at the beginning of the power stroke where the piston is heading down, dump the exhaust towards the end of that stroke, then bring in fresh air at the beginning of the next stroke as the piston begins heading back up, and then compresses the air for the remainder of that stroke. Towards the end of the compression stroke, fuel is injected into the cylinder where it combusts rapidly building pressure and pushing the piston back down on the power stroke. Four stroke engines separate these functions into four strokes: 1) power going down, 2) exhaust going up, 3) intake going down, and then 4) compression going up.
Two-stroke engines are common in lawn mowers, chainsaws, and some very small outboards because of their high power to weight ratio and simplicity of design that makes very low cost engines possible. Larger diesel engines used in trucks and automobiles are almost exclusively 4 stroke engines. Ironically, the very highest output diesel engines found in large marine applications are also two strokes.
From Spending an evening with the Hanjin team, I was super impressed. I love the technology, the scale was immense, everything was very well maintained, and they are clearly excellent operators. If I was moving goods between continents, I would look first to Hanjin.
Most of the time I write about the challenges posed by
scaling infrastructure. Today, though, I wanted mention some upcoming
events that have to do with a different sort of scale.
In Amazon Web Services we are tackling lots of really hairy
challenges as we build out one the world’s largest cloud computing platforms.
From data center design, to network architecture, to data persistence, to
high-performance computing and beyond we have a virtually limitless set
of problems needing to be solved. Over the coming years AWS will be
blazing new trails in virtually every aspect of computing and infrastructure.
In order to tackle these opportunities we are searching for
innovative technologists to join the AWS team. In other words we need to
scale our engineering staff. AWS has hundreds of open positions
throughout the organization. Every single AWS team is hiring including
EC2, S3, EBS, EMR, CloudFront, RDS, DynamoDB and even the AWS-powered Amazon Silk
On May 17th and 18th we will be holding recruiting events in
three cities: Houston, Minneapolis, and Nashville. If you live near any
of those cities and are passionate about defining and building the future of
computing you will find more information at the following URL http://aws.amazon.com/careers/local-events/
You can also send your resume to email@example.com
and we will follow up with you.
b: http://blog.mvdirona.com / http://perspectives.mvdirona.com
Every couple of weeks I get questions along the lines of “should I checksum application files, given that the disk already has error correction?” or “given that TCP/IP has error correction on every communications packet, why do I need to have application level network error detection?” Another frequent question is “non-ECC mother boards are much cheaper -- do we really need ECC on memory?” The answer is always yes. At scale, error detection and correction at lower levels fails to correct or even detect some problems. Software stacks above introduce errors. Hardware introduces more errors. Firmware introduces errors. Errors creep in everywhere and absolutely nobody and nothing can be trusted.
Over the years, each time I have had an opportunity to see the impact of adding a new layer of error detection, the result has been the same. It fires fast and it fires frequently. In each of these cases, I predicted we would find issues at scale. But, even starting from that perspective, each time I was amazed at the frequency the error correction code fired.
On one high scale, on-premise server product I worked upon, page checksums were temporarily added to detect issues during a limited beta release. The code fired constantly, and customers were complaining that the new beta version was “so buggy they couldn’t use it”. Upon deep investigation at some customer sites, we found the software was fine, but each customer had one, and sometimes several, latent data corruptions on disk. Perhaps it was introduced by hardware, perhaps firmware, or possibly software. It could have even been corruption introduced by one of our previous release when those pages where last written. Some of these pages may not have been written for years.
I was amazed at the amount of corruption we found and started reflecting on how often I had seen “index corruption” or other reported product problems that were probably corruption introduced in the software and hardware stacks below us. The disk has complex hardware and hundreds of thousands of lines of code, while the storage area network has complex data paths and over a million lines of code. The device driver has tens of thousands of lines of code. The operating systems has millions of lines of code. And our application had millions of lines of code. Any of us can screw-up, each has an opportunity to corrupt, and its highly likely that the entire aggregated millions of lines of code have never been tested in precisely the combination and on the hardware that any specific customer is actually currently running.
Another example. In this case, a fleet of tens of thousands of servers was instrumented to monitor how frequently the DRAM ECC was correcting. Over the course of several months, the result was somewhere between amazing and frightening. ECC is firing constantly.
The immediate lesson is you absolutely do need ECC in server application and it is just about crazy to even contemplate running valuable applications without it. The extension of that learning is to ask what is really different about clients? Servers mostly have ECC but most clients don’t. On a client, each of these corrections would instead be a corruption. Client DRAM is not better and, in fact, often is worse on some dimensions. These data corruptions are happening out there on client systems every day. Each day client data is silently corrupted. Each day applications crash without obvious explanation. At scale, the additional cost of ECC asymptotically approaches the cost of the additional memory to store the ECC. I’ve argued for years that Microsoft should require ECC for Windows Hardware Certification on all systems including clients. It would be good for the ecosystem and remove a substantial source of customer frustration. In fact, it’s that observation that leads most embedded systems parts to support ECC. Nobody wants their car, camera, or TV crashing. Given the cost at scale is low, ECC memory should be part of all client systems.
Here’s an interesting example from the space flight world. It caught my attention and I ended up digging ever deeper into the details last week and learning at each step. The Russian space mission Phobos-Grunt (also written Fobos-Grunt both of which roughly translate to Phobos Ground) was a space mission designed to, amongst other objectives, return soil samples from the Martian moon Phobos. This mission was launched atop the Zenit-2SB launch vehicle taking off from the Baikonur Cosmodrome 2:16am on November 9th 2011. On November 24th it was officially reported that the mission had failed and the vehicle was stuck in low earth orbit. Orbital decay has subsequently sent the satellite plunging to earth in a fiery end of what was a very expensive mission.
What went wrong aboard Phobos-Grunt? February 3rd the official accident report was released: The main conclusions of the Interdepartmental Commission for the analysis of the causes of abnormal situations arising in the course of flight testing of the spacecraft "Phobos-Grunt". Of course, this document is released in Russian but Google Translate actually does a very good job with it. And, IEEE Spectrum Magazine reported on the failing as well. The IEEE article, Did Bad Memory Chips Down Russia’s Mars Probe, is a good summary and the translated Russian article offers more detail if you are interested in digging deeper.
The conclusion of the report is that there was a double memory fault on board Phobos-Grunt. Essentially both computers in a dual-redundant set failed at the same or similar times with a Static Random Access Memory failure. The computer was part of the newly-developed flight control system that had focused on dropping the mass of the flight control systems from 30 kgs (66 lbs) to 1.5 kgs (3.3 lbs). Less weight in flight control is more weight that can be in payload, so these gains are important. However, this new flight control system was blamed for the delay of the mission by 2 years and the eventual demise of the mission.
The two flight control computers are both identical TsM22 computer systems supplied by Techcom, a spin-off of the Argon Design Bureau
Phobos Grunt Design). The official postmortem reports that both computers
suffered an SRAM failure in a WS512K32V20G24M SRAM. These SRAMS are manufactured by White Electronic Design and the model number can be decoded as “W” for White Electronic Design, “S” for SRAM, “512K32” for a 512k memory by 32 bit wide access, “V” is the improvement mark, “20” for 20ns memory access time, “G24” is the package type, and “M” indicates it is a military grade part.
In the paper "
Extreme latchup susceptibility in modern commercial-off-the-shelf (COTS) monolithic 1M and 4M CMOS static random-access memory (SRAM) devices"
Joe Benedetto reports that these SRAM packages are very susceptible to “latchup”, a condition which requires power recycling to return to operation and can be
permanent in some cases. Steven McClure of NASA Jet Propulsion Laboratory is the leader of the Radiation Effects Group.
He reports these SRAM parts would be very unlikely to be approved for use at JPL
(Did Bad Memory Chips Down Russia’s Mars Probe).
It is rare that even two failures will lead to disaster and this case is no exception. Upon double failure of the flight control systems, the spacecraft autonomously goes into “safe mode” where the vehicle attempts to stay stable in low-earth orbit and orients its solar cells towards the sun so that it continues to have sufficient power. This is a common design pattern where the system is able to stabilize itself in an extreme condition to allow flight control personal back on earth to figure out what steps to take to mitigate the problem. In this case, the mitigation is likely fairly simple in just restarting both computers (which probably happened automatically) and restarting the mission would likely have been sufficient.
Unfortunately there was still one more failure, this one a design fault. When the spacecraft goes into safe mode, it is incapable of communicating with earth stations, probably due to spacecraft orientation. Essentially if the system needs to go into safe mode while it is still in earth orbit, the mission is lost because ground control will never be able to command it out of safe mode.
I find this last fault fascinating. Smart people could never make such an obviously incorrect mistake, and yet this sort of design flaw shows up all the time on large systems. Experts in each vertical area or component do good work. But the interaction across vertical areas are complex and, if there is not sufficiently deep, cross-vertical-area technical expertise, these design flaws may not get seen. Good people design good components and yet there often exist obvious fault modes across components that get missed.
Systems sufficiently complex enough to require deep vertical technical specialization risk complexity blindness. Each vertical team knows their component well but nobody understands the interactions of all the components. The two solutions are 1) well-defined and well-documented interfaces between components, be they hardware or software, and 2) and very experienced, highly-skilled engineer(s) on the team focusing on understanding inter-component interaction and overall system operation, especially in fault modes. Assigning this responsibility to a senior manager often isn’t sufficiently effective.
The faults that follow from complexity blindness are often serious and depressingly easy to see in retrospect, as was the case in this example.
Summarizing some of the lessons from this loss: The SRAM chip probably was a poor choice. The computer systems should restart, scrub memory for faults, and be able to detect and load corrupt code from secondary locations before going into safe-mode. Safe-mode has to actually allow mitigating actions to be taken from a ground station or it is useless. Software systems should be constantly scrubbing memory for faults and check-summing the running software for corruption. A tiny amount of processor power spent on continuous, redundant checking and a few more lines of code to implement simple recovery paths when fault is encountered may have saved the mission. Finally we have to all remember the old adage “nothing works if it is not tested.” Every major fault has to be tested. Error paths are the common ones to not be tested so it is particularly important to focus on them. The general rule is to keep error paths simple, use the fewest possible, and test frequently.
Back in 2007, I wrote up a set of best practices on software design, testing, and operations of high scale systems:
On Designing and Deploying Internet-Scale Services. This paper targets large-scale services but it’s surprising to me that some, and perhaps many, of the suggestions could be applied successfully to a complex space flight system. The common theme across these two only partly-related domains is that the biggest enemy is complexity, and the exploding number of failure modes that follow from that complexity.
This incident reminds us of the importance of never trusting anything from any component in a multi-component system. Checksum every data block and have well-designed, and well-tested failure modes for even unlikely events. Rather than have complex recovery logic for the near infinite number of faults possible, have simple, brute-force recovery paths that you can use broadly and test frequently. Remember that all hardware, all firmware, and all software have faults and introduce errors. Don’t trust anyone or anything. Have test systems that bit flips and corrupts and ensure the production system can operate through these faults – at scale, rare events are amazingly common.
To dig deeper in the Phobos-Grunt loss:
b: http://blog.mvdirona.com /
Don't be a show-off. Never be too proud to turn back. There are old pilots and bold pilots, but no old, bold pilots.
I first heard the latter part of this famous quote made by US Airmail Pilot E. Hamilton Lee back when I raced cars. At that time, one of the better drivers in town, Gordon Monroe, used a variant of that quote (with pilots replaced by racers) when giving me driving advice. Gord’s basic message was that it is impossible to win a race if you crash out of it.
Nearly all of us have taken the odd chance and made some decisions that, in retrospect, just didn’t make sense from a risk vs reward perspective. Age and experience clearly helps but mistakes still get made and none of us are exempt. Most people’s mistakes at work don’t have life safety consequences and their mistakes are not typically picked up widely by the world news services as was the case in the recent grounding of the Costa Concordia cruise ship. But, we all make mistakes.
I often study engineering disasters and accidents in the belief that understanding mistakes, failures, and accidents deeply is a much lower cost way of learning. My last note on this topic was What Went Wrong at Fukushima Dai-1 where we looked at the nuclear release following the 2011 Tohuku Earthquake and Tsunami.
Living on a boat and cruising extensively (our boat blog is at http://blog.mvdirona.com/) makes me particularly interested in the Costa Concordia incident of January 13th 2012. The Concordia is a 114,137 gross ton floating city that cost $570m when it was delivered in 2006. It is 952’ long, has 17 decks, and is power by 6 Wartsila diesel engines with a combined output of 101,400 horse power. The ship is capable of 23 kts (26.5 mph) and has a service speed of 21 kts. At capacity, it carries 3,780 passengers with a crew of 1,100.
The Italian cruise ship Costa Concordia partially sank on Friday the 13th of January 2012 after hitting a reef off the Italian coast and running aground at Isola del Giglio, Tuscany, requiring the evacuation of 4,197 people on board. At least 16 people died, including 15 passengers and one crewman; 64 others were injured (three seriously) and 17 are missing. Two passengers and a crewmember trapped below deck were rescued.
The captain, Francesco Schettino, had deviated from the ship's computer-programmed route in order to treat people on Giglio Island to the spectacle of a close sail-past. He was later arrested on preliminary charges of multiple manslaughter, failure to assist passengers in need and abandonment of ship. First Officer Ciro Ambrosio was also arrested.
It is far too early to know exactly what happened on the Costa Concordia and, because there was loss of life and considerable property damage, the legal proceedings will almost certainly run for years. Unfortunately, rather than illuminating the mistakes and failures and helping us avoid them in the future, these proceedings typically focus on culpability and distributing blame. That’s not our interest here. I’m mostly focused on what happened and getting all the data I could find on the table to see what lessons the situation yields.
A fellow boater, Milt Baker pointed me towards an excellent video that offers considerable data into exactly what happened in the final 1 hour and 30 min. You can find the video at: Grounding of Costa Concordia. Another interesting data source is the video commentary available at: John Konrad Narrates the Final Maneuvers of the Costa Concordia. In what follows, I’ve combined snapshots of the first video intermixed with data available from other sources including the second video.
The source data for the two videos above is a wonderful safety system called Automatic Identification System. AIS is a safety system required on larger commercial craft and also used on many recreational boats as well. AIS works by frequently transmitting (up to every 2 seconds for fast moving ships) via VHF radio the ships GPS position, course, speed, name, and other pertinent navigational data. Receiving stations on other ships automatically plot transmitting AIS targets on electronic charts. Some receiving systems are also able to plot an expected target course and compute the time and location of the estimated closest point of approach. AIS an excellent tool to help reduce the frequency of ship-to-ship collisions.
Since AIS data is broadcast over VHF radio, it is widely available to both ships and land stations and this data can be used in many ways. For example, if you are interested in the boats in Seattle’s Elliott Bay, have a look at MarineTraffic.com and enter “Seattle” as the port in the data entry box near the top left corner of the screen (you might see our boat Dirona there as well).
AIS data is often archived and, because of that, we have a very precise record of the Costa Concordia’s course as well as core navigational data as it proceeded towards the rocks. In the pictures that follow, the red images of the ship are at the ship’s position as transmitted by the Costa Concordia’s AIS system. The black line between these images is the interpolated course between these known locations. The video itself (Costa Concordia Interpolated.wmv) uses a roughly 5:1 time compression.
In this screen shot, you can see the Concordia already very close to the Italian Isol del Giglio. From the BBC report the Captain has said he turned too late (Costa Concordia: Captain Schettino ‘Turned Too Late’). From that article:
According to the leaked transcript quoted by Italian media, Capt Schettino said the route of the Costa Concordia on the first day of its Mediterranean cruise had been decided as it left the port of Civitavecchia, near Rome, on Friday.
The captain reportedly told the investigating judge in the city of Grosseto that he had decided to sail close to Giglio to salute a former captain who had a home on the Tuscan island. "I was navigating by sight because I knew the depths well and I had done this maneuver three or four times," he reportedly said.
"But this time I ordered the turn too late and I ended up in water that was too shallow. I don't know why it happened."
In this screen shot of the boat at 20:44:47 just prior to the grounding, you can see the boat turned to 348.8 degrees but the massive 114,137 gross ton vessel is essentially plowing sideways through the water on a course of 332.7 degrees. The Captain can and has turned the ship with the rudder but, at 15.6 kts, it does not follow the exact course steered with inertia tending to widen and straiten the intended turn.
Given the speed of the boat and nearness of shore at this point, the die is cast and the ship is going to hit ground.
This screen shot was taken is just past the point of impact. You will note that it has slowed to 14.0 kts. You might also notice the Captain is turning aggressively to the starboard. He has the ship turned to a 8.9 degrees heading whereas the actual ships course lags behind at 356.2 degrees.
This screen shot is only 44 seconds after the previous one but the boat has already slowed from 14.0 kts to 8.1 and is still slowing quickly. Some of the slowing will have come from the grounding itself but passengers report that they heard the boat hard astern after the grounding.
You can also see the captain has swung the helm over from the starboard course he was steering trying to avoid the rocks over to port course now that he has struck them. This is almost certainly in an effort to minimize damage. What makes this (possibly counter-intuitive) decision a good one is the ships pivot point is approximately 1/3 of the way back from the bow so turning to port (towards the shore) will actually cause the stern to rotate away from the rocks they just struck.
The ship decelerated quickly to just under 6.0 knots but, in the two minutes prior to this screen shot, it has only slowed a further 0.9 kts down to 5.1. There were reports of a loss of power on the Concordia. Likely what happened is ship was hard astern taking off speed until a couple of minutes prior to this screen shot when water intrusion caused a power failure. The ship is a diesel electric and likely lost power to its main prop due to rapid water ingress.
At 5 kts and very likely without main engine power, the Concordia is still going much too quickly to risk running into the mud and sand shore so the Captain now turns hard away from shore and he is heading back out into the open channel.
With the helm hard over the starboard with the likely assistance of the bow thrusters the ship is turning hard which is pulling speed off fairly quickly. It is now down to 3.0 kts and it continues to slow.
The Concordia is now down to 1.6 kts and the Captain is clearly using the bow thrusters heavily as the bow continues to rotate quickly. He has now turned to a 41 degree heading.
It now has been just over 29 min since the ship first struck the rocks. It has essentially stopped and the bow is being brought all the way back round using bow thrusters in an effort to drive the ship back in towards shore presumably because the Captain believes it is at risk of sinking so he is seeking shallow water.
The Captain continues to force the Concordia to shore under bow thruster power. In this video narrative (John Konrad Narrates the Final Maneuvers of the Costa Concordia), the commentator reported that the combination of bow thrusters and the prevailing currents where being used in combination by the Captain to drive the boat into shore.
A further 11 min and 22 seconds have past and the ship has now accelerated back up to 0.9 kts now heading towards shore.
It has been more than an hour and 11 minutes since the original contact with the rocks and the Costa Concordia is now at rest in its final grounding point.
The Coast Guard transcript of the radio communications with the Captain are at Costa Concordia Transcript: Coastguard Orders Captain to return to Stricken Ship. In the following text De Falco is the Coast Guard Commander and Schettino is the Captain of the Costa Concordia:
De Falco: "This is De Falco speaking from Livorno. Am I speaking with the commander?"
Schettino: "Yes. Good evening, Cmdr De Falco."
De Falco: "Please tell me your name."
Schettino: "I'm Cmdr Schettino, commander."
De Falco: "Schettino? Listen Schettino. There are people trapped on board. Now you go with your boat under the prow on the starboard side. There is a pilot ladder. You will climb that ladder and go on board. You go on board and then you will tell me how many people there are. Is that clear? I'm recording this conversation, Cmdr Schettino …"
Schettino: "Commander, let me tell you one thing …"
De Falco: "Speak up! Put your hand in front of the microphone and speak more loudly, is that clear?"
Schettino: "In this moment, the boat is tipping …"
De Falco: "I understand that, listen, there are people that are coming down the pilot ladder of the prow. You go up that pilot ladder, get on that ship and tell me how many people are still on board. And what they need. Is that clear? You need to tell me if there are children, women or people in need of assistance. And tell me the exact number of each of these categories. Is that clear? Listen Schettino, that you saved yourself from the sea, but I am going to … really do something bad to you … I am going to make you pay for this. Go on board, (expletive)!"
Schettino: "Commander, please …"
De Falco: "No, please. You now get up and go on board. They are telling me that on board there are still …"
Schettino: "I am here with the rescue boats, I am here, I am not going anywhere, I am here …"
De Falco: "What are you doing, commander?"
Schettino: "I am here to co-ordinate the rescue …"
De Falco: "What are you co-ordinating there? Go on board! Co-ordinate the rescue from aboard the ship. Are you refusing?"
Schettino: "No, I am not refusing."
De Falco: "Are you refusing to go aboard, commander? Can you tell me the reason why you are not going?"
Schettino: "I am not going because the other lifeboat is stopped."
De Falco: "You go aboard. It is an order. Don't make any more excuses. You have declared 'abandon ship'. Now I am in charge. You go on board! Is that clear? Do you hear me? Go, and call me when you are aboard. My air rescue crew is there."
Schettino: "Where are your rescuers?"
De Falco: "My air rescue is on the prow. Go. There are already bodies, Schettino."
Schettino: "How many bodies are there?"
De Falco: "I don't know. I have heard of one. You are the one who has to tell me how many there are. Christ!"
Schettino: "But do you realize it is dark and here we can't see anything …"
De Falco: "And so what? You want to go home, Schettino? It is dark and you want to go home? Get on that prow of the boat using the pilot ladder and tell me what can be done, how many people there are and what their needs are. Now!"
Schettino: "… I am with my second in command."
De Falco: "So both of you go up then … You and your second go on board now. Is that clear?"
Schettino: "Commander, I want to go on board, but it is simply that the other boat here … there are other rescuers. It has stopped and is waiting …"
De Falco: "It has been an hour that you have been telling me the same thing. Now, go on board. Go on board! And then tell me immediately how many people there are there."
Schettino: "OK, commander."
De Falco: "Go, immediately!"
At least 16 died in the accident and 17 were still missing when this was written (Costa Concordia Disaster).The Captain of the Costa Concordia, Francesco Schettino, has been charged with manslaughter and abandoning ship.
At the time of the grounding, the ship was carrying 2,200 metric tons of heavy fuel oil and 185 metric tons of diesel and remains environmental risk remains (Costa Concordia Salvage Experts Ready to Begin Pumping Fuel from Capsized Cruise Ship Off Coast of Italy). The 170 year old salvage firm Smit Salvage will be leading the operation.
All situations are complex and few disasters have only a single cause. However, the facts as presented to this point pretty strongly towards pilot error as the primary contributor in this event. The Captain is clearly very experienced and his ship handling after the original grounding appear excellent. But, it’s hard to explain why the ship was that close to the rocks, the captain has reported that he turned too late, and public reports have him on the phone at or near the time of the original grounding.
What I take away from the data points presented here is that experience, ironically, can be our biggest enemy. As we get increasingly proficient at a task, we often stop paying as much attention. And, with less dedicated focus on a task, over time, we run the risk of a crucial mistake that we probably wouldn’t have made when we were effectively less experienced and perhaps less skilled. There is danger in becoming comfortable.
The videos referenced in the above can be found at:
· Grounding of Costa Concordia Interpolated
· gCaptain’s John Konrad Narrates the Final Maneuvers of the Costa Concordia
If you are interested in reading more:
b: http://blog.mvdirona.com / http://perspectives.mvdirona.com