In the data center world, there are few events taken more seriously than power failure and considerable effort is spent to make them rare. When a datacenter experiences a power failure, it’s a really big deal for all involved. But, a big deal in the infrastructure world still really isn’t a big deal on the world stage. The Super Bowl absolutely is a big deal by any measure. On average over the last couple of years, the Super Bowl has attracted 111 million viewers and is the number 1 most watched television show in North America eclipsing the final episode of Mash. World-wide, the Super Bowl is only behind the European Cup (UEFA Champions Leaque) which draws 178 million viewers.
When the 2013 Super Bowl power event occurred, the Baltimore Ravens had just run back the second half opening kick for a touchdown and they were dominating the game with a 28 to 6 point lead. The 49ers had already played half the game and failed to get a single touchdown. The Ravens were absolutely dominating and they started the second half by tying the record for the longest kickoff return in NFL history at 108 yards. The game momentum was strongly with Baltimore.
At 13:22 in the third quarter, just 98 seconds into the second half, ½ of the Superdome lost primary power. Fortunately it wasn’t during the runback that started the second half. The power failure let to a 34 min delay to restore full lighting the field and, when the game restarted, the 49ers were on fire. The game was fundamentally changed by the outage with the 49ers rallying back to a narrow defeat of only 3 points. The game ended 34 to 31 and it really did come down to the wire where either team could have won. There is no question the game was exciting and some will argue the power failure actually made the game more exciting. But, NFL championships should be decided on the field and not impacted by the electrical system used by the host stadium.
What happened at 13:22 in the third quarter when much of the field lighting failed? Entergy, the utility supply power to the Superdome reported their “distribution and transmission feeders that serve the Superdome were never interrupted” (Before Game Is Decided, Superdome Goes Dark). It was a problem at the facility.
The joint report from SMG the company that manages the Superdome and Entergy, the utility power provider, said:
A piece of equipment that is designed to monitor electrical load sensed an abnormality in the system. Once the issue was detected, the sensing equipment operated as designed and opened a breaker, causing power to be partially cut to the Superdome in order to isolate the issue. Backup generators kicked in immediately as designed.
Entergy and SMG subsequently coordinated start-up procedures, ensuring that full power was safely restored to the Superdome. The fault-sensing equipment activated where the Superdome equipment intersects with Entergy’s feed into the facility. There were no additional issues detected. Entergy and SMG will continue to investigate the root cause of the abnormality.
Essentially, the utility circuit breaker detected an “anomaly” and opened the breaker. Modern switchgear have many sensors monitored by firmware running on a programmable logic controller. The advantage of these software systems is they are incredibly flexible and can be configured uniquely for each installation. The disadvantage of software systems is the wide variety of configurations they can support can be complex and the default configurations are used perhaps more often than they should. The default configurations in a country where legal settlements can be substantial tend towards the conservative side. We don’t know if that was a factor in this event but we do know that no fault was found and the power was stable for the remainder of the game. This was almost certainly a false trigger.
Because the cause has not yet been reported and, quite often, the underlying root cause is never found. But, it’s worth asking, is it possible to avoid long game outages and what would it cost? As when looking at any system faults, the tools we have to mitigate the impact are: 1) avoid the fault entirely, 2) protect against the fault with redundancy, 3) minimize the impact of the fault through small fault zones, and 4) minimize the impact through fast recovery.
Fault avoidance: Avoidance starts with using good quality equipment, configuring it properly, maintaining it well, and testing it frequently. Given the Superdome just went through $336 million renovation, the switch gear may have been relatively new and, even if it wasn’t, it likely was almost certainly recently maintained and inspected.
Where issues often arise are in configuration. Modern switch gear have an amazingly large number of parameters many of which interact with each other and, in total, can be difficult to fully understand. And, given the switch gear manufactures know little about the intended end-use application of each switchgear sold, they ship conservative default settings. Generally, the risk and potential negative impact of a false positive (breaker opens when it shouldn’t) is far less than a breaker that fails to open. Consequently conservative settings are common.
Another common cause of problems is lack of testing. The best way to verify that equipment works is to test at full production load in a full production environment in a non-mission critical setting. Then test it just short of overload to ensure that it can still reliably support the full load even though the production design will never run it that close to the limit, and finally, test it into overload to ensure that the equipment opens up on real faults.
The first, testing in full production environment in non-mission critical setting is always done prior to a major event. But the latter two tests are much less common: 1) testing at rated load, and 2) testing beyond rated load. Both require synthetic load banks and skill electricians and so these tests are often not done. You really can’t beat testing in a non-mission critical setting as a means of ensuring that things work well in a mission critical setting (game time).
Redundancy: If we can’t avoid a fault entirely, the next best thing is to have redundancy to mask the fault. Faults will happen. The electrical fault at the Monday Night Football game back in December of 2011 was caused by utility sub-station failing. These faults are unavoidable and will happen occasionally. But is protection against utility failure possible and affordable? Sure, absolutely. Let’s use the Superdome fault yesterday as an example.
The entire Superdome load is only 4.6MW. This load would be easy to support on two 2.5 to 3.0MW utility feeds each protected by its own generator. Generators in the 2.5 to 3.0 MW range are substantial V16 diesel engines the size of a mid-sized bus. And they are expensive running just under $1M each but they are also available in mobile form and inexpensive to rent. The rental option is a no-brainer but let’s ignore that and look at what it would cost to protect the Superdome year around with a permanent installation. We would need 2 generators, the switchgear to connect it to the load and uninterruptable power supplies to hold the load during the first few seconds of a power failure until the generators start up and are able to pick up the load. To be super safe, we’ll buy third generator just in case there is a problem and one of the two generators don’t start. The generators are under $1m each and the overall cost of the entire redundant power configuration with the extra generator could be had for under $10m. Looking at statistics from the 2012 event, a 30 second commercial costs just over $4m.
For the price of just over 60 seconds of commercials the facility could protected against fault. And, using rental generators, less than 30 seconds of commercials would provide the needed redundancy to avoid impact from any utility failure. Given how common utility failures are and the negative impact of power disruptions at a professional sporting event, this looks like good value to me. Most sports facilities chose to avoid this “unnecessary” expense and I suspect the Superdome doesn’t have full redundancy for all of its field lighting. But even if it did, this failure mode can sometimes cause the generators to be locked out and not pick up the load during a some power events. In this failure mode, when a utility breaker incorrectly senses a ground fault within the facility, it is frequently configured to not put the generator at risk by switching it into a potential ground fault. My take is I would rather run the risk of damaging the generator and avoid the outage so I’m not a big fan of this “safety” configuration but it is a common choice.
Minimize Fault Zones: The reason why only ½ the power to the Superdome went down was because the system installed at the facility has two fault containment zones. In this design, a single switchgear event can only take down ½ of the facility.
Clearly the first choice is to avoid the fault entirely. And, if that doesn’t work, have redundancy take over and completely mask the fault. But, in the rare cases where none of these mitigations work, the next defense are small fault containment zones. Rather than using 2 zones, spend more on utility breakers and have 4 or 6 and, rather than losing ½ the facility, lose ¼ or 1/6. And, if the lighting power is checker boarded over the facility lights, (lights in a contiguous region are not all powered by the same utility feed but the feeds are distributed over the lights evenly), rather than losing ¼ or 1/6 of the lights in one area of the stadium, we would lose that fraction of the lights evenly over the entire facility. Under these conditions, it might be possible to operate with slightly degraded field lighting and be able to continue the game without waiting for light recovery.
Fast Recovery: Before we get to this fourth option, fast recovery, we have tried hard to avoid failure, then we have used power redundancy to mask the failure, then we have used small fault zones to minimize the impact. The next best thing we can do is to recover quickly. Fast recovery depends broadly on two things: 1) if possible automate recovery so it can happen in seconds rather than the rate at which humans can act, 2) if humans are needed, ensure they have access to adequate monitoring and event recording gear so they can see what happened quickly and they have trained extensively and are able to act quickly.
In this particular event, the recovery was not automated. Skilled electrical technicians were required. They spent nearly 15 minute checking system states before deciding it was safe to restore power. Generally, 15 min on a human judgment driven recover decision isn’t bad. But the overall outage was 34 min. If the power was restored in 15 min, what happened during the next 20? The gas discharge lighting still favored at large sporting venues, take roughly 15 minutes to restart after a momentary outage. Even a very short power interruption will still suffer the same long recovery time. Newer light technologies are becoming available that are both more power efficient and don’t suffer from these long warm-up periods.
It doesn’t appear that the final victor of Super Bowl XLVII was changed by the power failure but there is no question the game was broadly impacted. If the light failure had happened during the kickoff return starting the third quarter, the game may have been changed in a very fundamental way. Better power distribution architectures are cheap by comparison. Given the value of the game, the relative low cost of power redundancy equipment, I would argue it’s time to start retrofitting major sporting venues with more redundant design and employing more aggressive pre-game testing.
b: http://blog.mvdirona.com / http://perspectives.mvdirona.com
The last few weeks have been busy and it has been way too long since I have blogged. I’m currently thinking through the server tax and what’s wrong with the current server hardware ecosystem but don’t have anything yet ready to go on that just yet. But, there are a few other things on the go. I did a talk at Intel a couple of weeks back and last week at the First Round Capital CTO summit. I’ve summarized what I covered below with pointers to slides.
In addition, I’ll be at the Amazon in Palo Alto event this evening and will do a talk there as well. If you are interested in Amazon in general or in AWS specifically, we have a new office open in Palo Alto and you are welcome to come down this evening to learn more about AWS, have a beer or the refreshment of your choice, and talk about scalable systems. Feel free to attend if you are interested:
Amazon in Palo Alto
October 11, 2012 at 5:00 PM - 9:00 PM
Pampas 529 Alma Street Palo Alto, CA 94301
First Round Capital CTO Summit:
I started this session by arguing that cost or value models are the right way to ensure you are working on the right problem. I come across far too many engineers and even companies that are working on interesting problems but they fail at the “top 10 problem” test. You never want to first have to explain the problem to a perspective customer before you get a chance to explain your solution. It is way more rewarding to be working on top 10 problems where the value of what you are doing is obvious and you only need to convince someone that your solution actually works.
Cost models are a good way to force yourself to really understand all aspects of what the customer is doing and know precisely what savings or advantage you bring. A 25% improvement on an 80% problem is way better than 50% solution to a 5% problem. Cost or value models are a great way of keeping yourself honest on what the real savings or improvement of your approach actually are. And its quantifiable data that you can verify in early tests and prove in alpha or beta deployments.
I then covered three areas of infrastructure where I see considerable innovation and showed all the cost model helped drive me there:
· Networking: The networking eco-system is still operating on the closed, vertically integrated, mainframe model but the ingredients are now in place to change this. See Networking, the Last Bastion of Mainframe Computing for more detail. The industry is currently going through great change. Big change is a hard transition for the established high-margin industry players but it’s a huge opportunity for startups.
· Storage: The storage (and database) worlds are going through a unprecedented change where all high-performance random access storage is migrating from hard disk drives to flash storage. The early flash storage players have focused on performance over price so there is still considerable room for innovation. Another change happening in the industry is the explosion of cold storage (low I/O density storage that I jokingly refer to as write-only) due to falling prices, increasing compliance requirements, and an industry realization that data has great value. This explosion in cold storage is opening much innovation and many startup opportunities. The AWS entrant in this market is Glacier where you can store seldom accessed data at one penny per GB per month (for more on Glacier: Glacier: Engineering for Cold Storage in the Cloud.
· Cloud Computing: I used to argue that targeting cloud computing was a terrible idea for startups since the biggest cloud operators like Google and Amazon tend to do all custom hardware and software and purchase very little commercially. I may have been correct initially but, with the cloud market growing so incredibly fast, every teleco is entering the market, each colo provider is entering, most hardware providers are entering, … the number of players is going from 10s to 1000s. And, at 1,000s, it’s a great market for a startup to target. Most of these companies are not going to build custom networking, server, and storage hardware but they do have the need to innovate with the rest of the industry.
Slides: First Round Capital CTO Summit
Intel Distinguished Speaker Series:
In this talk I started with how fast the cloud computing market segment is growing using examples form AWS. I then talked about why cloud computing is such an incredible customer value proposition. This isn’t just a short term fad that will pass over time. I mostly focused on how that statement I occasionally hear just can’t be possibly be correct: “I can run my on-premise computing infrastructure less expensively then hosting it in the cloud”. I walk through some of the reasons why this statement can only be made with partial knowledge. There are reasons why some computing will be in the cloud and some will be hosted locally and industry transitions absolutely do take time but cost isn’t one of the reasons that some workloads aren’t in the cloud.
I think walked through 5 areas of infrastructure innovation and some of what is happening in each area:
· Power Distribution
· Mechanical Systems
· Data Center Building Design
Slides: Intel Distinguished Speaker Series
I hope to see you tonight at the Amazon Palo Alto event at Pampas (http://goo.gl/maps/dBZxb). The event starts at 5pm and I’ll do a short talk at 6:35.
Facebook recently released a detailed report on their energy consumption and carbon footprint: Facebook’s Carbon and Energy Impact. Facebook has always been super open with the details behind there infrastructure. For example, they invited me to tour the Prineville datacenter just prior to its opening:
· Open Compute Project
· Open Compute Mechanical System Design
· Open Compute Server Design
· Open Compute UPS & Power Supply
Reading through the Facebook Carbon and Energy Impact page, we see they consumed 532 million kWh of energy in 2011 of which 509m kWh went to their datacenters. High scale data centers have fairly small daily variation in power consumption as server load goes up and down and there are some variations in power consumption due to external temperature conditions since hot days require more cooling than chilly days. But, highly efficient datacenters tend to be effected less by weather spending only a tiny fraction of their total power on cooling. Assuming a flat consumption model, Facebook is averaging, over the course of the year, 58.07MW of total power delivered to its data centers.
Facebook reports an unbelievably good 1.07 Power Usage Effectiveness (PUE) which means that for every 1 Watt delivered to their servers they lose only 0.07W in power distribution and mechanical systems. I always take publicly released PUE numbers with a grain of salt in that there has been a bit of a PUE race going on between some of the large operators. It’s just about assured that there are different interpretations and different measurement techniques being employed in computing these numbers so comparing them probably doesn’t tell us much. See PUE is Still Broken but I Still use it and PUE and Total Power Usage Efficiency for more on PUE and some of the issues in using it comparatively.
Using the Facebook PUE number of 1.07, we know they are delivering 54.27MW to the IT load (servers and storage). We don’t know the average server draw at Facebook but they have excellent server designs (see Open Compute Server Design) so they likely average at or below as 300W per server. Since 300W is an estimate, let’s also look at 250W and 400W per server:
· 250W/server: 217,080 servers
· 300W/server: 180,900 servers
· 350W/server: 155,057 servers
As a comparative data point, Google’s data centers consume 260MW in aggregate (Google Details, and Defends, It’s use of Electricity). Google reports their PUE is 1.14 so we know they are delivering 228MW to their IT infrastructure (servers and storage). Google is perhaps the most focused in the industry on low power consuming servers. They invest deeply in custom designs and are willing to spend considerably more to reduce energy consumption. Estimating their average server power draw at 250W and looking at the +/-25W about that average consumption rate:
· 225W/server: 1,155,555 servers
· 250W/server: 1,040,000 servers
· 275W/server: 945,454 servers
I find the Google and Facebook server counts interesting for two reasons. First, Google was estimated to have 1 million servers more than 5 years ago. The number may have been high at the time but it’s very clear that they have been super focused on work load efficiency and infrastructure utilization. To grow the search and advertising as much as they have without growing the server count at anywhere close to the same rate (if at all) is impressive. Continuing to add computationally expensive search features and new products and yet still being able to hold the server count near flat is even more impressive.
The second notable observation from this data is that the Facebook server count is growing fast. Back in October of 2009, they had 30,000 servers. In June of 2010 the count climbed to 60,000 servers. Today they are over 150k.
b: http://blog.mvdirona.com / http://perspectives.mvdirona.com
The NASCAR Sprint Cup Stock Car Series kicks its season off with a bang and, unlike other sports, starts the season off with the biggest event of the year rather than closing with it. Daytona Speed Weeks is a multi-week, many race event the finale of which is the Daytona 500. The 500 starts with a huge field of 43 cars and is perhaps famous for some of the massive multi-car wrecks. The 17 car pile-up of 2011, made a 43 card field look like the appropriate amount of redundancy just to get a car over the finish line at the end.
Watching 43 stock cars race for the green flag at the start of the race is an impressive show of power as 146,000 lbs of metal charge towards the start line at nearly 200 miles per hour running so close that they appear to be connected. From the stands, the noise is deafening, the wall of air they are pushing can be felt 20 rows up and the air is hot from all the waste heat spilling off the field as they scream to the line.
Imagine harnessing all the power of all the engines from the 43 cars heading towards the start line at Daytona in a single engine? In fact, let’s make it harder, imagine having all the power of all the cars that take the green flag at both Daytona Sprnt Cup races each year. That would be a single engine capable of putting out 64,500 hp. Actually, for safety reasons, NASCAR restricts engine output at the Daytona and Talladega superspeedways to approximately 430 hp but let’s stick with the 750 hp they can produce when unrestricted. If we harnessed that power into a single engine, we would have an unbelievable 64,500 HP. Last week Jennifer and I were invited to tour the Hanjin Oslo container ship which happens to be single engine powered. Believe it or not, that single engine is more powerful that the aggregate horsepower of both Daytona starting fields. It has a single 74,700 hp engine.
Last week Peter Kim who supervises the Hanjin shipping port at Terminal 46 invited us to tour the port facility and the Hanjin Oslo container ship. I love technology, scale, and learning how well run operations work so I jumped on the opportunity.
Shortly after arriving, we watched the Oslo being brought into terminal 46. The captain and pilot were both looking down from the bridge wing towering more than 100’ above us giving commands to the tugs as the Oslo is being eased into the dock. Even before the ship was tied off, the port was rapidly coming to life. Dock workers were scrambling to their stations, trucks were starting, container cranes were moving into position, Customs and Border Patrol was getting ready to board, and line handlers were preparing to tie the ship off. There were workers and heavy equipment moving into position throughout the terminal. And, over the next 12 hours, more than a thousand containers would be moved before the ship would be off to its next destination at 6:30am the following morning.
The Oslo is not the newest ship in the Hanjin fleet having been built in 1998. It’s not the biggest ship nor is it the most powerful. But it’s a great example of a well-run, super clean, and expertly maintained container ship. And, starting with the size, here’s the view from the bridge.
The ship truly is huge. What I find even more amazing is that, as large as the Oslo is, there are container ships out there with up to twice the cargo carrying capacity and as much as 45% more horse power. In fact, the world’s most powerful diesel engine is deployed in a container ship. It’s a 14 cylinder, 3 floor high monster that produces 109,000 hp designed by the Finnish company Wartsila.
The Hanjin Oslo uses a (slightly) smaller inline 10 cylinder version of the same engine design. The key difference between it and the world’s largest diesel shown above is that the engine in the Oslo is 4 cylinders shorter at 10 cylinders inline rather than 14 and it produces proportionally less power. On the Oslo, the engine spans 3 decks so you can only see 1/3 of it at any one time. Here’s the view from the Hanjin Oslo engine room top deck, mid deck, and lower deck:
The engine is clearly notable for its size and power output. But, what I find most surprising is it’s a two stroke engine. Two stroke engines produce power at the beginning of the power stroke where the piston is heading down, dump the exhaust towards the end of that stroke, then bring in fresh air at the beginning of the next stroke as the piston begins heading back up, and then compresses the air for the remainder of that stroke. Towards the end of the compression stroke, fuel is injected into the cylinder where it combusts rapidly building pressure and pushing the piston back down on the power stroke. Four stroke engines separate these functions into four strokes: 1) power going down, 2) exhaust going up, 3) intake going down, and then 4) compression going up.
Two-stroke engines are common in lawn mowers, chainsaws, and some very small outboards because of their high power to weight ratio and simplicity of design that makes very low cost engines possible. Larger diesel engines used in trucks and automobiles are almost exclusively 4 stroke engines. Ironically, the very highest output diesel engines found in large marine applications are also two strokes.
From Spending an evening with the Hanjin team, I was super impressed. I love the technology, the scale was immense, everything was very well maintained, and they are clearly excellent operators. If I was moving goods between continents, I would look first to Hanjin.
Most of the time I write about the challenges posed by
scaling infrastructure. Today, though, I wanted mention some upcoming
events that have to do with a different sort of scale.
In Amazon Web Services we are tackling lots of really hairy
challenges as we build out one the world’s largest cloud computing platforms.
From data center design, to network architecture, to data persistence, to
high-performance computing and beyond we have a virtually limitless set
of problems needing to be solved. Over the coming years AWS will be
blazing new trails in virtually every aspect of computing and infrastructure.
In order to tackle these opportunities we are searching for
innovative technologists to join the AWS team. In other words we need to
scale our engineering staff. AWS has hundreds of open positions
throughout the organization. Every single AWS team is hiring including
EC2, S3, EBS, EMR, CloudFront, RDS, DynamoDB and even the AWS-powered Amazon Silk
On May 17th and 18th we will be holding recruiting events in
three cities: Houston, Minneapolis, and Nashville. If you live near any
of those cities and are passionate about defining and building the future of
computing you will find more information at the following URL http://aws.amazon.com/careers/local-events/
You can also send your resume to firstname.lastname@example.org
and we will follow up with you.
b: http://blog.mvdirona.com / http://perspectives.mvdirona.com
Every couple of weeks I get questions along the lines of “should I checksum application files, given that the disk already has error correction?” or “given that TCP/IP has error correction on every communications packet, why do I need to have application level network error detection?” Another frequent question is “non-ECC mother boards are much cheaper -- do we really need ECC on memory?” The answer is always yes. At scale, error detection and correction at lower levels fails to correct or even detect some problems. Software stacks above introduce errors. Hardware introduces more errors. Firmware introduces errors. Errors creep in everywhere and absolutely nobody and nothing can be trusted.
Over the years, each time I have had an opportunity to see the impact of adding a new layer of error detection, the result has been the same. It fires fast and it fires frequently. In each of these cases, I predicted we would find issues at scale. But, even starting from that perspective, each time I was amazed at the frequency the error correction code fired.
On one high scale, on-premise server product I worked upon, page checksums were temporarily added to detect issues during a limited beta release. The code fired constantly, and customers were complaining that the new beta version was “so buggy they couldn’t use it”. Upon deep investigation at some customer sites, we found the software was fine, but each customer had one, and sometimes several, latent data corruptions on disk. Perhaps it was introduced by hardware, perhaps firmware, or possibly software. It could have even been corruption introduced by one of our previous release when those pages where last written. Some of these pages may not have been written for years.
I was amazed at the amount of corruption we found and started reflecting on how often I had seen “index corruption” or other reported product problems that were probably corruption introduced in the software and hardware stacks below us. The disk has complex hardware and hundreds of thousands of lines of code, while the storage area network has complex data paths and over a million lines of code. The device driver has tens of thousands of lines of code. The operating systems has millions of lines of code. And our application had millions of lines of code. Any of us can screw-up, each has an opportunity to corrupt, and its highly likely that the entire aggregated millions of lines of code have never been tested in precisely the combination and on the hardware that any specific customer is actually currently running.
Another example. In this case, a fleet of tens of thousands of servers was instrumented to monitor how frequently the DRAM ECC was correcting. Over the course of several months, the result was somewhere between amazing and frightening. ECC is firing constantly.
The immediate lesson is you absolutely do need ECC in server application and it is just about crazy to even contemplate running valuable applications without it. The extension of that learning is to ask what is really different about clients? Servers mostly have ECC but most clients don’t. On a client, each of these corrections would instead be a corruption. Client DRAM is not better and, in fact, often is worse on some dimensions. These data corruptions are happening out there on client systems every day. Each day client data is silently corrupted. Each day applications crash without obvious explanation. At scale, the additional cost of ECC asymptotically approaches the cost of the additional memory to store the ECC. I’ve argued for years that Microsoft should require ECC for Windows Hardware Certification on all systems including clients. It would be good for the ecosystem and remove a substantial source of customer frustration. In fact, it’s that observation that leads most embedded systems parts to support ECC. Nobody wants their car, camera, or TV crashing. Given the cost at scale is low, ECC memory should be part of all client systems.
Here’s an interesting example from the space flight world. It caught my attention and I ended up digging ever deeper into the details last week and learning at each step. The Russian space mission Phobos-Grunt (also written Fobos-Grunt both of which roughly translate to Phobos Ground) was a space mission designed to, amongst other objectives, return soil samples from the Martian moon Phobos. This mission was launched atop the Zenit-2SB launch vehicle taking off from the Baikonur Cosmodrome 2:16am on November 9th 2011. On November 24th it was officially reported that the mission had failed and the vehicle was stuck in low earth orbit. Orbital decay has subsequently sent the satellite plunging to earth in a fiery end of what was a very expensive mission.
What went wrong aboard Phobos-Grunt? February 3rd the official accident report was released: The main conclusions of the Interdepartmental Commission for the analysis of the causes of abnormal situations arising in the course of flight testing of the spacecraft "Phobos-Grunt". Of course, this document is released in Russian but Google Translate actually does a very good job with it. And, IEEE Spectrum Magazine reported on the failing as well. The IEEE article, Did Bad Memory Chips Down Russia’s Mars Probe, is a good summary and the translated Russian article offers more detail if you are interested in digging deeper.
The conclusion of the report is that there was a double memory fault on board Phobos-Grunt. Essentially both computers in a dual-redundant set failed at the same or similar times with a Static Random Access Memory failure. The computer was part of the newly-developed flight control system that had focused on dropping the mass of the flight control systems from 30 kgs (66 lbs) to 1.5 kgs (3.3 lbs). Less weight in flight control is more weight that can be in payload, so these gains are important. However, this new flight control system was blamed for the delay of the mission by 2 years and the eventual demise of the mission.
The two flight control computers are both identical TsM22 computer systems supplied by Techcom, a spin-off of the Argon Design Bureau
Phobos Grunt Design). The official postmortem reports that both computers
suffered an SRAM failure in a WS512K32V20G24M SRAM. These SRAMS are manufactured by White Electronic Design and the model number can be decoded as “W” for White Electronic Design, “S” for SRAM, “512K32” for a 512k memory by 32 bit wide access, “V” is the improvement mark, “20” for 20ns memory access time, “G24” is the package type, and “M” indicates it is a military grade part.
In the paper "
Extreme latchup susceptibility in modern commercial-off-the-shelf (COTS) monolithic 1M and 4M CMOS static random-access memory (SRAM) devices"
Joe Benedetto reports that these SRAM packages are very susceptible to “latchup”, a condition which requires power recycling to return to operation and can be
permanent in some cases. Steven McClure of NASA Jet Propulsion Laboratory is the leader of the Radiation Effects Group.
He reports these SRAM parts would be very unlikely to be approved for use at JPL
(Did Bad Memory Chips Down Russia’s Mars Probe).
It is rare that even two failures will lead to disaster and this case is no exception. Upon double failure of the flight control systems, the spacecraft autonomously goes into “safe mode” where the vehicle attempts to stay stable in low-earth orbit and orients its solar cells towards the sun so that it continues to have sufficient power. This is a common design pattern where the system is able to stabilize itself in an extreme condition to allow flight control personal back on earth to figure out what steps to take to mitigate the problem. In this case, the mitigation is likely fairly simple in just restarting both computers (which probably happened automatically) and restarting the mission would likely have been sufficient.
Unfortunately there was still one more failure, this one a design fault. When the spacecraft goes into safe mode, it is incapable of communicating with earth stations, probably due to spacecraft orientation. Essentially if the system needs to go into safe mode while it is still in earth orbit, the mission is lost because ground control will never be able to command it out of safe mode.
I find this last fault fascinating. Smart people could never make such an obviously incorrect mistake, and yet this sort of design flaw shows up all the time on large systems. Experts in each vertical area or component do good work. But the interaction across vertical areas are complex and, if there is not sufficiently deep, cross-vertical-area technical expertise, these design flaws may not get seen. Good people design good components and yet there often exist obvious fault modes across components that get missed.
Systems sufficiently complex enough to require deep vertical technical specialization risk complexity blindness. Each vertical team knows their component well but nobody understands the interactions of all the components. The two solutions are 1) well-defined and well-documented interfaces between components, be they hardware or software, and 2) and very experienced, highly-skilled engineer(s) on the team focusing on understanding inter-component interaction and overall system operation, especially in fault modes. Assigning this responsibility to a senior manager often isn’t sufficiently effective.
The faults that follow from complexity blindness are often serious and depressingly easy to see in retrospect, as was the case in this example.
Summarizing some of the lessons from this loss: The SRAM chip probably was a poor choice. The computer systems should restart, scrub memory for faults, and be able to detect and load corrupt code from secondary locations before going into safe-mode. Safe-mode has to actually allow mitigating actions to be taken from a ground station or it is useless. Software systems should be constantly scrubbing memory for faults and check-summing the running software for corruption. A tiny amount of processor power spent on continuous, redundant checking and a few more lines of code to implement simple recovery paths when fault is encountered may have saved the mission. Finally we have to all remember the old adage “nothing works if it is not tested.” Every major fault has to be tested. Error paths are the common ones to not be tested so it is particularly important to focus on them. The general rule is to keep error paths simple, use the fewest possible, and test frequently.
Back in 2007, I wrote up a set of best practices on software design, testing, and operations of high scale systems:
On Designing and Deploying Internet-Scale Services. This paper targets large-scale services but it’s surprising to me that some, and perhaps many, of the suggestions could be applied successfully to a complex space flight system. The common theme across these two only partly-related domains is that the biggest enemy is complexity, and the exploding number of failure modes that follow from that complexity.
This incident reminds us of the importance of never trusting anything from any component in a multi-component system. Checksum every data block and have well-designed, and well-tested failure modes for even unlikely events. Rather than have complex recovery logic for the near infinite number of faults possible, have simple, brute-force recovery paths that you can use broadly and test frequently. Remember that all hardware, all firmware, and all software have faults and introduce errors. Don’t trust anyone or anything. Have test systems that bit flips and corrupts and ensure the production system can operate through these faults – at scale, rare events are amazingly common.
To dig deeper in the Phobos-Grunt loss:
b: http://blog.mvdirona.com /
Don't be a show-off. Never be too proud to turn back. There are old pilots and bold pilots, but no old, bold pilots.
I first heard the latter part of this famous quote made by US Airmail Pilot E. Hamilton Lee back when I raced cars. At that time, one of the better drivers in town, Gordon Monroe, used a variant of that quote (with pilots replaced by racers) when giving me driving advice. Gord’s basic message was that it is impossible to win a race if you crash out of it.
Nearly all of us have taken the odd chance and made some decisions that, in retrospect, just didn’t make sense from a risk vs reward perspective. Age and experience clearly helps but mistakes still get made and none of us are exempt. Most people’s mistakes at work don’t have life safety consequences and their mistakes are not typically picked up widely by the world news services as was the case in the recent grounding of the Costa Concordia cruise ship. But, we all make mistakes.
I often study engineering disasters and accidents in the belief that understanding mistakes, failures, and accidents deeply is a much lower cost way of learning. My last note on this topic was What Went Wrong at Fukushima Dai-1 where we looked at the nuclear release following the 2011 Tohuku Earthquake and Tsunami.
Living on a boat and cruising extensively (our boat blog is at http://blog.mvdirona.com/) makes me particularly interested in the Costa Concordia incident of January 13th 2012. The Concordia is a 114,137 gross ton floating city that cost $570m when it was delivered in 2006. It is 952’ long, has 17 decks, and is power by 6 Wartsila diesel engines with a combined output of 101,400 horse power. The ship is capable of 23 kts (26.5 mph) and has a service speed of 21 kts. At capacity, it carries 3,780 passengers with a crew of 1,100.
The Italian cruise ship Costa Concordia partially sank on Friday the 13th of January 2012 after hitting a reef off the Italian coast and running aground at Isola del Giglio, Tuscany, requiring the evacuation of 4,197 people on board. At least 16 people died, including 15 passengers and one crewman; 64 others were injured (three seriously) and 17 are missing. Two passengers and a crewmember trapped below deck were rescued.
The captain, Francesco Schettino, had deviated from the ship's computer-programmed route in order to treat people on Giglio Island to the spectacle of a close sail-past. He was later arrested on preliminary charges of multiple manslaughter, failure to assist passengers in need and abandonment of ship. First Officer Ciro Ambrosio was also arrested.
It is far too early to know exactly what happened on the Costa Concordia and, because there was loss of life and considerable property damage, the legal proceedings will almost certainly run for years. Unfortunately, rather than illuminating the mistakes and failures and helping us avoid them in the future, these proceedings typically focus on culpability and distributing blame. That’s not our interest here. I’m mostly focused on what happened and getting all the data I could find on the table to see what lessons the situation yields.
A fellow boater, Milt Baker pointed me towards an excellent video that offers considerable data into exactly what happened in the final 1 hour and 30 min. You can find the video at: Grounding of Costa Concordia. Another interesting data source is the video commentary available at: John Konrad Narrates the Final Maneuvers of the Costa Concordia. In what follows, I’ve combined snapshots of the first video intermixed with data available from other sources including the second video.
The source data for the two videos above is a wonderful safety system called Automatic Identification System. AIS is a safety system required on larger commercial craft and also used on many recreational boats as well. AIS works by frequently transmitting (up to every 2 seconds for fast moving ships) via VHF radio the ships GPS position, course, speed, name, and other pertinent navigational data. Receiving stations on other ships automatically plot transmitting AIS targets on electronic charts. Some receiving systems are also able to plot an expected target course and compute the time and location of the estimated closest point of approach. AIS an excellent tool to help reduce the frequency of ship-to-ship collisions.
Since AIS data is broadcast over VHF radio, it is widely available to both ships and land stations and this data can be used in many ways. For example, if you are interested in the boats in Seattle’s Elliott Bay, have a look at MarineTraffic.com and enter “Seattle” as the port in the data entry box near the top left corner of the screen (you might see our boat Dirona there as well).
AIS data is often archived and, because of that, we have a very precise record of the Costa Concordia’s course as well as core navigational data as it proceeded towards the rocks. In the pictures that follow, the red images of the ship are at the ship’s position as transmitted by the Costa Concordia’s AIS system. The black line between these images is the interpolated course between these known locations. The video itself (Costa Concordia Interpolated.wmv) uses a roughly 5:1 time compression.
In this screen shot, you can see the Concordia already very close to the Italian Isol del Giglio. From the BBC report the Captain has said he turned too late (Costa Concordia: Captain Schettino ‘Turned Too Late’). From that article:
According to the leaked transcript quoted by Italian media, Capt Schettino said the route of the Costa Concordia on the first day of its Mediterranean cruise had been decided as it left the port of Civitavecchia, near Rome, on Friday.
The captain reportedly told the investigating judge in the city of Grosseto that he had decided to sail close to Giglio to salute a former captain who had a home on the Tuscan island. "I was navigating by sight because I knew the depths well and I had done this maneuver three or four times," he reportedly said.
"But this time I ordered the turn too late and I ended up in water that was too shallow. I don't know why it happened."
In this screen shot of the boat at 20:44:47 just prior to the grounding, you can see the boat turned to 348.8 degrees but the massive 114,137 gross ton vessel is essentially plowing sideways through the water on a course of 332.7 degrees. The Captain can and has turned the ship with the rudder but, at 15.6 kts, it does not follow the exact course steered with inertia tending to widen and straiten the intended turn.
Given the speed of the boat and nearness of shore at this point, the die is cast and the ship is going to hit ground.
This screen shot was taken is just past the point of impact. You will note that it has slowed to 14.0 kts. You might also notice the Captain is turning aggressively to the starboard. He has the ship turned to a 8.9 degrees heading whereas the actual ships course lags behind at 356.2 degrees.
This screen shot is only 44 seconds after the previous one but the boat has already slowed from 14.0 kts to 8.1 and is still slowing quickly. Some of the slowing will have come from the grounding itself but passengers report that they heard the boat hard astern after the grounding.
You can also see the captain has swung the helm over from the starboard course he was steering trying to avoid the rocks over to port course now that he has struck them. This is almost certainly in an effort to minimize damage. What makes this (possibly counter-intuitive) decision a good one is the ships pivot point is approximately 1/3 of the way back from the bow so turning to port (towards the shore) will actually cause the stern to rotate away from the rocks they just struck.
The ship decelerated quickly to just under 6.0 knots but, in the two minutes prior to this screen shot, it has only slowed a further 0.9 kts down to 5.1. There were reports of a loss of power on the Concordia. Likely what happened is ship was hard astern taking off speed until a couple of minutes prior to this screen shot when water intrusion caused a power failure. The ship is a diesel electric and likely lost power to its main prop due to rapid water ingress.
At 5 kts and very likely without main engine power, the Concordia is still going much too quickly to risk running into the mud and sand shore so the Captain now turns hard away from shore and he is heading back out into the open channel.
With the helm hard over the starboard with the likely assistance of the bow thrusters the ship is turning hard which is pulling speed off fairly quickly. It is now down to 3.0 kts and it continues to slow.
The Concordia is now down to 1.6 kts and the Captain is clearly using the bow thrusters heavily as the bow continues to rotate quickly. He has now turned to a 41 degree heading.
It now has been just over 29 min since the ship first struck the rocks. It has essentially stopped and the bow is being brought all the way back round using bow thrusters in an effort to drive the ship back in towards shore presumably because the Captain believes it is at risk of sinking so he is seeking shallow water.
The Captain continues to force the Concordia to shore under bow thruster power. In this video narrative (John Konrad Narrates the Final Maneuvers of the Costa Concordia), the commentator reported that the combination of bow thrusters and the prevailing currents where being used in combination by the Captain to drive the boat into shore.
A further 11 min and 22 seconds have past and the ship has now accelerated back up to 0.9 kts now heading towards shore.
It has been more than an hour and 11 minutes since the original contact with the rocks and the Costa Concordia is now at rest in its final grounding point.
The Coast Guard transcript of the radio communications with the Captain are at Costa Concordia Transcript: Coastguard Orders Captain to return to Stricken Ship. In the following text De Falco is the Coast Guard Commander and Schettino is the Captain of the Costa Concordia:
De Falco: "This is De Falco speaking from Livorno. Am I speaking with the commander?"
Schettino: "Yes. Good evening, Cmdr De Falco."
De Falco: "Please tell me your name."
Schettino: "I'm Cmdr Schettino, commander."
De Falco: "Schettino? Listen Schettino. There are people trapped on board. Now you go with your boat under the prow on the starboard side. There is a pilot ladder. You will climb that ladder and go on board. You go on board and then you will tell me how many people there are. Is that clear? I'm recording this conversation, Cmdr Schettino …"
Schettino: "Commander, let me tell you one thing …"
De Falco: "Speak up! Put your hand in front of the microphone and speak more loudly, is that clear?"
Schettino: "In this moment, the boat is tipping …"
De Falco: "I understand that, listen, there are people that are coming down the pilot ladder of the prow. You go up that pilot ladder, get on that ship and tell me how many people are still on board. And what they need. Is that clear? You need to tell me if there are children, women or people in need of assistance. And tell me the exact number of each of these categories. Is that clear? Listen Schettino, that you saved yourself from the sea, but I am going to … really do something bad to you … I am going to make you pay for this. Go on board, (expletive)!"
Schettino: "Commander, please …"
De Falco: "No, please. You now get up and go on board. They are telling me that on board there are still …"
Schettino: "I am here with the rescue boats, I am here, I am not going anywhere, I am here …"
De Falco: "What are you doing, commander?"
Schettino: "I am here to co-ordinate the rescue …"
De Falco: "What are you co-ordinating there? Go on board! Co-ordinate the rescue from aboard the ship. Are you refusing?"
Schettino: "No, I am not refusing."
De Falco: "Are you refusing to go aboard, commander? Can you tell me the reason why you are not going?"
Schettino: "I am not going because the other lifeboat is stopped."
De Falco: "You go aboard. It is an order. Don't make any more excuses. You have declared 'abandon ship'. Now I am in charge. You go on board! Is that clear? Do you hear me? Go, and call me when you are aboard. My air rescue crew is there."
Schettino: "Where are your rescuers?"
De Falco: "My air rescue is on the prow. Go. There are already bodies, Schettino."
Schettino: "How many bodies are there?"
De Falco: "I don't know. I have heard of one. You are the one who has to tell me how many there are. Christ!"
Schettino: "But do you realize it is dark and here we can't see anything …"
De Falco: "And so what? You want to go home, Schettino? It is dark and you want to go home? Get on that prow of the boat using the pilot ladder and tell me what can be done, how many people there are and what their needs are. Now!"
Schettino: "… I am with my second in command."
De Falco: "So both of you go up then … You and your second go on board now. Is that clear?"
Schettino: "Commander, I want to go on board, but it is simply that the other boat here … there are other rescuers. It has stopped and is waiting …"
De Falco: "It has been an hour that you have been telling me the same thing. Now, go on board. Go on board! And then tell me immediately how many people there are there."
Schettino: "OK, commander."
De Falco: "Go, immediately!"
At least 16 died in the accident and 17 were still missing when this was written (Costa Concordia Disaster).The Captain of the Costa Concordia, Francesco Schettino, has been charged with manslaughter and abandoning ship.
At the time of the grounding, the ship was carrying 2,200 metric tons of heavy fuel oil and 185 metric tons of diesel and remains environmental risk remains (Costa Concordia Salvage Experts Ready to Begin Pumping Fuel from Capsized Cruise Ship Off Coast of Italy). The 170 year old salvage firm Smit Salvage will be leading the operation.
All situations are complex and few disasters have only a single cause. However, the facts as presented to this point pretty strongly towards pilot error as the primary contributor in this event. The Captain is clearly very experienced and his ship handling after the original grounding appear excellent. But, it’s hard to explain why the ship was that close to the rocks, the captain has reported that he turned too late, and public reports have him on the phone at or near the time of the original grounding.
What I take away from the data points presented here is that experience, ironically, can be our biggest enemy. As we get increasingly proficient at a task, we often stop paying as much attention. And, with less dedicated focus on a task, over time, we run the risk of a crucial mistake that we probably wouldn’t have made when we were effectively less experienced and perhaps less skilled. There is danger in becoming comfortable.
The videos referenced in the above can be found at:
· Grounding of Costa Concordia Interpolated
· gCaptain’s John Konrad Narrates the Final Maneuvers of the Costa Concordia
If you are interested in reading more:
b: http://blog.mvdirona.com / http://perspectives.mvdirona.com
Ordinarily I focus this blog on areas of computing where I spend most of my time from high performance computing to database internals and cloud computing. An area that interests me greatly but I’ve seldom written about is entrepreneurship and startups.
One of the Seattle areas startups with which I stay in touch is Socrata. They are focused on enabling federal, state, and local governments to improve the reach, usability and social utility of their public information assets. Essentially making public information available and useful to their constituents. They are used by: the World Bank, the United Nations, the World Economic Forum, the US Data.Gov, Health & Human Services, Centers for Disease Control, several most major cities including NYC, Seattle, Chicago, San Francisco and Austin and many county and state governments. Even foreign governments like the Country of Kenya have adopted Socrata.
I first met Kevin Merritt, the founder and CEO of Socrata, back in 2005 when I was doing technical diligence for the Microsoft acquisition of the LA-based Frontbridge Technologies. I love doing diligence on startups because it’s an opportunity to dive in and spend a day or more digging deeply and understanding what smart people have produced, where things worked really well, and areas where things didn’t pan out as well as they could have. I’ve learned a lot in these roles and I’m lucky to have been able to do many of them first at IBM, later at Microsoft, and now at Amazon.
What made this one a bit different is I got a call shortly after the deal closed asking if I wanted to be the General Manager of the Microsoft subsidiary that was formed in the acquisition. An opportunity to run mid-sized business in its entirety. Development, test, operations, and customer support. Absolutely! I’ve never learned so much as I did in the first year or so at what would become Microsoft Exchange Hosted Services.
It was a great experience and I’ve been 100% focused on cloud services since that time. And, as a consequence of leading Frontbridge, I got to know Kevin Merritt well. He is an excellent strategic thinker and an even better operator. Whenever Kevin was involved, customers were happy and the service was rapidly improving and expanding. Kevin eventually left to form Socrata and he and I have stayed in touch since then. He knows I’m a sucker for a beer and some wings :-).
Based in Seattle, Socrata is venture-backed with a small and talented engineering team. They are enjoying strong customer demand and their market success is fueling growth in the engineering team. They are currently looking for a CTO and, if I didn’t already have one of the best job out there, I would seriously considering joining Kevin and the team. If you are a technology leader interested in big data, cloud computing, architecture of distributed systems, ops automation, and the user experience of making data easy to find and use, you should send Kevin, their founder and CEO, a note at email@example.com.
b: http://blog.mvdirona.com / http://perspectives.mvdirona.com
Last week I got to participate in one of my favorite days each year, serving on the judging panel for the AWS Startup Challenge. The event is a fairly intense day where our first meeting starts at 6:45am and the event closes at 9pm that evening. But it is an easy day to love in that the entire day is spent with innovative startups who have built their companies on cloud computing.
I’m a huge believer in the way cloud computing is changing the computing landscape and that’s all I’ve worked on for many years now. But I have still not tired of hearing “Without AWS, we wouldn’t have even been able to think about launching this business.”
Cloud computing is allowing significant businesses to be conceived and delivered at scale with only tiny amounts of seed funding or completely bootstrapped. Many of the finalist we looked at last week’s event had taken less than $200k of seed funding and yet had already had thousands of users. That simply wouldn’t have been possible 10 years ago and I just love to see it.
The finalist for this year’s AWS Startup Challenge were:
Booshaka - United States (Sunnyvale, California)
Booshaka simplifies advocacy marketing for brands and businesses by making sense of large amounts of social data and providing an easy to use software-as-a-service solution. In an era where people are bombarded by media, advertisers face significant challenges in reaching and engaging their customers. Booshaka combines the social graph and big data technology to help advertisers turn customers into their best marketers.
Deputy.com - Australia (Sydney)
Deputy.com is an online business management solution specifically addressing the HR department. The powerful online and mobile platform engages all staff across an enterprise, builds positive culture and drives business growth.
Fantasy Shopper - UK (Exeter)
Fantasy Shopper is a social shopping game. The shopping platform centralizes, socializes and “gamifies” online shopping to provide a real-world experience.
Flixlab - United States (Palo Alto, California)
With Flixlab, people can instantly and automatically transform raw videos and photos from their smartphone or their friends’ smartphones, into fun, compelling stories with just a few taps and immediately share them online. After creation, viewers can then interact with these movies by remixing them and creating personally relevant movies from the shared pictures and videos.
Getaround - United States (San Francisco, California)
Getaround is a peer-to-peer car sharing marketplace that enables car owners to rent their cars to qualified drivers by the hour, day, or week. Getaround facilitates payment, provide 24/7 roadside assistance, and provide complete insurance backed by Berkshire Hathaway with each rental.
Intervention Insights - United States (Grand Rapids, Michigan)
Intervention Insights provides a medical information service that combines cutting edge bioinformatics tools with disease information to deliver molecular insights to oncologists describing an individual’s unique tumor at a genomic level. The company then provides a report with an evidenced-based list of therapies that target the unique molecular basis of the cancer.
Localytics - United States (Cambridge, Massachusetts)
Localytics is a real-time mobile application analytics service that provides app developers with tools to measure usage, engagement and custom events in their applications. All data is stored at a per-event level instead of as aggregated counts. This allows app publishers, for example, to create more accurately targeted advertising and promotional campaigns from detailed segmentation of their dedicated customers.
Judging this year’s competition was even more difficult than last year because of the high quality of the field. Rather than a clear winner just jumping out, nearly all the finalist were viable winners and each clearly led in some dimensions.
As I write this and reflect on the field of finalist, some notable aspects of the list: 1) it is truly international in that there are several very strong entrants from outside the US and more than ½ of the finalists come from outside of Silicon Valley – the combination of two trends is powerful: first the economics of cloud computing supports successful startups without venture funding and, second, the spread of venture and angel funding throughout the world. Both trends make for a very strong field. Continuing on the notable attributes list, 2) very early stage startups are getting traction incredibly quickly – cloud computing allows companies to go to beta without having to grow a huge company. And, 3) Diversity. There were consumer offerings, developer offerings, and services aimed at highly skilled professionals.
The winner of the AWS Startup Challenge this year was Fantasy Shopper from Exeter, United Kingdom. Fantasy Shopper is a small, mostly bootstrapped startup led by CEO Chris Prescott and CTO Dan Noz with two other engineers. Fantasy Shopper is a social shopping game. They just went into beta on October 18th and already have thousands of incredibly engaged users. My favorite example of which is this video blog posted to YouTube November 6th: http://www.youtube.com/watch?v=h_sKDgdEexk. Watch the first 60 to 90 seconds and you’ll see what I mean.
Congratulations to Chris, Dan, Brendan, and Findlay at Fantasy Shopper and keep up the great work.
b: http://blog.mvdirona.com / http://perspectives.mvdirona.com