It’s difficult to adequately test complex systems. But what’s really difficult is keeping a system adequately tested. Creating systems that do what they are designed to do is hard but, even with the complexity of these systems, many life critical systems have the engineering and production testing investment behind them to be reasonably safe when deployed. Its keeping them adequately tested over time as conditions and the software system changes where we sometimes fail.
There are exceptions to the general observation that we can build systems that operate safely when inside reasonable expectations of expected operating conditions. One I’ve written about was the Fukushima Dai-1 nuclear catastrophe. Any reactor design that doesn’t anticipate total power failure including backups, is ignoring considerable history. Those events, although rare, do happen and the life critical designs need to expect them and this fault condition needs to be tested in a live production environment. “Nearly expected” events shouldn’t bring life critical systems. The more difficult to protect against are 1) the impact of black swan events and 2) the combination of vastly changing environmental conditions over time.
Another even more common cause of complex system failure is the impact of human operators. Most systems are exposed to human factors and human error will always be a leading cause of complex system failure. One I’ve written about and continue to follow is the Costa Concordia Grounding. It is well on its way to becoming a text book case on how human error can lead to loss of life and how poor decisions can cascade to provide opportunity for yet more poor decisions by the same people that got the first one wrong and are now operating under incredible stress. It’s a super interesting situation that I will likely return to in the near future and summarize what has been learned over the last few years of investigation, salvage operation, and court cases.
Returning to the two other causes of complex system failure I mentioned earlier: 1) the impact of black swan events and 2) compounding changes in environmental conditions. Much time gets spent on how to mitigate the negative impact of rare and difficult to predict events. The former are, by definition, very difficult to adequately predict so most mitigations involve being able to reduce the blast radius (localize the negative impact as much as possible) and design the system to fail in a degraded but non-life threatening state (degraded operations mode). The more mundane of these two conditions is the second, compounding changes in the environmental conditions in which the complex system is operating. These are particularly difficult in that, quite often, none of the changes are large and none happen fast but scale changes, traffic conditions change, workload mix change, and the sum of all changes over a long period of time can put the system in a state far outside of those anticipated by the original designers and, consequently, were never tested.
I came across an example of these latter failure modes in my recent flight through Los Angeles Airport on April 30th. While waiting for my flight from Los Angeles north to Vancouver, we were told there had been a regional air traffic control system failure and the entire Los Angeles area was down. These regional air traffic control facilities are responsible for the air space between airports. In the US, there are currently 22 Area Control Centers referred to by the FAA as Air Route Traffic Control Centers (ARTCC). Each ARTCC is responsible for a portion of the US air space outside of the inverted pyramid controlled by each airport. This number of ARTCC has been increasing but, even with larger numbers, the negative impact of one going down is broad. In this instance all traffic for the LAX, Burbank, Long Beach, Ontario, and Orange County airports was all brought to a standstill.
As I waited to board my flight to Vancouver, more and more aircraft were accumulating at LAX. Over the next hour or so the air space in the region drained as flights landed or were diverted but none departed. 50 flights were canceled and 428 flights were delayed. The impact of the cancelations and delays rippled throughout North America and probably world-wide. As a consequence of this delay, I missed my flight from Vancouver to Victoria much later in the evening and many other passengers passing through Vancouver were impacted even though it’s a long way away in a different country many hours later. An ARTCC going down for even a short time, can leads to delays in the system that take nearly a day to fully resolve.
Having seen the impact personally, I got interested in what happened. What took down the ARTCC system controlling the entire Los Angeles region? It has taken some time and, in the interim, the attention of many news outlets has wandered elsewhere but this BBC article summarizes the failure: Air Traffic Control Memory Shortage Behind Air Chaos and this article has more detail: Fabled U-2 Spy Plane Begins Farewell Tour by Shutting Down Airports in the L.A. Region.
The failing system at the ARTCC was the En-Route Automation Modernization (ERAM) that can track up to 1,900 concurrent flights simultaneously using data from many sensors including 64 RADAR deployment systems. The system was deployed in 2010 so it’s fairly new but we all know that over time, all regions get more busy. The airports get better equipment allowing them to move planes safely at greater rates, take-off and landing frequency goes up, some add new runways, most add new gates. And the software system itself gets changes not all of which will improve scaling or maximum capability. Over time, the load keeps going up and the system moves further from the initial test conditions when it was designed, developed, and tested. This happens to many highly complex systems and some end up operating at an order of magnitude higher load or different load mix than originally anticipated – this is a common cause for complex system failure. We know that systems eventually go non-linear as the load increases so we need to constantly probe 10x beyond what possible today to ensure that there remains adequate headroom between possible operating modes and the system failure point.
The FAA ERAM system is a critical life safety system so presumably it operates with more engineering headroom and safety margin than some commercial systems but yet it still went down hard and all backups failed in this situation. What happened in this case appears to have been a combination of slowly ramping load combined with a rare event. In this case a U-2 spy plane was making an high altitude pass over the Western US as part of its farewell tour. The U-2 is a spy plane with an operational ceiling of 70,000’ (13.2 miles above earth) and a range of 6,405 miles at a speed of 500 mph. It flies well above commercial air traffic. The U-2 is neither particularly fast nor long range but you have to remember its first flight was in 1955. It’s an aging technology and it was incredibly advanced when it was first produced by the famous Lockheed Skunk Works group lead by Kelly Johnson. Satellite imagery and drones have largely replaced the U-2 and the booming $32,000 bill for each flight hour has led to the planned retirement of the series.
What brought the ERAM system down was the combination of the typically heavy air traffic in the LA region combined with an imprecise flight plan filed for the U-2 spy plane passing through the region on a farewell flight. It’s not clear why a plane flying well above the commercial flight ceilings was viewed as a collision threat by ERAM but the data stream driven by the U-2 flights ended up consuming massive amounts of memory which brought down both the primary and secondary software systems leaving the region without air traffic control.
The lessons here are at least twofold. First, as complex systems age, the environmental conditions under which they operate change dramatically. Workload typically goes up, workload mix changes over time, and there will be software changes made over time some of which will change the bounds of what system can reliably handle. Knowing this, we must always be retesting production systems with current workload mixes and we must probe the bounds well beyond any reasonable production workload. As these operating and environmental conditions evolve, the testing program must be updated. I have seen complex system fail where, upon closer inspection, it’s found that the system has been operating for years beyond its design objectives and its original test envelope. Systems have to be constantly probed to failure so we know where they will operate stably and where they will fail.
The second lesson is that rare events will happen. I doubt a U2 pass of the western US is all that rare but something about this one was unusual. We need to expect that complex systems will face unexpected environmental conditions and look hard for some form degraded operations mode. Failure fault containment zones should be made as small as possible, we want to look for ways to deliver the system such that some features may fail while others continue to operate. You don’t want to lose all functionality for the entire system in a failure. For all complex systems, we are looking for ways to divide up the problem such that a fault has a small blast radius (less impact) and we want the system to gracefully degrade to less functionality rather than have the entire system hard fail. With backup systems its particularly important that they are designed to fail down to less functionality rather than also hard failing. In this case, having the backup come up and require more flight separation and do less real time collision calculations might be the right answer. Generally, all systems need to have some way to operate at less than full functionality or less than full scale without completely hard failing. These degraded operation modes are hard to find but I’ve never seen a situation where we couldn’t find one.
It’s not really relevant but still ironic that both the trigger for the fault, the U-2 spy plane, and the system brought down by the fault, ERAM, were both Lockheed produced products although released 55 years apart.
Unfortunately, it’s not all about tuning the last bit of reliability out of complex systems. There is still a lot of low hanging fruit out there. As proof that even some relatively simple systems can be poorly engineered and produce remarkably poor results even when working well within their design parameters, on a 747-400 flying back to New Zealand from San Francisco, at least 3 passengers on that flight had lost bags. My loss appears to be permanent at this point. It’s amazing that couriers can move 10s of millions of packages a year over far more complex routings and yet, in passenger air travel baggage handling, the results are several orders of magnitude worse. I suspect that if the operators were economically responsible for the full loss, the quality of air baggage tracking would come up to current technology levels fairly quickly.
James Hamilton, jrh@mvdirona.com
Hi, I’m looking for the possible effects or problems brought by atc system failure to controllers. If you know any related journals or articles relating to this pls let me know thanks!
If I’m understanding you correctly, you are interested in the effect of air traffic control system failure on the air traffic controllers rather than passengers, pilots, and other users of the system. I’ve not seen anything but I suspect there will be some studies out there on the effect of system failure on the operators of life critical systems. If you expand your search beyond air traffic controllers and focus on operators of any life critical systems and the impact of failures control system failures on them.
Does the cloud backup help to avoid this problem? I am a master’s student in USA and doing my project on “Enterprise Network Design and Implementation for Airports”. In this project, I have been decided to provide the entire network with two ISPs in order to make the cloud recovery work all the time when the is fail in the system. can you provide me an information about that or am I correct or not. Thanks a lot and regards
The way that AWS customers (including Amazon.com) avoid this problem is by having three availability zones in the same region. Each availability zone is a seperate data center. These seperate availability zones are chosen to be far enough apart that they do not suffer correlated failures from fire, wheather, flood, etc. But, they are close enough together that data can be committed to multiple data centers synchronously. What this allows is a fairly simple transaction and recover model where data is committed to at least 2 of 3 data centers and the system can operate seemlessly throuhg the loss of an entire data center.
Because the system will operate well when removing a data center, if you are uncertain and think there may be a fault, you can bring an entire facility down to quickly inestigate. This is how Amazon.com and AWS get excellent availability even while the code is being changed frequently.
The more common approach is to run cross region redundancy where there are two data centers in different parts of the country. However, this model will not support synchronous commits — the redundancy is asynchronous and, as a consequence, there is a small window of lost data on failover. Because failover is a messy event that looses data and it’s very hard to fail back, this model is just about never tested and consequently is hard to make reliable.
The multiple independent data centers close enough to support synchronous replication is a a model that really works well and it’s one of the advantages of the AWS model. Of course, it can be combined with cross region replication as well and some of our customers do exactly that.
Hi Frank, good to hear from you.
Isn’t time for a big change in your work life? It’s past time you consider coming to AWS and chewing on some of the thorny infrastructure challengs we have on the go. You would have a ball.
I actually think that failing back from ERAM to a backup system that can’t sustain as many flights in the air concurrently would be a perfectly good solution. In this case, they appear to have made the decision to drain the entire airspace. The problem with failing to manual-mode, especially a manual-mode with less capacity and higher risk, is that it just about never gets tested in production (nobody wants to run at higher risk and lower capacity just to ensure the backups work) so it ends up being more dangerous due to never being used.
An approach I like it to failover quickly to a frequently used secondary system. Most faults are corrected there and the ARTCCs do have this capability. If the secondsary system starts experiencing faults as it did in this case, failing down to a degraded operations mode that can’t support as many concurrent flights and requires more air seperation would be better to failing to "off". Draining the skys has massive economic impact and the upwards of an hour required to do this exposes us to considerable risk if the manual backups can’t deal with the capacity.
My take is asking people to take over when they have no practice and they are not capable of safely running with the same air seperation as the automated systems, is not where we want to be.
Back when the ERAM system was still just being tested, many folks in aviation (pilots & ATC) expressed concerns that ATC would become too dependent on the ERAM automation, and not practice manual radar / flight-progress-strip work enough to work well as a fallback. I guess that was both right & wrong, as during this event no one got hurt because of the outage, but OTOH ATC couldn’t handle the normal load.