It’s difficult to adequately test complex systems. But what’s really difficult is keeping a system adequately tested. Creating systems that do what they are designed to do is hard but, even with the complexity of these systems, many life critical systems have the engineering and production testing investment behind them to be reasonably safe when deployed. Its keeping them adequately tested over time as conditions and the software system changes where we sometimes fail.
There are exceptions to the general observation that we can build systems that operate safely when inside reasonable expectations of expected operating conditions. One I’ve written about was the Fukushima Dai-1 nuclear catastrophe. Any reactor design that doesn’t anticipate total power failure including backups, is ignoring considerable history. Those events, although rare, do happen and the life critical designs need to expect them and this fault condition needs to be tested in a live production environment. “Nearly expected” events shouldn’t bring life critical systems. The more difficult to protect against are 1) the impact of black swan events and 2) the combination of vastly changing environmental conditions over time.
Another even more common cause of complex system failure is the impact of human operators. Most systems are exposed to human factors and human error will always be a leading cause of complex system failure. One I’ve written about and continue to follow is the Costa Concordia Grounding. It is well on its way to becoming a text book case on how human error can lead to loss of life and how poor decisions can cascade to provide opportunity for yet more poor decisions by the same people that got the first one wrong and are now operating under incredible stress. It’s a super interesting situation that I will likely return to in the near future and summarize what has been learned over the last few years of investigation, salvage operation, and court cases.
Returning to the two other causes of complex system failure I mentioned earlier: 1) the impact of black swan events and 2) compounding changes in environmental conditions. Much time gets spent on how to mitigate the negative impact of rare and difficult to predict events. The former are, by definition, very difficult to adequately predict so most mitigations involve being able to reduce the blast radius (localize the negative impact as much as possible) and design the system to fail in a degraded but non-life threatening state (degraded operations mode). The more mundane of these two conditions is the second, compounding changes in the environmental conditions in which the complex system is operating. These are particularly difficult in that, quite often, none of the changes are large and none happen fast but scale changes, traffic conditions change, workload mix change, and the sum of all changes over a long period of time can put the system in a state far outside of those anticipated by the original designers and, consequently, were never tested.
I came across an example of these latter failure modes in my recent flight through Los Angeles Airport on April 30th. While waiting for my flight from Los Angeles north to Vancouver, we were told there had been a regional air traffic control system failure and the entire Los Angeles area was down. These regional air traffic control facilities are responsible for the air space between airports. In the US, there are currently 22 Area Control Centers referred to by the FAA as Air Route Traffic Control Centers (ARTCC). Each ARTCC is responsible for a portion of the US air space outside of the inverted pyramid controlled by each airport. This number of ARTCC has been increasing but, even with larger numbers, the negative impact of one going down is broad. In this instance all traffic for the LAX, Burbank, Long Beach, Ontario, and Orange County airports was all brought to a standstill.
As I waited to board my flight to Vancouver, more and more aircraft were accumulating at LAX. Over the next hour or so the air space in the region drained as flights landed or were diverted but none departed. 50 flights were canceled and 428 flights were delayed. The impact of the cancelations and delays rippled throughout North America and probably world-wide. As a consequence of this delay, I missed my flight from Vancouver to Victoria much later in the evening and many other passengers passing through Vancouver were impacted even though it’s a long way away in a different country many hours later. An ARTCC going down for even a short time, can leads to delays in the system that take nearly a day to fully resolve.
Having seen the impact personally, I got interested in what happened. What took down the ARTCC system controlling the entire Los Angeles region? It has taken some time and, in the interim, the attention of many news outlets has wandered elsewhere but this BBC article summarizes the failure: Air Traffic Control Memory Shortage Behind Air Chaos and this article has more detail: Fabled U-2 Spy Plane Begins Farewell Tour by Shutting Down Airports in the L.A. Region.
The failing system at the ARTCC was the En-Route Automation Modernization (ERAM) that can track up to 1,900 concurrent flights simultaneously using data from many sensors including 64 RADAR deployment systems. The system was deployed in 2010 so it’s fairly new but we all know that over time, all regions get more busy. The airports get better equipment allowing them to move planes safely at greater rates, take-off and landing frequency goes up, some add new runways, most add new gates. And the software system itself gets changes not all of which will improve scaling or maximum capability. Over time, the load keeps going up and the system moves further from the initial test conditions when it was designed, developed, and tested. This happens to many highly complex systems and some end up operating at an order of magnitude higher load or different load mix than originally anticipated – this is a common cause for complex system failure. We know that systems eventually go non-linear as the load increases so we need to constantly probe 10x beyond what possible today to ensure that there remains adequate headroom between possible operating modes and the system failure point.
The FAA ERAM system is a critical life safety system so presumably it operates with more engineering headroom and safety margin than some commercial systems but yet it still went down hard and all backups failed in this situation. What happened in this case appears to have been a combination of slowly ramping load combined with a rare event. In this case a U-2 spy plane was making an high altitude pass over the Western US as part of its farewell tour. The U-2 is a spy plane with an operational ceiling of 70,000’ (13.2 miles above earth) and a range of 6,405 miles at a speed of 500 mph. It flies well above commercial air traffic. The U-2 is neither particularly fast nor long range but you have to remember its first flight was in 1955. It’s an aging technology and it was incredibly advanced when it was first produced by the famous Lockheed Skunk Works group lead by Kelly Johnson. Satellite imagery and drones have largely replaced the U-2 and the booming $32,000 bill for each flight hour has led to the planned retirement of the series.
What brought the ERAM system down was the combination of the typically heavy air traffic in the LA region combined with an imprecise flight plan filed for the U-2 spy plane passing through the region on a farewell flight. It’s not clear why a plane flying well above the commercial flight ceilings was viewed as a collision threat by ERAM but the data stream driven by the U-2 flights ended up consuming massive amounts of memory which brought down both the primary and secondary software systems leaving the region without air traffic control.
The lessons here are at least twofold. First, as complex systems age, the environmental conditions under which they operate change dramatically. Workload typically goes up, workload mix changes over time, and there will be software changes made over time some of which will change the bounds of what system can reliably handle. Knowing this, we must always be retesting production systems with current workload mixes and we must probe the bounds well beyond any reasonable production workload. As these operating and environmental conditions evolve, the testing program must be updated. I have seen complex system fail where, upon closer inspection, it’s found that the system has been operating for years beyond its design objectives and its original test envelope. Systems have to be constantly probed to failure so we know where they will operate stably and where they will fail.
The second lesson is that rare events will happen. I doubt a U2 pass of the western US is all that rare but something about this one was unusual. We need to expect that complex systems will face unexpected environmental conditions and look hard for some form degraded operations mode. Failure fault containment zones should be made as small as possible, we want to look for ways to deliver the system such that some features may fail while others continue to operate. You don’t want to lose all functionality for the entire system in a failure. For all complex systems, we are looking for ways to divide up the problem such that a fault has a small blast radius (less impact) and we want the system to gracefully degrade to less functionality rather than have the entire system hard fail. With backup systems its particularly important that they are designed to fail down to less functionality rather than also hard failing. In this case, having the backup come up and require more flight separation and do less real time collision calculations might be the right answer. Generally, all systems need to have some way to operate at less than full functionality or less than full scale without completely hard failing. These degraded operation modes are hard to find but I’ve never seen a situation where we couldn’t find one.
It’s not really relevant but still ironic that both the trigger for the fault, the U-2 spy plane, and the system brought down by the fault, ERAM, were both Lockheed produced products although released 55 years apart.
Unfortunately, it’s not all about tuning the last bit of reliability out of complex systems. There is still a lot of low hanging fruit out there. As proof that even some relatively simple systems can be poorly engineered and produce remarkably poor results even when working well within their design parameters, on a 747-400 flying back to New Zealand from San Francisco, at least 3 passengers on that flight had lost bags. My loss appears to be permanent at this point. It’s amazing that couriers can move 10s of millions of packages a year over far more complex routings and yet, in passenger air travel baggage handling, the results are several orders of magnitude worse. I suspect that if the operators were economically responsible for the full loss, the quality of air baggage tracking would come up to current technology levels fairly quickly.
James Hamilton, firstname.lastname@example.org