In the data center world, there are few events taken more seriously than power failure and considerable effort is spent to make them rare. When a datacenter experiences a power failure, it’s a really big deal for all involved. But, a big deal in the infrastructure world still really isn’t a big deal on the world stage. The Super Bowl absolutely is a big deal by any measure. On average over the last couple of years, the Super Bowl has attracted 111 million viewers and is the number 1 most watched television show in North America eclipsing the final episode of Mash. World-wide, the Super Bowl is only behind the European Cup (UEFA Champions Leaque) which draws 178 million viewers.
When the 2013 Super Bowl power event occurred, the Baltimore Ravens had just run back the second half opening kick for a touchdown and they were dominating the game with a 28 to 6 point lead. The 49ers had already played half the game and failed to get a single touchdown. The Ravens were absolutely dominating and they started the second half by tying the record for the longest kickoff return in NFL history at 108 yards. The game momentum was strongly with Baltimore.
At 13:22 in the third quarter, just 98 seconds into the second half, ½ of the Superdome lost primary power. Fortunately it wasn’t during the runback that started the second half. The power failure let to a 34 min delay to restore full lighting the field and, when the game restarted, the 49ers were on fire. The game was fundamentally changed by the outage with the 49ers rallying back to a narrow defeat of only 3 points. The game ended 34 to 31 and it really did come down to the wire where either team could have won. There is no question the game was exciting and some will argue the power failure actually made the game more exciting. But, NFL championships should be decided on the field and not impacted by the electrical system used by the host stadium.
What happened at 13:22 in the third quarter when much of the field lighting failed? Entergy, the utility supply power to the Superdome reported their “distribution and transmission feeders that serve the Superdome were never interrupted” (Before Game Is Decided, Superdome Goes Dark). It was a problem at the facility.
The joint report from SMG the company that manages the Superdome and Entergy, the utility power provider, said:
A piece of equipment that is designed to monitor electrical load sensed an abnormality in the system. Once the issue was detected, the sensing equipment operated as designed and opened a breaker, causing power to be partially cut to the Superdome in order to isolate the issue. Backup generators kicked in immediately as designed.
Entergy and SMG subsequently coordinated start-up procedures, ensuring that full power was safely restored to the Superdome. The fault-sensing equipment activated where the Superdome equipment intersects with Entergy’s feed into the facility. There were no additional issues detected. Entergy and SMG will continue to investigate the root cause of the abnormality.
Essentially, the utility circuit breaker detected an “anomaly” and opened the breaker. Modern switchgear have many sensors monitored by firmware running on a programmable logic controller. The advantage of these software systems is they are incredibly flexible and can be configured uniquely for each installation. The disadvantage of software systems is the wide variety of configurations they can support can be complex and the default configurations are used perhaps more often than they should. The default configurations in a country where legal settlements can be substantial tend towards the conservative side. We don’t know if that was a factor in this event but we do know that no fault was found and the power was stable for the remainder of the game. This was almost certainly a false trigger.
Because the cause has not yet been reported and, quite often, the underlying root cause is never found. But, it’s worth asking, is it possible to avoid long game outages and what would it cost? As when looking at any system faults, the tools we have to mitigate the impact are: 1) avoid the fault entirely, 2) protect against the fault with redundancy, 3) minimize the impact of the fault through small fault zones, and 4) minimize the impact through fast recovery.
Fault avoidance: Avoidance starts with using good quality equipment, configuring it properly, maintaining it well, and testing it frequently. Given the Superdome just went through $336 million renovation, the switch gear may have been relatively new and, even if it wasn’t, it likely was almost certainly recently maintained and inspected.
Where issues often arise are in configuration. Modern switch gear have an amazingly large number of parameters many of which interact with each other and, in total, can be difficult to fully understand. And, given the switch gear manufactures know little about the intended end-use application of each switchgear sold, they ship conservative default settings. Generally, the risk and potential negative impact of a false positive (breaker opens when it shouldn’t) is far less than a breaker that fails to open. Consequently conservative settings are common.
Another common cause of problems is lack of testing. The best way to verify that equipment works is to test at full production load in a full production environment in a non-mission critical setting. Then test it just short of overload to ensure that it can still reliably support the full load even though the production design will never run it that close to the limit, and finally, test it into overload to ensure that the equipment opens up on real faults.
The first, testing in full production environment in non-mission critical setting is always done prior to a major event. But the latter two tests are much less common: 1) testing at rated load, and 2) testing beyond rated load. Both require synthetic load banks and skill electricians and so these tests are often not done. You really can’t beat testing in a non-mission critical setting as a means of ensuring that things work well in a mission critical setting (game time).
Redundancy: If we can’t avoid a fault entirely, the next best thing is to have redundancy to mask the fault. Faults will happen. The electrical fault at the Monday Night Football game back in December of 2011 was caused by utility sub-station failing. These faults are unavoidable and will happen occasionally. But is protection against utility failure possible and affordable? Sure, absolutely. Let’s use the Superdome fault yesterday as an example.
The entire Superdome load is only 4.6MW. This load would be easy to support on two 2.5 to 3.0MW utility feeds each protected by its own generator. Generators in the 2.5 to 3.0 MW range are substantial V16 diesel engines the size of a mid-sized bus. And they are expensive running just under $1M each but they are also available in mobile form and inexpensive to rent. The rental option is a no-brainer but let’s ignore that and look at what it would cost to protect the Superdome year around with a permanent installation. We would need 2 generators, the switchgear to connect it to the load and uninterruptable power supplies to hold the load during the first few seconds of a power failure until the generators start up and are able to pick up the load. To be super safe, we’ll buy third generator just in case there is a problem and one of the two generators don’t start. The generators are under $1m each and the overall cost of the entire redundant power configuration with the extra generator could be had for under $10m. Looking at statistics from the 2012 event, a 30 second commercial costs just over $4m.
For the price of just over 60 seconds of commercials the facility could protected against fault. And, using rental generators, less than 30 seconds of commercials would provide the needed redundancy to avoid impact from any utility failure. Given how common utility failures are and the negative impact of power disruptions at a professional sporting event, this looks like good value to me. Most sports facilities chose to avoid this “unnecessary” expense and I suspect the Superdome doesn’t have full redundancy for all of its field lighting. But even if it did, this failure mode can sometimes cause the generators to be locked out and not pick up the load during a some power events. In this failure mode, when a utility breaker incorrectly senses a ground fault within the facility, it is frequently configured to not put the generator at risk by switching it into a potential ground fault. My take is I would rather run the risk of damaging the generator and avoid the outage so I’m not a big fan of this “safety” configuration but it is a common choice.
Minimize Fault Zones: The reason why only ½ the power to the Superdome went down was because the system installed at the facility has two fault containment zones. In this design, a single switchgear event can only take down ½ of the facility.
Clearly the first choice is to avoid the fault entirely. And, if that doesn’t work, have redundancy take over and completely mask the fault. But, in the rare cases where none of these mitigations work, the next defense are small fault containment zones. Rather than using 2 zones, spend more on utility breakers and have 4 or 6 and, rather than losing ½ the facility, lose ¼ or 1/6. And, if the lighting power is checker boarded over the facility lights, (lights in a contiguous region are not all powered by the same utility feed but the feeds are distributed over the lights evenly), rather than losing ¼ or 1/6 of the lights in one area of the stadium, we would lose that fraction of the lights evenly over the entire facility. Under these conditions, it might be possible to operate with slightly degraded field lighting and be able to continue the game without waiting for light recovery.
Fast Recovery: Before we get to this fourth option, fast recovery, we have tried hard to avoid failure, then we have used power redundancy to mask the failure, then we have used small fault zones to minimize the impact. The next best thing we can do is to recover quickly. Fast recovery depends broadly on two things: 1) if possible automate recovery so it can happen in seconds rather than the rate at which humans can act, 2) if humans are needed, ensure they have access to adequate monitoring and event recording gear so they can see what happened quickly and they have trained extensively and are able to act quickly.
In this particular event, the recovery was not automated. Skilled electrical technicians were required. They spent nearly 15 minute checking system states before deciding it was safe to restore power. Generally, 15 min on a human judgment driven recover decision isn’t bad. But the overall outage was 34 min. If the power was restored in 15 min, what happened during the next 20? The gas discharge lighting still favored at large sporting venues, take roughly 15 minutes to restart after a momentary outage. Even a very short power interruption will still suffer the same long recovery time. Newer light technologies are becoming available that are both more power efficient and don’t suffer from these long warm-up periods.
It doesn’t appear that the final victor of Super Bowl XLVII was changed by the power failure but there is no question the game was broadly impacted. If the light failure had happened during the kickoff return starting the third quarter, the game may have been changed in a very fundamental way. Better power distribution architectures are cheap by comparison. Given the value of the game, the relative low cost of power redundancy equipment, I would argue it’s time to start retrofitting major sporting venues with more redundant design and employing more aggressive pre-game testing.
James Hamilton e: firstname.lastname@example.org w: http://www.mvdirona.com b: http://blog.mvdirona.com / http://perspectives.mvdirona.com
Disclaimer: The opinions expressed here are my own and do not
necessarily represent those of current or past employers.