Monday, February 04, 2013

In the data center world, there are few events taken more seriously than power failure and considerable effort is spent to make them rare. When a datacenter experiences a power failure, it’s a really big deal for all involved. But, a big deal in the infrastructure world still really isn’t a big deal on the world stage. The Super Bowl absolutely is a big deal by any measure. On average over the last couple of years, the Super Bowl has attracted 111 million viewers and is the number 1 most watched television show in North America eclipsing the final episode of Mash.  World-wide, the Super Bowl is only behind the European Cup (UEFA Champions Leaque) which draws 178 million viewers.

 

When the 2013 Super Bowl power event occurred, the Baltimore Ravens had just run back the second half opening kick for a touchdown and they were dominating the game with a 28 to 6 point lead. The 49ers had already played half the game and failed to get a single touchdown. The Ravens were absolutely dominating and they started the second half by tying the record for the longest kickoff return in NFL history at 108 yards. The game momentum was strongly with Baltimore.

 

At 13:22 in the third quarter, just 98 seconds into the second half, ½ of the Superdome lost primary power. Fortunately it wasn’t during the runback that started the second half.  The power failure let to a 34 min delay to restore full lighting the field and, when the game restarted, the 49ers were on fire. The game was fundamentally changed by the outage with the 49ers rallying back to a narrow defeat of only 3 points. The game ended 34 to 31 and it really did come down to the wire where either team could have won. There is no question the game was exciting and some will argue the power failure actually made the game more exciting. But, NFL championships should be decided on the field and not impacted by the electrical system used by the host stadium.

 

What happened at 13:22 in the third quarter when much of the field lighting failed?  Entergy, the utility supply power to the Superdome reported their “distribution and transmission feeders that serve the Superdome were never interrupted” (Before Game Is Decided, Superdome Goes Dark). It was a problem at the facility.

 

The joint report from SMG the company that manages the Superdome and Entergy, the utility power provider, said:

 

A piece of equipment that is designed to monitor electrical load sensed an abnormality in the system. Once the issue was detected, the sensing equipment operated as designed and opened a breaker, causing power to be partially cut to the Superdome in order to isolate the issue. Backup generators kicked in immediately as designed.

 

Entergy and SMG subsequently coordinated start-up procedures, ensuring that full power was safely restored to the Superdome. The fault-sensing equipment activated where the Superdome equipment intersects with Entergy’s feed into the facility. There were no additional issues detected. Entergy and SMG will continue to investigate the root cause of the abnormality.

 

 

Essentially, the utility circuit breaker detected an “anomaly” and opened the breaker. Modern switchgear have many sensors monitored by firmware running on a programmable logic controller. The advantage of these software systems is they are incredibly flexible and can be configured uniquely for each installation. The disadvantage of software systems is the wide variety of configurations they can support can be complex and the default configurations are used perhaps more often than they should. The default configurations in a country where legal settlements can be substantial tend towards the conservative side. We don’t know if that was a factor in this event but we do know that no fault was found and the power was stable for the remainder of the game. This was almost certainly a false trigger.

 

Because the cause has not yet been reported and, quite often, the underlying root cause is never found. But, it’s worth asking, is it possible to avoid long game outages and what would it cost?  As when looking at any system faults, the tools we have to mitigate the impact are: 1) avoid the fault entirely, 2) protect against the fault with redundancy, 3) minimize the impact of the fault through small fault zones, and 4) minimize the impact through fast recovery.

 

Fault avoidance: Avoidance starts with using good quality equipment, configuring it properly, maintaining it well, and testing it frequently. Given the Superdome just went through $336 million renovation, the switch gear may have been relatively new and, even if it wasn’t, it likely was almost certainly recently maintained and inspected.

 

Where issues often arise are in configuration. Modern switch gear have an amazingly large number of parameters many of which interact with each other and, in total, can be difficult to fully understand. And, given the switch gear manufactures know little about the intended end-use application of each switchgear sold, they ship conservative default settings. Generally, the risk and potential negative impact of a false positive (breaker opens when it shouldn’t) is far less than a breaker that fails to open. Consequently conservative settings are common.

 

Another common cause of problems is lack of testing. The best way to verify that equipment works is to test at full production load in a full production environment in a non-mission critical setting. Then test it just short of overload to ensure that it can still reliably support the full load even though the production design will never run it that close to the limit, and finally, test it into overload to ensure that the equipment opens up on real faults.

 

The first, testing in full production environment in non-mission critical setting is always done prior to a  major event. But the latter two tests are much less common: 1) testing at rated load, and 2) testing beyond rated load.  Both require synthetic load banks and skill electricians and so these tests are often not done. You really can’t beat testing in a non-mission critical setting as a means of ensuring that things work well in a mission critical setting (game time).

 

Redundancy: If we can’t avoid a fault entirely, the next best thing is to have redundancy to mask the fault. Faults will happen. The electrical fault at the Monday Night Football game back in December of 2011 was caused by utility sub-station failing. These faults are unavoidable and will happen occasionally. But is protection against utility failure possible and affordable? Sure, absolutely. Let’s use the Superdome fault yesterday as an example.

 

The entire Superdome load is only 4.6MW. This load would be easy to support on two 2.5 to 3.0MW utility feeds each protected by its own generator. Generators in the 2.5 to 3.0 MW range are substantial V16 diesel engines the size of a mid-sized bus. And they are expensive running just under $1M each but they are also available in mobile form and inexpensive to rent. The rental option is a no-brainer but let’s ignore that and look at what it would cost to protect the Superdome year around with a permanent installation. We would need 2 generators, the switchgear to connect it to the load and uninterruptable power supplies to hold the load during the first few seconds of a power failure until the generators start up and are able to pick up the load. To be super safe, we’ll buy third generator just in case there is a problem and one of the two generators don’t start. The generators are under $1m each and the overall cost of the entire redundant power configuration with the extra generator could be had for under $10m.  Looking at statistics from the 2012 event, a 30 second commercial costs just over $4m.

 

For the price of just over 60 seconds of commercials the facility could protected against fault. And, using rental generators, less than 30 seconds of commercials would provide the needed redundancy to avoid impact from any utility failure. Given how common utility failures are and the negative impact of power disruptions at a professional sporting event, this looks like good value to me. Most sports facilities chose to avoid this “unnecessary” expense and I suspect the Superdome doesn’t have full redundancy for all of its field lighting. But even if it did, this failure mode can sometimes cause the generators to be locked out and not pick up the load during a some power events. In this failure mode, when a utility breaker incorrectly senses a ground fault within the facility, it is frequently configured to not put the generator at risk by switching it into a potential ground fault. My take is I would rather run the risk of damaging the generator and avoid the outage so I’m not a big fan of this “safety” configuration but it is a common choice.

 

Minimize Fault Zones: The reason why only ½ the power to the Superdome went down was because the system installed at the facility has two fault containment zones. In this design, a single switchgear event can only take down ½ of the facility.

 

Clearly the first choice is to avoid the fault entirely. And, if that doesn’t work, have redundancy take over and completely mask the fault. But, in the rare cases where none of these mitigations work, the next defense are small fault containment zones. Rather than using 2 zones, spend more on utility breakers and have 4 or 6 and, rather than losing ½ the facility, lose ¼ or 1/6.  And, if the lighting power is checker boarded over the facility lights, (lights in a contiguous region are not all powered by the same utility feed but the feeds are distributed over the lights evenly), rather than losing ¼ or 1/6 of the lights in one area of the stadium, we would lose that fraction of the lights evenly over the entire facility. Under these conditions, it might be possible to operate with slightly degraded field lighting and be able to continue the game without waiting for light recovery.

 

Fast Recovery: Before we get to this fourth option, fast recovery, we have tried hard to avoid failure, then we have used power redundancy to mask the failure, then we have used small fault zones to minimize the impact. The next best thing we can do is to recover quickly. Fast recovery depends broadly on two things: 1) if possible automate recovery so it can happen in seconds rather than the rate at which humans can act, 2) if humans are needed, ensure they have access to adequate monitoring and event recording gear so they can see what happened quickly and they have trained extensively and are able to act quickly.

 

In this particular event, the recovery was not automated. Skilled electrical technicians were required. They spent nearly 15 minute checking system states before deciding it was safe to restore power. Generally, 15 min on a human judgment driven recover decision isn’t bad. But the overall outage was 34 min. If the power was restored in 15 min, what happened during the next 20?  The gas discharge lighting still favored at large sporting venues, take roughly 15 minutes to restart after a momentary outage. Even a very short power interruption will still suffer the same long recovery time. Newer light technologies are becoming available that are both more power efficient and don’t suffer from these long warm-up periods.

 

It doesn’t appear that the final victor of Super Bowl XLVII was changed by the power failure but there is no question the game was broadly impacted. If the light failure had happened during the kickoff return starting the third quarter, the game may have been changed in a very fundamental way. Better power distribution architectures are cheap by comparison. Given the value of the game, the relative low cost of power redundancy equipment, I would argue it’s time to start retrofitting major sporting venues with more redundant design and employing more aggressive pre-game testing.

 

                                                                --jrh

 

James Hamilton 
e: jrh@mvdirona.com 
w: 
http://www.mvdirona.com 
b: 
http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Monday, February 04, 2013 11:16:06 AM (Pacific Standard Time, UTC-08:00)  #    Comments [18] - Trackback
Hardware | Ramblings
Monday, February 04, 2013 11:44:42 AM (Pacific Standard Time, UTC-08:00)
Baltimore Colts? Been a few years since they've been around.
Brad
Monday, February 04, 2013 12:12:48 PM (Pacific Standard Time, UTC-08:00)
Yikes, thanks. I guess I'm a bit more focused on the technology than the game :-).

Fixed. Thanks for pointing it out.

--jrh
Monday, February 04, 2013 2:03:00 PM (Pacific Standard Time, UTC-08:00)
Good article - but I think you missed load reduction. Had they used more energy efficient lighting, then there would be massive safety in their subsystems and unlikely to reach anywhere near rated load on their circuitry. Data centers are going much more efficient to reduce energy consumption, and one clear benefit is lower cost of backup systems etc.
Bryan
Monday, February 04, 2013 2:38:33 PM (Pacific Standard Time, UTC-08:00)
Load reduction is good for the environment and it enables increased levels of redundancy without buying more capacity. But,load reduction by itself won't help. If you have a 4kVA circuit, and you are only drawing 1kVA vs 3.5kVA the failure mode is the same. If the utility breaker opens, the load is dropped.

Just changing to high efficiency lighting and dropping the load levels alone does protect them from outage. But I 100% agree its a good idea and, as you pointed out, smaller loads are much cheaper to make redundant.

--jrh
Tuesday, February 05, 2013 1:27:35 PM (Pacific Standard Time, UTC-08:00)
It appears that they had back-up generators in line. Given what little detail has been provided, this instance seems to show the difference between source redundancy and pathway redundancy. Only way to sustain a distribution fault (i.e. switch-gear fuse, PLC breaker, shorted conductor, etc.) would be full 2N distribution pathways (or "like-kind") - common topology for data center providers, but unlikely that any stadium would design to that level. However, this instance may change that paradigm given the potential economic impacts you mentioned. I’m sure UPS vendors would like to think that 30 stadiums @ 5MW per = 150MW of new “mission critical” lighting was on the market.

Great points on PLC, switch, & breaker “default” settings. You don’t pay much attention to breaker coordination until it eats your lunch. I especially like the origin thesis on litigation defining OCPD default settings – never made that connection before, but it fits.

I look forward to seeing who gets the blame/bill for this one. Post mortem’s make for a great read.
tyler
Tuesday, February 05, 2013 1:57:44 PM (Pacific Standard Time, UTC-08:00)
I read the joint report stated the system operated as it was designed to operate. Your observation on "issues arising in the configuration" would be one explanation for the observed behaviour of the system, however one should not rule out abnormal load conditions. Here power monitoring and diagnostic tools could have confirmed the nature of the abnormality.

Your considerations for addressing system failures, i.e. maximizing availability are interesting, however one aspect that was not clear during the 34 minute outage is whether this resulted in loss of revenue or not.
Mike Evans
Tuesday, February 05, 2013 4:32:10 PM (Pacific Standard Time, UTC-08:00)
James
Part of fault tolerant strategy is to bypass on containment , right? So the original spec needs to be designed for 1.5 times the load and alternate paths around single points of failures would provide better fault tolerant.

Good article.
tk
TK
Wednesday, February 06, 2013 10:05:18 AM (Pacific Standard Time, UTC-08:00)
Tyler said "It appears that they had back-up generators in line." I'm pretty sure that they don't have generator backup for the stadium lights (of course they do for emergency lighting). But, even if they did, it would have still yielded a 15 min outage while the lights restarted unless they also installed UPS protection. I agree its a sizable opportunity for UPS providers.

--jrh
Wednesday, February 06, 2013 10:12:43 AM (Pacific Standard Time, UTC-08:00)
Mike Evans pointed out "Your considerations for addressing system failures, i.e. maximizing availability are interesting, however one aspect that was not clear during the 34 minute outage is whether this resulted in loss of revenue or not." Modern customers have a remarkably short attention span and even short outage or small additional lateness can lose customers. However, in the case of the NFL Super Bowl, I'm inclined to agree that they probably didn't lose any viewers or revenue. Had the 49ers come back to win after the outage, there would have been a significant outcry. Generally, having fans believing that an event outside of the game influenced the outcome is not good for the business. Remember all the outrage around the less accurate play calling of the substitute officials.

If the game had ended differently, it would have had a negative reputation impact on the NFL which could be revenue impacting. if its rare and doesn't impact the outcome, I agree the negative impact is small. Given the revenue at stake and the small cost of backup power, running risk doesn't seem worth it.

--jrh


Thursday, February 07, 2013 8:26:34 AM (Pacific Standard Time, UTC-08:00)
Would it be better for the event to be powered by the generators, and have grid power as the backup, rather than the usual setup? That way, you don't have the generator start-up time to deal with, and minimal UPS requirements. If the generator fails, utility power steps in instantly.
SC
Thursday, February 07, 2013 8:34:17 AM (Pacific Standard Time, UTC-08:00)
SC asked if it makes sense to run the generators as primary power and use the grid as backup to avoid the UPS? Unfortunately, you have the same problem either way in that you can't have the generators and utility online at the same time. So the switch is break before make where, on transition, the primary power is disconnected prior to connecting the secondary power. Because of this, its best to have a short duration UPS protecting the load.

Although it would be a tiny part of the overall environmental impact of the event, running the generators full time would likely get some necessary negative press. A small UPS is a pretty economic solution and it makes the entire system much more robust.

Renting all the equipment for a couple of weeks is under $3m and buying and installing it with an expected lifetime of 15 years of protection is an easy to justify expense.

--jrh
Thursday, February 07, 2013 11:28:50 AM (Pacific Standard Time, UTC-08:00)
I know this post was largely about the Superdome, but we ought to consider its implications for the data center industry. Today hardware redundancy is the conservative choice for a lot of data center operators. As we've seen with Hurricane Sandy and other super storms though, hardware will periodically fail, people will make mistakes, and software will have bugs. Rather than treating these as abnormal conditions, we should operationalize them and design systems that will shorten the mean time to recovery instead of the mean time between failures. This can be done by building resiliancy into the service layer of the application such that in the event of a power loss or a network disruption, the user experience remains unaffected because a) there are multiple instances of the application spread across fault domains, and b) the service or monitoring system is intelligent enough to move the load elsewhere when these disruptions occur. Sure, it requires additional upfront planning, but the saving realized by not having to order equipment with redundant power supplies, emergency generators, etc is well worth it.

Jeremy
Friday, February 08, 2013 9:11:05 AM (Pacific Standard Time, UTC-08:00)
Totally agree with Jeremy. He argues the right place to add redundancy in software services is up in the application stack. The longer argument goes like this: You can make a single data center very reliable using the techniques described here but you can't get it to the 5th nine. To get the fifth nine of reliability, you need cross data center redundancy and the best solutions are across more than 2 data centers. Ironically, once you have the ability to survive an incredibly unlikely event like data center failure, you actually don't need the constituent data centers to be all that reliable.

Application level redundancy across data centers is best way to achieve super high application reliability. It's a more difficult programming model so most applications and services are not written this way but it is the right model for applications. But, for the Super Bowl, playing the game over two facilities isn't an elegant solution :-).

For the big game, we need a rock solid power distribution system with generators and uninteruptable power supplies.

--jrh
Friday, February 08, 2013 5:07:44 PM (Pacific Standard Time, UTC-08:00)
Good call on the switch mis-configuration - it sounds like that was the root cause:

http://www.cnn.com/2013/02/08/us/superdome-power-outage/index.html

"But the relay's manufacturer, Chicago-based S&C Electric Co., says it believes it knows why the problem happened: The relay, it says, wasn't operated at the proper setting.
System operators essentially put the relay's trip setting too low, S&C vice president Michael Edmonds wrote to CNN in an e-mail. The electrical load exceeded the trip setting, so the relay triggered, he said."
Dave Wright
Friday, February 08, 2013 8:57:19 PM (Pacific Standard Time, UTC-08:00)
Great hearing from you Dave. Yeah, the evidence strongly points to mis-configuration.

Everyone will pile on Entergy but, in the end, not having backup power for the Super Bowl is simply nuts. A couple of million will rent enough mobile generation and UPS to protect the facility and under $10m will protect it for 15+ years.

--jrh
Saturday, February 09, 2013 7:13:27 AM (Pacific Standard Time, UTC-08:00)
Great discussion! I can only suggest a kinetic energy UPS solutions might make more economic sense, given the periodic nature of only bridging loads during critical events. Current capacity specifications would require building several more fault control areas.
Robert Okrie
Sunday, February 17, 2013 2:24:31 PM (Pacific Standard Time, UTC-08:00)
Consider timezones this must of been the most exactly calculated 34 minutes in the history of mankind. suppose my decision to buy came before the power outtage or one minute after america africa japan, etc? exchange reaction for the redundant risk. no safety no risk, all power!!!!!
Cordell Harris
Tuesday, February 19, 2013 7:59:47 PM (Pacific Standard Time, UTC-08:00)
Big fan of your work Sir, cheers from Sri Lanka.
Mahilal Peiris
Comments are closed.

Disclaimer: The opinions expressed here are my own and do not necessarily represent those of current or past employers.

Archive
<November 2014>
SunMonTueWedThuFriSat
2627282930311
2345678
9101112131415
16171819202122
23242526272829
30123456

Categories
This Blog
Member Login
All Content © 2014, James Hamilton
Theme created by Christoph De Baene / Modified 2007.10.28 by James Hamilton