At Scale, Rare Events aren’t Rare

I’m a connoisseur of failure. I love reading about engineering failures of all forms and, unsurprisingly, I’m particularly interested in data center faults. It’s not that I delight in engineering failures. My interest is driven by believing that the more faults we all understand, the more likely we can engineer systems that don’t suffer from these weaknesses.

It’s interesting, at least to me, that even fairly poorly-engineered data centers don’t fail all that frequently and really well-executed facilities might go many years between problems. So why am I so interested in understanding the cause of faults even in facilities where I’m not directly involved? Two big reasons: 1) the negative impact of a fault is disproportionately large and avoiding just one failure could save millions of dollars and 2) at extraordinary scale, even very rare faults can happen more frequently.

Today’s example is from a major US airline last summer and it is a great example of “rare events happen dangerously frequently at scale.” I’m willing to bet this large airline has never before seen this particular fault under discussion and yet, operating at much higher scale, I’ve personally encountered it twice in my working life. This example is a good one because the negative impact is high, the fault mode is well-understood, and although a relatively rare event there are multiple public examples of this failure mode.

Before getting into the details of what went wrong, let’s look at the impact of this failure on customers and the business. In this case, 1,000 flights were canceled on the day of the event but the negative impact continued for two more days with 775 flights canceled the next day and 90 on the third day. The Chief Financial Office reported that $100m of revenue or roughly 2% of the airline’s world-wide monthly revenue was lost in the fall-out of this event. It’s more difficult to measure the negative impact on brand and customer future travel planning, but presumably there would have been impact on these dimensions as well.

It’s rare that the negative impact of a data center failure will be published, but the magnitude of this particular fault isn’t surprising. Successful companies are automated and, when a systems failure brings them down, the revenue impact can be massive.

What happened? The report was “switch gear failed and locked out reserve generators.” To understand the fault, it’s best to understand what the switch gear normally does and how faults are handled and then dig deeper into what went wrong in this case.

In normal operation the utility power feeding a data center flows in from the mid-voltage transformers through the switch gear and then to the uninterruptible power supplies which eventually feeds the critical load (servers, storage, and networking equipment). In normal operation, the switch gear is just monitoring power quality.

If the utility power goes outside of acceptable quality parameters or simply fails, the switch gear waits a few seconds since, in the vast majority of the cases, the power will return before further action needs to be taken. If the power does not return after a predetermined number of seconds (usually less than 10), the switch gear will signal the backup generators to start. The generators start, run up to operating RPM, and are usually given a very short period to stabilize. Once the generator power is within acceptable parameters, the load is switched to the generator. During the few seconds required to switch to generator power, the UPS has been holding the critical load and the switch to generators is transparent. When the utility power returns and is stable, the load is switched back to utility and the generators are brought back down.

The utility failure sequence described above happens correctly almost every time. In fact, it occurs exactly as designed so frequently that most facilities will never see the fault mode we are looking at today. The rare failure mode that can cost $100m looks like this: when the utility power fails, the switch gear detects a voltage anomaly sufficiently large to indicate a high probability of a ground fault within the data center. A generator brought online into a direct short could be damaged. With expensive equipment possibly at risk, the switch gear locks out the generator. Five to ten minutes after that decision, the UPS will discharge and row after row of servers will start blinking out.

This same fault mode caused the 34-minute outage at the 2012 super bowl: The Power Failure Seen Around the World.

Backup generators run around 3/4 of million dollars so I understand the switch gear engineering decision to lockout and protect an expensive component. And, while I suspect that some customers would want it that way, I’ve never worked for one of those customers and the airline hit by this fault last summer certainly isn’t one of them either.

There are likely many possible causes of a power anomaly of sufficient magnitude to cause switch gear lockout, but the two events I’ve been involved with were both caused by cars colliding with aluminum street light poles that subsequently fell across two phases of the utility power. Effectively an excellent conductor landed across two phases of a high voltage utility feed.

One of two times this happened, I was within driving distance of the data center and everyone I was with was getting massive numbers of alerts warning of a discharging UPS. We sped to the ailing facility and arrived just as servers were starting to go down as the UPSs discharged. With the help of the switch gear manufacturer and going through the event logs, we were able to determine what happened. What surprised me is the switch gear manufacturer was unwilling to make the change to eliminate this lockout condition even if we were willing to accept all equipment damage that resulted from that decision.

What happens if the generator is brought into the load rather than locking out? In the vast majority of the situations and in 100% those I’ve looked at, the fault is outside of the building and so the lockout has no value. If there was a ground fault in the facility, the impacted branch circuit breaker would open and the rest of the facility would continue to operate on generator and the servers downstream of the open breaker would switch to secondary power and also continue to operate normally. No customer impact. If the fault was much higher in the power distribution system and without breaker protection or the breaker failed to open, I suspect a generator might take damage but I would rather put just under $1m at risk than be guaranteed that the load will be dropped. If just one customer could lose $100m, saving the generator just doesn’t feel like the right priority.

I’m lucky enough to work at a high-scale operator where custom engineering to avoid even a rare fault still makes excellent economic sense so we solved this particular fault mode some years back. In our approach, we implemented custom control firmware such that we can continue to multi-source industry switch gear but it is our firmware that makes the load transfer decisions and, consequently, we don’t lockout.

Services

64 comments on “At Scale, Rare Events aren’t Rare”

Greg says:

September 19, 2017 at 3:02 pm

9-19-2017
Hi James
Can you believe the vision of a super computer small enough to wear on ones wrist is finally within our grasp??

Reply
Greg says:

September 18, 2017 at 7:34 pm

9-18-2017
Regarding your above article – In practice installs the non energized cable certainly could’ve been pulled threw non conductive – indestructible conduit rated for ground burial. Putting the greediness of money before good sense always ends up costing more waste over the long hall along with the related disasters that follow. Be well

Reply
Greg says:

September 18, 2017 at 6:46 pm

9-18-2018
Hi
Your above article is right on! The human is in deed the most dangerous animal walking the face of earth1 We have been a self destructing species sense the beginning of time. If one had to put a job security tag on something engineering failures would be a delightful start.

Reply
Azamat says:

July 1, 2017 at 9:31 am

Hi, James. Can you please clarify about your suggestions that could fix problem (not taking risk for generator) or advise design engineering decision to prevent such problem?

Reply
- James Hamilton says:
  
  July 1, 2017 at 4:55 pm
  
  Asamat asked “Can you please clarify about your suggestions that could fix problem (not taking risk for generator) or advise design engineering decision to prevent such problem?”
  We don’t know with certainty what when wrong in this case but most of the reports indicate an operator error that caused a USB to be disabled. The fast answer is the avoidance is easy. Better training and higher skill levels but it’s actually just about impossible to eliminate human error so the better approach is to make the system reliant in the presence of human error.
  The classic availability architecture is to have the primary systems running in one data center and have a backup configuration in a different data center running active/passive. To ensure availability after even very large events, the second facility it usually a 100s of miles away. Because of the distance, the secondary system runs as near to synced as possible but it can’t be fully synchronous. The latency between then is too great to run synchronous replication (can’t commit a write to both before proceeding).
  As a consequence, there is some transaction loss when failing over. Usually the system leaves enough tracks that administrators can can the system back to sync without data loss but it takes work. The work to recover after failover is substantial and, even if no data is lost, there is a period of time when data is lost and the customer experience isn’t great. Because of the work and the customer impact, it’s never tested. Because it’s (at most) rarely tested, there is a high probability that a config change was made on the primary and not the secondary. Or some software upgrade on one but not the other. Or the network topology is different. Or since the secondary system isn’t always taking load, there is an accumulation of entropy and things aren’t running properly.
  In active/passive systems, someone has to make the tough call between failover and taking the temporary data loss, some risk of worse, and it being customer visible event; Or, don’t failover and work as fast as possible to recover the primary systems. In the heat of the moment, the wrong call will be made.
  The modern availability architecture is to run the application over two or preferably three data centers in active/active config. In this model, the facilities are small 10s of miles apart. Very independent on power, cooling, and network but still able to synchronously commit so all data is comited everywhere. There are sufficient resources available to completely shut down a data center without customer impact. Testing failure and taking a full data center down is not customer visible, so you can test. Because you can test, it works.
  Running apps over 2 to 3 (or more) data centers in active/active mode allows testing and allows aggressive failover whenever something doesn’t look right. It’s tested and so it works. Ironically, some of the worlds most mission critical workloads are still running active/passive while many less important workloads are running on the modern active/active model.
  
  If you use cloud computing, you don’t even need large scale to be able to run active/active in multiple data centers.
  
  Classic active/passive goes for years and sometimes forever without an ugly event but they happen and, when they do, they are really ugly and recover takes days and sometimes longer. The risk/cost equation is super clear but these old architectures live on. I’ve seen two airlines report $100m outages this year.
  
  Reply
  - peter X says:
    
    September 20, 2017 at 10:35 am
    
    Hi James . After a data center construction is completed, how long does the generator Integrated ajusting testing ?8 hours or 12 hours? What are these factors?
    
    Reply
    - James Hamilton says:
      
      September 20, 2017 at 4:56 pm
      
      If you don’t test all systems and all backup systems, they won’t work when customers arrive. So, however long it takes to test every primary systems and to force failure and test every secondary system.
      
      Reply
- Fatih Mutlu says:
  
  July 23, 2017 at 9:55 pm
  
  There is also one more issue has to be considered; which is the diesel fuel stored for a medium to long term for generators. Diesel fuel is extreamly unstable liquid especially after ULS regulations. In case of a power failure; the if the fuel is not at optimum specs the generators cannot kick on. This is experienced world-wide in data centers couple of time so far and still there is no concrete preventive actions taken at TIER III or IV Data Centers. TIA and Uptime Institute suggests having fuel maintenance and purification systems but it is still “should” not “shall”
  
  Reply
  - James Hamilton says:
    
    July 24, 2017 at 11:46 am
    
    I agree. Generators need frequent testing and periodic testing at full load. Fuel needs treatment, filtration, and testing to be trusted.
    
    Reply
Brian Bulkowski says:

June 2, 2017 at 7:47 pm

James – your fascinating post is now being referred to vis a vis the British Airways fiasco.

In reference to that, I wonder about the difference between planning that minimizes a multi-day outage, as it appears BA had. All the public info seems to refer to “power surge related damage” being so terribly extensive to cause a multi-day recovery. In the BA case, it would seem that the goal shouldn’t be a UPS as much as the kind of isolation system you’re talking about that requires an electrical engineer on site before bringing equipment back online?

The “root cause” that seems to be discussed more is “why did it fail”, where I am very interested in “why did it fail for 20+ hours”.

( PS. Distributed systems software engineer here, with some background in practical wiring and fire effects )

Reply
- Chasm says:
  
  June 2, 2017 at 11:37 pm
  
  The BA incident is still quite unclear, there are two reports in the media.
  A single contractor working on switch gear and/or UPS systems disconnected the whole data center causing the initial outage.
  A single sysadmin restarted the servers too fast causing another outage, this time with “catastrophic physical damage” to various systems. This is again a contractor, this time brought in to pick up the pieces.
  
  Somewhere along data corruption happened, probably more than once.
  Obviously no word about management failure. After all it’s per every report a 1980s data center -and power system- what could possibly go wrong now that did not already fail in the past? ;)
  
  Reply
  - James Hamilton says:
    
    June 3, 2017 at 6:48 am
    
    Yeah, a lot has been learned over the last couple of decades on highly reliable software architectures. Ironically, some of the most mission critical systems in the world continue to run active/passive over two data centers with async replication between them. Failover loses in flight transactions and generally takes a week to fully recover so it simply never gets tested. When something goes wrong, it’s a mess which is just about assured if you write a complex system and never test fault modes.
    
    These systems are fragile, small hardware problems can kill them, inevitable operational errors kill them, and once they go down, they are incredibly slow to recover and data loss/corruption is common. The better model is to configure the application as many independent application “slices” running active/active over multiple data centers with sufficient capacity that an entire building can be lost without availability impact.
    
    Reply
  - Marcel says:
    
    June 3, 2017 at 11:29 am
    
    I noticed a lot of anger in the UK about the outsourcing by BA to India. Would not be surprised if a single anonymous source told a newspaper about this human error.
    The contracting firm responsible for the DC denied a human error. To me a human error seems very unlikely.
    
    Reply
  - Marcel says:
    
    June 5, 2017 at 7:17 pm
    
    At June 5, BBC reported that the CEO of IAG, parent company of BA, confirmed the power shutdown was a human error.
    “an electrical engineer disconnected the uninterruptible power supply which shut down BA’s data centre.”
    Ooooops
    http://www.bbc.com/news/business-40159202
    
    Reply
    - James Hamilton says:
      
      June 8, 2017 at 5:38 am
      
      All things in engineering fail but humans are under represented in failure data. Nobody wants to say, “I had a deep look into the reason why my team lost the company $10s of millions and concluded that the team is insufficiently trained and we lack adequate operational procedures possibly due to poor leadership on my behalf.” Multiple capacitor failure due to a component batch problem coupled with over temperature just sounds a ton better and seems more actionable so you tend to hear more of the latter. Clearly both happen but human error is under represented in industry data.
      
      Reply
- James Hamilton says:
  
  June 3, 2017 at 6:40 am
  
  You are on a great point here Brian. Mean time to failure is important but mean time to recovery is a much bigger lever when trying to improve availability. Systems do have failures so it’s way more important recover them very quickly.
  
  An even more important lever to improve availability is to partition the system such that it fails in highly independent components and place those components in independent data centers where all are active. Most of these legacy “highly available” run active/passive across two data centers and these systems almost never go through failure testing. When something goes wrong it’s a mess. The best approach is many independent application “slices” running active/active over multiple data centers with sufficient capacity that an entire building can be lost without availability impact.
  
  Reply
- Marcel says:
  
  June 3, 2017 at 11:27 am
  
  I spent a lot of time documenting the British Airways datacenter failure. The result is here
  http://up2v.nl/2017/05/29/what-went-wrong-in-british-airways-datacenter/
  
  BA did not tell much about what happened. As a result British press comes with a lot of stories, most are based on a single source and most journalists do not understand what an UPS is.
  
  It seems BA was using a DRUPS (unconfirmed but BA is listed as Hitec customer on its website). In 2007 a Hitec DRUPS failed because of a bug in the Detroit Diesel Electronic Controller.
  
  Reply
Marcel says:

June 1, 2017 at 9:07 pm

This is a very interesting presentation by Yahoo with many examples of datacenter failures. https://www.youtube.com/watch?v=iO2z3ttlpi4

Reply
Marcel says:

May 31, 2017 at 9:41 pm

I love to find out the root cause of datacenter failures as well. At May 27, 2017 one of British Airways datacenters went all black because of an UPS failure. I described in detail what happened http://up2v.nl/2017/05/29/what-went-wrong-in-british-airways-datacenter/

Reply
- James Hamilton says:
  
  June 1, 2017 at 6:47 am
  
  I’ve read the British Airways problem was a software issue. By far the most common “data center” problems are software and human error. It’s the infrastructure at fault only rarely.
  
  Reply
  - Marcel says:
    
    June 1, 2017 at 3:08 pm
    
    James, it was not a software fault. At least not software wich runs on common servers. The UPS failed for some reason. I assume this is dynamic UPS. For some reason soon after power was restored, a power surge damaged many components like servers and networking equipment. See my blog for the details.
    
    Reply
    - James Hamilton says:
      
      June 1, 2017 at 7:06 pm
      
      I was reading general press and the note was only a passing comment that they have a new software systems that isn’t working and they have been down 5 times already in 2017. I didn’t dig deeper but it sounds like you did. Thanks for passing along your analysis Marcel.
      
      Reply
      - Marcel says:
        
        June 1, 2017 at 8:04 pm
        
        I was even quoted by BBC here.
        http://www.bbc.com/news/technology-40118386
        
        As a coincidece, at May 27 as well, another UK datacenter has power issues.
        https://forums.theregister.co.uk/forum/3/2017/05/26/major_incident_at_capita_data_centre/
        
        Capita’s Chief Executive Andy Parker write this email to clients
        
        “I am very sorry for the service experience and difficulties you have had in accessing the Teachers Pensions website. The system issues that recently impacted Teachers Pensions were caused by a failure of the power to re-instate cleanly back from our emergency generator which had been operating as required in one of our data centres following a power outage in the local area, this then caused damage across connectivity and other elements requiring replacement and recovery of systems. Our IT team have worked continuously since the incident with full support from Capita’s senior management to fix the issue and restore all Teacher Pensions services as safely as possible. Our number one priority was to ensure payment systems were enabled to process payments to teachers and this was successfully restored to ensure all payments due to members were made as expected. Our attention then moved to ensure that the website and other services were restored. All services are now restored and functioning satisfactorally [sic].
        
        Reply
        
        James Hamilton says:
        
        June 2, 2017 at 5:59 am
        
        Thanks Marcel.
        
        Reply
        
        Marcel says:
        
        June 2, 2017 at 7:15 am
        
        Hi James. In the comments section of my blog someone posted a very detailed explanation of what could have gone wrong with the DRUPS.
        
        Reply
        
        James Hamilton says:
        
        June 2, 2017 at 9:37 am
        
        DRUPS are a case in point where highly integrated complex systems are poor choices for the last line of defense. They just have so many different ways to drop the load and, for many of these failures, the manufacturer points out that if they were configured differently, they wouldn’t have failed. The problem is they have so many ways to fail and are tested so rarely that my conclusion has been I want a simpler last line of defense.
        
        I try very hard to not ever to depend upon DRUPS as the last line of defense and, as a consequence, some my critisize my observation that they are too complex for a last line of load protection. I’ve heard that the load would have been held if this configuration was different or if this installation was different but, in the end, I just note that I’ve seen a disproportionately large number of failures in these units. Battery backed systems are simpler, easier to get right, and on average, seem to produced better results. I do like all the upsides that come with DRUPS including power conditioning but, in the end, the final test is “did the system drop the load?” In DRUPS with mechanical clutches, I’ve seen clutch failure multiple times where the diesel fails to take the load and the UPS quickly exhausts. I’ve seen large MTUs engines throw rods and dump the entire oil supply — clearly the same failure mode can happen with generators but I would rather have a generator go down that an integrated UPS/Generator. Another one is flywheel vibration requiring the system to be shut down for service after it took the load from the utility. Should have been done prior to the emergency but, in otherwise well maintained facilities, DRUPS seem to see more of these issues. Another nasty failure mode for at least some rotating UPS is a deep power sag that is not quite deep enough to disconnect the utility from the UPS. In this model, the UPS supplies power to the grid which exhausts the stored rotating energy incredibly quickly. Again, after weeks of investigation the manufacturer recommends a different config. Great and I appreciate the effort but, in the end, the load was dropped.
        
        I’m sure good engineering can make a rotating UPS a good choice but, at least for me, the simplicity and independence of batteries and separate generators with redundancy is appealing. For backup systems, I like simple, non-integrated components, with components that fail independently.
        
        Reply
      - Matthias Fouquet-Lapar says:
        
        June 2, 2017 at 12:48 pm
        
        I just read in the Times
        
        “A power supply unit at the centre of last weekend’s British Airways fiasco was in perfect working order but was deliberately shut down in a catastrophic blunder, The Times has learnt.
        
        An investigation into the incident, which disrupted the travel plans of 75,000 passengers, is likely to centre on human error rather than any equipment failure at BA, it emerged.”
        
        which would confirms James’ theory that human error (however you twist the definition of error in this case) is one of the top root causes together with SW.
        
        Reply
        
        Marcel says:
        
        June 2, 2017 at 1:07 pm
        
        A bt later the contracting firm denied human error https://www.theguardian.com/business/2017/jun/02/ba-shutdown-caused-by-contractor-who-switched-off-power-reports-claim?CMP=share_btn_tw
        
        Reply
Alan Brown says:

May 30, 2017 at 7:39 pm

Over the years I’ve seen a number of supply faults – not just at data centres – which have caused outages despite generators and UPSes

One that sticks in my mind is a single phase 11kV failure caused the generators to start but didn’t allow the cutovers to work, resulting in a large mountaintop TV transmitter burning up its supply transformer and a bunch of lesser radio/TV transmitters on the same site being cut off or being fed 480V on their 240V inputs depending on which phase they were fed from. That took several weeks to sort out and for the communities involved was highly visible.

Dual phase failures like you’ve described have caused similar problems. If there’s an overhead feed along the way then a phase to phase short should be part of the failure planning (as it is in the electrical grid).

There’s a _lot_ to be said for monitoring the quality of the incoming power and dropping a crowbar across the lot if it’s too far out of spec. Polyphase supplies should always be treated as having the potential to become wildly imbalanced as well as simply “dead”.

I get why the DC electrical installation outfits want to “play it safe”, but the reality is that they haven’t adequately modelled all supply mode failures and as such they need to be hauled across the coals. The “smaller” flywheel-based 2MW system at my current site can certainly cope with this kind of issue and has done on multiple occasions.

Reply
- James Hamilton says:
  
  May 31, 2017 at 8:07 am
  
  I agree with you Alan that many power system failure modes are not modeled and just about never tested.
  
  On your use of DRUPS to avoid problems, they too have problems and, perhaps I’m just too conservative, I’ve actually come to strongly prefer battery based system simplicity. I really don’t likely have the experience that many have with rotating UPSs but I’ve seen a disproportionately large number of failures in these units. In systems with mechanical clutches, I’ve seen clutch failure multiple times where the diesel fails to take the load and the UPS quickly exhuasts. I’ve seen large MTUs engines throw rods and dump the entire oil supply — clearly the same failure mode can happen with generators but I would rather have a generator go down that an integrated UPS/Generator. Another one is flywheel vibration requiring the system to be shut down for service after it took the load from the utility. Another nasty failure mode for at least some rotating UPS is a deep power sag that is not quite deep enough to disconnect the utility from the UPS. In this model, the UPS supplies power to the grid which exhausts the stored rotating energy incredibly quickly.
  
  I’m sure good engineering can make a rotating UPS a good choice but, at least for me, the simplicity and independence of batteries and separate generators with redundancy is appealing. For backup systems, I like simple, non-integrated components, with components that fail independently.
  
  Reply
Robert Gusciora says:

April 13, 2017 at 1:47 am

James, I’m not entirely sure we understand the “lockout” as described by the major airline. Low Voltage Circuit Breakers (> 600V) usually have integral trip units that cause the breaker to open under a short circuit, overload or ground fault (down stream of them). Medium Voltage Breakers (1000V – 35,000V) use a type of external trip sensor called a protective relay. They usually are more intelligent than a low voltage integral trip unit, but basically serve the same purpose if the generator is not intended to be paralleled or connected to the Utility Service. If the generator is intended to be connected to the utility service, then, in the case as described, the generator would correctly not be allowed to be connected to a faulted utility service.

However, in the much more common and likely scenario of “open transition” transfer to generator, it is somewhat technically debatable if a generator (or other source of electrical power) should be allowed to connected to a faulted bus or load. Here, I agree with you on this point, as well as does the NEC with respect to ground fault detection that it can transfer to generator. A “standard & common ” automatic transfer switch (ATS) does not have this lockout functionality.

However, with a fault upstream of the Utility breaker, there is no reason whatsoever that this switchgear should “lockout”. I highly doubt the switchgear automation was designed to operate this way. If what is described is what happened, then one needs to investigate why the switchgear “locked out”. This could be due to:
a) improper facility wide or switchgear grounding and/or bonding,
b) a faulty trip unit or protection relay performing this function
c) did the programmable logic controller (plc) go down due to the outage and not perform the generator transfer function?,
d) is there a bug in the plc program?
e) did some other wiring, connection or control device fail?
f) generators “self protect” and lock out due to many conditions such as “over crank”, “over speed”, low oil pressure, low coolant level and many others. This condition could have been annunciated on the switchgear.

I just don’t see how one can conclude, based on what was published, that the switchgear locked out due to intentional transfer logic programming.

A thorough investigation by an Engineer and switchgear/controls technician would be required if no “smoking gun” fingerprints were found.

Reply
- James Hamilton says:
  
  April 13, 2017 at 11:01 am
  
  I’ve been involved with a couple of these events that got investigated in detail with both the utility and switch gear provider engineering teams and both events were caused by the issue you described as “open transition”. Your paragraph on this fault mode:
  
  “….in the much more common and likely scenario of “open transition” transfer to generator, it is somewhat technically debatable if a generator (or other source of electrical power) should be allowed to connected to a faulted bus or load. Here, I agree with you on this point, as well as does the NEC with respect to ground fault detection that it can transfer to generator.”
  
  You speculated that the switchgear would not be intentionally programmed to produce this behaviour. The problem the switch gear manufacturers face is they believe they want to detect inside the facility shorts to ground but they are not able to do this without some false positives. These events I’m referring to as lockout and you are referring to as “open transition” are unavoidable while the switchgear is detecting the potential for direct short to ground inside the facility. This PLC programming is intentional and, although the false positives are not really wanted, the switch gear providers report there is no economic way to reliably avoid them so the lockout issue remains.
  
  Reply
Steve Mushero says:

April 11, 2017 at 12:43 pm

Great article and analysis – I’ve seen this in factories, too – one time underground 4KV lines were just wet enough for ground-fault systems to sense it and trip, in part due to some badly-engineered (but operating for years prior) ground routing that like your 2-out-of-3 short, was too sensitive to phase imbalanced floating ground references.

We also lacked on-site HiPot gear to test, but we had a factory down (twice) and experience told us it was okay, so we manually threw 10MW 34.5KV switches while most of the team hid under cars hoping the house-sized transformers wouldn’t blow up if there was a real short.

One of a few actual or nearly explosive power situations I was involved in . . .

Reply
- James Hamilton says:
  
  April 11, 2017 at 12:51 pm
  
  I always give power engineers “room to work” when they are switching breakers.
  
  Reply
- James Hamilton says:
  
  April 11, 2017 at 3:04 pm
  
  I would have offered to stand in a completely different room while you test engaged 10MW switch gear :-).
  
  Reply
Rick says:

April 10, 2017 at 9:48 pm

I still remember how a major data center in Colorado Springs had redundant power lines into the prem, but both of the power leads looped around into a parallel structure 30 feet outside the prem, and the backhoe got them both about 10 feet later.

Reply
- James Hamilton says:
  
  April 11, 2017 at 12:32 pm
  
  Yeah, weird things do happen. I was involved with a critical facility that had dual network feeds on diverse paths with good distance between them. A neighboring construction project managed to cut one network link and, before the first issue had been corrected, they somehow managed to cut the second network link. At that point, I was close to hoping the might find the utility feed :-).
  
  Reply
John Duffin says:

April 9, 2017 at 3:05 am

HI James, This may be of interest to you:
https://journal.uptimeinstitute.com/examining-and-learning-from-complex-systems-failures/

Reply
- James Hamilton says:
  
  April 9, 2017 at 9:55 am
  
  Thanks John. It’s a pretty interesting article with lots of examples. The core premise is a simple one: complex systems often run in degraded operations mode, one where the safety margins are partially consumed by less than ideal management decisions and these are at least partly responsible for the subsequent systems failure.
  
  Reply
Angel Castillo says:

April 8, 2017 at 5:42 pm

Should datacenters have enough ups capacity to allow for troubleshooting power issues further up the line? Like in the switch gear lockout event, ups with one hour capacity could’ve allowed for a manual switch without customers being affected. A 5 minute ups seems cutting it pretty close to me in case of a genset failure, phasing issues or the lockout issue.

Reply
- James Hamilton says:
  
  April 9, 2017 at 9:48 am
  
  Your logic is sound that 5 to 10 min is not close to enough for human solution. The problem is doubling the capacity to 10 to 20 min is nearly twice the price and still doesn’t give much time. Extend the capacity by 10 times out to 50 to 100 min and you have a chance of fixing the problem but, even then, problems that can be safely addressed in 1 hour in power distribution systems and teams with all the right skills on staff 24×7 aren’t that common.
  
  Whenever I look at solving problems with longer UPS ride through times, I end up concluding it’s linear in expensive but not all that effective at improving availability so I argue the industry is better off with redundancy and automated failover. It ends up being both more effective and more economic.
  
  An interesting side effect of the automated redundancy approach is, once you have that in place, you can ask an interesting question. Rather than 5 to 10 min of UPS, what do I give up in going with 2 to 3 min. Everything that is going to be done through automation is done in the first 2 to 3 min (with lots of safety margin) so why bother powering for longer? Shorter UPS times are getting to be a more common choice.
  
  Reply
  - Donough Roche says:
    
    April 12, 2017 at 1:58 pm
    
    In the vast majority of data center designs, the HVAC (cooling) is not supported by the UPS systems. The IT systems will experience some elevation of temperatures within the five minutes of UPS run time, but the overall design takes this into account and is matched to the runtime of your UPS. However, if you continue to provide power to the IT systems for an extended time while generator/switchgear issues are being investigated and fixed, increasing temperatures within the data center will take down your IT equipment. Adding UPS support to HVAC systems adds significant capital and operational costs that are superfluous when your upstream switchgear is designed, programmed, and tested to work when you need it to work.
    
    Reply
    - James Hamilton says:
      
      April 12, 2017 at 2:07 pm
      
      Yes, I agree. Another good reason why adding UPS ride through time isn’t the most effective approach
      
      Reply
Anon says:

April 8, 2017 at 12:15 pm

Can you fix this horrible layout on this site? Text right up to the edge of the screen? Are you kidding me? It’s painful to read.

Reply
- James Hamilton says:
  
  April 8, 2017 at 12:18 pm
  
  Looks fine on all devices I’m got around here. What device, operating system, and browser are you using?
  
  Reply
  - Tester says:
    
    April 10, 2017 at 4:03 pm
    
    The Engineer’s reply to a designer type. Hilarious.
    
    Reply
  - Kevin says:
    
    April 10, 2017 at 5:56 pm
    
    Windows 7, Google Chrome 57.0.2987.133, 1920 x1200
    Windows 7, Internet Explorer 11.0.9600.18617IS, 1920 x1200
    Windows 10, Edge, 38.14393.0.0, 1920 x1080
    
    Reply
    - James Hamilton says:
      
      April 10, 2017 at 8:04 pm
      
      Yeah, OK. I use the same blog software on both perspectives.mvdirona.com and mvdirona.com and on mvdirona.com there is a lot of content to show with both a blog and real time location display. I mostly access the site on a Nexus 9 and it actually has pretty reasonable margins. Looking at it on Windows under Chrome and IE, I generally get your point — the site would look better with more margin. I’m don’t really focus much on user interface — I’m more of an infrastructure guy — but I’ll add more margins on both mvdirona.com sites. Thanks.
      
      Reply
Vermont Fearer says:

April 5, 2017 at 8:05 pm

Really enjoyed this post and learned a lot about how data centers are designed. I had read about a similar case involving a rare event at an airline data center in Arizona that you might find interesting: https://www.azcourts.gov/Portals/45/16Summaries/November82016CV160027PRUS%20Airwayfinal.pdf

Reply
- James Hamilton says:
  
  April 6, 2017 at 1:17 pm
  
  Thanks for sending the Arizona Supreme court filing on Qwest vs US Air. US Air lost but the real lesson here is they need a better supplier. One cut fiber bundle shouldn’t isolate a data center.
  
  Here’s a funny one I’ve not talked about before. Years ago I was out in my backyard (back when we still had a backyard) getting the garden ready for some spring planing when I found just below the surface an old damaged black cable with some exposed conductors buried a 6″ below the surface. Weird I hadn’t found it before but I cut it at the entry and exit of the garden proceeded. Later that day I noticed our phone didn’t work but didn’t put the two events together. The next day I learned that the entire area had lost phone coverage the previous day and was still down. Really, wonder what happened there? :-)
  
  Reply
  - Ron P. says:
    
    April 8, 2017 at 5:09 pm
    
    I glad you are not “gardening” in my neighborhood!! :)
    
    That is the most hilarious admission from someone partially responsible for perhaps billions of dollars of communications infrastructure!!!!
    
    Reply
    - James Hamilton says:
      
      April 9, 2017 at 9:40 am
      
      I would arguing that Qwest putting the entire neighborhood cable through my back yard, only burying it six inches, and not using conduit is kind of irresponsible. But, yes, it was my shovel that found it. I’m just glad it wasn’t the power company :-).
      
      Reply
Mark "employee" Evans says:

April 5, 2017 at 3:48 pm

Dear Sir Hamilton,
thanks for the article. It is in reference to the 8/8/2016 failure(read “fire”) of a 22 year old “kit” in Delta’s Atlanta HQ, America’s second largest airline? Has the North American Electric Reliability Corporation (NERC) any implements to ensure switchgear manufacturers become more flexible?

Reply
- James Hamilton says:
  
  April 6, 2017 at 1:09 pm
  
  If NERC has taken a position on this failure mode, I’ve not seen it but they publish a wonderful resource in lessons learned to help operators and contractors understand different faults and how to avoid them.
  
  Reply
Florian Seidl-Schulz says:

April 5, 2017 at 9:24 am

You offered to take all legal responsibilities, if the generator went online, no matter what?
As in, humans harmed by continuous power supply to a short and contract damages upon prolonged outage due to generator destruction?

Reply
- James Hamilton says:
  
  April 5, 2017 at 10:32 am
  
  You raised the two important issues but human safety is the dominant one. Legal data centers have to meet juristrictional electrical safety requirements and these standards have evolved over more than 100 years of electrical system design and usage. The standards have become pretty good, reflect the industry focus on safety first, and industry design practices usually exceed these requirements on many dimensions.
  
  The switch gear lockout is not required by electrical standards but it’s still worth looking at lockout and determining whether it adds safety factors or reduces it at the benefit of the equipment. When a lockout event occurs an electrical professional will have to investigate and they have two choices with the first being by far the most common. They can first try re-engaging the breaker to see if it was a transient event or they probe the conductors looking for faults. All well designed and compliant data centers have many redundant breakers between the load and the generator. Closing that breaker via automation allows the investigating electrical professional to have more data when they investigate the event.
  
  Equipment damage is possible but, again, well designed facilities are redundant with concurrent maintainability which means they have equipment off line for maintenance and then have an electrical fault and still safely hold the load. Good designs need to have more generators than required to support the entire facility during a utility fault. A damaged generator represents a cost but it should not lead to outage in a well designed facility.
  
  Human safety is a priority in all facilities. Equipment damage is not something any facility wants but, for many customers, availability is more important than possible generator repair cost avoidance.
  
  Reply
Ruprecht Schmidt says:

April 5, 2017 at 7:22 am

Loved the piece! I’m now inspired to ask someone about the switch gear situation at the data center where we host the majority of our gear. Thanks!

Reply
- James Hamilton says:
  
  April 5, 2017 at 10:15 am
  
  It’s actually fairly complex to chase down and getting the exact details on what are the precise triggering events that cause different switch modes to be entered. Only the switch gear engineering teams really knows the details, the nuances, and the edge cases that cause given switch modes to be entered. It’s hard data to get with precision and completeness.
  
  In many ways it’s worth asking about this fault mode but, remember, this really is a rare event. It’s a super unusual facility where this fault mode is anywhere close to the most likely fault to cause down time. Just about all aspects of the UPSs need scrutiny first.
  
  Reply
A says:

April 5, 2017 at 5:13 am

The 2102 Super Bowl sounds very interesting, we’ll have to see how it goes.

Back on topic, if there was such a fault, I have to wonder if more than mere equipment damage might be at stake in some cases. I suspect they also want to limit their own liability for doing something dangerous.

Reply
- James Hamilton says:
  
  April 5, 2017 at 10:07 am
  
  Thanks, I fixed the 2102 typo.
  
  Human risk factors are the dominant concern for the data center operators and equipment operators. Data centers have high concentrations of power but these concerns are just as important in office buildings, apartment buildings, and personal homes and that’s why we have juristictional electrical standards designed to reduce the risk directly to occupants and operators and indirectly through fire. The safe guards in place are important, required by all jurstictions but these safety regulations do not include switch lock out. All data centers have 5+ breakers between the generator and the load. There are breakers at the generator, the switch gear itself, the UPS, and downstream in some form of remote power panel and, depending upon the design, many more locations. As an industry we have lots of experience in electrical safety and the designs operate well even when multiple faults are present because they all have many layers of defense.
  
  Let’s assume that the switch gear lockout is a part of this multi-layered human defense system even though not required by electrical codes. Is it possible that this implementation is an important part of why modern electrical systems have such an excellent safety record? With the lockout design, the system goes dark and professional electrical engineers are called. Many critical facilities have electrical engineers on premise at all times but, even then, it’ll likely take more time than the UPS discharge time to get to the part of the building that is faulting. The first thing a professional engineer will do when investigating an electrical system switch gear lockout is to re-engage the breaker and see if it was a transient event or is an actual on-premise issue. Another investigative possibility is to probe the system for ground fault but most professionals chose to engage the breaker first and it seems like a prudent first choice since probing potentially hot conductors is not a task best taken on under time pressure and 99th percentile of the cases, the event is outside the facility and just re-engaging the breaker is both safer than probing.
  
  Doing this first level test of re-engaging the open breaker through automation has the advantage of 1) not dropping the load in the common case, and 2) not requiring a human to be at the switch gear to engage it in a test. I hate closing 3,000A breakers and, if I personally have to do it, I always stand beside them rather than in front the breaker. As safe as it is, it’s hard to feel totally comfortable with loads that high. Doing the first level investigation in automation reduces human risk and puts more information on the table for the professional engineer who will investigate the issue. Of course, all issues, whether resolved through automation or not, still need full root cause investigation.
  
  Reply
David says:

April 5, 2017 at 3:56 am

“the 2102 super bowl…” So you ARE from the future… Busted!

Reply
- James Hamilton says:
  
  April 5, 2017 at 9:37 am
  
  I like to think of myself as forward thinking but I might have gotten carried away in predicting the 2102 Super Bowl :-). Thanks, I fixed that one.
  
  Reply
Denis Altudov says:

April 4, 2017 at 11:29 pm

A more, ahem, “pedestrian” story along the same lines.

Electric bikes have batteries which may overheat under certain condition such as weather, load, manufacturing defects, motor problems, shorts, etc. The battery controller in this case if programmed to cut the power, saving the battery from overheating and/or catching fire. Fair enough, right? A fire is avoided, the equipment is saved, and the user just coasts along for a while and come to a safe stop.

Fast froward to the “hoverboard” craze – the self-balancing boards with two wheels, one on each side. The batteries and controllers have been repurposed in a hurry to serve the new hot market. When a battery overheats the controller cuts the power to the motor, self-balancing feature turns off and the user face-plants into the pavement. But the $100 battery is saved!

Sadly, I don’t have the Amazon’s scale to rewrite the battery controller firmware, hence why I lead my post with the word “pedestrian”. Off I go for a stroll.

Cheers.

Reply
- James Hamilton says:
  
  April 5, 2017 at 10:47 am
  
  Hey Denis. Good to hear from you. Lithion-Ion batteries have massive power density and just over the last few months have been in the news frequently with reports of headphones suffering from an explosive discharge and cell phone catching fire. The hoverboard mishaps have included both the permanent battery lockout you describe and also fire from not isolating faulty cells.
  
  Safety around Li-Ion batteries, especially large ones, is super important. Good battery designs include inter-cell fusing, some include a battery wide fuse, and include charge/discharge monitoring firmware that includes temperature monitoring. Some of the more elaborate designs include liquid cooling. Tesla has been particularly careful in their design partly since they couldn’t afford to have an fault in the early days but mostly because they were building a positively massive battery with 1,000s of cells.
  
  Good Li-ion battery designs use stable chemistries and have fail-safe monitoring with inter-cell fusing and often battery-wide fusing. These safety system will cause the odd drone to crash and may cause sudden drive loss on hoverboards. In the case of hoverboards because the basic system is unstable without power, a good design would have to ensure that there is sufficient reserve energy to safely shutdown on battery fault. I’m sure this could be done but, as you point out, it usually wasn’t.
  
  My take is that transportation vehicles that are unstable or unsafe when not powered are probably just not ideal transportation vehicles. I’m sure hoverboards could be designed with sufficient backup power to allow them to shut down safely but the easy mitigation is get an electric bike :-)
  
  Reply

Perspectives

At Scale, Rare Events aren’t Rare

64 comments on “At Scale, Rare Events aren’t Rare”

Leave a Reply Cancel reply