Observations on Errors, Corrections, & Trust of Dependent Systems

Every couple of weeks I get questions along the lines of “should I checksum application files, given that the disk already has error correction?” or “given that TCP/IP has error correction on every communications packet, why do I need to have application level network error detection?” Another frequent question is “non-ECC mother boards are much cheaper — do we really need ECC on memory?” The answer is always yes. At scale, error detection and correction at lower levels fails to correct or even detect some problems. Software stacks above introduce errors. Hardware introduces more errors. Firmware introduces errors. Errors creep in everywhere and absolutely nobody and nothing can be trusted.

Over the years, each time I have had an opportunity to see the impact of adding a new layer of error detection, the result has been the same. It fires fast and it fires frequently. In each of these cases, I predicted we would find issues at scale. But, even starting from that perspective, each time I was amazed at the frequency the error correction code fired.

On one high scale, on-premise server product I worked upon, page checksums were temporarily added to detect issues during a limited beta release. The code fired constantly, and customers were complaining that the new beta version was “so buggy they couldn’t use it”. Upon deep investigation at some customer sites, we found the software was fine, but each customer had one, and sometimes several, latent data corruptions on disk. Perhaps it was introduced by hardware, perhaps firmware, or possibly software. It could have even been corruption introduced by one of our previous release when those pages where last written. Some of these pages may not have been written for years.

I was amazed at the amount of corruption we found and started reflecting on how often I had seen “index corruption” or other reported product problems that were probably corruption introduced in the software and hardware stacks below us. The disk has complex hardware and hundreds of thousands of lines of code, while the storage area network has complex data paths and over a million lines of code. The device driver has tens of thousands of lines of code. The operating systems has millions of lines of code. And our application had millions of lines of code. Any of us can screw-up, each has an opportunity to corrupt, and its highly likely that the entire aggregated millions of lines of code have never been tested in precisely the combination and on the hardware that any specific customer is actually currently running.

Another example. In this case, a fleet of tens of thousands of servers was instrumented to monitor how frequently the DRAM ECC was correcting. Over the course of several months, the result was somewhere between amazing and frightening. ECC is firing constantly.

The immediate lesson is you absolutely do need ECC in server application and it is just about crazy to even contemplate running valuable applications without it. The extension of that learning is to ask what is really different about clients? Servers mostly have ECC but most clients don’t. On a client, each of these corrections would instead be a corruption. Client DRAM is not better and, in fact, often is worse on some dimensions. These data corruptions are happening out there on client systems every day. Each day client data is silently corrupted. Each day applications crash without obvious explanation. At scale, the additional cost of ECC asymptotically approaches the cost of the additional memory to store the ECC. I’ve argued for years that Microsoft should require ECC for Windows Hardware Certification on all systems including clients. It would be good for the ecosystem and remove a substantial source of customer frustration. In fact, it’s that observation that leads most embedded systems parts to support ECC. Nobody wants their car, camera, or TV crashing. Given the cost at scale is low, ECC memory should be part of all client systems.

Here’s an interesting example from the space flight world. It caught my attention and I ended up digging ever deeper into the details last week and learning at each step. The Russian space mission Phobos-Grunt (also written Fobos-Grunt both of which roughly translate to Phobos Ground) was a space mission designed to, amongst other objectives, return soil samples from the Martian moon Phobos. This mission was launched atop the Zenit-2SB launch vehicle taking off from the Baikonur Cosmodrome 2:16am on November 9th 2011. On November 24th it was officially reported that the mission had failed and the vehicle was stuck in low earth orbit. Orbital decay has subsequently sent the satellite plunging to earth in a fiery end of what was a very expensive mission.

What went wrong aboard Phobos-Grunt? February 3rd the official accident report was released: The main conclusions of the Interdepartmental Commission for the analysis of the causes of abnormal situations arising in the course of flight testing of the spacecraft “Phobos-Grunt”. Of course, this document is released in Russian but Google Translate actually does a very good job with it. And, IEEE Spectrum Magazine reported on the failing as well. The IEEE article, Did Bad Memory Chips Down Russia’s Mars Probe, is a good summary and the translated Russian article offers more detail if you are interested in digging deeper.

The conclusion of the report is that there was a double memory fault on board Phobos-Grunt. Essentially both computers in a dual-redundant set failed at the same or similar times with a Static Random Access Memory failure. The computer was part of the newly-developed flight control system that had focused on dropping the mass of the flight control systems from 30 kgs (66 lbs) to 1.5 kgs (3.3 lbs). Less weight in flight control is more weight that can be in payload, so these gains are important. However, this new flight control system was blamed for the delay of the mission by 2 years and the eventual demise of the mission.

The two flight control computers are both identical TsM22 computer systems supplied by Techcom, a spin-off of the Argon Design Bureau Phobos Grunt Design). The official postmortem reports that both computers suffered an SRAM failure in a WS512K32V20G24M SRAM. These SRAMS are manufactured by White Electronic Design and the model number can be decoded as “W” for White Electronic Design, “S” for SRAM, “512K32” for a 512k memory by 32 bit wide access, “V” is the improvement mark, “20” for 20ns memory access time, “G24” is the package type, and “M” indicates it is a military grade part.

In the paper “ Extreme latchup susceptibility in modern commercial-off-the-shelf (COTS) monolithic 1M and 4M CMOS static random-access memory (SRAM) devices” Joe Benedetto reports that these SRAM packages are very susceptible to “latchup”, a condition which requires power recycling to return to operation and can be permanent in some cases. Steven McClure of NASA Jet Propulsion Laboratory is the leader of the Radiation Effects Group. He reports these SRAM parts would be very unlikely to be approved for use at JPL (Did Bad Memory Chips Down Russia’s Mars Probe).

It is rare that even two failures will lead to disaster and this case is no exception. Upon double failure of the flight control systems, the spacecraft autonomously goes into “safe mode” where the vehicle attempts to stay stable in low-earth orbit and orients its solar cells towards the sun so that it continues to have sufficient power. This is a common design pattern where the system is able to stabilize itself in an extreme condition to allow flight control personal back on earth to figure out what steps to take to mitigate the problem. In this case, the mitigation is likely fairly simple in just restarting both computers (which probably happened automatically) and restarting the mission would likely have been sufficient.

Unfortunately there was still one more failure, this one a design fault. When the spacecraft goes into safe mode, it is incapable of communicating with earth stations, probably due to spacecraft orientation. Essentially if the system needs to go into safe mode while it is still in earth orbit, the mission is lost because ground control will never be able to command it out of safe mode.

I find this last fault fascinating. Smart people could never make such an obviously incorrect mistake, and yet this sort of design flaw shows up all the time on large systems. Experts in each vertical area or component do good work. But the interaction across vertical areas are complex and, if there is not sufficiently deep, cross-vertical-area technical expertise, these design flaws may not get seen. Good people design good components and yet there often exist obvious fault modes across components that get missed.

Systems sufficiently complex enough to require deep vertical technical specialization risk complexity blindness. Each vertical team knows their component well but nobody understands the interactions of all the components. The two solutions are 1) well-defined and well-documented interfaces between components, be they hardware or software, and 2) and very experienced, highly-skilled engineer(s) on the team focusing on understanding inter-component interaction and overall system operation, especially in fault modes. Assigning this responsibility to a senior manager often isn’t sufficiently effective.

The faults that follow from complexity blindness are often serious and depressingly easy to see in retrospect, as was the case in this example.

Summarizing some of the lessons from this loss: The SRAM chip probably was a poor choice. The computer systems should restart, scrub memory for faults, and be able to detect and load corrupt code from secondary locations before going into safe-mode. Safe-mode has to actually allow mitigating actions to be taken from a ground station or it is useless. Software systems should be constantly scrubbing memory for faults and check-summing the running software for corruption. A tiny amount of processor power spent on continuous, redundant checking and a few more lines of code to implement simple recovery paths when fault is encountered may have saved the mission. Finally we have to all remember the old adage “nothing works if it is not tested.” Every major fault has to be tested. Error paths are the common ones to not be tested so it is particularly important to focus on them. The general rule is to keep error paths simple, use the fewest possible, and test frequently.

Back in 2007, I wrote up a set of best practices on software design, testing, and operations of high scale systems: On Designing and Deploying Internet-Scale Services. This paper targets large-scale services but it’s surprising to me that some, and perhaps many, of the suggestions could be applied successfully to a complex space flight system. The common theme across these two only partly-related domains is that the biggest enemy is complexity, and the exploding number of failure modes that follow from that complexity.

This incident reminds us of the importance of never trusting anything from any component in a multi-component system. Checksum every data block and have well-designed, and well-tested failure modes for even unlikely events. Rather than have complex recovery logic for the near infinite number of faults possible, have simple, brute-force recovery paths that you can use broadly and test frequently. Remember that all hardware, all firmware, and all software have faults and introduce errors. Don’t trust anyone or anything. Have test systems that bit flips and corrupts and ensure the production system can operate through these faults – at scale, rare events are amazingly common.

To dig deeper in the Phobos-Grunt loss:


James Hamilton
e: jrh@mvdirona.com
w: http://www.mvdirona.com
b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

10 comments on “Observations on Errors, Corrections, & Trust of Dependent Systems
  1. Marco, thanks for posting the SOSP summary (http://www.sigops.org/sosp/sosp11/posters/summaries/sosp11-final43.pdf). The paper is fairly brief so it would be good to get more data on what you are doing but it sounds like an interesting approach. Thanks for pointing it out.

    –jrh

  2. Vikram, I agree that forcing inter-component failures and introducing data packet corruptions is an effective tool in testing complex, multi-component systems. Automated systems that can be run at scale are a great investment.

    –jrh

  3. This post and Steve’s slides point out an important aspect that too many system engineers ignore: using hardware-level error detection (e.g. ECC and RAID) is good to prevent silent data corruptions, but not sufficient. One should not assume that the hardware/firmware/drivers/OS stack is completely reliable. Beyond unreliable disks, there are many SW and HW components in the data path, and each of them may introduce corruptions.

    1) A large-scale study on memory errors at Google (Schroeder et al., SIGMETRICS’09) shows that equal memory modules can have significantly different error rates depending on server configuration (i.e. board, HDD) and on the CPU load.

    2) Silent data corruptions may be caused by design faults, not only by hardware glitches. Even motherboards can have such problems, (see for examplehttp://don.blogs.smugmug.com/2007/07/25/silent-data-corruption-on-amd-servers/ ). ECC does not help here.

    My take is that applications should be designed to guarantee end-to-end protection against silent data corruptions, but systematically placing checks in SW is difficult and error-prone. You might be interested in a new library we have developed to execute systematic state integrity and control flow checks automatically. It prevents error propagation among processes and is based on a formal fault model of data corruptions called ASC, for Arbitrary State Corruptions. ASC-hardening does not assume trusted SW or HW components, has good performance (significantly better than BFT) and it does not require replication. Some preliminary results have been presented at SOSP’11 (seehttp://www.sigops.org/sosp/sosp11/posters/summaries/sosp11-final43.pdf ). We are going to open source the library very soon.

  4. Vikram says:

    James, agreed. I think BFT in its general form is pretty complex, but it’s intriguing to note how what Lamport wrote in 82 is so relevant in modern scale-out datacenters simply because failure is the norm, not an exception at that scale. IOW, *despite* all the error checks you may have in place, you still need a method to trust your own system. I don’t have references handy and you probably already do this at AWS, but one of the areas BFT is useful is monitoring the integrity of the system, or verify your verification methods if you will. Introduce controlled failures and run some sort of consensus algorithms to verify if the system can converge. IMO, forced failures and the ability to recover from them in large datacenters are the equivalent of introducing bit errors and verifying if ECC works as desired. Interesting stuff!

  5. Thanks for the Comment Vikram. You were asking about BFT. You are right that BFT is the logical extension of what I’m arguing for in the article. But, its a big extension and in many cases is more extreme than feels practical. I’m not saying I’m right but the conclusion I’ve personally come to is that "trust nobody and nothing" is a balance across several dimensions. I look at the frequency of a fault, the possible negative impact of a fault and the cost to protect against it. BFT in its purest form feels extreme to me but these are judgement calls and each is circumstance dependent.

    I smiled when you asked about the S3 outage of 4 years ago. That was a long time ago and there are now 3/4 of a trillion objects stored. Its pretty cool that the last memorable outage is back in 2008. The team is super focused on hardening the services against even the most unlikely event. Because S3 is so widely trusted, failures are very visible.

    The 2008 S3 event is an excellent example in that the problem was caused by a corrupted control plane message. At the time most data exchanges in S3 were checksum protected but this control message protocol was not and that is what led to the problem. As you point out, its a good example for this discussion. The post mortem is fairly detailed: http://status.aws.amazon.com/s3-20080720.html.

    –jrh

  6. pedant says:

    "on-premises" please! "on-premise" leads to the question of what premise the product was on – hopefully the premise that it works?

  7. Vikram says:

    Thanks for the valuable insights and references, makes for a great read! Would have loved to see the more general reference to Byzantine Fault Tolerance, as well as your thoughts on more practical BFT. Wouldn’t hurt to hear more about the solutions behind the S3 outage :-)

  8. Steve, thanks for the comment and sorry about the transient site error.

    The set of slides Steve pointed to are quite good: http://www.slideshare.net/steve_l/did-you-reallywantthatdat

    –jrh

  9. A blog comment that was emailed to me due to blog site error:

    Sent: Monday, February 27, 2012 4:19 AM
    To: jrh@mvdirona.com
    Subject: error correction

    James, I tried to add a comment on your blog, but there was an error …

    1. I agree with all your points on ECC, esp. client side.
    You should read (or if you have, point to) Nightingale et al., Cycles, Cells and Platters

    2. Here are my most recent slides on the issues in the DC and the implications for Hadoop
    http://www.slideshare.net/steve_l/did-you-reallywantthatdata

    Even with ECC and replication, the data is at risk if you aren’t careful, and everyone is complacent about networking.

    On a more positive note, the latest CRC opcode built in to the new Xeons makes CRC checks & verifies very, very fast, so there’s no reason not to add it to application-side protocols like Protobuf and Thrift

    -Steve

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.