Every couple of weeks I get questions along the lines of “should I checksum application files, given that the disk already has error correction?” or “given that TCP/IP has error correction on every communications packet, why do I need to have application level network error detection?” Another frequent question is “non-ECC mother boards are much cheaper -- do we really need ECC on memory?” The answer is always yes. At scale, error detection and correction at lower levels fails to correct or even detect some problems. Software stacks above introduce errors. Hardware introduces more errors. Firmware introduces errors. Errors creep in everywhere and absolutely nobody and nothing can be trusted.
Over the years, each time I have had an opportunity to see the impact of adding a new layer of error detection, the result has been the same. It fires fast and it fires frequently. In each of these cases, I predicted we would find issues at scale. But, even starting from that perspective, each time I was amazed at the frequency the error correction code fired.
On one high scale, on-premise server product I worked upon, page checksums were temporarily added to detect issues during a limited beta release. The code fired constantly, and customers were complaining that the new beta version was “so buggy they couldn’t use it”. Upon deep investigation at some customer sites, we found the software was fine, but each customer had one, and sometimes several, latent data corruptions on disk. Perhaps it was introduced by hardware, perhaps firmware, or possibly software. It could have even been corruption introduced by one of our previous release when those pages where last written. Some of these pages may not have been written for years.
I was amazed at the amount of corruption we found and started reflecting on how often I had seen “index corruption” or other reported product problems that were probably corruption introduced in the software and hardware stacks below us. The disk has complex hardware and hundreds of thousands of lines of code, while the storage area network has complex data paths and over a million lines of code. The device driver has tens of thousands of lines of code. The operating systems has millions of lines of code. And our application had millions of lines of code. Any of us can screw-up, each has an opportunity to corrupt, and its highly likely that the entire aggregated millions of lines of code have never been tested in precisely the combination and on the hardware that any specific customer is actually currently running.
Another example. In this case, a fleet of tens of thousands of servers was instrumented to monitor how frequently the DRAM ECC was correcting. Over the course of several months, the result was somewhere between amazing and frightening. ECC is firing constantly.
The immediate lesson is you absolutely do need ECC in server application and it is just about crazy to even contemplate running valuable applications without it. The extension of that learning is to ask what is really different about clients? Servers mostly have ECC but most clients don’t. On a client, each of these corrections would instead be a corruption. Client DRAM is not better and, in fact, often is worse on some dimensions. These data corruptions are happening out there on client systems every day. Each day client data is silently corrupted. Each day applications crash without obvious explanation. At scale, the additional cost of ECC asymptotically approaches the cost of the additional memory to store the ECC. I’ve argued for years that Microsoft should require ECC for Windows Hardware Certification on all systems including clients. It would be good for the ecosystem and remove a substantial source of customer frustration. In fact, it’s that observation that leads most embedded systems parts to support ECC. Nobody wants their car, camera, or TV crashing. Given the cost at scale is low, ECC memory should be part of all client systems.
Here’s an interesting example from the space flight world. It caught my attention and I ended up digging ever deeper into the details last week and learning at each step. The Russian space mission Phobos-Grunt (also written Fobos-Grunt both of which roughly translate to Phobos Ground) was a space mission designed to, amongst other objectives, return soil samples from the Martian moon Phobos. This mission was launched atop the Zenit-2SB launch vehicle taking off from the Baikonur Cosmodrome 2:16am on November 9th 2011. On November 24th it was officially reported that the mission had failed and the vehicle was stuck in low earth orbit. Orbital decay has subsequently sent the satellite plunging to earth in a fiery end of what was a very expensive mission.
What went wrong aboard Phobos-Grunt? February 3rd the official accident report was released: The main conclusions of the Interdepartmental Commission for the analysis of the causes of abnormal situations arising in the course of flight testing of the spacecraft "Phobos-Grunt". Of course, this document is released in Russian but Google Translate actually does a very good job with it. And, IEEE Spectrum Magazine reported on the failing as well. The IEEE article, Did Bad Memory Chips Down Russia’s Mars Probe, is a good summary and the translated Russian article offers more detail if you are interested in digging deeper.
The conclusion of the report is that there was a double memory fault on board Phobos-Grunt. Essentially both computers in a dual-redundant set failed at the same or similar times with a Static Random Access Memory failure. The computer was part of the newly-developed flight control system that had focused on dropping the mass of the flight control systems from 30 kgs (66 lbs) to 1.5 kgs (3.3 lbs). Less weight in flight control is more weight that can be in payload, so these gains are important. However, this new flight control system was blamed for the delay of the mission by 2 years and the eventual demise of the mission.
The two flight control computers are both identical TsM22 computer systems supplied by Techcom, a spin-off of the Argon Design Bureau
Phobos Grunt Design). The official postmortem reports that both computers
suffered an SRAM failure in a WS512K32V20G24M SRAM. These SRAMS are manufactured by White Electronic Design and the model number can be decoded as “W” for White Electronic Design, “S” for SRAM, “512K32” for a 512k memory by 32 bit wide access, “V” is the improvement mark, “20” for 20ns memory access time, “G24” is the package type, and “M” indicates it is a military grade part.
In the paper "
Extreme latchup susceptibility in modern commercial-off-the-shelf (COTS) monolithic 1M and 4M CMOS static random-access memory (SRAM) devices"
Joe Benedetto reports that these SRAM packages are very susceptible to “latchup”, a condition which requires power recycling to return to operation and can be
permanent in some cases. Steven McClure of NASA Jet Propulsion Laboratory is the leader of the Radiation Effects Group.
He reports these SRAM parts would be very unlikely to be approved for use at JPL
(Did Bad Memory Chips Down Russia’s Mars Probe).
It is rare that even two failures will lead to disaster and this case is no exception. Upon double failure of the flight control systems, the spacecraft autonomously goes into “safe mode” where the vehicle attempts to stay stable in low-earth orbit and orients its solar cells towards the sun so that it continues to have sufficient power. This is a common design pattern where the system is able to stabilize itself in an extreme condition to allow flight control personal back on earth to figure out what steps to take to mitigate the problem. In this case, the mitigation is likely fairly simple in just restarting both computers (which probably happened automatically) and restarting the mission would likely have been sufficient.
Unfortunately there was still one more failure, this one a design fault. When the spacecraft goes into safe mode, it is incapable of communicating with earth stations, probably due to spacecraft orientation. Essentially if the system needs to go into safe mode while it is still in earth orbit, the mission is lost because ground control will never be able to command it out of safe mode.
I find this last fault fascinating. Smart people could never make such an obviously incorrect mistake, and yet this sort of design flaw shows up all the time on large systems. Experts in each vertical area or component do good work. But the interaction across vertical areas are complex and, if there is not sufficiently deep, cross-vertical-area technical expertise, these design flaws may not get seen. Good people design good components and yet there often exist obvious fault modes across components that get missed.
Systems sufficiently complex enough to require deep vertical technical specialization risk complexity blindness. Each vertical team knows their component well but nobody understands the interactions of all the components. The two solutions are 1) well-defined and well-documented interfaces between components, be they hardware or software, and 2) and very experienced, highly-skilled engineer(s) on the team focusing on understanding inter-component interaction and overall system operation, especially in fault modes. Assigning this responsibility to a senior manager often isn’t sufficiently effective.
The faults that follow from complexity blindness are often serious and depressingly easy to see in retrospect, as was the case in this example.
Summarizing some of the lessons from this loss: The SRAM chip probably was a poor choice. The computer systems should restart, scrub memory for faults, and be able to detect and load corrupt code from secondary locations before going into safe-mode. Safe-mode has to actually allow mitigating actions to be taken from a ground station or it is useless. Software systems should be constantly scrubbing memory for faults and check-summing the running software for corruption. A tiny amount of processor power spent on continuous, redundant checking and a few more lines of code to implement simple recovery paths when fault is encountered may have saved the mission. Finally we have to all remember the old adage “nothing works if it is not tested.” Every major fault has to be tested. Error paths are the common ones to not be tested so it is particularly important to focus on them. The general rule is to keep error paths simple, use the fewest possible, and test frequently.
Back in 2007, I wrote up a set of best practices on software design, testing, and operations of high scale systems:
On Designing and Deploying Internet-Scale Services. This paper targets large-scale services but it’s surprising to me that some, and perhaps many, of the suggestions could be applied successfully to a complex space flight system. The common theme across these two only partly-related domains is that the biggest enemy is complexity, and the exploding number of failure modes that follow from that complexity.
This incident reminds us of the importance of never trusting anything from any component in a multi-component system. Checksum every data block and have well-designed, and well-tested failure modes for even unlikely events. Rather than have complex recovery logic for the near infinite number of faults possible, have simple, brute-force recovery paths that you can use broadly and test frequently. Remember that all hardware, all firmware, and all software have faults and introduce errors. Don’t trust anyone or anything. Have test systems that bit flips and corrupts and ensure the production system can operate through these faults – at scale, rare events are amazingly common.
To dig deeper in the Phobos-Grunt loss:
b: http://blog.mvdirona.com /
Disclaimer: The opinions expressed here are my own and do not
necessarily represent those of current or past employers.