In past posts such as Web Search Using Small Cores I’ve said “Atom is a wonderful processor but current memory managers on Atom boards don’t support Error Correcting Codes (ECC) nor greater than 4 gigabytes of memory. I would love to use Atom in server designs but all the data I’ve gathered argues strongly that no server workload should be run without ECC.” And, in Linux/Apache on ARM Processors I said “unlike Intel Atom based servers, this ARM-based solution has the full ECC Memory support we want in server applications (actually you really want ECC in all applications from embedded through client to servers”
An excellent paper was just released that puts hard data behind this point and shows conclusively that ECC is absolutely needed. In DRAM Errors in the Wild: A Large Scale Field Study, Bianca Schroeder, Eduardo Pinheiro, and Wolf-Dietrich Weber show conclusively that you really do need ECC memory in server applications. Wolf was also an author of the excellent Power Provisioning in a Warehouse-Sized Computer that I mentioned in my blog post Slides From Conference on Innovative Data Systems Research where the authors described a technique to over-sub subscribe data center power.
I continue to believe that client systems should also be running ECC and strongly suspect that a great many kernel and device driver failures are actually the result of memory fault. We don’t have the data to prove it conclusively from a client population but I’ve long suspected that the single most effective way for Windows to reduce their blue screen rate would be to require ECC as a required feature for Windows Hardware Certification.
Returning to servers, in Kathy Yelick’s ISCA 2009 keynote, she showed a graph that showed ECC recovery rates (very common) and noted that the recovery times are substantial and the increased latency of correction is substantially slowing the computation (ISCA 2009 Keynote I: How to Waste a Parallel Computer – Kathy Yelick).
This more recent data further supports Kathy’s point, includes wonderfully detailed analysis and concludes with:
· Conclusion 1: We found the incidence of memory errors and the range of error rates across different DIMMs to be much higher than previously reported. About a third of machines and over 8% of DIMMs in our fleet saw at least one correctable error per year. Our per-DIMM rates of correctable errors translate to an average of 25,000–75,000 FIT (failures in time per billion hours of operation) per Mbit and a median FIT range of 778 –25,000 per Mbit (median for DIMMs with errors), while previous studies report 200-5,000 FIT per Mbit. The number of correctable errors per DIMM is highly variable, with some DIMMs experiencing a huge number of errors, compared to others. The annual incidence of uncorrectable errors was 1.3% per machine and 0.22% per DIMM. The conclusion we draw is that error correcting codes are crucial for reducing the large number of memory errors to a manageable number of uncorrectable errors. In fact, we found that platforms with more powerful error codes (chipkill versus SECDED) were able to reduce uncorrectable error rates by a factor of 4–10 over the less powerful codes. Nonetheless, the remaining incidence of 0.22% per DIMM per year makes a crash-tolerant application layer indispensable for large-scale server farms.
· Conclusion 2: Memory errors are strongly correlated. We observe strong correlations among correctable errors within the same DIMM. A DIMM that sees a correctable error is 13–228 times more likely to see another correctable error in the same month, compared to a DIMM that has not seen errors. There are also correlations between errors at time scales longer than a month. The autocorrelation function of the number of correctable errors per month shows significant levels of correlation up to 7 months. We also observe strong correlations between correctable errors and uncorrectable errors. In 70-80% of the cases an uncorrectable error is preceded by a correctable error in the same month or the previous month, and the presence of a correctable error increases the probability of an uncorrectable error by factors between 9–400. Still, the absolute probabilities of observing an uncorrectable error following a correctable error are relatively small, between 0.1–2.3% per month, so replacing a DIMM solely based on the presence of correctable errors would be attractive only in environments where the cost of downtime is high enough to outweigh the cost of the expected high rate of false positives.
· Conclusion 3: The incidence of CEs increases with age, while the incidence of UEs decreases with age (due to replacements). Given that DRAM DIMMs are devices without any mechanical components, unlike for example hard drives, we see a surprisingly strong and early effect of age on error rates. For all DIMM types we studied, aging in the form of increased CE rates sets in after only 10–18 months in the field. On the other hand, the rate of incidence of uncorrectable errors continuously declines starting at an early age, most likely because DIMMs with UEs are replaced (survival of the fittest).
· Conclusion 4: There is no evidence that newer generation DIMMs have worse error behavior. There has been much concern that advancing densities in DRAM technology will lead to higher rates of memory errors in future generations of DIMMs. We study DIMMs in six different platforms, which were introduced over a period of several years, and observe no evidence that CE rates increase with newer generations. In fact, the DIMMs used in the three most recent platforms exhibit lower CE rates, than the two older platforms, despite generally higher DIMM capacities. This indicates that improvements in technology are able to keep up with adversarial trends in DIMM scaling.
· Conclusion 5: Within the range of temperatures our production systems experience in the field, temperature has a surprisingly low effect on memory errors. Temperature is well known to increase error rates. In fact, artificially increasing the temperature is a commonly used tool for accelerating error rates in lab studies. Interestingly, we find that differences in temperature in the range they arise naturally in our fleet’s operation (a difference of around 20C between the 1st and 9th temperature decile) seem to have a marginal impact on the incidence of memory errors, when controlling for other factors, such as utilization.
· Conclusion 6: Error rates are strongly correlated with utilization.
· Conclusion 7: Error rates are unlikely to be dominated by soft errors. We observe that CE rates are highly correlated with system utilization, even when isolating utilization effects from the effects of temperature. In systems that do not use memory scrubbers this observation might simply reflect a higher detection rate of errors. In systems with memory scrubbers, this observations leads us to the conclusion that a significant fraction of errors is likely due to mechanism other than soft errors, such as hard errors or errors induced on the data path. The reason is that in systems with memory scrubbers the reported rate of soft errors should not depend on utilization levels in the system. Each soft error will eventually be detected (either when the bit is accessed by an application or by the scrubber), corrected and reported. Another observation that supports Conclusion 7 is the strong correlation between errors in the same DIMM. Events that cause soft errors, such as cosmic radiation, are expected to happen randomly over time and not in correlation. Conclusion 7 is an interesting observation, since much previous work has assumed that soft errors are the dominating error mode in DRAM. Some earlier work estimates hard errors to be orders of magnitude less common than soft errors and to make up about 2% of all errors. Conclusion 7 might also explain the significantly higher rates of memory errors we observe compared to previous studies.
Based upon this data and others, I recommend against non-ECC servers. Read the full paper at: DRAM Errors in the Wild: A Large Scale Field Study. Thanks for Cary Roberts for pointing me to this paper.