Wednesday, October 07, 2009

In past posts such as Web Search Using Small Cores I’ve said “Atom is a wonderful processor but current memory managers on Atom boards don’t support Error Correcting Codes (ECC) nor greater than 4 gigabytes of memory. I would love to use Atom in server designs but all the data I’ve gathered argues strongly that no server workload should be run without ECC.”  And, in Linux/Apache on ARM Processors I said “unlike Intel Atom based servers, this ARM-based solution has the full ECC Memory support we want in server applications (actually you really want ECC in all applications from embedded through client to servers

 

An excellent paper was just released that puts hard data behind this point and shows conclusively that ECC is absolutely needed. In DRAM Errors in the Wild: A Large Scale Field Study, Bianca Schroeder, Eduardo Pinheiro, and Wolf-Dietrich Weber show conclusively that you really do need ECC memory in server applications. Wolf was also an author of the excellent Power Provisioning in a Warehouse-Sized Computer that I mentioned in my blog post Slides From Conference on Innovative Data Systems Research where the authors described a technique to over-sub subscribe data center power.

 

I continue to believe that client systems should also be running ECC and strongly suspect that a great many kernel and device driver failures are actually the result of memory fault. We don’t have the data to prove it conclusively from a client population but I’ve long suspected that the single most effective way for Windows to reduce their blue screen rate would be to require ECC as a required feature for Windows Hardware Certification.

 

Returning to servers, in Kathy Yelick’s ISCA 2009 keynote, she showed a graph that showed ECC recovery rates (very common) and noted that the recovery times are substantial and the increased latency of correction is substantially slowing the computation (ISCA 2009 Keynote I: How to Waste a Parallel Computer – Kathy Yelick).


This more recent data further supports Kathy’s point, includes wonderfully detailed analysis and concludes with:

 

·         Conclusion 1: We found the incidence of memory errors and the range of error rates across different DIMMs to be much higher than previously reported. About a third of machines and over 8% of DIMMs in our fleet saw at least one correctable error per year. Our per-DIMM rates of correctable errors translate to an average of 25,000–75,000 FIT (failures in time per billion hours of operation) per Mbit and a median FIT range of 778 –25,000 per Mbit (median for DIMMs with errors), while previous studies report 200-5,000 FIT per Mbit. The number of correctable errors per DIMM is highly variable, with some DIMMs experiencing a huge number of errors, compared to others. The annual incidence of uncorrectable errors was 1.3% per machine and 0.22% per DIMM. The conclusion we draw is that error correcting codes are crucial for reducing the large number of memory errors to a manageable number of uncorrectable errors. In fact, we found that platforms with more powerful error codes (chipkill versus SECDED) were able to reduce uncorrectable error rates by a factor of 4–10 over the less powerful codes. Nonetheless, the remaining incidence of 0.22% per DIMM per year makes a crash-tolerant application layer indispensable for large-scale server farms.

·         Conclusion 2: Memory errors are strongly correlated. We observe strong correlations among correctable errors within the same DIMM. A DIMM that sees a correctable error is 13–228 times more likely to see another correctable error in the same month, compared to a DIMM that has not seen errors. There are also correlations between errors at time scales longer than a month. The autocorrelation function of the number of correctable errors per month shows significant levels of correlation up to 7 months. We also observe strong correlations between correctable errors and uncorrectable errors. In 70-80% of the cases an uncorrectable error is preceded by a correctable error in the same month or the previous month, and the presence of a correctable error increases the probability of an uncorrectable error by factors between 9–400. Still, the absolute probabilities of observing an uncorrectable error following a correctable error are relatively small, between 0.1–2.3% per month, so replacing a DIMM solely based on the presence of correctable errors would be attractive only in environments where the cost of downtime is high enough to outweigh the cost of the expected high rate of false positives.

·         Conclusion 3: The incidence of CEs increases with age, while the incidence of UEs decreases with age (due to replacements). Given that DRAM DIMMs are devices without any mechanical components, unlike for example hard drives, we see a surprisingly strong and early effect of age on error rates. For all DIMM types we studied, aging in the form of increased CE rates sets in after only 10–18 months in the field. On the other hand, the rate of incidence of uncorrectable errors continuously declines starting at an early age, most likely because DIMMs with UEs are replaced (survival of the fittest).

·         Conclusion 4: There is no evidence that newer generation DIMMs have worse error behavior. There has been much concern that advancing densities in DRAM technology will lead to higher rates of memory errors in future generations of DIMMs. We study DIMMs in six different platforms, which were introduced over a period of several years, and observe no evidence that CE rates increase with newer generations. In fact, the DIMMs used in the three most recent platforms exhibit lower CE rates, than the two older platforms, despite generally higher DIMM capacities. This indicates that improvements in technology are able to keep up with adversarial trends in DIMM scaling.

·         Conclusion 5: Within the range of temperatures our production systems experience in the field, temperature has a surprisingly low effect on memory errors. Temperature is well known to increase error rates. In fact, artificially increasing the temperature is a commonly used tool for accelerating error rates in lab studies. Interestingly, we find that differences in temperature in the range they arise naturally in our fleet’s operation (a difference of around 20C between the 1st and 9th temperature decile) seem to have a marginal impact on the incidence of memory errors, when controlling for other factors, such as utilization.

·         Conclusion 6: Error rates are strongly correlated with utilization.

·         Conclusion 7: Error rates are unlikely to be dominated by soft errors. We observe that CE rates are highly correlated with system utilization, even when isolating utilization effects from the effects of temperature. In systems that do not use memory scrubbers this observation might simply reflect a higher detection rate of errors. In systems with  memory scrubbers, this observations leads us to the conclusion that a significant fraction of errors is likely due to mechanism other than soft errors, such as hard errors or errors induced on the data path. The reason is that in systems with memory scrubbers the reported rate of soft errors should not depend on utilization levels in the system. Each soft error will eventually be detected (either when the bit is accessed by an application or by the scrubber), corrected and reported. Another observation that supports Conclusion 7 is the strong correlation between errors in the same DIMM. Events that cause soft errors, such as cosmic radiation, are expected to happen randomly over time and not in correlation. Conclusion 7 is an interesting observation, since much previous work has assumed that soft errors are the dominating error mode in DRAM. Some earlier work estimates hard errors to be orders of magnitude less common than soft errors and to make up about 2% of all errors. Conclusion 7 might also explain the significantly higher rates of memory errors we observe compared to previous studies.

 

Based upon this data and others, I recommend against non-ECC servers. Read the full paper at: DRAM Errors in the Wild: A Large Scale Field Study. Thanks for Cary Roberts for pointing me to this paper.

 

                                                                --jrh

 

James Hamilton

e: jrh@mvdriona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Wednesday, October 07, 2009 5:33:50 AM (Pacific Standard Time, UTC-08:00)  #    Comments [6] - Trackback
Hardware
Wednesday, October 07, 2009 10:36:05 AM (Pacific Standard Time, UTC-08:00)
These are actually scary results. I wonder if they can correlate results over time to sunspots or the phase of the moon?

This reminds me of something...

Many, many years ago (30, to be precise) I helped to develop a travel agency automation system. The system was hosted on an Ohio Scientific 6502 machine with dual floppy disk drives, 48K of RAM, and up to 3 ADM-3A CRT displays. We built the entire software suite from the OS on up, and had full control of everything but the boom ROM.

The travel agency package included a double-entry accounting system. Occasional single bit memory errors would actually corrupt one of the two entries before it was written to disk, resulting in accounts that would never balance!

We were committed to (e.g. stuck with) the hardware, so we simply had to make it more reliable. I hit on the idea of doing real-time memory tests. We wired up the OS to call my code as part of its scheduling loop. My code then had 30 milliseconds to run a test. The details escape me now, but I think I managed to exhaustively test 1 byte and to make sure that it didn't change any other bytes in the same 256-byte page. The OS would simply halt (after displaying an error message) if an error was detected. A more modern system with a memory manager could definitely do something a lot more intelligent.

I'm surprised that this concept hasn't resurfaced (as far as I know) in the 30 years since I first implemented it.
Wednesday, October 07, 2009 12:45:47 PM (Pacific Standard Time, UTC-08:00)
Great story Jeff. Your "memory cleaning" idea does have modern day implementations: Memory scrubbing (http://en.wikipedia.org/wiki/Memory_scrubbing) coupled with high quality error correcting codes like chipkill. In memory scrubbing the memory controller steals some memory bandwith and periodically touches all memory looking for correctable faults.

It works great but I've not seen it on commodity servers yet.

Thanks for passing that along.

James Hamilton
jrh@mvdirona.com
Thursday, October 08, 2009 8:56:50 AM (Pacific Standard Time, UTC-08:00)
That DRAM errors paper out of Google is very interesting, James, and a great read. Thanks for pointing it out.

One minor thing. Most of your discussions on this blog explore the cost and benefit of various configurations in the data center. While I have little doubt that we would easily show that ECC is worth the small cost premium given the large numbers of errors avoided, a more explicit discussion on the cost of ECC (total increase in price per server) and the benefit (downtime avoided, reduced need for error correction in software, or reduction in servers needed) might be both fun and useful.
Thursday, October 08, 2009 10:15:32 AM (Pacific Standard Time, UTC-08:00)
I hate it when data gets in the way of what I want to do! Thanks for the link.
William Casperson
Friday, October 09, 2009 8:43:12 AM (Pacific Standard Time, UTC-08:00)
I totally agree Greg. The challenge is that most reports on ECC are studies from non-production workloads and they don't reflect the failure rates seen in large cloud deployments. The Google study is one of the fist to release data publicly from a large population.

Generally, EEC reminds of disk MTBF. The numbers from the labs have just about no bearing on what we can expect to see in production.

From watching failures in high scale software and hardware systems for many years I'm 100% convinced that, at scale, everything fails and it fails much more frequently that any model predicts. Every time, we add checksums or error checking to any part of the system, we find more issues. My belief from watching tihs for years is that all pages and buggers should be end to end checksumed and, in memory and data paths where we can't efficiently have software checking, we need hardware checking.

On cost there is debate. System designers argue that its a "substantial" cost but the facts don't bear this out. All AMD processors whether server, embedded, or client have ECC. All ARMs have ECC. In fact, i find it fascinating that the most cost conscious market in the world, embedded processors almost all have ECC. There are two interesting data points in there. The cost conscious embedded market believes they need it and are willing to pay for it. And, the cost uplift is fairly small in embedded applications.

All chip suppliers have ECC capable memory managers. In volume, the cost of ECC will trend to the incremental cost of memory.

If we put it in client systems, client system reliability will increase and server ECC will be less cost effective, and we will be free to use client and embedded parts in server applications.
Friday, October 09, 2009 8:55:50 AM (Pacific Standard Time, UTC-08:00)
Hey William, good hearing from you. And, yeah, I totally agree that needing ECC is a unfortunate constraint in server purchases. But, there are two ways to get rid of the constraints: 1) develop systems that can operate correctly in the presence of massive datapath and memory errors or 2) convince client system buyers and builders they need ECC for the same reasons that server and embedded markets have gone that route. The former is hard to do both well and cost effectively. We really need hardware support. The later looks like the right approach. Let's get ECC on client systems and the cost of ECC will decline to the marginal cost of memory. This is especially important for Microsoft bluescreens are always blaimed on the O/S provider when, in fact, uncorrected memory faults are clearly part of the problem. ECC will improve the customer experience and Microsoft should insist on it as part of the windows h/w cert program.

All AMD parts I've come across whether client, embedded, or server include ECC. Hats off to AMD for offering that support across the board. That's one of the reasons this work was based upon AMD: http://perspectives.mvdirona.com/2009/01/15/TheCaseForLowCostLowPowerServers.aspx.

For those that don't know William, he's responsible for Bing hardware selections and presides over one of the biggest fleets there is.

James Hamilton
jrh@mvdirona.com
Comments are closed.

Disclaimer: The opinions expressed here are my own and do not necessarily represent those of current or past employers.

Archive
<October 2014>
SunMonTueWedThuFriSat
2829301234
567891011
12131415161718
19202122232425
2627282930311
2345678

Categories
This Blog
Member Login
All Content © 2014, James Hamilton
Theme created by Christoph De Baene / Modified 2007.10.28 by James Hamilton