I love high-scale systems and, more than anything, I love data from real systems. I’ve learned over the years that no environment is crueler, less forgiving, or harder to satisfy than real production workloads. Synthetic tests at scale are instructive but nothing catches my attention like data from real, high-scale, production systems. Consequently, I really liked the disk population studies from Google and CMU at FAST2007 (Failure Trends in a Large Disk Population, Disk Failures in the Real World: What does a MTBF of 100,000 hours mean to you). These two papers presented actual results from independent production disk populations of 100,000 each. My quick summary of these 2 papers is basically “all you have learned about disks so far is probably wrong.”
Disk failures are often the #1 or #2 failing component in a storage system usually just ahead of memory. Occasionally fan failures lead disk but that isn’t the common case. We now have publically available data on disk failures but not much has been published on other component failure rates and even less on the overall storage stack failure rates. Cloud storage systems are multi-tiered, distributed systems involving 100s to even 10s of thousands of servers and huge quantities of software. Modeling the failure rates of discrete components in the stack is difficult but, with the large amount of component failure data available to large fleet operators, it can be done. What’s much more difficult to model are correlated failures.
Essentially, there are two challenges encountered when attempting to modeling overall storage system reliability: 1) availability of component failure data and 2) correlated failures. The former is available to very large fleet owners but is often unavailable publically. Two notable exceptions are disk reliability data from the two FAST’07 conference papers mentioned above. Other than these two data points, there is little credible component failure data publically available. Admittedly, component manufacturers do publish MTBF data but these data are often owned by the marketing rather than engineering teams and they range between optimistic and epic works of fiction.
Even with good quality component failure data, modeling storage system failure modes and data durability remains incredibly difficult. What makes this hard is the second issue above: correlated failure. Failures don’t always happen alone, many are correlated, and certain types of rare failures can take down the entire fleet or large parts of it. Just about every model assumes failure independence and then works out data durability to many decimal points. It makes for impressive models with long strings of nines but the challenge is the model is only as good as the input. And one of the most important model inputs is the assumption of component failure independence which is violated by every real-world system of any complexity. Basically, these failure models are good at telling you when your design is not good enough but they can never tell you how good your design actually is nor whether it is good enough.
Where the models break down is in modeling rare events and non-independent failures. The best way to understand common correlated failure modes is to study storage systems at scale over longer periods of time. This won’t help us understand the impact of very rare events. For example, Two thousand years of history would not helped us model or predict that a airplane would be flown into the World Trade Center. And certainly the odds of it happening again 16 min and 20 seconds later would be close to impossible. Studying historical storage system failure data will not help us understand the potential negative impact of very rare black swan events but it does help greatly in understanding the more common failure modes including correlated or non-independent failures.
Murray Stokely recently sent me Availability in Globally Distributed Storage Systems which is the work of a team from Google and Columbia University. They look at a high scale storage system at Google that includes multiple clusters of Bigtable which is layered over GFS which is implemented as a user–mode application over Linux file system. You might remember Stokely from my Using a post I did back in March titled Using a Market Economy. In this more recent paper, the authors study 10s of Google storage cells each of which is comprised of between 1,000 and 7,000 servers over a 1 year period. The storage cells studied are from multiple datacenters in different regions being used by different projects within Google.
I like the paper because it is full of data on a high-scale production system and it reinforces many key distributed storage system design lessons including:
· Replicating data across multiple datacenters greatly improves availability because it protects against correlated failures.
o Conclusion: Two way redundancy in two different datacenters is considerably more durable than 4 way redundancy in a single datacenter.
· Correlation among node failures dwarfs all other contributions to unavailability in production environments.
· Disk failures can result in permanent data loss but transitory node failures account for the majority of unavailability.
To read more: http://research.google.com/pubs/pub36737.html
The abstract of the paper:
Highly available cloud storage is often implemented with complex, multi-tiered distributed systems built on top of clusters of commodity servers and disk drives. Sophisticated management, load balancing and recovery techniques are needed to achieve high performance and availability amidst an abundance of failure sources that include software, hardware, network connectivity, and power issues. While there is a relative wealth of failure studies of individual components of storage systems, such as disk drives, relatively little has been reported so far on the overall availability behavior of large cloud-based storage services. We characterize the availability properties of cloud storage systems based on an extensive one year study of Google’s main storage infrastructure and present statistical models that enable further insight into the impact of multiple design choices, such as data placement and replication strategies. With these models we compare data availability under a variety of system parameters given the real patterns of failures observed in our fleet.