Past experience suggests that disk and memory are the most common server component failures but what about power supplies and mother boards? Amaya Souarez of Global Foundation Services pulled the data on component replacements for the last six months of 2007 and we saw this distribution:
1. Disks: 59.0%
2. Memory: 23.1%
3. Disk Controller: 05.0% (Fiber HBAs and array controllers)
4. Power Supply: 03.4%
5. Fan: 01.1%
6. NIC: 01.0%
After disk and memory the numbers fade to the noise fairly quickly. Not a big surprise. What I did find quite surprising was the percentage of systems requiring service. Some systems will require service more than once and some systems will have multiple components replaced in a single service. Ignoring these factors and treating each logged component replacement as a service event, in the sample we looked at, we found we had 192 service events per 1,000 servers in six months. Making the reasonable assumption that this data is not sensitive to the time of year, that would be 384 service events per 1,000.
The good news is that server service is fairly inexpensive. Nonetheless, these data reinforce the argument I’ve been making for the last couple of years: the service-free, fail-in-place model is where we should be going over the longer term. I wrote this up in http://research.microsoft.com/~jamesrh/TalksAndPapers/JamesRH_CIDR.doc but the basic observation is that the cost per server continues to decline while people costs don’t.
Going to a service free-model can save service costs but, even more interesting, in this model the servers can be packaged in designs optimized for cooling efficiency without regard to human access requirements. If technicians need to go into a space, then the space needs be safe for humans and meet multiple security regulations, a growing concern, and there needs to be space for them. I believe we will get to a model where servers are treated like flash memory blocks: you have a few thousand in a service-free module, over time some fail and are shut off, and the overall module capacity diminishes over time. When server work done/watt improves sufficiently or when the module capacity degrades far enough, the entire module is replaced and returned to the manufacturer for recycling.