Datacenter temperature has been ramping up rapidly over the last 5 years. In fact, leading operators have been pushing temperatures up so quickly that the American Society of Heating, Refrigeration, and Air-Conditioning recommendations have become a become trailing indicator of what is being done rather than current guidance. ASHRAE responded in January of 2009 by raising the recommended limit from 77F to 80.6F (HVAC Group Says Datacenters Can be Warmer). This was a good move but many of us felt it was late and not nearly a big enough increment. Earlier this month, ASHRAE announced they are again planning to take action and raise the recommended limit further but haven’t yet announced by how much (ASHRAE: Data Centers Can be Even Warmer).
Many datacenters are operating reliably well in excess even the newest ASHRAE recommended temp of 81F. For example, back in 2009 Microsoft announced they were operating at least one facility without chillers at 95F server inlet temperatures.
As a measure of datacenter efficiency, we often use Power Usage Effectiveness. That’s the ratio of the power that arrives at the facility divided by the power actually delivered to the IT equipment (servers, storage, and networking). Clearly PUE says nothing about the efficiency of servers which is even more important but, just focusing on facility efficiency, we see that mechanical systems are the largest single power efficiency overhead in the datacenter.
There are many innovative solutions to addressing the power losses to mechanical systems. Many of these innovations work well and others have great potential but none are as powerful as simply increasing the server inlet temperatures. Obviously less cooling is cheaper than more. And, the higher the target inlet temperatures, the higher percentage of time that a facility can spend running on outside air (air-side economization) without process-based cooling.
The downsides of higher temperatures are 1) high semiconductor leakage losses, 2) higher server fan speed which increases the losses to air moving, and 3) higher server mortality rates. I’ve measured the former and, although these losses are inarguably present, these losses are measureable but have a very small impact at even quite high server inlet temperatures. The negative impact of fan speed increases is real but can be mitigated via different server target temperatures and more efficient server cooling designs. If the servers are designed for higher inlet temperatures, the fans will be configured for these higher expected temperatures and won’t run faster. This is simply a server design decision and good mechanical designs work well at higher server temperatures without increased power consumption. It’s the third issue that remains the scary one: increased server mortality rates.
The net of these factors is fear of server mortality rates is the prime factor slowing an even more rapid increase in datacenter temperatures. An often quoted study eports the failure rate of electronics doubles with every 10C increase of temperature (MIL-HDBK 217F). This data point is incredibly widely used by the military, NASA space flight program, and in commercial electronic equipment design. I’m sure the work is excellent but it is a very old study, wasn’t focused on a large datacenter environment, and the rule of thumb that has emerged from is a linear model of failure to heat.
This linear failure model is helpful guidance but is clearly not correct across the entire possible temperatures range. We know that at low temperature, non-heat related failure modes dominate. The failure rate of gear at 60F (16C) is not ½ the rate of gear at 79F (26C). As the temperature ramps up, we expect the failure rate to increase non-linearly. Really what we want to find is the knee of the curve. When does the failure rate increase start to approach that of the power savings achieved at higher inlet temperatures? Knowing that the rule of thumb that 10C increase double the failure rate is incorrect doesn’t really help us. What is correct?
What is correct is a difficult data point to get for two reasons: 1) nobody wants to bear the cost of the experiment – the failure case could run several hundred million dollars, and 2) those that have explored higher temperatures and have the data aren’t usually interested in sharing these results since the data has substantial commercial value.
I was happy to see a recent Datacenter Knowledge article titled What’s Next? Hotter Servers with ‘Gas Pedals’ (ignore the title – Rich just wants to catch your attention). This article include several interesting tidbits on high temperature operation. Most notable was the quote by Subodh Bapat, the former VP of Energy Efficiency at Sun Microsystems, who says:
Take the data center in the desert. Subodh Bapat, the former VP of Energy Efficiency at Sun Microsystems, shared an anecdote about a data center user in the Middle East that wanted to test server failure rates if it operated its data center at 45 degrees Celsius – that’s 115 degrees Fahrenheit.
Testing projected an annual equipment failure rate of 2.45 percent at 25 degrees C (77 degrees F), and then an increase of 0.36 percent for every additional degree. Thus, 45C would likely result in annual failure rate of 11.45 percent. “Even if they replaced 11 percent of their servers each year, they would save so much on air conditioning that they decided to go ahead with the project,” said Bapat. “They’ll go up to 45C using full air economization in the Middle East.”
This study caught my interest for a variety of reasons. First and most importantly, they studied failure rates at 45C and decided that, although they were high, it still made sense for them to operate at these temperatures. It is reported they are happy to pay the increased failure rate in order to save the cost of the mechanical gear. The second interesting data point from this work is that they have found a far greater failure rate for 15C of increased temperature than predicted by MIL-HDBK-217F. Consistent with the217F, they also report a linear relationship between temperature and failure rate. I almost guarantee that the failure rate increase between 40C and 45C was much higher than the difference between 25C and 30C. I don’t buy the linear relationship between failure rate and temperature based upon what we know of failure rates at lower temperatures. Many datacenters have raised temperatures between 20C (68F) and 25C (77F) and found no measureable increase in server failure rate. Some have raised their temperatures between 25C (77F) and 30C (86F) without finding the predicted 1.8% increase in failures but admitted the data remains sparse.
The linear relationship is absolutely not there across a wide temperature range. But I still find the data super interesting in two respects: 1) they did see a roughly a 9% increase in failure rate at 45C over the control case at 20C and 2) even with the high failure rate, they made an economic decision to run at that temperature. The article doesn’t say if the failure rate is a lifetime failure rate or an Annual Failure Rate (AFR). Assuming it is an AFR, many workloads and customers could not life with a 11.45% AFR nonetheless, the data is interesting and it good to see the first public report of an operator doing the study and reaching a conclusion that works for their workloads and customers.
The net of the whole discussion above is the current linear rules of thumb are wrong and don’t usefully predict failure rates in the temperature ranges we care about, there is very little public data out there, industry groups like ASHRAE are behind and it’s quite possible that cooling specialist may not be the first to recommend we stop cooling datacenters aggressively :-). Generally, it’s a sparse data environment out there but the potential economic and environmental gains are exciting so progress will continue to get made. I’m looking forward to efficient and reliable operation at high temperature becoming a critical data point in server fleet purchasing decisions. Let’s keep pushing.