Temprature Management in Data Centers

Cooling is the largest single non-IT (overhead) load in a modern datacenter. There are many innovative solutions to addressing the power losses in cooling systems. Many of these mechanical system innovations work well and others have great potential but none are as powerful as simply increasing the server inlet temperatures. Obviously less cooling is cheaper than more. And, the higher the target inlet temperatures, the higher percentage of time that a facility can spend running on outside air (air-side economization) without process-based cooling.

The downsides of higher temperatures are 1) high semiconductor leakage losses, 2) higher server fan speed which increases the losses to air moving, and 3) higher server mortality rates. I’ve measured the former and, although these losses are inarguably present, these losses are measureable but have a very small impact at even quite high server inlet temperatures. The negative impact of fan speed increases is real but can be mitigated via different server target temperatures and more efficient server cooling designs. If the servers are designed for higher inlet temperatures, the fans will be configured for these higher expected temperatures and won’t run faster. This is simply a server design decision and good mechanical designs work well at higher server temperatures without increased power consumption. It’s the third issue that remains the scary one: increased server mortality rates.

The net of these factors is fear of higher server mortality rates is the prime factor slowing an even more rapid increase in datacenter temperatures. An often quoted study reports the failure rate of electronics doubles with every 10C increase of temperature (MIL-HDBK 217F). This data point is incredibly widely used by the military, NASA space flight program, and in commercial electronic equipment design. I’m sure the work is excellent but it is a very old study, wasn’t focused on a large datacenter environment, and the rule of thumb that has emerged from is a linear model of failure to heat.

A recent paper that does an excellent job of methodically digging through the possible issues of high datacenter temperature and investigating each concern methodically. I like Temperature Management in Data Centers: Why Some (Might) Like it Hot for two reasons: 1) it unemotionally works through the key issues and concerns, and 2) it draws from a sample of 7 production data centers at Google so the results are credible and from a substantial sample

From the introduction:

Interestingly, one key aspect in the thermal management of a data center is still not very well understood: controlling the setpoint temperature at which to run a data center’s cooling system. Data centers typically operate in a temperature range between 20C and 22C, some are as cold as 13C degrees [8, 29]. Due to lack of scientiﬁc data, these values are often chosen based on equipment manufacturers’ (conservative) suggestions. Some estimate that increasing the setpoint temperature by just one degree can reduce energy consumption by 2 to 5 percent [8, 9]. Microsoft reports that raising the temperature by two to four degrees in one of its Silicon Valley data centers saved $250,000 in annual energy costs [29]. Google and Facebook have also been considering increasing the temperature in their data centers [29].

The authors go on to observe that “the details of how increased data center temperatures will affect hardware reliability are not well understood and existing evidence is contradictory.” The remainder of the paper presents the data as measured in the 7 production datacenters under study and concludes each section with an observation. I encourage you to read the paper and I’ll cover just the observations here:

Observation 1: For the temperature range that our data covers with statistical signiﬁcance (< 50C), the prevalence of latent sector errors increases much more slowly with temperature, than reliability models suggest. Half of our model/data center pairs show no evidence of an increase, while for the others the increase is linear rather than exponential.

Observation 2: The variability in temperature tends to have a more pronounced and consistent eﬀect on Latent Sector Error rates than mere average temperature

Observation 3: Higher temperatures do not increase the expected number of Latent Sector Errors (LSEs) once a drive develops LSEs, possibly indicating that the mechanisms that cause LSEs are the same under high or low temperatures.

Observation 4: Within a range of 0-36 months, older drives are not more likely to develop Latent Sector Errors under temperature than younger drives.

Observation 5: High utilization does not increase Latent Sector Error rates under temperatures.

Observation 6: For temperatures below 50C, disk failure rates grow more slowly with temperature than common models predict. The increase tends to be linear rather than exponential, and the expected increase in failure rates for each degree increase in temperature is small compared to the magnitude of existing failure rates.

Observation 7: Neither utilization nor the age of a drive signiﬁcantly aﬀect drive failure rates as a function of temperature.

Observation 8: We do not observe evidence for increasing rates of uncorrectable DRAM errors, DRAM DIMM replacements or node outages caused by DRAM problems as a function of temperature (within the range of temperature our data comprises).

Observation 9: We observe no evidence that hotter nodes have a higher rate of node outages, node downtime or hardware replacements than colder nodes.

Observation 10: We ﬁnd that high variability in temperature seems to have a stronger eﬀect on node reliability than average temperature.

Observation 11: As ambient temperature increases, the resulting increase in power is signiﬁcant and can be mostly attributed to fan power. In comparison, leakage power is negligible.

Observation 12: Smart control of server fan speeds is imperative to run data centers hotter. A signiﬁcant fraction of the observed increase in power dissipation in our experiments could likely be avoided by more sophisticated algorithms controlling the fan speeds.

Observation 13: The degree of temperature variation across the nodes in a data center is surprisingly similar for all data centers in our study. The hottest 5% nodes tend to be more than 5C hotter than the typical node, while the hottest 1%

nodes tend to be more than 8–10C hotter.

The paper under discussion: http://www.cs.toronto.edu/~nosayba/temperature_cam.pdf.

Other notes on increased data center temperatures:

· Exploring the Limits of Datacenter Temperature

· Chillerless Data Center at 95F

· Computer Room Evaporative Cooling

· Next Point of Server Differentiation: Efficiency at Very High Temperature

· Open Compute Mechanical System Design

· Example of Efficient Mechanical Design

· Innovative Datacenter Design: Ishikari Datacenter

James Hamilton
e: jrh@mvdirona.com
w: http://www.mvdirona.com
b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

Hardware

2 comments on “Temprature Management in Data Centers”

James Hamilton says:

May 30, 2012 at 1:37 am

Sorry Mighty, I don’t have an answer for you. This blog isn’t the best place to get official Amazon views on pretty much any topic.

–jrh

Reply
Mighty says:

May 30, 2012 at 12:29 am

What is Amazon’s contribution to this subject? What do you guys think about hotter DCs?

Reply

Perspectives

Temprature Management in Data Centers

2 comments on “Temprature Management in Data Centers”

Leave a Reply Cancel reply