Sunday, February 27, 2011

Datacenter temperature has been ramping up rapidly over the last 5 years. In fact, leading operators have been pushing temperatures up so quickly that the American Society of Heating, Refrigeration, and Air-Conditioning recommendations have become a become trailing indicator of what is being done rather than current guidance. ASHRAE responded in January of 2009 by raising the recommended limit from 77F to 80.6F (HVAC Group Says Datacenters Can be Warmer). This was a good move but many of us felt it was late and not nearly a big enough increment.  Earlier this month, ASHRAE announced they are again planning to take action and raise the recommended limit further but haven’t yet announced by how much (ASHRAE: Data Centers Can be Even Warmer). 

 

Many datacenters are operating reliably well in excess even the newest ASHRAE recommended temp of 81F. For example, back in 2009 Microsoft announced they were operating at least one facility without chillers at 95F server inlet temperatures.

 

As a measure of datacenter efficiency, we often use Power Usage Effectiveness. That’s the ratio of the power that arrives at the facility divided by the power actually delivered to the IT equipment (servers, storage, and networking). Clearly PUE says nothing about the efficiency of servers which is even more important but, just focusing on facility efficiency, we see that mechanical systems are the largest single power efficiency overhead in the datacenter.

 

There are many innovative solutions to addressing the power losses to mechanical systems. Many of these innovations work well and others have great potential but none are as powerful as simply increasing the server inlet temperatures. Obviously less cooling is cheaper than more. And, the higher the target inlet temperatures, the higher percentage of time that a facility can spend running on outside air (air-side economization) without process-based cooling.

 

The downsides of higher temperatures are 1) high semiconductor leakage losses, 2) higher server fan speed which increases the losses to air moving, and 3) higher server mortality rates. I’ve measured the former and, although these losses are inarguably present, these losses are measureable but have a very small impact at even quite high server inlet temperatures.  The negative impact of fan speed increases is real but can be mitigated via different server target temperatures and more efficient server cooling designs. If the servers are designed for higher inlet temperatures, the fans will be configured for these higher expected temperatures and won’t run faster. This is simply a server design decision and good mechanical designs work well at higher server temperatures without increased power consumption. It’s the third issue that remains the scary one: increased server mortality rates.

 

The net of these factors is fear of server mortality rates is the prime factor slowing an even more rapid increase in datacenter temperatures. An often quoted study eports the failure rate of electronics doubles with every 10C increase of temperature (MIL-HDBK 217F). This data point is incredibly widely used by the military, NASA space flight program, and in commercial electronic equipment design. I’m sure the work is excellent but it is a very old study, wasn’t focused on a large datacenter environment, and the rule of thumb that has emerged from is a linear model of failure to heat.

 

This linear failure model is helpful guidance but is clearly not correct across the entire possible temperatures range. We know that at low temperature, non-heat related failure modes dominate. The failure rate of gear at 60F (16C) is not ½ the rate of gear at 79F (26C). As the temperature ramps up, we expect the failure rate to increase non-linearly. Really what we want to find is the knee of the curve. When does the failure rate increase start to approach that of the power savings achieved at higher inlet temperatures?  Knowing that the rule of thumb that 10C increase double the failure rate is incorrect doesn’t really help us. What is correct?

 

What is correct is a difficult data point to get for two reasons: 1) nobody wants to bear the cost of the experiment – the failure case could run several hundred million dollars, and 2) those that have explored higher temperatures and have the data aren’t usually interested in sharing these results since the data has substantial commercial value.

 

I was happy to see a recent Datacenter Knowledge article titled What’s Next? Hotter Servers with ‘Gas Pedals’ (ignore the title – Rich just wants to catch your attention). This article include several interesting tidbits on high temperature operation. Most notable was the quote by Subodh Bapat, the former VP of Energy Efficiency at Sun Microsystems, who says:

 

Take the data center in the desert. Subodh Bapat, the former VP of Energy Efficiency at Sun Microsystems, shared an anecdote about a data center user in the Middle East that wanted to test server failure rates if it operated its data center at 45 degrees Celsius – that’s 115 degrees Fahrenheit.

 

Testing projected an annual equipment failure rate of 2.45 percent at 25 degrees C (77 degrees F), and then an increase of 0.36 percent for every additional degree. Thus, 45C would likely result in annual failure rate of 11.45 percent. “Even if they replaced 11 percent of their servers each year, they would save so much on air conditioning that they decided to go ahead with the project,” said Bapat. “They’ll go up to 45C using full air economization in the Middle East.”

 

This study caught my interest for a variety of reasons. First and most importantly, they studied failure rates at 45C and decided that, although they were high, it still made sense for them to operate at these temperatures. It is reported they are happy to pay the increased failure rate in order to save the cost of the mechanical gear. The second interesting data point from this work is that they have found a far greater failure rate for 15C of increased temperature than predicted by MIL-HDBK-217F. Consistent with the217F, they also report a linear relationship between temperature and failure rate. I almost guarantee that the failure rate increase between 40C and 45C was much higher than the difference between 25C and 30C. I don’t buy the linear relationship between failure rate and temperature based upon what we know of failure rates at lower temperatures. Many datacenters have raised temperatures between 20C (68F) and 25C (77F) and found no measureable increase in server failure rate. Some have raised their temperatures between 25C (77F) and 30C (86F) without finding the predicted 1.8% increase in failures but admitted the data remains sparse.

 

The linear relationship is absolutely not there across a wide temperature range. But I still find the data super interesting in two respects: 1) they did see a roughly a 9% increase in failure rate at 45C over the control case at 20C and 2) even with the high failure rate, they made an economic decision to run at that temperature. The article doesn’t say if the failure rate is a lifetime failure rate or an Annual Failure Rate (AFR). Assuming it is an AFR, many workloads and customers could not life with a 11.45% AFR nonetheless, the data is interesting and it good to see the first public report of an operator doing the study and reaching a conclusion that works for their workloads and customers.

 

The net of the whole discussion above is the current linear rules of thumb are wrong and don’t usefully predict failure rates in the temperature ranges we care about, there is very little public data out there, industry groups like ASHRAE are behind and it’s quite possible that cooling specialist may not be the first to recommend we stop cooling datacenters aggressively :-). Generally, it’s a sparse data environment out there but the potential economic and environmental gains are exciting so progress will continue to get made. I’m looking forward to efficient and reliable operation at high temperature becoming a critical data point in server fleet purchasing decisions. Let’s keep pushing.

 

                                                                --jrh

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Sunday, February 27, 2011 10:25:43 AM (Pacific Standard Time, UTC-08:00)  #    Comments [6] - Trackback
Hardware
Monday, February 28, 2011 10:21:11 PM (Pacific Standard Time, UTC-08:00)
Server manufacturers have failure rate versus temperature statistics, but won't release them.

As you posted earlier, manufacturer temperature "knees" are documented in their specifications, e.g. Dell and HP at 35C, SGI/Rackable at 40C, and HP is pushing 35C data center containers. Those are probably the temperatures just before failures create too many repairs for their profit models. Not the same as failure rate versus power, but maybe close.

Very large customers could move server manufacturers to support higher temperatures by creating demand. How much extra would they pay per server to run reliably at 45C?

Maybe higher temperature support would trickle down to smaller customers. Think WalMart and CFL light bulbs.

I had to educate HVAC contractors, and other corporate tenants on high-temperature server operations while designing my latest server room. Their are many obstacles to running at higher temperature.
Tuesday, March 01, 2011 12:53:19 AM (Pacific Standard Time, UTC-08:00)
James,

Two questions:
Will your organisation disclose 'temperature and failure rates' ?

I have seen the DatacenterKnowledge article and other blogs on 'operating environment and IT failure' (Remember the Microsoft datacenter outdoor tent (2008), and the Intel Air Economizer POC (2008)), but they did not release the research documents en data behind the tests.
Who should take the future lead in research on this ? Who would be able to execute this ?

Regards,

Jan.
Tuesday, March 01, 2011 5:48:05 AM (Pacific Standard Time, UTC-08:00)
Rocky said "server manufacturers have failure rates versus temperature stats but won't release them" In a logical world, you would be 100% correct. However, most server manufacturers only test and do failure analysis within the supported temperature range and some don't even do that. Many don't even know their "supported" temp ranges. I had one very large server supplier insist that 35C was unacceptable even though all specs on the server said 35C. Because commercial servers aren't typically run at high temperatures, there really isn't a lot of data.

You are right that large customers can influence the market and the Rackable Systems (now SGI but I'm still having trouble calling it SGI) did the CloudRack C2 which supports 45C. Good for Rackable.

I think you are 100% right that these techniques will trickle down over time. There is no question that will happen. Thanks for the comment.

--jrh
Tuesday, March 01, 2011 5:53:03 AM (Pacific Standard Time, UTC-08:00)
Jan asked if I remember the servers in tents work done at Microsoft. Absolutely! That was Christian Belady and its wonderful work. It wasn't 100% serious -- mostly he was challenging the assumption that servers needed incredibly closely control temperature and humidity environmentals. It was a good challenge and wakeup call for the industry.

--jrh
Tuesday, March 01, 2011 4:16:31 PM (Pacific Standard Time, UTC-08:00)
James

Just to add to your comments if I may.

Temperature Changes

In a previous life I had a chance to run an experiment on a small DC room. What we did was to turn off the cooling water to see how long it would take for the room to overheat to validate the engineering consultants figures. This room was at 60% capacity but the application was not live. Most of the servers and storage were in S&V testing at the time (we didn’t tell the application people we were running the test and they didn’t notice). As I don’t have access to the report any more, the following is from memory.

As the temperature in the room increased, so did the power consumption, around 5% from memory. This (in my opinion) was due to the increase in fan power (to double the air flow you need 4 times the power). Once the server inlet temperate went over 27oC (80oF) you could hear the increase in fan speed and this is when the power consumption really increased. Testing was stopped when air inlet temperature reached 32oC (90oF) as this was the maximum recommended operating temperature of some of the IT.

So from my experience if you let the air-inlet temperatures go above 27oC (80oF) you will see a power increase and an increase in PUE (OK a lot less than the chillers).

Another problem with running large amounts of outside air is that you must ensure that you control the temperature rate-of-change. Components like hard disks do not like temperature fluctuations (rapid heating and cooling), and this is not easy to control without some type of preconditioning, which is very expensive.

Reliability

I have been attempting to find recent reliability data for electronic equipment for some time. Most I what I have found wasn’t applicable to the DC environment. I did find one common citation in many papers, that was to the “Arrhenius equation” (See below) and the effect on any component based on a chemical reaction. In most cases that will be Electrolytic Capacitors & Batteries. While DC operators will tell you that running Batteries at higher temperatures will reduce their life, those who know what an Electrolytic Capacitor is, didn’t know it was based on a chemical reaction.

As Electrolytic Capacitors are very common within most IT, you can expect a reduction in life due to the increased chemical reaction. One of their most common causes of failure is “Drying out” due to increased operating temperature. While the designers of IT equipment are well aware of the short coming of these components, the fact remains that they will age quicker.

While I will continue my search for reliability data on electronics, I suspect that all the manufactures want to keep such data in house, we live in a very litigious society now and there is no incentive to publish. As most servers, depending on the application, are end-of-life in 3 to 5 years there is no incentive for the IT manufactures to design their hardware to run any longer than necessary. In fact there is an incentive for the manufactures to get IT departments to churn their hardware as often as possible.

That’s all for now

Regards

Anthony

Here is a definition of the Arrhenius Equation from Wiki
“A historically useful generalization supported by the Arrhenius equation is that, for many common chemical reactions at room temperature, the reaction rate doubles for every 10 degree Celsius increase in temperature.”



Anthony Drew
Wednesday, March 02, 2011 2:15:15 PM (Pacific Standard Time, UTC-08:00)
Thanks for your detailed post and sending along your experiences Anthony.

On the increased power you saw at higher temperatures, there are two factors in play: 1) increased semi-conductor leakage at higher temperature, and 2) higher fan power consumption. I've measured the former it is there but it is not a significant factor. You correctly point out the second factor can be relevant. This depends upon the fan speed settings and the mechanical design used by the server. Some servers are incredibly badly designed from a thermal perspective and almost all play it safe and cool more aggressively than needed. Jay Park of Facebook reports they avoided the fan speed problem by changing the temperature to RPM set points. Its nice work: http://www.facebook.com/note.php?note_id=448717123919.

I suspect you are right that very fast rates of temperature change is a problem for server reliability. This is managed partly by the fact that nature moves slowly -- very rapid temperature changes are not common. The other technique to mitigate is to mix hot inside air with (potentially) cold outside air.

Anthony reports that PUE goes up with temperature. I could imagine conditions where this could be true but, in general, is not. Raising temperature remains one of the most powerful and cheapest levers to drive down PUE.

On reliability, the questions you raised are good ones. We empirically know for sure that the pure linear relationship that some experts report between server failure rates and temperature are simply not true at all temperatures. For example, raising temps from 18C (64F) to 28C (82F) has zero measurable impact on overall server reliability.

Anthony also mentioned Arrhenious equation as a predictor of the relationship between component failure and operating temperature. This equation is very heavily used in accelerated failure mode testing and it may be very predictive at higher temperatures. But remember the equation is used to relate component operating temps with component failure rates -- this is an equation applied to components and most components used in servers have a recommended operating range that tops out in the 45C and above range: Processors are 60C to 70C, hard drives 50C to 60C, memory at 85C to 105C. Component temperatures are dependent upon the quality of the server mechanical design and air speed -- any reasonable inlet temp is less than the component specified operating range so it really comes down to how good the server thermal design is.

--jrh
Comments are closed.

Disclaimer: The opinions expressed here are my own and do not necessarily represent those of current or past employers.

Archive
<October 2014>
SunMonTueWedThuFriSat
2829301234
567891011
12131415161718
19202122232425
2627282930311
2345678

Categories
This Blog
Member Login
All Content © 2014, James Hamilton
Theme created by Christoph De Baene / Modified 2007.10.28 by James Hamilton