High data center temperatures is the next frontier for server competition (see pages 16 through 22 of my Data Center Efficiency Best Practices talk: http://mvdirona.com/jrh/TalksAndPapers/JamesHamilton_Google2009.pdf and 32C (90F) in the Data Center). At higher temperatures the difference between good and sloppy mechanical designs are much more pronounced and need to be a purchasing criteria.
The infrastructure efficiency gains of running at higher temperatures are obvious. In a typical data center 1/3 of the power arriving at the property line is consumed by cooling systems. Large operational expenses can be avoided by raising the temperature set point. In most climates raising data center set points to the 95F range will allow a facility to move to a pure air-side economizer configuration eliminating 10% to 15% of the overall capital expense with the later number being the more typical.
These savings are substantial and exciting. But, there are potential downsides: 1) increased server mortality, 2) higher semi-conductor leakage current at higher temperatures, 3) increased air movement costs driven by higher fan speeds at higher temperatures. The former, increased server mortality, has very little data behind it. I’ve seen some studies that confirm higher failure rates at higher temperature and I’ve seen some that actually show the opposite. For all servers there clearly is some maximum temperature beyond which failure rates will increase rapidly. What’s unclear is what that temperature point actually is.
We also know that the knee of the curve where failures start to get more common is heavily influenced by the server components chosen and the mechanical design. Designs that cool more effectively, will operate without negative impact at higher temperatures. We could try to understand all details of each server and try to build a failure prediction model for different temperatures but this task is complicated by the diversity of servers and components and the near complete lack of data at higher temperatures.
So, not being able to build a model, I chose to lean on a different technique that I’ve come to prefer: incent the server OEMs to produce the models themselves. If we ask the server OEMs to warrant the equipment at the planned operating temperature, we’re giving the modeling problem to the folks that have both the knowledge and the skills to model the problem faithfully and, much more importantly, they have ability to change designs if they aren’t fairing well in the field. The technique of transferring the problem to the party most capable of solving it and financially incenting them to solve it will bring success.
My belief is that this approach of transferring the risk, failure modeling, and field result tracking to the server vendor will control point 1 above (increased server mortality rate). We also know that the Telecom world has been operating at 40C (104F) for years (see NEBS)so clearly equipment can be designed to operate correctly at these temperatures and last longer than current servers are used. This issue looks manageable.
The second issue raised above was increased semi-conductor current leakage at higher temperatures. This principle is well understood and certainly measureable. However, in the crude measurements I’ve seen, the increased leakage is lost in the noise of higher fan power losses. And, the semiconductor leakages costs are dependent upon semi-conductor temperature rather than air inlet temperature. Better cooling designs or higher air volumes can help prevent substantial increases in actually semi-conductor temperatures. Early measurements with current servers suggests that this issue is minor so I’ll set it aside as well.
The final issue issues is hugely important and certainly not lost in the noise. As server temperatures go up, the required cooling air flow will increase. Moving more air consumes more power and, as it turns out, air is an incredibly inefficient fluid to move. More fan speed is a substantial and very noticeable cost. What this tells us is the savings of higher temperature will get eaten up, slowly at first and more quickly as the temperature increases, until some cross over point where fan power increases dominate conventional cooling system operational costs.
Where is the knee of the curve where increased fan power crosses over and dominates the operational savings of running at higher temperatures? Well, like many things in engineering, the answer is “it depends.” But, it depends in very interesting ways. Poor mechanical designs built by server manufactures who think a mechanical engineers are a waste of money, will be able to run perfectly well at 95F. Even I’m a good enough mechanical engineer to pass this bar. The trick is to put a LARGE fan in the chassis and move lots of air. This approach is very inefficient and wastes much power but it’ll work perfectly well at cooling the server. The obvious conclusion is that points 1 and 2 above really don’t matter. We clearly CAN use 95F approach air to cool servers and maintain them at the same temperature they run today which eliminates server mortality issues and potential semi-conductor leakage issues. But, eliminating these two issues with a sloppy mechanical design will be expensive and waste much power.
A well designed server with careful part placement, good mechanical design, and careful impeller selection and control will perform incredibly differently from a poor design. The combination of good mechanical engineering and intelligent component selection can allow a server to run at 95F at a nominal increase in power due to higher air movement requirements. A poorly designed system will be expensive to run at elevated temperatures. This is a good thing for the server industry because it’s a chance for them to differentiate and compete on engineering talent rather than all building the same thing and chasing the gray box server cost floor.
In past postings, I’ve said that server purchases should be made on the basis of work done per dollar and work done per joule (see slides at http://mvdirona.com/jrh/TalksAndPapers/JamesHamilton_Google2009.pdf). Measure work done using your workload or a kernel of your workload or a benchmark you feel is representative of your work load. When measuring work done per dollar and work done joule (one watt for one second), do it at your planned data center air supply temperature. Higher temperatures will save you big operational costs and, at the same time, measuring and comparing servers at high temperatures will show much larger differentiation between server designs. Good servers will be very visibly better than poor designs. And, if we all measure work done joule (or just power consumption under load) at high inlet temperatures, we’ll quickly get efficient servers that run reliably at high temperature.
Make the server suppliers compete for work done per joule at 95F approach temperatures and the server world will evolve quickly. It’s good for the environment and is perhaps the largest and easiest to obtain cost reduction on the horizon.
James Hamilton, Amazon Web Services
1200, 12th Ave. S., Seattle, WA, 98144W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | firstname.lastname@example.org
H:mvdirona.com | W:mvdirona.com/jrh/work | blog:http://perspectives.mvdirona.com
Disclaimer: The opinions expressed here are my own and do not
necessarily represent those of current or past employers.