Sometime back I whined that Power Usage Efficiency (PUE) is a seriously abused term: PUE and Total Power Usage Efficiency. But I continue to use it because it gives us a rough way to compare the efficiency of different data centers. It’s a simple metric that takes the total power delivered to a facility (total power) and divides it by the amount of power delivered to the servers (critical power or IT load). A PUE of 1.35 is very good today. Some datacenter owners have claimed to be as good as 1.2. Conventionally designed data centers operated conservatively are in the 1.6 to 1.7 range. Unfortunately most of the industry has a PUE of over 2.0, some are as bad as 3.0, and the EPA reports the industry average is 2.0 (Report to Congress on Server Data Center Efficiency). A PUE of 2.0 means that for each watt delivered to the IT load (servers, net gear, and storage), one watt is lost in cooling and in power distribution.
Whenever a metric becomes important, managers ask about it and marketing people use it. Eventually we start seeing data points that are impossibly good. The recent Red Sky installation is one of these events. Sandia National Lab’s Red Sky supercomputer is reported to be delivering a PUE of 1.035 in a system without waste heat recovery. In Red Sky at Night, Sandia’s New Computer Might it is reported “The power usage effectiveness of Red Sky is an almost unheard-of 1.035”. The video referenced below also reports Red Sky at a 1.035 PUE. in response to the claimed PUE of 1.035, Rich Miller of Data Center Knowledge astutely asked “How’s this possible?” (see Red Sky: Supercomputing and Efficiency Meet).
The data center knowledge article links to a blog posting Building Red Sky by Marc Hamilton which includes a wonderful time lapse video showing the building of Red Sky: http://www.youtube.com/watch?v=mNW9cYY4tqc. You should watch the 4 min and 51 second video and I’ll include my notes and observations from the video below. But, before we get to the video, let’s look more closely at the widely reported 1.035 PUE and what it would mean.
A PUE of 1.035 implies that for each 1 watt delivered to the servers, 0.035 is lost in power distribution and mechanical systems. For a facility of this size, I suspect they will get delivered high voltage in the 115kV range. In a conventional power distribution design, they will take 115kV and transform it to mid-voltage (13kV range), then to 480V 3p, then to 208V to be delivered to the servers. In addition to all these conversions, there is some loss in the conductors themselves. And there is considerable loss in even the very best uninterruptable power supply (UPS) systems. In fact, a UPS alone with 3.5% loss is excellent. Excellent power distribution designs will avoid 1 or perhaps 2 of the conversions above and will use a full bypass UPS. But, getting these excellent power distribution designs to even within a factor of 2 of the reported 3.5% loss is incredibly difficult and I’m very skeptical that they are going to get much below 6% to 7%. In fact, if anyone knows how to get down below 6% loss in the power distribution system measured fully, I’m super interested and would love to see what you have done, buy you lunch, and do a datacenter a tour.
A 6% loss in power distribution would limit the PUE to nothing lower than 1.06. But, we still have the cooling system to account for. Air is an expensive fluid to move long distances. Consequently, Red Sky brings the water to the server racks using Sun Cooling Door Systems (similar to the IBM iDataPlex Rear Door Cooling system).
The Sun Cooling Door System is a nice designs that will significantly improve PUE over more conventional CRAC-based mechanical designs. Generally, bringing water close to the heat load in systems that use water (rather than aggressive free-air only designs) is a good approach. The Sun advertising material credibly reports that “A highly efficient datacenter utilizing a holistic design for closely coupled cooling using Sun Cooling Door Systems can reach a PUE of 1.3”.
I know of no way to circulate air through a heat exchanger, pump water to the outside of the building, and then cool the water using any of the many technologies available that can be done at only a 3.5% loss. Which is to say that a PUE of 1.035 can’t be done with the Red Sky mechanical system design even if power distribution losses were ignored completely. I like Red Sky but suspect we’re looking at a 1.35 PUE system rather than the reported 1.035. But, that’s OK, 1.35 is quite good and, for a top 10 super computer, it’s GREAT.
Note that a PUE of 1.035 is technically possible with waste heat recovery and, in fact, even less than 1.0 can be achieved with waste heat recovery. See the “PUE less than 1.0” section of PUE and Total Power Usage Efficiency for more data on waste heat recovery. Remember this is “technically possible” rather than achieved in production today. It’s certainly possible to do today but doing it cost effectively is the challenge. I have seen it applied to related domains that also have large quantities of low grade heat. For example, a city in Norway is experimenting with waste heat recovery from Sewage: Flush the loo, warm your house.
My notes from the Red Sky Video follow:
· 47,232 cores of Intel EM64T Xeon X55xx (Nehalem-EP) 2930 MHz (11.72 GFlops)
o 553 Teraflops
· Infiniband QDR interconnect
o 1,440 cables totally 9.1 miles
· Operating System: CentOS
· Main Memory: 22,104 GB
· 266 VA [jrh: this is clearly incorrect unless they are talking about each server]
o Each reach is 32kW
· 96 JBOD enclosures
o 2,304 1TB disks
· 12 GB RAM/note & 70TB total
· PUE 1.035 [jrh: I strongly suspect they meant 1.35]
· 328 tons cooling
· 7.3million gallons of water per year
The video is worth watching although if you play with cross referencing the numbers above, there appear to be many mistakes: Red Sky time Lapse. Thanks to Jeff Bar for sending this one my way.
b: http://blog.mvdirona.com / http://perspectives.mvdirona.com
Thanks for the analysis Craig.
Thanks for the pointer to the video, which nicely shows a number of improvements in energy, space, and cost efficiency, relative to previous HPC installations on this scale.
Following earlier comment, waste heat moves from air to refrigerant via the Sun rack rear doors, and thence to water via the Liebert XDP refrigerant-pumping units along the back wall. I’d be interested to know if anything innovative happens to the waste heat after it reaches the chilled-water loop.
Following your comments on power supply chain, some efficiency is gained using APC’s autotransformer-based PDUs at ends of rows to go directly from 480VAC-3p to 400VAC-3p (230VAC-1p line-to-neutral) feeding blade-server and switch chassis. Whitepaper describing architecture is at http://www.apcmedia.com/salestools/NRAN-6CN8PK_R0_EN.pdf.
With power, cooling, and network all routed overhead, leaving only chilled-water piping underneath, would have been interesting to see them take the next step and also eliminate the traditional raised floor. I can also recommend an older whitepaper on some disadvantages of raised floors: http://www.apcmedia.com/salestools/SADE-5TNQYN_R1_EN.pdf.
Based only on what is actually shown, I’d agree that PUE of 1.35 seems plausible.
Your more generous than I. That would be one amazingly efficient heat recovery system to get them down to 1.035.
I’ve had troubles getting heat pumps to an overall economic win when I’ve played through the efficiencies but I’ll check out what Sun has done. Thanks for the pointer.
minor correction: Sun offers 2 types of heat exchanger doors, one using water and the other one using R134A; Red Sky has the second type deployed. This door works with Liebert heat pump systems, so maybe they are really recovering waste heat here.