Over the last 10 years, there has been considerable innovation in data center cooling. Large operators are now able to operate at Power Usage Efficiency of 1.10 to 1.20. This means that less than 20% of the power delivered to the facility is lost to power distribution and cooling. These days, very nearly all of the power delivered to the facility, is delivered to the servers.
I would never say there is no more innovation coming in an but most of the ideas I’ve been seeing recently in data center cooling designs are familiar. Good engineering and often more somewhat more efficient than pass approaches but still largely the same as previous work. However, in recent discussions with DeepWater Desal, I came across an idea that I really think has potential. In this approach the co-location of a desalination plant and data centers is used to reduce the power consumption of both.
DeepWater Desal plans to build a desalination plant at Monterey Bay. Desalination produces drinking water from sea water. Given the abundance of sea water in the world and the shortage of drinking water in many parts of the world, these plants are becoming more common. They are fairly power intensive techniques but still used extensively throughout the world especially in the Middle East.
Deep Water Desal proposes to mitigate the power consumption of desalination in a very creative way. Rather than reduce the power required to desalinate water, they proposed to co-locate up to 150MW of data center facilities on site and reduce the power required to cool the data center. Essentially the desalination plant and data centers would be symbiotic and the overall power consumption of the combination of the two plants together would be lower.
Here’s how it works. In order to avoid plankton and other life forms that plug up the plants filters and increase operating costs, the desalination plant will be drawing water from 100’ below the surface in Monterey Bay. This water will have upwelled from even deeper down the canyons of Monterey bay and will be quite cold.
Taking water from lower in the bay reduces the potential for negative impact on the local ecosystem by putting the intake below the majority of it but this has the downside of sourcing much colder water. Cold water is less efficient to desalinate and, consequently, considerably more water will need to pumped which increases the pumping power expenses considerably. If the water is first run through the data center cooling heat exchanger, at very little increased pumping losses, the data center now gets cooled for essentially free (just the costs of circulating their cooling plant). And, as an additional upside, the desalination plant gets warmer feed water which can reduce pumping losses by millions of dollars annually. A pretty nice solution.
There have been many examples in the past of data centers cooled by deep water cooling. For Example: 46MW with Water Cooling at a PUE of 1.10. There have also been examples of data centers cooled using salt water: Google Opening Saltwater-cooled data center. What’s different and interesting in this case is someone else is covering most of the data center pumping costs and there are additional and quite substantial gains in delivering warmer water to the co-located desalination plan.
Since Desalination, even when done efficiently, is a power intensive business, a new municipal utility is being created that will delivery to the co-located data center facilities, power at 6 to 8 cents per kWh which is higher than some geographies but is actually quite a good rate for data center commercial power in California.
If you are interested in siting a data center in Monterey that is better for the environment, cheaper to operate, and not a bad place to live, contact Grant Gordon, COO of DeepWater Desal (email@example.com).
--James Hamilton, firstname.lastname@example.org / http://perspectives.mvdirona.com
It’s an unusual time in our industry where many of the most interesting server, storage, and networking advancements aren’t advertised, don’t have a sales team, don’t have price lists, and actually are often never even mentioned in public. The largest cloud providers build their own hardware designs and, since the equipment is not for sale, it’s typically not discussed publically.
A notable exception is Facebook. They are big enough that they do some custom gear but they don’t view their hardware investments as differentiating. That may sound a bit strange -- why spend on something if it is not differentiating? Their argument is that if some other social networking site adopts the same infrastructure that Facebook uses, it’s very unlikely to change social networking market share in any measureable way. Customers simply don’t care about infrastructure. I generally agree with that point. It’s pretty clear that if MySpace had adopted the same hardware platform as Facebook years ago, it really wouldn’t have changed the outcome. Facebook also correctly argues that OEM hardware just isn’t cost effective at the scale they operate. The core of this argument is custom hardware is needed and social networking customers go where their friends go whether or not the provider does a nice job on the hardware.
I love the fact that part of our industry is able to be open about the hardware they are building but I don’t fully agree that hardware is not differentiating in the social networking world. For example, maintaining a deep social graph is actually a fairly complex problem. In fact, I remember when tracking friend-of-a-friend over 10s of millions of users, a required capability of any social networking site today, was still just a dream. Nobody had found the means to do it without massive costs and/or long latencies. Lower cost hardware and software innovation made it possible and the social network user experience and engagement has improved as a consequence.
Looking at a more modern version of the same argument, It has not been cost effective to store full resolution photos and videos using today’s OEM storage systems. Consequently, most social networks haven’t done this at scale. It’s clear that storing full resolution images would be a better user experience and it’s another example where hardware innovation could be differentiating.
Of course the data storage problem is not restricted to social networks nor just to photo and video.The world is realizing the incredible value of data and the same time the costs of storage are plummeting. Most companies storage assets are growing quickly. Companies are hiring data scientists because even the most mundane bits of operational data can end up being hugely valuable. I’ve always believed in the value of data but more and more companies are starting to realize that data can be game changing to their businesses. The perceived value is going up fast while, at the same time, the industry is realizing that if you have weekly data, it is good. But daily is better, hourly is a lot better, 5 min is awesome but you really prefer 1 min granularity. This number keeps falling. The perceived value of data is climbing the resolution of measures is becoming finer and, as a consequence, the amount of data being stored is skyrocketing. Most estimates have data volumes doubling on 12 to 18 month centers -- somewhat faster than Moore’s law. Since all operational data backs up to cold storage, cold storage is always going to be larger than any other data storage category.
Next week, Facebook will show work they have been doing in cold storage mostly driven by their massive image storage problem. At OCP Summit V an innovative low-cost archival storage hardware platform will be shown. Archival projects always catch my interest because the vast majority of the world’s data is cold, the percentage that is cold is growing quickly, and I find the purity of a nearly single dimensional engineering problem to be super interesting. Almost the only dimension of relevance in cold storage is cost. See Glacier: Engineering for Cold Data Storage in the Cloud for more on this market segment and how Amazon Glacier is addressing it in the cloud.
This Facebook hardware project is particularly interesting in that it’s based upon an optical media rather than tape. Tape economics come from a combination of very low cost media combined with only a small number of fairly expensive drives. The tape is moved back and forth between storage slots and the drives when needed by robots. Facebook is taking the same basic approach of using robotic systems to allow a small number of drives to support a large media pool. But, rather than using tape, they are leveraging the high volume Blu-ray disk market with the volume economics driven by consumer media applications. Expect to see over a Petabyte of Blu-ray disks supplied by a Japanese media manufacturer housed in a rack built by a robotic systems supplier.
I’m a huge believer in leveraging consumer component volumes to produce innovative, low-cost server-side solutions. Optical is particularly interesting in this application and I’m looking forwarding to seeing more of the details behind the new storage platform. It looks like very interesting work.
b: http://blog.mvdirona.com / http://perspectives.mvdirona.com
I frequently get asked “why not just put solar panels on data center roofs and run them on that.” The short answer is datacenter roofs are just way too small. In a previous article (I Love Solar But…) I did a quick back of envelope calculation and, assuming a conventional single floor build with current power densities, each square foot of datacenter space would require roughly 362 sq ft of solar panels. The roof would only contribute roughly 1% of the facility requirements. Quite possibly still worth doing but there is simply no way a roof top array is going to power an entire datacenter.
There are other issues with roof top arrays, the two biggest of which are weight and strong wind protection. A roof requires significant re-enforcement to support the weight of an array and the array needs to be strongly secured against strong winds. But both of these issues are fairly simple engineering problems and very solvable. The key problem with powering datacenters with solar is insufficient solar array power density. If we were to believe the data center lighting manufacturers estimates (Lighting is the Unsung Hero in Data Center Energy Efficiency), a roof top array wouldn’t be able to fully power the facility lighting system. Most modern data centers have moved to more efficient lighting systems but the difficult fact remains: a roof top array will not supply even 1% of the overall facility draw.
When last discussing the space requirements of a solar array sufficient to power a datacenter in I love Solar but …, a few folks took exception to my use of sq ft as the measure of solar array size. The legitimate criticism raised was that in using sq ft and then computing that a large datacenter would require 181 million sq ft of solar array, I made the solar farm look unreasonably large. Arguably 4,172 acres is a better unit and produces a much smaller number. All true.
The real challenge for most people is in trying to understand the practicality of solar to power datacenters is to get a reasonable feel for how big the land requirements actually would be. They sound big but data centers are big and everything associated with them is big. Large numbers aren’t remarkable. One approach to calibrating the “how big is it?” question is to go with a ratio. Each square foot of data center would require approximately 362 square feet of solar array, is one way to get calibration of the true size requirements. Roughly 100:1 which tells us that roof top solar power is a contributor but not a solution. It also helps explain why solar is not a practical solution in densely packed urban areas especially where multi-floor facilities are in use.
The opening last week of the Kagoshima Nanatsujima Mega Solar Power Plant is large array at over 70MW and it gives us another scaling point of land requirements for large solar plants:
· Name: Kagoshima Nanatsujima Mega Solar Power Plant
· Location: 2 Nanatsujima, Kagoshima City, Kagoshima Prefecture, Japan
· Annual output: Approx. 78,800MWh (projected)
· Construction timeline: September 2012 through October 2013
· Total Investment: Approx. 27 billion yen (approx. 275.5 million U.S. dollars*4)
This is an impressive deployment with 70MW peak output over 314 acres. That’s roughly 4.5 acres/MW which is fairly consistent with other solar plants I’ve written up in the past. The estimated annual output for Kagoshima is 78,800MWh which is, on average, 9MW output. The reason the expected average output is only 13% of the peak output is solar plants aren’t productive 24x7.
Datacenter power consumption tends to be fairly constant day and night. Night being somewhat lower due to typically lower cooling requirements. But with only 10 to 15% of the total power in a modern datacenter going to cooling, the difference between day and night is actually not that large. Where there are big differences in datacenter power consumption is in the power consumption at full utilization vs partial. Idle critical loads can be as low as 50% of full loads so some variability in power consumption will be seen in a datacenter but datacenters are largely classified as base load (near constant) demand. Using these numbers, the Kagoshima solar facility is large enough to power ½ of a large datacenter.
The Kagoshima Nanatsujima Mega Solar Power Plant is a vast plant and anything at scale is interesting. It’s well worth learning more about the engineering behind this project but one aspect that caught my interest is there are some excellent pictures of this deployment that do a better job of scaling that my previous “several million sq ft” reference. The Kagoshima array is 313 acres which is 13.6 million sq ft or 1.27 million sq meters but these numbers aren’t that meaningful to most of us and pictures tell a far better story.
Above Photos from: http://global.kyocera.com/news/2013/1101_nnms.html
Photo Credit: http://www.sma.de/en/products/references/kagoshima-nanatsujima-mega-solar-power-plant.html
No matter how you look at it, the Kagoshima array is truly vast. I’m super interested in all forms of energy production and consider myself fairly familiar with numbers on this scale and yet I still find the size of this plant when shown in its entirety to still be surprising. It really is big.
The cost of solar is falling fast due to deep R&D investment but a facility of this scope and scale is still fairly expensive. At US$275 million the solar plant is more expensive than the datacenter it would be able to power.
Despite the shortage of land space in Japan, expect to see continued investment in large scale solar arrays for at least another year or two driven by the Feed-in Tariff legislation which requires Japanese power companies to buy all the power produced by any solar plant bigger than 10kW at a generous 20 year fixed rate of 42 yen/kWh which at current exchange rates is US$0.42kWh (Japan Creates Potential $9.6 Billion Solar Boom with FITs and also 1.8GW of New Solar Power for Japan in Q2). As a point of comparison, in the US commercial power, if purchased in large quantities is often available at rates below $0.05/kWh.
Yano research institute predicts that the massive Japanese solar boom driven by the FIT legislation will be grow until 2014 and then taper off due to land availability pressure: Japan Market for Large-Scale Solar to Drop After 2014, Yano Says.
One possible solution to both the solar space consumption problem and the inability to produce power in poor light conditions are orbital based solar arrays (Japan Aims to Beam Solar Energy Down From Orbit). This technology is nowhere close to a reality today but it is still of interest.
More articles on the Kagoshima Nanatsujima Mega Solar Power Plant:
· KYOCERA Starts Operation of 70MW Solar Power Plant, the Largest in Japan
· Japan's biggest solar plant starts generating power in SW. Japan
· Kyocera completes Japan’s largest offshore solar energy plant in Kagoshima
For those of you able to attend the AWS re:Invent conference in Vegas, I’ll see you there. I’ll presenting Why Scale Matters and How the Cloud Really is Different Thursday at 4:15 in the Delfino room and I’ll be around the conference all week.
b: http://blog.mvdirona.com / http://perspectives.mvdirona.com
Back in 2009, in Datacenter Networks are in my way
, I argued that the networking world was stuck in the mainframe business model: everything vertically integrated. In most datacenter networking equipment, the core Application Specific Integrated Circuit (ASIC – the heart of a switch or router), the entire hardware platform for the ASIC including power and physical network connections, and the software stack including all the protocols all come from a single vender and there is no practical mechanism to make different choices. This is how the server world operated back 40 years ago and we get much the same result. Networking gear is expensive, interoperates poorly, is expensive to manage and is almost always over-subscribed and constraining the rest of the equipment in the datacenter.
Further exaggerating what is already a serious problem, unlike the mainframe server world of 40 years back, networking equipment is also unreliable. Each has 10s of millions of lines of code under the hood forming frustratingly productive bug farms. Each fault is met with a request from the vendor to “install the latest version” – the new build is usually different than what was previously running but “different” isn’t always good when running production systems. It’s just a new set of bugs to start chasing. The core problem is many customers ask for complex features they believe will help make their network easier to manage. The networking vendor knows delivering these features keeps the customer uniquely dependent upon that vendor’s single solution. One obvious downside is the vendor lock-in that follows from these helpful features but the larger problem is these extensions are usually not broadly used, not well tested in subsequent releases, and the overall vendor protocol stacks become yet more brittle as they become more complex aggregating all these unique customer requests. This is an area in desperate need for change.
Because networking gear is complex and, despite them all implementing the same RFCs, equipment from different vendors (and sometimes the same vendor) still interoperates poorly. It’s very hard to deliver reliable networks at controllable administration costs from multiple vendors freely mixing and matching. The customer is locked in, the vendors know it, and the network equipment prices reflect that realization.
Not only is networking gear expensive absolutely but the relative expensive of networking is actually increasing over time. Tracking the cost of networking gear as a ratio of all the IT equipment (servers, storage, and networking) in a data center, a terrible reality emerges. For a given spend on servers and storage, the required network cost has been going up each year I have been tracking it. Without a fundamental change in the existing networking equipment business model, there is no reason to expect this trend will change.
Many of the needed ingredients for change actually have been in place for more than half a decade now. We have very high function networking ASICs available from Broadcom, Marvell, Fulcrum (Intel), and many others. Each competes with the others driving much faster innovation and ensuring that cost decreases are passed on to customers rather than simply driving more profit margin. Each ASIC design house produces references designs that are built by multiple competing Original Design Manufacturers each with their own improvements. Taking the widely used Broadcom ASIC as an example, routers based upon this same ASIC are made by Quanta, Accton, DNI, Foxconn, Celestica, and many others. Each competes with the others driving much faster innovation and ensuring that the cost decreases are passed on to customers rather than further padding networking equipment vendor margins.
What is missing is high quality control software, management systems, and networking protocol stacks that can run across a broad range of competing, commodity networking hardware. It’s still very hard to take merchant silicon ASICs packaged in ODM produced routers and deploy production networks. Very big datacenter operators actually do it but it’s sufficiently hard that this gear is largely unavailable to the vast majority of networking customers.
One of my favorite startups, Cumulus Networks, has gone after exactly the problem of making ODM produced commodity networking gear available broadly with high quality software support. Cumulus supports a broad range of ODM produced routing platforms built upon Broadcom networking ASICs. They provide everything it takes above the bare metal router to turn an ODM platform into a production quality router. Included is support for both layer 2 switching and layer 3 routing protocols including OSPF (v2 and V3) and BGP. Because the Cumulus system includes and is hosted on a Linux distribution (Debian), many of the standard tools, management, and monitoring systems just work. For example, they support Puppet, Chef, collectd, SNMP, Nagios, bash, python, perl, and ruby.
Rather than implement a proprietary device with proprietary management as the big networking players typically do, or make it looks like a CISCO router as many of the smaller payers often do, Cumulus makes the switch look like a Linux server with high-performance routing optimizations. Essentially it’s just a routing optimized Linux server.
The business model is similar to Red Hat Linux where the software and support are available on a subscription model at a price point that makes a standard network support contract look like the hostage payout that it actually is. The subscription includes entire turnkey stack with everything needed to take one of these ODM produced hardware platforms and deploy a production quality network. Subscriptions will be available directly from Cumulus and through an extensive VAR network.
Cumulus supported platforms include Accton AS4600-54T (48x1G & 4x10G), Accton AS5600-52x (48x10G & 4x40G), Agema (DNI brand) AG-6448CU (48x1G & 4x10G), Agema AG-7448CU (48x10G & 4x40G), Quanta QCT T1048-LB9 (48x1G & 4x10G), and Quanta QCT T-3048-LY2 (48x10G & 4x40G). Here’s a picture of many of these routing platforms from the Cumulus QA lab:
In addition to these single ASIC routing and switching platforms, Cumulus is also working on a chassis-based router to be released later this year:
This platform has all the protocol support outlined above and delivers 512 ports of 10G or 128 ports of 40G in a single chassis. High-port count chassis-based routers have always been exclusively available from the big, vertically integrated networking companies mostly because high-port count routers are expensive to design and are sold in lower volumes than the simpler, single ASIC designs commonly used as Top of Rack or as components of aggregation layer fabrics. Cumulus and their hardware partners are not yet ready to release more details on the chassis but the plans are exciting and the planned price point is game changing. Expect to see this later in the year.
Cumulus Networks was founded by JR Rivers and Nolan Leake in early 2010. Both are phenomenal engineers and I’ve been huge fans of their work since meeting them as they were first bringing the company together. They raised seed funding in 2011 from Andreessen Horowitz, Battery Ventures, Peter Wagner, Gurav Garv, Mendel Rosenblum, Diane Greene, and Ed Bugnion. In mid-2012, they did an A-round from Andreessen Horowitz and Battery Ventures
The pace of change continues to pick up in the networking world and I’m looking forward to the formal announcement of the Cumulus chassis-based router.
b: http://blog.mvdirona.com / http://perspectives.mvdirona.com
In the data center world, there are few events taken more seriously than power failure and considerable effort is spent to make them rare. When a datacenter experiences a power failure, it’s a really big deal for all involved. But, a big deal in the infrastructure world still really isn’t a big deal on the world stage. The Super Bowl absolutely is a big deal by any measure. On average over the last couple of years, the Super Bowl has attracted 111 million viewers and is the number 1 most watched television show in North America eclipsing the final episode of Mash. World-wide, the Super Bowl is only behind the European Cup (UEFA Champions Leaque) which draws 178 million viewers.
When the 2013 Super Bowl power event occurred, the Baltimore Ravens had just run back the second half opening kick for a touchdown and they were dominating the game with a 28 to 6 point lead. The 49ers had already played half the game and failed to get a single touchdown. The Ravens were absolutely dominating and they started the second half by tying the record for the longest kickoff return in NFL history at 108 yards. The game momentum was strongly with Baltimore.
At 13:22 in the third quarter, just 98 seconds into the second half, ½ of the Superdome lost primary power. Fortunately it wasn’t during the runback that started the second half. The power failure let to a 34 min delay to restore full lighting the field and, when the game restarted, the 49ers were on fire. The game was fundamentally changed by the outage with the 49ers rallying back to a narrow defeat of only 3 points. The game ended 34 to 31 and it really did come down to the wire where either team could have won. There is no question the game was exciting and some will argue the power failure actually made the game more exciting. But, NFL championships should be decided on the field and not impacted by the electrical system used by the host stadium.
What happened at 13:22 in the third quarter when much of the field lighting failed? Entergy, the utility supply power to the Superdome reported their “distribution and transmission feeders that serve the Superdome were never interrupted” (Before Game Is Decided, Superdome Goes Dark). It was a problem at the facility.
The joint report from SMG the company that manages the Superdome and Entergy, the utility power provider, said:
A piece of equipment that is designed to monitor electrical load sensed an abnormality in the system. Once the issue was detected, the sensing equipment operated as designed and opened a breaker, causing power to be partially cut to the Superdome in order to isolate the issue. Backup generators kicked in immediately as designed.
Entergy and SMG subsequently coordinated start-up procedures, ensuring that full power was safely restored to the Superdome. The fault-sensing equipment activated where the Superdome equipment intersects with Entergy’s feed into the facility. There were no additional issues detected. Entergy and SMG will continue to investigate the root cause of the abnormality.
Essentially, the utility circuit breaker detected an “anomaly” and opened the breaker. Modern switchgear have many sensors monitored by firmware running on a programmable logic controller. The advantage of these software systems is they are incredibly flexible and can be configured uniquely for each installation. The disadvantage of software systems is the wide variety of configurations they can support can be complex and the default configurations are used perhaps more often than they should. The default configurations in a country where legal settlements can be substantial tend towards the conservative side. We don’t know if that was a factor in this event but we do know that no fault was found and the power was stable for the remainder of the game. This was almost certainly a false trigger.
Because the cause has not yet been reported and, quite often, the underlying root cause is never found. But, it’s worth asking, is it possible to avoid long game outages and what would it cost? As when looking at any system faults, the tools we have to mitigate the impact are: 1) avoid the fault entirely, 2) protect against the fault with redundancy, 3) minimize the impact of the fault through small fault zones, and 4) minimize the impact through fast recovery.
Fault avoidance: Avoidance starts with using good quality equipment, configuring it properly, maintaining it well, and testing it frequently. Given the Superdome just went through $336 million renovation, the switch gear may have been relatively new and, even if it wasn’t, it likely was almost certainly recently maintained and inspected.
Where issues often arise are in configuration. Modern switch gear have an amazingly large number of parameters many of which interact with each other and, in total, can be difficult to fully understand. And, given the switch gear manufactures know little about the intended end-use application of each switchgear sold, they ship conservative default settings. Generally, the risk and potential negative impact of a false positive (breaker opens when it shouldn’t) is far less than a breaker that fails to open. Consequently conservative settings are common.
Another common cause of problems is lack of testing. The best way to verify that equipment works is to test at full production load in a full production environment in a non-mission critical setting. Then test it just short of overload to ensure that it can still reliably support the full load even though the production design will never run it that close to the limit, and finally, test it into overload to ensure that the equipment opens up on real faults.
The first, testing in full production environment in non-mission critical setting is always done prior to a major event. But the latter two tests are much less common: 1) testing at rated load, and 2) testing beyond rated load. Both require synthetic load banks and skill electricians and so these tests are often not done. You really can’t beat testing in a non-mission critical setting as a means of ensuring that things work well in a mission critical setting (game time).
Redundancy: If we can’t avoid a fault entirely, the next best thing is to have redundancy to mask the fault. Faults will happen. The electrical fault at the Monday Night Football game back in December of 2011 was caused by utility sub-station failing. These faults are unavoidable and will happen occasionally. But is protection against utility failure possible and affordable? Sure, absolutely. Let’s use the Superdome fault yesterday as an example.
The entire Superdome load is only 4.6MW. This load would be easy to support on two 2.5 to 3.0MW utility feeds each protected by its own generator. Generators in the 2.5 to 3.0 MW range are substantial V16 diesel engines the size of a mid-sized bus. And they are expensive running just under $1M each but they are also available in mobile form and inexpensive to rent. The rental option is a no-brainer but let’s ignore that and look at what it would cost to protect the Superdome year around with a permanent installation. We would need 2 generators, the switchgear to connect it to the load and uninterruptable power supplies to hold the load during the first few seconds of a power failure until the generators start up and are able to pick up the load. To be super safe, we’ll buy third generator just in case there is a problem and one of the two generators don’t start. The generators are under $1m each and the overall cost of the entire redundant power configuration with the extra generator could be had for under $10m. Looking at statistics from the 2012 event, a 30 second commercial costs just over $4m.
For the price of just over 60 seconds of commercials the facility could protected against fault. And, using rental generators, less than 30 seconds of commercials would provide the needed redundancy to avoid impact from any utility failure. Given how common utility failures are and the negative impact of power disruptions at a professional sporting event, this looks like good value to me. Most sports facilities chose to avoid this “unnecessary” expense and I suspect the Superdome doesn’t have full redundancy for all of its field lighting. But even if it did, this failure mode can sometimes cause the generators to be locked out and not pick up the load during a some power events. In this failure mode, when a utility breaker incorrectly senses a ground fault within the facility, it is frequently configured to not put the generator at risk by switching it into a potential ground fault. My take is I would rather run the risk of damaging the generator and avoid the outage so I’m not a big fan of this “safety” configuration but it is a common choice.
Minimize Fault Zones: The reason why only ½ the power to the Superdome went down was because the system installed at the facility has two fault containment zones. In this design, a single switchgear event can only take down ½ of the facility.
Clearly the first choice is to avoid the fault entirely. And, if that doesn’t work, have redundancy take over and completely mask the fault. But, in the rare cases where none of these mitigations work, the next defense are small fault containment zones. Rather than using 2 zones, spend more on utility breakers and have 4 or 6 and, rather than losing ½ the facility, lose ¼ or 1/6. And, if the lighting power is checker boarded over the facility lights, (lights in a contiguous region are not all powered by the same utility feed but the feeds are distributed over the lights evenly), rather than losing ¼ or 1/6 of the lights in one area of the stadium, we would lose that fraction of the lights evenly over the entire facility. Under these conditions, it might be possible to operate with slightly degraded field lighting and be able to continue the game without waiting for light recovery.
Fast Recovery: Before we get to this fourth option, fast recovery, we have tried hard to avoid failure, then we have used power redundancy to mask the failure, then we have used small fault zones to minimize the impact. The next best thing we can do is to recover quickly. Fast recovery depends broadly on two things: 1) if possible automate recovery so it can happen in seconds rather than the rate at which humans can act, 2) if humans are needed, ensure they have access to adequate monitoring and event recording gear so they can see what happened quickly and they have trained extensively and are able to act quickly.
In this particular event, the recovery was not automated. Skilled electrical technicians were required. They spent nearly 15 minute checking system states before deciding it was safe to restore power. Generally, 15 min on a human judgment driven recover decision isn’t bad. But the overall outage was 34 min. If the power was restored in 15 min, what happened during the next 20? The gas discharge lighting still favored at large sporting venues, take roughly 15 minutes to restart after a momentary outage. Even a very short power interruption will still suffer the same long recovery time. Newer light technologies are becoming available that are both more power efficient and don’t suffer from these long warm-up periods.
It doesn’t appear that the final victor of Super Bowl XLVII was changed by the power failure but there is no question the game was broadly impacted. If the light failure had happened during the kickoff return starting the third quarter, the game may have been changed in a very fundamental way. Better power distribution architectures are cheap by comparison. Given the value of the game, the relative low cost of power redundancy equipment, I would argue it’s time to start retrofitting major sporting venues with more redundant design and employing more aggressive pre-game testing.
b: http://blog.mvdirona.com / http://perspectives.mvdirona.com
Since 2008, I’ve been excited by, working on, and writing about Microservers. In these early days, some of the workloads I worked with were I/O bound and didn’t really need or use high single-thread performance. Replacing the server class processors that supported these applications with high-volume, low-cost client system CPUs yielded both better price/performance and power/performance. Fortunately, at that time, there were good client processors available with ECC enabled (see You Really DO Need ECC) and most embedded system processors also supported ECC.
I wrote up some of the advantages of these early microserver deployments and showed performance results from a production deployment in an internet-scale mail processing application in Cooperative, Expendable, Microslice, Servers: Low-Cost, Low-Power Servers for Internet-Scale Services.
Intel recognizes the value of low-power, low-cost processors for less CPU demanding applications and announced this morning the newest members of the Atom family, the S1200 series. These new processors support 2 cores and 4 threads and are available in variants of up to 2Ghz while staying under 8.5 watts. The lowest power members of the family come in at just over 6W. Intel has demonstrated an S1200 reference board running spec_web at 7.9W including memory, SATA, Networking, BMC, and other on-board components.
Unlike past Atom processors, the S1200 series supports full ECC memory. And all members of the family support hardware virtualization (Intel VT-x2), 64 bit addressing, and up to 8GB of memory. These are real server parts.
Centerton (S1200 series) features:
One of my favorite Original Design Manufacturers, Quanta Computer, has already produced a shared infrastructure rack design that packs 48 Atom S1200 servers into a 3 rack unit form factor (5.25”).
Quanta S900-X31A front and back view:
Quanta S900-X31a server drawer:
Quanta has done a nice job with this shared infrastructure rack. Using this design, they can pack a booming 624 servers into a standard 42 RU rack.
I’m excited by the S1200 announcement because it’s both a good price/performer and power/performer and shows that Intel is serious about the microserver market. This new Atom gives customers access to microserver pricing without having to change instruction set architectures. The combination of low-cost, low-power, and the familiar Intel ISA with its rich tool chain and broad application availability is a compelling combination. It’s exciting to see the microserver market heating up and I like Intel’s roadmap looking forward.
Related Microserver focused postings:
· Cooperative Expendable Microslice Servers: Low-cost, Low-power Servers for Internet Scale Services
· The Case for Low-Cost, Low-Power Servers
· Low Power Amdahl Blades for Data Intensive Computing
· Microslice Servers
· ARMCortext-A9 Design Announced
· 2010 the Year of the Microslice Server
· Very Low Power Server Progress
· Nvidia Project Denver: ARM Powered Servers
· ARM V8 Architecture
· AMD Announced Server-Targeted ARM Part
· Quanta S900-X31A
b: http://blog.mvdirona.com / http://perspectives.mvdirona.com
I have been interested in, and writing about, microservers since 2007. Microservers can be built using any instruction set architecture but I’m particularly interested in ARM processors and their application to server-side workloads. Today Advanced Micro Devices announced they are going to build an ARM CPU targeting the server market. This will be 4-core, 64 bit, more than 2Ghz part that is expected to sample in 2013 and ship in volume in early 2014.
AMD is far from new to microserver market. In fact, much of my past work on microservers has been AMD-powered. What’s different today is that AMD is applying their server processor skills while, at the same time, leveraging the massive ARM processor ecosystem. ARM processors power Apple iPhones, Samsung smartphones, tablets, disk drives, and applications you didn’t even know had computers in them.
The defining characteristic of server processor selection is to focus first and most on raw CPU performance and accept the high cost and high-power consumption that follows from that goal. The defining characteristic of Microservers is we leverage the high-volume client and connected device ecosystem and make a CPU selection on the basis of price/performance and power/performance with an emphasis on building balanced servers. The case for microservers is anchored upon these 4 observations:
· Volume economics: Rather than draw on the small-volume economics of the server market, with Microservers we leverage the massive volume economics of the smart device world driven by cell phones, tablets, and clients. To give some scale to this observation, IDC reports that there were 7.6M server units sold in 2010. ARM reports that there were 6.1B Arm processors shipped last year. The connected and embedded device market volumes are 1000x larger than that of the server market and the performance gap is shrinking rapidly. Semiconductor analyst Semicast estimates that by 2015 there will be 2 ARM processors for every person in the world. In 2010, ARM reported that, on average, there were 2.5 ARM-based processors in each Smartphone. The connected and embedded device market is 1000x that of that of the server world.
Having watched and participated in our industry for nearly 3 decades, one reality seems to dominate all others: high-volume economics drives innovation and just about always wins. As an example, IBM mainframes ran just about every important server-side workload in the mid-80s. But, they were largely swept aside by higher-volume RISC servers running UNIX. At the time I loved RISC systems – databases systems would just scream on them and they offered customers excellent price/performance. But, the same trend played out again. The higher-volume X86 processors from the client world swept the superior raw performing RISC systems aside.
Invariably what we see happening about once a decade is a high-volume, lower-priced technology takes over the low end of the market. When this happens many engineers correctly point out that these systems can’t hold a candle to the previous generation server technology and then incorrectly believe they won’t get replaced. The new generation is almost never better in absolute terms but they are better price/performers so they first are adopted for the less performance critical applications. Once this happens, the die is cast and the outcome is just about assured. The high-volume parts move up market and eventually take over even the most performance critical workloads of the previous generation. We see this same scenario play out roughly once a decade.
· Not CPU bound: Most discussion in our industry centers on the more demanding server workloads like databases but, in reality, many workloads are not pushing CPU limits and are instead storage, networking, or memory bound. There are two major classes of workloads that don’t need or can’t fully utilize more CPU:
1. Some workloads simply do not require the highest performing CPUs to achieve their SLAs. You can pay more and buy a higher performing processor but it will achieve little for these applications. Some workloads just don’t require more CPU performance to meet their goals.
2. This second class of workloads is characterized by being blocked on networking, storage, or memory. And by memory bound I don’t mean the memory is too small. In this case it isn’t the size of the memory that is the problem, but the bandwidth. The processor looks to be fully utilized from an operating system perspective but the bulk of its cycles are waiting for memory. Disk and CPU bound systems are easy to detect by looking for which is running close to 100% utilization while the CPU load is way lower. Memory bound is more challenging to detect but its super common so worth talking about it. Most server processors are super-scalar, which is to say they can retire multiple instructions each cycle. On many workloads, less than 1 instruction is retired each cycle (you can see this by monitoring Instructions per cycle) because the processor is waiting for memory transfers.
If a workload is bound on network, storage, or memory, spending more on a faster CPU will not deliver results. The same is true for non-demanding workloads. They too are not bound on CPU so a faster part won’t help in this case either.
· Price/performance: Device price/performance is far better than current generation server CPUs. Because there is less competition in server processors, prices are far higher and price/performance is relatively low compared to the device world. Using server parts, performance is excellent but price is not.
Let’s use an example again: A server CPU is hundreds of dollars sometimes approaching $1,000 whereas the ARM processor in an iPhone comes in at just under $15. My general rule of thumb in comparing ARM processors with server CPUs is they are capable of ¼ the processing rate at roughly 1/10th the cost. And, super important, the massive shipping volume of the ARM ecosystem feeds the innovation and completion and this performance gap shrinks the performance gap with each processor generation. Each generational improvement captures more possible server workloads while further improving price/performance
· Power/performance: Most modern servers run over 200W, and many are well over 500W, while microservers can weigh in at 10 to 20W. Nowhere is power/performance more important than in portable devices, so the pace of power/performance innovation in the ARM world is incredibly strong. In fact, I’ve long used mobile devices as a window into future innovations coming to the server market. The technologies you seen in the current generation of cell phones has a very high probability of being used in a future server CPU generation.
This is not the first ARM based server processor that has been announced. And, even more announcements are coming over the next year. In fact, that is one of the strengths of the ARM ecosystem. The R&D investments can be leveraged over huge shipping volume from many producers to bring more competition, lower costs, more choice, and a faster pace of innovation.
This is a good day for customers, a good day for the server ecosystem, and I’m excited to see AMD help drive the next phase in the evolution of the ARM Server market. The pace of innovation continues to accelerate industry-wide and it’s going to be an exciting rest of the decade.
Past notes on Microservers:
b: http://blog.mvdirona.com / http://perspectives.mvdirona.com
When I come across interesting innovations or designs notably different from the norm, I love to dig in and learn the details. More often than not I post them here. Earlier this week, Google posted a number of pictures taken from their datacenters (Google Data Center Tech). The pictures are beautiful and of interest to just about anyone, somewhat more interesting to those working in technology, and worthy of detailed study for those working in datacenter design. My general rule with Google has always been that anything they show publically is always at least one generation old and typically more. Nonetheless, the Google team does good work so the older designs are still worth understanding so I always have a look.
Some examples of older but interesting Google data center technology:
· Efficient Data Center Summit
· Rough Notes: Data Center Efficiency Summit
· Rough notes: Data Center Efficiency Summit (posting #3)
· 2011 European Data Center Summit
The set of pictures posted last week (Google Data Center Tech) is a bit unusual in that they are showing current pictures of current facilities running their latest work. What was published was only pictures without explanatory detail but, as the old cliché says, a picture is worth a thousand words. I found the mechanical design to be most notable so I’ll dig into that area a bit but let’s start with showing a conventional datacenter mechanical design as a foil against which to compare the Google approach.
The conventional design has numerous issues the most obvious being that any design that is 40 years old and probably could use some innovation. Notable problems with the conventional design: 1) no hot aisle/cold aisle containment so there is air leakage and mixing of hot and cold air, 2) air is moved long distances between the Computer Room Air Handers (CRAHs) and the servers and air is an expensive fluid to move, and 3) it’s a closed system and hot air is recirculated after cooling rather than released outside with fresh air brought in and cooled if needed.
An example of an excellent design that does a modern job of addressing most of these failings is the Facebook Prineville Oregon facility:
I’m a big fan of the Facebook facility. In this design they eliminate the chilled water system entirely, have no chillers (expensive to buy and power), have full hot aisle isolation, use outside air with evaporative cooling, and treat the entire building as a giant, high-efficiency air duct. More detail on the Facebook design at: Open Compute Mechanical System Design.
Let’s have a look at the Google Concil Bluffs Iowa Facility:
You can see that have chosen a very large, single room approach rather than sub-dividing up into pods. As with any good, modern facility they have hot aisle containment which just about completely eliminates leakage of air around the servers or over the racks. All chilled air passes through the servers and none of the hot air leaks back prior to passing through the heat exchanger. Air containment is a very important efficiency gain and the single largest gain after air-side economization. Air-side economization is the use of outside air rather than taking hot server exhaust and cooling it to the desired inlet temperature (see the diagram above showing the Facebook use of full building ducting with air-side economization).
From the Council Bluffs picture, we see Google has taken a completely different approach. Rather than completely eliminate the chilled water system and use the entire building as an air duct, they have instead kept the piped water cooling system and instead focused on making it as efficient as possible and exploiting some of the advantages of water based systems. This shot from the Google Hamina Finland facility shows the multi-coil heat exchanger at the top of the hot aisle containment system.
From inside the hot aisle, this shot picture from the Mayes County data center, we can see the water is brought up from below the floor in the hot aisle using steel braided flexible chilled water hoses. These pipes bring cool water up to the top-of-hot-aisle heat exchangers that cool the server exhaust air before it is released above the racks of servers.
One of the key advantages of water cooling is that water is a cheaper to move fluid than air for a given thermal capacity. In the Google, design they exploit fact by bringing water all the way to the rack. This isn’t an industry first but it is nicely executed in the Google design. IBM iDataPlex brought water directly to the back of the rack and many high power density HPC systems have done this as well.
I don’t see the value of the short stacks above the heat exchanges. I would think that any gain in air acceleration through the smoke stack effect would be dwarfed by the loses of having the passive air stacks as restrictions over the heat exchangers.
Bringing water directly to the rack is efficient but I still somewhat prefer air-side economization systems. Any system that can reject hot air outside and bring in outside air for cooling (if needed) for delivery to the servers is tough to beat (see Diagram at the top for an example approach). I still prefer the outside air model, however, as server density climbs we will eventually get to power densities sufficiently high that water is needed either very near the server as Google has done or direct water cooling as used by IBM Mainframes in the 80s (thermal conduction module). One very nice contemporary direct water cooling system is the work by Green Revolution Cooling where they completely immerse otherwise unmodified servers in a bath of chilled oil.
Hat’s off to Google for publishing a very informative set of data center pictures. The pictures are well done and the engineering is very nice. Good work!
· Here’s a very cool Google Street view based tour of the Google Lenoir NC Datacenter.
· The detailed pictures released last week: Google Data Center Photo Album
b: http://blog.mvdirona.com / http://perspectives.mvdirona.com
Last night, Tom Klienpeter sent me The Official Report of the Fukushima Nuclear Accident Independent Investigation Commission Executive Summary. They must have hardy executives in Japan in that the executive summary runs 86 pages in length. Overall, It’s an interesting document but I only managed to read in to the first page before starting to feel disappointed. What I was hoping for is a deep dive into why the reactors failed, the root causes of the failures, and what can be done to rectify it.
Because of the nature of my job, I’ve spent considerable time investigating hardware and software system failures and what I find most difficult and really time consuming is getting to the real details. It’s easy to say there was a tsunami and it damaged the reactor complex and loss of power caused radiation release. But why did loss of power cause radiation release? Why didn’t the backup power systems work? Why does the design depend upon the successful operation of backup power systems? Digging to the root cause takes the time, requires that all assumptions be challenged, and invariably leads to many issues that need to be addresses. Good post mortems are detailed, get to the root cause, and it’s rare that a detailed investigation of any complex system doesn’t yield a long, detailed list of design and operational changes. The Rogers Commission on the Space Shuttle Challenger failure is perhaps the best example of digging deeply, finding root cause both technical and operational, and making detailed recommendations.
On the second page of this report, the committee members were enumerated. The committed includes 1) seismologist, 2) 2 medical doctors, 3) chemist, 4) journalist, 5) 2 lawyers, 6) social system designer, 7) one politician, and 8) no nuclear scientist, no reactor designers, and no reactor operators. The earthquake and subsequent tsunami was clearly the seed for the event but since we can’t prevent these, I would argue that they should only play a contextual role in the post mortem. What we need to understand is exactly why the both the reactor and nuclear material storage design were not stable in the presence of cooling system failure. It's weird that there were no experts in the subject area where the most dangerous technical problems were encountered. Basically we can’t stop earthquakes and tsunamis so we need to ensure that systems remain safe in the presence of them.
Obviously the investigative team is very qualified to deal with the follow-on events both in assessing radiation exposure risk, how the evacuation was carried out, and regulatory effectiveness. And it is clear these factors are all important. But still, it feels like the core problem is that cooling system flow was lost and the both the reactors and nuclear material storage ponds overheated. Using materials that, when overheated, release explosive hydrogen gas is a particularly important area of investigation.
Personally, the largest part of my interest were it my investigation, would be focused on achieving designs stable in the presence of failure. Failing that, getting really good at evacuation seems like a good idea but still less important than ensuring these reactors and others in the country fail into a safe state.
The report reads like a political document. Its heavy on blame, light on root cause and the technical details of the root cause failure, and the recommended solution depends upon more regulatory oversight. The document focuses on more oversight by the Japanese Diet (a political body) and regulatory agencies but doesn't go after the core issues that lead to the nuclear release. From my perspective, the key issues are 1) scramming the reactor has to 100% stop the reaction and the passive cooling has to be sufficient to ensure the system can cool from full operating load without external power, operational oversight, or other input beyond dropping the rods. Good SCRAM systems automatically deploy and stop the nuclear reaction. This is common. What is uncommon is ensuring the system can successfully cool from a full load operational state without external input of power, cooling water, or administrative input.
The second key point that this nuclear release drove home for me is 2) all nuclear material storage areas must be seismically stable, above flood water height, maintain integrity through natural disasters, and must be able to stay stable and safe without active input or supervision for long periods of time. They can't depends upon pumped water cooling and have to 100% passive and stable for long periods without tending.
My third recommendation is arguably less important than my first two but applies to all systems: operators can’t figure out what is happening or take appropriate action without detailed visibility into the state of the system. The monitoring system needs to be independent (power, communications, sensors, …) , detailed, and able to operate correctly with large parts of the system destroyed or inoperative.
My fourth recommendation is absolutely vital and I would never trust any critical system without this: test failure modes frequently. Shut down all power to the entire facility at full operational load and establish that temperatures fall rather than rise and no containment systems are negatively impacted. Shut off the monitoring system and ensure that the system continues to operate safely. Never trust any system in any mode that hasn’t been tested.
The recommendations from the Official Report of the Fukushima Nuclear Accident Independent Investigation Commission Executive Summary follow:
Monitoring of the nuclear regulatory body by the National Diet
A permanent committee to deal with issues regarding nuclear power must be established in the National Diet in order to supervise the regulators to secure the safety of the public. Its responsibilities should be:
1. To conduct regular investigations and explanatory hearings of regulatory agencies, academics and stakeholders.
2. To establish an advisory body, including independent experts with a global perspective, to keep the committee’s knowledge updated in its dealings with regulators.
3. To continue investigations on other relevant issues.
4. To make regular reports on their activities and the implementation of their recommendations.
Reform the crisis management system
A fundamental reexamination of the crisis management system must be made. The boundaries dividing the responsibilities of the national and local governments and the operators must be made clear. This includes:
1. A reexamination of the crisis management structure of the government. A structure must be established with a consolidated chain of command and the power to deal with emergency situations.
2. National and local governments must bear responsibility for the response to off-site radiation release. They must act with public health and safety as the priority.
3. The operator must assume responsibility for on-site accident response, including the halting of operations, and reactor cooling and containment.
Government responsibility for public health and welfare
Regarding the responsibility to protect public health, the following must be implemented as soon as possible:
1. A system must be established to deal with long-term public health effects, including stress-related illness. Medical diagnosis and treatment should be covered by state funding. Information should be disclosed with public health and safety as the priority, instead of government convenience. This information must be comprehensive, for use by individual residents to make informed decisions.
2. Continued monitoring of hotspots and the spread of radioactive contamination must be undertaken to protect communities and the public. Measures to prevent any potential spread should also be implemented.
3. The government must establish a detailed and transparent program of decontamination and relocation, as well as provide information so that all residents will be knowledgeable about their compensation options.
Monitoring the operators
TEPCO must undergo fundamental corporate changes, including strengthening its governance, working towards building an organizational culture which prioritizes safety, changing its stance on information disclosure, and establishing a system which prioritizes the site. In order to prevent the Federation of Electric Power Companies (FEPC) from being used as a route for negotiating with regulatory agencies, new relationships among the electric power companies must also be established—built on safety issues, mutual supervision and transparency.
1. The government must set rules and disclose information regarding its relationship with the operators.NAIIC 23
2. Operators must construct a cross-monitoring system to maintain safety standards at the highest global levels.
3. TEPCO must undergo dramatic corporate reform, including governance and risk management and information disclosure—with safety as the sole priority.
4. All operators must accept an agency appointed by the National Diet as a monitoring authority of all aspects of their operations, including risk management, governance and safety standards, with rights to on-site investigations.
Criteria for the new regulatory body
The new regulatory organization must adhere to the following conditions. It must be:
1. Independent: The chain of command, responsible authority and work processes must be: (i) Independent from organizations promoted by the government (ii) Independent from the operators (iii) Independent from politics.
2. Transparent: (i) The decision-making process should exclude the involvement of electric power operator stakeholders. (ii) Disclosure of the decision-making process to the National Diet is a must. (iii) The committee must keep minutes of all other negotiations and meetings with promotional organizations, operators and other political organizations and disclose them to the public. (iv) The National Diet shall make the final selection of the commissioners after receiving third-party advice.
3. Professional: (i) The personnel must meet global standards. Exchange programs with overseas regulatory bodies must be promoted, and interaction and exchange of human resources must be increased. (ii) An advisory organization including knowledgeable personnel must be established. (iii) The no-return rule should be applied without exception.
4. Consolidated: The functions of the organizations, especially emergency communications, decision-making and control, should be consolidated.
5. Proactive: The organizations should keep up with the latest knowledge and technology, and undergo continuous reform activities under the supervision of the Diet.
Reforming laws related to nuclear energy
Laws concerning nuclear issues must be thoroughly reformed.
1. Existing laws should be consolidated and rewritten in order to meet global standards of safety, public health and welfare.
2. The roles for operators and all government agencies involved in emergency response activities must be clearly defined.
3. Regular monitoring and updates must be implemented, in order to maintain the highest standards and the highest technological levels of the international nuclear community.
4. New rules must be created that oversee the backfit operations of old reactors, and set criteria to determine whether reactors should be decommissioned.
Develop a system of independent investigation commissions
A system for appointing independent investigation committees, including experts largely from the private sector, must be developed to deal with unresolved issues, including, but not limited to, the decommissioning process of reactors, dealing with spent fuel issues, limiting accident effects and decontamination.
Many of the report recommendations are useful but they fall short of addressing the root cause. Here’s what I would like to see:
1. Scramming the reactor has to 100% stop the reaction and the passive cooling has to be sufficient to ensure the system can cool from full operating load without external power, operational oversight, or other input beyond dropping the rods.
2. All nuclear material storage areas must be seismically stable, above flood water height, maintain integrity through natural disasters, and must be able to stay stable and safe without active input or supervision for long periods of time.
3. The monitoring system needs to be independent, detailed, and able to operate correctly with large parts of the system destroyed or inoperative.
4. Test all failure modes frequently. Assume that all systems that haven’t been tested will not work. Surprisingly frequently, they don’t.
The Official Report of the Fukushima Nuclear Accident Independent Investigation Commission Executive Summary can be found at: http://naiic.go.jp/wp-content/uploads/2012/07/NAIIC_report_lo_res2.pdf.
Since our focus here is primarily on building reliable hardware and software systems, this best practices document may be of interest: Designing & Deploying Internet-Scale Services: http://mvdirona.com/jrh/talksAndPapers/JamesRH_Lisa.pdf
b: http://blog.mvdirona.com / http://perspectives.mvdirona.com
Cooling is the largest single non-IT (overhead) load in a modern datacenter. There are many innovative solutions to addressing the power losses in cooling systems. Many of these mechanical system innovations work well and others have great potential but none are as powerful as simply increasing the server inlet temperatures. Obviously less cooling is cheaper than more. And, the higher the target inlet temperatures, the higher percentage of time that a facility can spend running on outside air (air-side economization) without process-based cooling.
The downsides of higher temperatures are 1) high semiconductor leakage losses, 2) higher server fan speed which increases the losses to air moving, and 3) higher server mortality rates. I’ve measured the former and, although these losses are inarguably present, these losses are measureable but have a very small impact at even quite high server inlet temperatures. The negative impact of fan speed increases is real but can be mitigated via different server target temperatures and more efficient server cooling designs. If the servers are designed for higher inlet temperatures, the fans will be configured for these higher expected temperatures and won’t run faster. This is simply a server design decision and good mechanical designs work well at higher server temperatures without increased power consumption. It’s the third issue that remains the scary one: increased server mortality rates.
The net of these factors is fear of higher server mortality rates is the prime factor slowing an even more rapid increase in datacenter temperatures. An often quoted study reports the failure rate of electronics doubles with every 10C increase of temperature (MIL-HDBK 217F). This data point is incredibly widely used by the military, NASA space flight program, and in commercial electronic equipment design. I’m sure the work is excellent but it is a very old study, wasn’t focused on a large datacenter environment, and the rule of thumb that has emerged from is a linear model of failure to heat.
A recent paper that does an excellent job of methodically digging through the possible issues of high datacenter temperature and investigating each concern methodically. I like Temperature Management in Data Centers: Why Some (Might) Like it Hot for two reasons: 1) it unemotionally works through the key issues and concerns, and 2) it draws from a sample of 7 production data centers at Google so the results are credible and from a substantial sample
From the introduction:
Interestingly, one key aspect in the thermal management of a data center is still not very well understood: controlling the setpoint temperature at which to run a data center’s cooling system. Data centers typically operate in a temperature range between 20C and 22C, some are as cold as 13C degrees [8, 29]. Due to lack of scientiﬁc data, these values are often chosen based on equipment manufacturers’ (conservative) suggestions. Some estimate that increasing the setpoint temperature by just one degree can reduce energy consumption by 2 to 5 percent [8, 9]. Microsoft reports that raising the temperature by two to four degrees in one of its Silicon Valley data centers saved $250,000 in annual energy costs . Google and Facebook have also been considering increasing the temperature in their data centers .
The authors go on to observe that “the details of how increased data center temperatures will affect hardware reliability are not well understood and existing evidence is contradictory.” The remainder of the paper presents the data as measured in the 7 production datacenters under study and concludes each section with an observation. I encourage you to read the paper and I’ll cover just the observations here:
Observation 1: For the temperature range that our data covers with statistical signiﬁcance (< 50C), the prevalence of latent sector errors increases much more slowly with temperature, than reliability models suggest. Half of our model/data center pairs show no evidence of an increase, while for the others the increase is linear rather than exponential.
Observation 2: The variability in temperature tends to have a more pronounced and consistent eﬀect on Latent Sector Error rates than mere average temperature
Observation 3: Higher temperatures do not increase the expected number of Latent Sector Errors (LSEs) once a drive develops LSEs, possibly indicating that the mechanisms that cause LSEs are the same under high or low temperatures.
Observation 4: Within a range of 0-36 months, older drives are not more likely to develop Latent Sector Errors under temperature than younger drives.
Observation 5: High utilization does not increase Latent Sector Error rates under temperatures.
Observation 6: For temperatures below 50C, disk failure rates grow more slowly with temperature than common models predict. The increase tends to be linear rather than exponential, and the expected increase in failure rates for each degree increase in temperature is small compared to the magnitude of existing failure rates.
Observation 7: Neither utilization nor the age of a drive signiﬁcantly aﬀect drive failure rates as a function of temperature.
Observation 8: We do not observe evidence for increasing rates of uncorrectable DRAM errors, DRAM DIMM replacements or node outages caused by DRAM problems as a function of temperature (within the range of temperature our data comprises).
Observation 9: We observe no evidence that hotter nodes have a higher rate of node outages, node downtime or hardware replacements than colder nodes.
Observation 10: We ﬁnd that high variability in temperature seems to have a stronger eﬀect on node reliability than average temperature.
Observation 11: As ambient temperature increases, the resulting increase in power is signiﬁcant and can be mostly attributed to fan power. In comparison, leakage power is negligible.
Observation 12: Smart control of server fan speeds is imperative to run data centers hotter. A signiﬁcant fraction of the observed increase in power dissipation in our experiments could likely be avoided by more sophisticated algorithms controlling the fan speeds.
Observation 13: The degree of temperature variation across the nodes in a data center is surprisingly similar for all data centers in our study. The hottest 5% nodes tend to be more than 5C hotter than the typical node, while the hottest 1%
nodes tend to be more than 8–10C hotter.
The paper under discussion: http://www.cs.toronto.edu/~nosayba/temperature_cam.pdf.
Other notes on increased data center temperatures:
· Exploring the Limits of Datacenter Temperature
· Chillerless Data Center at 95F
· Computer Room Evaporative Cooling
· Next Point of Server Differentiation: Efficiency at Very High Temperature
· Open Compute Mechanical System Design
· Example of Efficient Mechanical Design
· Innovative Datacenter Design: Ishikari Datacenter
b: http://blog.mvdirona.com / http://perspectives.mvdirona.com