There is a growing gap between memory bandwidth and CPU power and this growing gap makes low power servers both more practical and more efficient than current designs. Per-socket processor performance continues to increase much more rapidly than memory bandwidth and this trend applies across the application spectrum from mobile devices, through client, to servers. Essentially we are getting more compute than we have memory bandwidth to feed.
We can attempt to address this problem two ways: 1) more memory bandwidth and 2) less fast processors. The former solution will be used and Intel Nehalem is a good example of this but costs increase non-linearly so the effectiveness of this technique will be bounded. The second technique has great promise to reduce both cost and power consumption.
For more detail on this trend:
· The Case for Low-Cost, Low-Power Servers
· 2010 the Year of the MicroSlice Servers
· Linux/Apache on ARM Processors
· ARM Cortex-A9 SMP Design Announced
This morning GigOm reported that SeaMicro has just obtained a $9.3M Department of Energy grant to improve data center efficiency (SeaMicro’s Secret Server Changes Computing Economics). SeaMicro is a Santa Clara based start-up that is building a 512 processor server based upon Intel Atom. Also mentioned was Smooth Stone who is designing a high-scale server based upon ARM processors. ARMs processors are incredibly power efficient, commonly used in embedded devices and by far the most common processor used in cell phones.
Over the past year I’ve met with both Smooth Stone and SeaMicro frequently and it’s great to see more information about both available broadly. The very low power server trend is real and advancing quickly. When purchasing servers, it needs to be all about work done per dollar and work done per joule
Congratulations to SeaMicro on the DoE grant.
James Hamilton
e: jrh@mvdirona.com
w: http://www.mvdirona.com
b: http://blog.mvdirona.com / http://perspectives.mvdirona.com
The networking world remains one of the last bastions of the mainframe computing design point. Back in 1987 Garth Gibson, Dave Patterson, and Randy Katz showed we could aggregate low-cost, low-quality commodity disks into storage subsystems far more reliable and much less expensive than the best purpose-built storage subsystems (Redundant Array of Inexpensive Disks). The lesson played out yet again where we learned that large aggregations of low-cost, low-quality commodity servers are far more reliable and less expensive than the best purpose-built scale up servers. However, this logic has not yet played out in the networking world.
The networking equipment world looks just like mainframe computing ecosystem did 40 years ago. A small number of players produce vertically integrated solutions where the ASICs (the central processing unit responsible for high speed data packet switching), the hardware design, the hardware manufacture, and the entire software stack are stack are single sourced and vertically integrated. Just as you couldn’t run IBM MVS on a Burrows computer, you can’t run Cisco IOS on Juniper equipment.

When networking gear is purchased, it’s packaged as a single sourced, vertically integrated stack. In contrast, in the commodity server world, starting at the most basic component, CPUs are multi-sourced. We can get CPUs from AMD and Intel. Compatible servers built from either Intel or AMD CPUs are available from HP, Dell, IBM, SGI, ZT Systems, Silicon Mechanics, and many others. Any of these servers can support both proprietary and open source operating systems. The commodity server world is open and multi-sourced at every layer in the stack.
Open, multi-layer hardware and software stacks encourage innovation and rapidly drive down costs. The server world is clear evidence of what is possible when such an ecosystem emerges. In the networking world, we have a long way to go but small steps are being made. Broadcom, Fulcrum, Marvell, Dune (recently purchased by Broadcom), Fujitsu and others all produce ASICs (the data plane CPU of the networking world). These ASICS are available for any hardware designer to pick up and use. Unfortunately, there is no standardization and hardware designs based upon one part can’t easily be adapted to use another.
In the X86 world, the combination of the X86 ISA, hardware platform, and the BIOS forms a De facto standard interface. Any server supporting this low level interface can host the wide variety of different Linux systems, Windows, and many embedded O/Ss. The existence of this layer allows software innovation above and encourages nearly unconstrained hardware innovation below. New hardware designs work with existing software. New software extensions and enhancements work with all the existing hardware platforms. Hardware producers get a wider variety of good quality operating systems. Operating systems authors get a broad install base of existing hardware to target. Both get bigger effective markets. High volumes encourage greater investment and drive down costs.
This standardized layer hasn’t existed in the networking ecosystem as it has in the commodity server world. As a consequence, we don’t have high quality networking stacks able to run across a wide variety of networking devices. A potential solution is near: OpenFlow. This work originating out of the Stanford networking team driven by Nick McKeown. It is a low level hardware independent interface for updating network routing tables in a hardware independent-way. It is sufficiently rich to support current routing protocols and it also can support research protocols optimized at high-scale data center networking systems such as VL2 and PortLand. Current OpenFlow implementations exist on X86 hardware running linux, Broadcom, NEC, NetFPGA, Toroki, and many others.
The ingredients of an open stack are coming together. We have merchant silicon ASIC from Broadcom, Fulcrum, Dune and others. We have commodity, high-radix routers available from Broadcom (shipped by many competing OEMs), Arista, and others. We have the beginnings of industry momentum behind OpenFlow which has a very good chance of being that low level networking interface we need. A broadly available, low-level interface may allow a high-quality, open source networking stack to emerge. I see the beginnings of the right thing happening.
· OpenFlow web site: http://www.openflowswitch.org/
· OpenFlow paper: Enabling Innovation in Campus Networks
· My Stanford Clean Slate Talk Slides: DC Networks are in my way
James Hamilton
e: jrh@mvdirona.com
w: http://www.mvdirona.com
b: http://blog.mvdirona.com / http://perspectives.mvdirona.com
For several years I’ve been interested in PUE<1.0 as a rallying cry for the industry around increased efficiency. From PUE and Total Power Usage Efficiency (tPUE) where I talked about PUE<1.0:
In the Green Grid document [Green Grid Data Center Power Efficiency Metrics: PUE and DCiE], it says that “the PUE can range from 1.0 to infinity” and goes on to say “… a PUE value approaching 1.0 would indicate 100% efficiency (i.e. all power used by IT equipment only). In practice, this is approximately true. But PUEs better than 1.0 is absolutely possible and even a good idea. Let’s use an example to better understand this. I’ll use a 1.2 PUE facility in this case. Some facilities are already exceeding this PUE and there is no controversy on whether its achievable.
Our example 1.2 PUE facility is dissipating 16% of the total facility power in power distribution and cooling. Some of this heat may be in transformers outside the building but we know for sure that all the servers are inside which is to say that at least 83% of the dissipated heat will be inside the shell. Let’s assume that we can recover 30% of this heat and use it for commercial gain. For example, we might use the waste heat to warm crops and allow tomatoes or other high value crops to be grown in climates that would not normally favor them. Or we can use the heat as part of the process to grow algae for bio-diesel. If we can transport this low grade heat and net only 30% of the original value, we can achieve a 0.90 PUE. That is to say if we are only 30% effective at monetizing the low-grade waste heat, we can achieve a better than 1.0 PUE.
Less than 1.0 PUE are possible and I would love to rally the industry around achieving a less than 1.0 PUE. In the database world years ago, we rallied around the achieving 1,000 transactions per second. The High Performance Transactions Systems conference was originally conceived with a goal of achieving these (at the time) incredible result. 1,000 TPS was eclipsed decades ago but HPTS remains a fantastic conference. We need to do the same with PUE and aim to get below 1.0 before 2015. A PUE less than 1.0 is hard but it can and will be done.
So, a PUE of less than 1.0 is totally possible but doing it efficiently and economically has proven elusive so far. The challenge is finding a process that can make use of the very low grade heat produced by data centers and turn it into economic gain. The challenge is producing economic gain from the low grade heat where the economic gain exceeds the combined capital and operational expense of recovering that energy.
In the posting Is Sandia National Lab's Red Sky Really Able to Deliver a PUE of 1.035?, I pointed to an innovative sewage waste heat reclamation system in Norway: Flush the loo, warm your house. In this system, heat pumps are used to reclaim waste heat from sewage and convert to home heat.
Other possible applications of waste heat are heating green houses to allow the growth of valuable crops in adverse climates. See Vertical Farming for most radical extension of these ideas. Another possible approach is to grow biodiesel from microbes and use the low grade heat as a heat source for the culture. See A Better Biofuel for an example of this approach.
Yesterday, I came across an interesting application of waste heat reclamation from datacenters from Helsingin Energia (Helsinki public energy company).

In this proof of concept datacenter that will come on line next month, they have a conventional datacenter water cooling design but rather than releasing the waste heat to the atmosphere via a cooling tower or related technique, they run it through a heat pump to add heat to a heating loop to heat homes in the Finnish capital. The data center is located in an unused bomb shelter.
In a conversation I had earlier today, the project manager Sipilia Juha said:
We provide facilities for datacenter operators including underground property, electricity and cooling. We can capture almost 100% of the heat that comes out of the datacenter and put it in to the district heating system to heat buildings in Helsinki. Our customers make the detailed planning inside the premises and bring their own IT-equipment.
The cooling costs for the customer from 7€ to 20€ per MWh depending on the size of the center and of the time in the year. We can do it very ecologically and economically.
Computerworld also talked to Juha: Green Data Center Recycles Waste Heat.
I’ve been unable to get the details on the capital cost, the operational costs and the estimated cost recovery time and model used. The facility won’t be live until January so, even with good cost models, they wouldn’t yet be calibrated by real operational experience.
They are aiming for a PUE of around 1.0 and its quite conceivable they will get there:
The energy efficiency of computer halls is quantified by the so-called efficiency factor which expresses the ratio of the total energy consumption and the energy used for actual computing. The efficiency factor of ordinary computer halls is between 1.5 and 2, with the figure for computer halls deemed to be extremely ecoefficient possibly under 1.5. The efficiency factor of Academica's and Helsingin Energia's hall is around one, and it is possible to get even below this figure.
The next test is to see if this level of efficiency can be achieved in a economically positively or at least without loss. It’s an interesting project. I’ll continue to watch this and similar proof of concept facilities closely.
A brochure from Helsingin Energia is at: Hel_En_Eco-efficient_computer_hall.pdf (1.93 MB).
--jrh
James Hamilton
e: jrh@mvdirona.com
w: http://www.mvdirona.com
b: http://blog.mvdirona.com / http://perspectives.mvdirona.com
Very low-power scale-out servers -- it’s an idea whose time has come. A few weeks ago Intel announced it was doing Microslice servers: Intel Seeks new ‘microserver’ standard. Rackable Systems (I may never manage to start calling them ‘SGI’ – remember the old MIPS-based workstation company?) was on this idea even earlier: Microslice Servers. The Dell Data Center Solutions team has been on a similar path: Server under 30W.
Rackable has been talking about very low power servers as physicalization: When less is more: the basics of physicalization. Essentially they are arguing that rather than buying, more-expensive scale-up servers and then virtualizing the workload onto those fewer servers, buy many smaller servers. This saves the virtualization tax which can run 15% to 50% in I/O intensive applications and smaller and low-scale servers can produce more work done per joule and better work done per dollar. I’ve been a believer in this approach for years and wrote it up for the Conference on Innovative Data Research last year in The Case for Low-Cost, Low-Power Servers.
I’ve recently been very interested in the application of ARM processors to web-server workloads:
· Linux/Apache on ARM Processors
· ARM Cortex-A9 SMP Design Announced
ARMs are an even more radical application of the Microslice approach.
Scale-down servers easily win on many workloads when looking at work done per dollar and work done per joule and I claim, if you are looking at single dimensional metrics, like performance, you aren’t looking hard enough. However, there are workloads where scale-up wins. They are absolutely required when the workload won’t partition and scale near linearly. Database workloads are classic examples of partition-resistant workloads that really do often run better on more-expensive, scale-up servers.
The other limit is administration. Non-automated IT shops believe they are better off with fewer, more-expensive servers although they often achieve this goal by running many operating system images on a single server. Given that the bulk of administration is spent on the software stack, it’s not clear that this approach of running the same number of O/S images and software stacks on a single server is a substantial savings. However, I do agree that administration costs are important at low-scale. If, at high-scale, admin costs are over 10% of overall operational costs, go fix it rather than buying bigger, more expensive servers.
When do scale-up servers win economically? 1) very low-scale workloads where administration costs dominate, and 2) workloads that partition poorly and suffer highly-sub-linear scale-out. Simple web workloads and other partition-tolerant applications should look to scale-down severs. Make sure your admin costs are sub-10% and don’t scale with server count. Then use work done per dollar and work done per joule and you’ll be amazed to see scale-down gets more done at lower cost and lower power consumption.
2010 is the year of the low-cost, scale-down server.
--jrh
James Hamilton
e: jrh@mvdirona.com
w: http://www.mvdirona.com
b: http://blog.mvdirona.com / http://perspectives.mvdirona.com
Sometime back I whined that Power Usage Efficiency (PUE) is a seriously abused term: PUE and Total Power Usage Efficiency. But I continue to use it because it gives us a rough way to compare the efficiency of different data centers. It’s a simple metric that takes the total power delivered to a facility (total power) and divides it by the amount of power delivered to the servers (critical power or IT load). A PUE of 1.35 is very good today. Some datacenter owners have claimed to be as good as 1.2. Conventionally designed data centers operated conservatively are in the 1.6 to 1.7 range. Unfortunately most of the industry has a PUE of over 2.0, some are as bad as 3.0, and the EPA reports the industry average is 2.0 (Report to Congress on Server Data Center Efficiency). A PUE of 2.0 means that for each watt delivered to the IT load (servers, net gear, and storage), one watt is lost in cooling and in power distribution.
Whenever a metric becomes important, managers ask about it and marketing people use it. Eventually we start seeing data points that are impossibly good. The recent Red Sky installation is one of these events. Sandia National Lab’s Red Sky supercomputer is reported to be delivering a PUE of 1.035 in a system without waste heat recovery. In Red Sky at Night, Sandia’s New Computer Might it is reported “The power usage effectiveness of Red Sky is an almost unheard-of 1.035”. The video referenced below also reports Red Sky at a 1.035 PUE. in response to the claimed PUE of 1.035, Rich Miller of Data Center Knowledge astutely asked “How’s this possible?” (see Red Sky: Supercomputing and Efficiency Meet).
The data center knowledge article links to a blog posting Building Red Sky by Marc Hamilton which includes a wonderful time lapse video showing the building of Red Sky: http://www.youtube.com/watch?v=mNW9cYY4tqc. You should watch the 4 min and 51 second video and I’ll include my notes and observations from the video below. But, before we get to the video, let’s look more closely at the widely reported 1.035 PUE and what it would mean.
A PUE of 1.035 implies that for each 1 watt delivered to the servers, 0.035 is lost in power distribution and mechanical systems. For a facility of this size, I suspect they will get delivered high voltage in the 115kV range. In a conventional power distribution design, they will take 115kV and transform it to mid-voltage (13kV range), then to 480V 3p, then to 208V to be delivered to the servers. In addition to all these conversions, there is some loss in the conductors themselves. And there is considerable loss in even the very best uninterruptable power supply (UPS) systems. In fact, a UPS alone with 3.5% loss is excellent. Excellent power distribution designs will avoid 1 or perhaps 2 of the conversions above and will use a full bypass UPS. But, getting these excellent power distribution designs to even within a factor of 2 of the reported 3.5% loss is incredibly difficult and I’m very skeptical that they are going to get much below 6% to 7%. In fact, if anyone knows how to get down below 6% loss in the power distribution system measured fully, I’m super interested and would love to see what you have done, buy you lunch, and do a datacenter a tour.
A 6% loss in power distribution would limit the PUE to nothing lower than 1.06. But, we still have the cooling system to account for. Air is an expensive fluid to move long distances. Consequently, Red Sky brings the water to the server racks using Sun Cooling Door Systems (similar to the IBM iDataPlex Rear Door Cooling system).
The Sun Cooling Door System is a nice designs that will significantly improve PUE over more conventional CRAC-based mechanical designs. Generally, bringing water close to the heat load in systems that use water (rather than aggressive free-air only designs) is a good approach. The Sun advertising material credibly reports that “A highly efficient datacenter utilizing a holistic design for closely coupled cooling using Sun Cooling Door Systems can reach a PUE of 1.3”.
I know of no way to circulate air through a heat exchanger, pump water to the outside of the building, and then cool the water using any of the many technologies available that can be done at only a 3.5% loss. Which is to say that a PUE of 1.035 can’t be done with the Red Sky mechanical system design even if power distribution losses were ignored completely. I like Red Sky but suspect we’re looking at a 1.35 PUE system rather than the reported 1.035. But, that’s OK, 1.35 is quite good and, for a top 10 super computer, it’s GREAT.
Note that a PUE of 1.035 is technically possible with waste heat recovery and, in fact, even less than 1.0 can be achieved with waste heat recovery. See the “PUE less than 1.0” section of PUE and Total Power Usage Efficiency for more data on waste heat recovery. Remember this is “technically possible” rather than achieved in production today. It’s certainly possible to do today but doing it cost effectively is the challenge. I have seen it applied to related domains that also have large quantities of low grade heat. For example, a city in Norway is experimenting with waste heat recovery from Sewage: Flush the loo, warm your house.
My notes from the Red Sky Video follow:
· 47,232 cores of Intel EM64T Xeon X55xx (Nehalem-EP) 2930 MHz (11.72 GFlops)
o 553 Teraflops
· Infiniband QDR interconnect
o 1,440 cables totally 9.1 miles
· Operating System: CentOS
· Main Memory: 22,104 GB
· 266 VA [jrh: this is clearly incorrect unless they are talking about each server]
o Each reach is 32kW
· 96 JBOD enclosures
o 2,304 1TB disks
· 12 GB RAM/note & 70TB total
· PUE 1.035 [jrh: I strongly suspect they meant 1.35]
· 328 tons cooling
· 7.3million gallons of water per year
The video is worth watching although if you play with cross referencing the numbers above, there appear to be many mistakes: Red Sky time Lapse. Thanks to Jeff Bar for sending this one my way.
--jrh
James Hamilton
e: jrh@mvdirona.com
w: http://www.mvdirona.com
b: http://blog.mvdirona.com / http://perspectives.mvdirona.com
I’ve attached below my rough notes from Andy Bechtolsheim’s talk this morning at High Performance Transactions Systems 2009. The agenda for HPTS 2009 is up at: http://www.hpts.ws/agenda.html.
Andy is a founder of Sun Microsystems and of Arista Networks and is an incredibly gifted hardware designer. He’s consistently able to innovate and push the design limits while delivering reliable systems that really do work. It was an excellent talk. Unfortunately, he’s not planning to post the slides so you’ll have to make do with my rough notes below.
· Memory Technologies for Data Intensive Computing
· Speaker: Andy Bechtolsheim, Sun Microsystems
· Flash density is increasing at faster than Moore’s law and this is expected to continue
o Expect 100x improvement over the next 10 or 12 years
· Emerging technologies are coming
o Carbon Nano-Tube, Phase-change, Nano-ionic, …
o But new technologies take time so flash for now
· Expect: in 2022
o 64x cores but only 500W
o Would need 2.5 TB/s to memory and 250 GB/s to memory
· We would have been at 10GHz at 2022 but power density limits makes this impractical
o Power = clock * capacitance * Vdd^2
· Most saving will be packaging innovations: Multi-Chip 3D packaging (stacking cpu and many memory chips)
o More bandwidth through more channels without having to drive more pins (power issue)
· Expect no more memory per core than today and it could be worse
o Expect deeper multi-tier memories
· 10Gbps shipping today but expect 25GB in 2012
· Disks are SOOOOO slow in this context
o Forget disk for all but sequential and archival storage
· Sun Flash DIMM
o 30,000 Read IOPS, 20,000 writes
o Oracle did 7,717,510 tpmC using 24 sun flash devices
· Not easy to get 10^6 IOPS
o Limit is disk interface
o Answer is to go direct to PCI-X PCIe bus [jrh: this is what FusionIO does]
· Flash very different from DRAM: 100usec to read flash. About 1000x slower than DRAM.
· Enterprise flash coming:
o Rather than power optimized 33 Mhz transfers, run 133 Mhz or better
· Flash Summary:
o Expect the price of flash to ½ each year and the density to double each year
o Access times will fall by 50% per year
o Throughput will double each year
o Controllers are rapidly improving
o Interface moving from SATA to PCI-X PCIe
· Most promising new technologies are stacked chips (thu-Si via stacking) and flash
· Expect optic volumes to go up and price to go down driven by client side volumes:
o Intel Light Peak announced $5/client with on board chips
Generally Andy is incredibly positive on the continuation of Moors expects this pace of advancement to continue for at least another 10 years. He argues that disk is only useful for cold and sequential workloads and that flash is the future. Phase Change Memory and other new technologies may eventually replace flash but he points out these changes always take longer than predicted.
Expect flash to stay strong and relevant for the near term and expect it to be PCI-X PCIe connected rather than SATA attached.
James Hamilton
e: jrh@mvdriona.com
w: http://www.mvdirona.com
b: http://blog.mvdirona.com / http://perspectives.mvdirona.com
I attended the Stanford Clean Slate CTO Summit last week. It was a great event organized by Guru Parulkar. Here’s the agenda:
-
12:00: State of Clean Slate -- Nick McKeown, Stanford
-
12:30:00pm: Software defined data center networking -- Martin Casado, Nicira
-
1:00: Role of OpenFlow in data center networking -- Stephen Stuart, Google
-
2:30: Data center networks are in my way -- James Hamilton, Amazon
-
3:00: Virtualization and Data Center Networking -- Simon Crosby, Citrix
-
3:30:RAMCloud: Scalable Datacenter Storage Entirely in DRAM -- John Ousterhout, Stanford
-
4:00: L2.5: Scalable and reliable packet delivery in data centers -- Balaji Prabhakar, Stanford
-
4:45: Panel: Challenges of Future Data Center Networking--Panelists, James Hamilton, Stephen Stuart, Andrew Lambeth (VMWare), Marc Kwiatkowski (Facebook)
I presented Networks are in my Way. My basic premise is that networks are both expensive and poor power/performers. But, much more important, they are in the way of other even more important optimizations. Specifically, because most networks are heavily oversubscribed, the server workload placement problem ends up being seriously over-constrained. Server workloads need to be near storage, near app tiers, distant from redundant instances, near customer, and often on a specific subnet due to load balancer or VM migration restrictions. Getting networks out of the way so that servers can be even slightly better utilized will have considerably more impact than many direct gains achieved by optimizing networks.
Providing cheap 10Gibps to the host gets networks out of the way by enabling the hosting of many data intensive workloads such as data warehousing, analytics, commercial HPC, and MapReduce workloads. Simply providing more and cheaper bandwidth could potentially have more impact than many direct networking innovations.
Networking power/performance is unquestionably poor. I often refer to net gear as the SUV of the data center. However, the biggest gain in power efficiency that networks could enable isn’t in reducing networking power but in getting out of the way and allowing better server utilization. Networking is under 4% of the power consumption in a typical high-scale data center whereas severs are responsible for 44%. I’m arguing that the best networking power innovations are the ones that help make the use of servers more efficient.
Looking at networking cost, we see we actually do have a direct problem there. At scale, networking gear represents a full 18% of the cost of all infrastructure (shell, power, power distribution, mechanical systems, servers,& networking). For every $2.5 spend on servers, roughly $1 is spent on networking. Over time, the ratio of networking gear to servers continues to worsen. I look at this phenomena in more detail in It's the Eco System Stupid where the commodity server ecosystem is compared to the to the current networking equipment ecosystem. In my view, the industry needs an competitive, multi-source networking hardware and software stack.
--jrh
James Hamilton
e: jrh@mvdriona.com
w: http://www.mvdirona.com
b: http://blog.mvdirona.com / http://perspectives.mvdirona.com
I love Solid State Disks and have written about them extensively:
And, being a lover of SSD, I know they are a win in many situations including power savings. However, try as I might (and I have tried hard), I can’t find a way to justify using SSDs on power savings alone. In When SSDs Make Sense in Server Applications I walked through where flash SSDs are a clear win. They are great for replacing disks in VERY high IOPS workloads. They are great for workloads that can’t go do disk and must entirely be held by main-memory caches. In this usage mode, SSDs can allow the data that was previously had to be memory resident to be moved to SSD. This can be a huge win in that memory is very power intensive and, as much as memory prices have fallen, it’s still expensive relative to disk and flash.
In this recently released article, MySpace Replaces all Server Hard Disks with Flash Drives, it was announced that MySpace has completely stopped using hard disks. The article said “MySpace said the solid state storage uses less than 1% of the power and cooling costs that their previous hard drive-based server infrastructure had and that they were able to remove all of their server racks because the ioDrives are embedded directly into even its smallest servers.” Presumably they meant “remove all of the storage racks” rather than “remove all the server racks.”
I totally believe that MySpace can and should move some content that currently must live in memory and shift it out to SSDs and I like the savings that will come from doing this. No debate. I fully expect MySpace have some workloads that drive incredibly high IOPS rates and these can be replaced by SSDs. But in every company I’ve ever worked or visited, the vast majority of the persistent disk resident data is cold. Security and audit logs, backups, boot disks, archival copies, debugging information, rarely accessed large objects. Putting cold data without extremely aggressive access time SLAs on flash just doesn’t make sense. These data are capacity bound rather than IOPS bound and flash is an expensive way to get capacity.
The argument made in the article is that power savings of flash over SSD justify the cost per GB delta. I can’t make that math even get close to working. In Annual Fully Burdened Cost of Power we looked at the cost of fully provisioned power and found that if we include power distribution costs, cooling costs, and power, power costs $2.12 per watt per year. Given that a commodity disk (where cold data belongs) consumes 6 to 10w disk each and can store well over a TB, this math simply can’t be made to work. I have come across folks that think the power savings justify the technology change even for cold data but I’ve never seen a case where the assertion stood the test of scrutiny.
SSDs are wonderful for many applications and I would certainly replace some disks with SSDs but replacing ALL disks doesn’t make sense in the workload mixes found in most data centers. In this case, I suspect it was a communication error and MySpace has not really replaced all of their disk with SSDs.
Thanks to Tom Klienpeter who sent this one my way.
--jrh
James Hamilton
e: jrh@mvdriona.com
w: http://www.mvdirona.com
b: http://blog.mvdirona.com / http://perspectives.mvdirona.com
In past posts such as Web Search Using Small Cores I’ve said “Atom is a wonderful processor but current memory managers on Atom boards don’t support Error Correcting Codes (ECC) nor greater than 4 gigabytes of memory. I would love to use Atom in server designs but all the data I’ve gathered argues strongly that no server workload should be run without ECC.” And, in Linux/Apache on ARM Processors I said “unlike Intel Atom based servers, this ARM-based solution has the full ECC Memory support we want in server applications (actually you really want ECC in all applications from embedded through client to servers”
An excellent paper was just released that puts hard data behind this point and shows conclusively that ECC is absolutely needed. In DRAM Errors in the Wild: A Large Scale Field Study, Bianca Schroeder, Eduardo Pinheiro, and Wolf-Dietrich Weber show conclusively that you really do need ECC memory in server applications. Wolf was also an author of the excellent Power Provisioning in a Warehouse-Sized Computer that I mentioned in my blog post Slides From Conference on Innovative Data Systems Research where the authors described a technique to over-sub subscribe data center power.
I continue to believe that client systems should also be running ECC and strongly suspect that a great many kernel and device driver failures are actually the result of memory fault. We don’t have the data to prove it conclusively from a client population but I’ve long suspected that the single most effective way for Windows to reduce their blue screen rate would be to require ECC as a required feature for Windows Hardware Certification.
Returning to servers, in Kathy Yelick’s ISCA 2009 keynote, she showed a graph that showed ECC recovery rates (very common) and noted that the recovery times are substantial and the increased latency of correction is substantially slowing the computation (ISCA 2009 Keynote I: How to Waste a Parallel Computer – Kathy Yelick).
This more recent data further supports Kathy’s point, includes wonderfully detailed analysis and concludes with:
· Conclusion 1: We found the incidence of memory errors and the range of error rates across different DIMMs to be much higher than previously reported. About a third of machines and over 8% of DIMMs in our fleet saw at least one correctable error per year. Our per-DIMM rates of correctable errors translate to an average of 25,000–75,000 FIT (failures in time per billion hours of operation) per Mbit and a median FIT range of 778 –25,000 per Mbit (median for DIMMs with errors), while previous studies report 200-5,000 FIT per Mbit. The number of correctable errors per DIMM is highly variable, with some DIMMs experiencing a huge number of errors, compared to others. The annual incidence of uncorrectable errors was 1.3% per machine and 0.22% per DIMM. The conclusion we draw is that error correcting codes are crucial for reducing the large number of memory errors to a manageable number of uncorrectable errors. In fact, we found that platforms with more powerful error codes (chipkill versus SECDED) were able to reduce uncorrectable error rates by a factor of 4–10 over the less powerful codes. Nonetheless, the remaining incidence of 0.22% per DIMM per year makes a crash-tolerant application layer indispensable for large-scale server farms.
· Conclusion 2: Memory errors are strongly correlated. We observe strong correlations among correctable errors within the same DIMM. A DIMM that sees a correctable error is 13–228 times more likely to see another correctable error in the same month, compared to a DIMM that has not seen errors. There are also correlations between errors at time scales longer than a month. The autocorrelation function of the number of correctable errors per month shows significant levels of correlation up to 7 months. We also observe strong correlations between correctable errors and uncorrectable errors. In 70-80% of the cases an uncorrectable error is preceded by a correctable error in the same month or the previous month, and the presence of a correctable error increases the probability of an uncorrectable error by factors between 9–400. Still, the absolute probabilities of observing an uncorrectable error following a correctable error are relatively small, between 0.1–2.3% per month, so replacing a DIMM solely based on the presence of correctable errors would be attractive only in environments where the cost of downtime is high enough to outweigh the cost of the expected high rate of false positives.
· Conclusion 3: The incidence of CEs increases with age, while the incidence of UEs decreases with age (due to replacements). Given that DRAM DIMMs are devices without any mechanical components, unlike for example hard drives, we see a surprisingly strong and early effect of age on error rates. For all DIMM types we studied, aging in the form of increased CE rates sets in after only 10–18 months in the field. On the other hand, the rate of incidence of uncorrectable errors continuously declines starting at an early age, most likely because DIMMs with UEs are replaced (survival of the fittest).
· Conclusion 4: There is no evidence that newer generation DIMMs have worse error behavior. There has been much concern that advancing densities in DRAM technology will lead to higher rates of memory errors in future generations of DIMMs. We study DIMMs in six different platforms, which were introduced over a period of several years, and observe no evidence that CE rates increase with newer generations. In fact, the DIMMs used in the three most recent platforms exhibit lower CE rates, than the two older platforms, despite generally higher DIMM capacities. This indicates that improvements in technology are able to keep up with adversarial trends in DIMM scaling.
· Conclusion 5: Within the range of temperatures our production systems experience in the field, temperature has a surprisingly low effect on memory errors. Temperature is well known to increase error rates. In fact, artificially increasing the temperature is a commonly used tool for accelerating error rates in lab studies. Interestingly, we find that differences in temperature in the range they arise naturally in our fleet’s operation (a difference of around 20C between the 1st and 9th temperature decile) seem to have a marginal impact on the incidence of memory errors, when controlling for other factors, such as utilization.
· Conclusion 6: Error rates are strongly correlated with utilization.
· Conclusion 7: Error rates are unlikely to be dominated by soft errors. We observe that CE rates are highly correlated with system utilization, even when isolating utilization effects from the effects of temperature. In systems that do not use memory scrubbers this observation might simply reflect a higher detection rate of errors. In systems with memory scrubbers, this observations leads us to the conclusion that a significant fraction of errors is likely due to mechanism other than soft errors, such as hard errors or errors induced on the data path. The reason is that in systems with memory scrubbers the reported rate of soft errors should not depend on utilization levels in the system. Each soft error will eventually be detected (either when the bit is accessed by an application or by the scrubber), corrected and reported. Another observation that supports Conclusion 7 is the strong correlation between errors in the same DIMM. Events that cause soft errors, such as cosmic radiation, are expected to happen randomly over time and not in correlation. Conclusion 7 is an interesting observation, since much previous work has assumed that soft errors are the dominating error mode in DRAM. Some earlier work estimates hard errors to be orders of magnitude less common than soft errors and to make up about 2% of all errors. Conclusion 7 might also explain the significantly higher rates of memory errors we observe compared to previous studies.
Based upon this data and others, I recommend against non-ECC servers. Read the full paper at: DRAM Errors in the Wild: A Large Scale Field Study. Thanks for Cary Roberts for pointing me to this paper.
--jrh
James Hamilton
e: jrh@mvdriona.com
w: http://www.mvdirona.com
b: http://blog.mvdirona.com / http://perspectives.mvdirona.com
|