In the data center world, there are few events taken more seriously than power failure and considerable effort is spent to make them rare. When a datacenter experiences a power failure, it’s a really big deal for all involved. But, a big deal in the infrastructure world still really isn’t a big deal on the world stage. The Super Bowl absolutely is a big deal by any measure. On average over the last couple of years, the Super Bowl has attracted 111 million viewers and is the number 1 most watched television show in North America eclipsing the final episode of Mash. World-wide, the Super Bowl is only behind the European Cup (UEFA Champions Leaque) which draws 178 million viewers.
When the 2013 Super Bowl power event occurred, the Baltimore Ravens had just run back the second half opening kick for a touchdown and they were dominating the game with a 28 to 6 point lead. The 49ers had already played half the game and failed to get a single touchdown. The Ravens were absolutely dominating and they started the second half by tying the record for the longest kickoff return in NFL history at 108 yards. The game momentum was strongly with Baltimore.
At 13:22 in the third quarter, just 98 seconds into the second half, ½ of the Superdome lost primary power. Fortunately it wasn’t during the runback that started the second half. The power failure let to a 34 min delay to restore full lighting the field and, when the game restarted, the 49ers were on fire. The game was fundamentally changed by the outage with the 49ers rallying back to a narrow defeat of only 3 points. The game ended 34 to 31 and it really did come down to the wire where either team could have won. There is no question the game was exciting and some will argue the power failure actually made the game more exciting. But, NFL championships should be decided on the field and not impacted by the electrical system used by the host stadium.
What happened at 13:22 in the third quarter when much of the field lighting failed? Entergy, the utility supply power to the Superdome reported their “distribution and transmission feeders that serve the Superdome were never interrupted” (Before Game Is Decided, Superdome Goes Dark). It was a problem at the facility.
The joint report from SMG the company that manages the Superdome and Entergy, the utility power provider, said:
A piece of equipment that is designed to monitor electrical load sensed an abnormality in the system. Once the issue was detected, the sensing equipment operated as designed and opened a breaker, causing power to be partially cut to the Superdome in order to isolate the issue. Backup generators kicked in immediately as designed.
Entergy and SMG subsequently coordinated start-up procedures, ensuring that full power was safely restored to the Superdome. The fault-sensing equipment activated where the Superdome equipment intersects with Entergy’s feed into the facility. There were no additional issues detected. Entergy and SMG will continue to investigate the root cause of the abnormality.
Essentially, the utility circuit breaker detected an “anomaly” and opened the breaker. Modern switchgear have many sensors monitored by firmware running on a programmable logic controller. The advantage of these software systems is they are incredibly flexible and can be configured uniquely for each installation. The disadvantage of software systems is the wide variety of configurations they can support can be complex and the default configurations are used perhaps more often than they should. The default configurations in a country where legal settlements can be substantial tend towards the conservative side. We don’t know if that was a factor in this event but we do know that no fault was found and the power was stable for the remainder of the game. This was almost certainly a false trigger.
Because the cause has not yet been reported and, quite often, the underlying root cause is never found. But, it’s worth asking, is it possible to avoid long game outages and what would it cost? As when looking at any system faults, the tools we have to mitigate the impact are: 1) avoid the fault entirely, 2) protect against the fault with redundancy, 3) minimize the impact of the fault through small fault zones, and 4) minimize the impact through fast recovery.
Fault avoidance: Avoidance starts with using good quality equipment, configuring it properly, maintaining it well, and testing it frequently. Given the Superdome just went through $336 million renovation, the switch gear may have been relatively new and, even if it wasn’t, it likely was almost certainly recently maintained and inspected.
Where issues often arise are in configuration. Modern switch gear have an amazingly large number of parameters many of which interact with each other and, in total, can be difficult to fully understand. And, given the switch gear manufactures know little about the intended end-use application of each switchgear sold, they ship conservative default settings. Generally, the risk and potential negative impact of a false positive (breaker opens when it shouldn’t) is far less than a breaker that fails to open. Consequently conservative settings are common.
Another common cause of problems is lack of testing. The best way to verify that equipment works is to test at full production load in a full production environment in a non-mission critical setting. Then test it just short of overload to ensure that it can still reliably support the full load even though the production design will never run it that close to the limit, and finally, test it into overload to ensure that the equipment opens up on real faults.
The first, testing in full production environment in non-mission critical setting is always done prior to a major event. But the latter two tests are much less common: 1) testing at rated load, and 2) testing beyond rated load. Both require synthetic load banks and skill electricians and so these tests are often not done. You really can’t beat testing in a non-mission critical setting as a means of ensuring that things work well in a mission critical setting (game time).
Redundancy: If we can’t avoid a fault entirely, the next best thing is to have redundancy to mask the fault. Faults will happen. The electrical fault at the Monday Night Football game back in December of 2011 was caused by utility sub-station failing. These faults are unavoidable and will happen occasionally. But is protection against utility failure possible and affordable? Sure, absolutely. Let’s use the Superdome fault yesterday as an example.
The entire Superdome load is only 4.6MW. This load would be easy to support on two 2.5 to 3.0MW utility feeds each protected by its own generator. Generators in the 2.5 to 3.0 MW range are substantial V16 diesel engines the size of a mid-sized bus. And they are expensive running just under $1M each but they are also available in mobile form and inexpensive to rent. The rental option is a no-brainer but let’s ignore that and look at what it would cost to protect the Superdome year around with a permanent installation. We would need 2 generators, the switchgear to connect it to the load and uninterruptable power supplies to hold the load during the first few seconds of a power failure until the generators start up and are able to pick up the load. To be super safe, we’ll buy third generator just in case there is a problem and one of the two generators don’t start. The generators are under $1m each and the overall cost of the entire redundant power configuration with the extra generator could be had for under $10m. Looking at statistics from the 2012 event, a 30 second commercial costs just over $4m.
For the price of just over 60 seconds of commercials the facility could protected against fault. And, using rental generators, less than 30 seconds of commercials would provide the needed redundancy to avoid impact from any utility failure. Given how common utility failures are and the negative impact of power disruptions at a professional sporting event, this looks like good value to me. Most sports facilities chose to avoid this “unnecessary” expense and I suspect the Superdome doesn’t have full redundancy for all of its field lighting. But even if it did, this failure mode can sometimes cause the generators to be locked out and not pick up the load during a some power events. In this failure mode, when a utility breaker incorrectly senses a ground fault within the facility, it is frequently configured to not put the generator at risk by switching it into a potential ground fault. My take is I would rather run the risk of damaging the generator and avoid the outage so I’m not a big fan of this “safety” configuration but it is a common choice.
Minimize Fault Zones: The reason why only ½ the power to the Superdome went down was because the system installed at the facility has two fault containment zones. In this design, a single switchgear event can only take down ½ of the facility.
Clearly the first choice is to avoid the fault entirely. And, if that doesn’t work, have redundancy take over and completely mask the fault. But, in the rare cases where none of these mitigations work, the next defense are small fault containment zones. Rather than using 2 zones, spend more on utility breakers and have 4 or 6 and, rather than losing ½ the facility, lose ¼ or 1/6. And, if the lighting power is checker boarded over the facility lights, (lights in a contiguous region are not all powered by the same utility feed but the feeds are distributed over the lights evenly), rather than losing ¼ or 1/6 of the lights in one area of the stadium, we would lose that fraction of the lights evenly over the entire facility. Under these conditions, it might be possible to operate with slightly degraded field lighting and be able to continue the game without waiting for light recovery.
Fast Recovery: Before we get to this fourth option, fast recovery, we have tried hard to avoid failure, then we have used power redundancy to mask the failure, then we have used small fault zones to minimize the impact. The next best thing we can do is to recover quickly. Fast recovery depends broadly on two things: 1) if possible automate recovery so it can happen in seconds rather than the rate at which humans can act, 2) if humans are needed, ensure they have access to adequate monitoring and event recording gear so they can see what happened quickly and they have trained extensively and are able to act quickly.
In this particular event, the recovery was not automated. Skilled electrical technicians were required. They spent nearly 15 minute checking system states before deciding it was safe to restore power. Generally, 15 min on a human judgment driven recover decision isn’t bad. But the overall outage was 34 min. If the power was restored in 15 min, what happened during the next 20? The gas discharge lighting still favored at large sporting venues, take roughly 15 minutes to restart after a momentary outage. Even a very short power interruption will still suffer the same long recovery time. Newer light technologies are becoming available that are both more power efficient and don’t suffer from these long warm-up periods.
It doesn’t appear that the final victor of Super Bowl XLVII was changed by the power failure but there is no question the game was broadly impacted. If the light failure had happened during the kickoff return starting the third quarter, the game may have been changed in a very fundamental way. Better power distribution architectures are cheap by comparison. Given the value of the game, the relative low cost of power redundancy equipment, I would argue it’s time to start retrofitting major sporting venues with more redundant design and employing more aggressive pre-game testing.
b: http://blog.mvdirona.com / http://perspectives.mvdirona.com
In the cloud there is nothing more important than customer trust. Without customer trust, a cloud business can’t succeed. When you are taking care of someone else’s assets, you have to treat those assets as more important than your own. Security has to be rock solid and absolutely unassailable. Data loss or data corruption has to be close to impossible and incredibly rare. And all commitments to customers have to be respected through business changes. These are hard standards to meet but, without success against these standards, a cloud service will always fail. Customers can leave any time and, if they have to leave, they will remember you did this to them.
These are facts and anyone working in cloud services labors under these requirements every day. It’s almost reflexive and nearly second nature. What brought this up for me over the weekend was a note I got from one of my cloud service providers. It emphasized that it really is worth talking more about customer trust.
Let’s start with some history. Many years ago, Michael Merhej and Tom Klienpeter started a company called ByteTaxi that eventually offered a product called Foldershare. It was a simple service with a simple UI but it did peer-to-peer file sync incredibly well, it did it through firewalls, it did it without install confusion and, well, it just worked. It was a simple service but was well executed and very useful. In 2005, Microsoft acquired Foldershare and continued to offer the service. It didn’t get enhanced much for years but it remained useful. Then Microsoft came up with a broader plan called Windows Live Mesh and the Foldershare service was renamed. Actually the core peer-to-peer functionality passed through an array of names and implementations from Foldershare, Windows Live Foldershare, Windows Live Sync and finally Windows Live Mesh.
During the early days at Microsoft, it was virtually uncared for and had little developer attention. As new names and implementations were announced and the feature actually had developer attention, it was getting enhanced but, ironically, it was also getting somewhat harder to use and definitely less stable. But, it still worked and the functionality lived on in Live Mesh. Microsoft has another service called Skydrive that does the same thing that all the other cloud sync services do: sync files to cloud hosted storage. Unfortunately, it doesn’t include the core peer-to-peer functionality of Live Mesh. Reportedly 40% of the Live Mesh users also use Skydrive.
This is where we get back to customer trust. Over the weekend, Microsoft sent out a note to all Mesh users confirming it will be shut off next month as a follow up to their announcement that the service will be killed that went out in December. They explained the reason to terminate the service and remove the peer-to-peer file sync functionality:
Currently 40% of Mesh customers are actively using SkyDrive and based on the positive response and our increasing focus on improving personal cloud storage, it makes sense to merge SkyDrive and Mesh into a single product for anytime and anywhere access for files.
Live Mesh is being killed without a replacement service. It’s not a big deal but 2 months isn’t a lot of warning. I know that this sort of thing can happen to small startups anytime and, at any time, customers could get left unsupported. But, Microsoft seems well beyond the startup phase at this point. I get that strategic decisions have to be made but there are times when I wonder how much thought went into the decision. I suspect it was something like “there are only 3 million Live Mesh customers so it’s really not worth continuing with it.” And, it actually may not be worth continuing the service. But, there is this customer trust thing. And I just hate to see it violated – it’s bad for all cloud provider when anyone in the industry makes a decision that raises the customer trust question.
Fortunately, there is a Mesh replacement service: http://www.cubby.com/. I’ve been using it since the early days when it was in controlled beta. Over the last month or so Cubby has moved to full, unrestricted production. It’s been solid for the period I’ve been using it and, like Foldershare, its simple and it works. I really like it. If you are a Mesh user, were a Foldershare user, or just would like to be able to sync your files between your different systems, try Cubby. Cubby also add support for Android or IOS devices without extra cost. Cubby is well executed and stable.
It must be Cloud Cleaning week at Microsoft. A friend forwarded the note sent to the millions of active Microsoft Messenger customers this month: the service is being “retired” and users are recommended to consider Skype.
If you are interested in reading more on the Live Mesh service elimination, the following is the text of the note sent to all current Mesh users:
Dear Mesh customer,
Recently we released the latest version of SkyDrive, which you can use to:
- Choose the files and folders on your SkyDrive that sync on each computer.
- Access your SkyDrive using a brand new app for Android v2.3 or the updated apps for Windows Phone, iPhone, and iPad.
- Collaborate online with the new Office Web apps, including Excel forms, co-authoring in PowerPoint and embeddable Word documents.
Currently 40% of Mesh customers are actively using SkyDrive and based on the positive response and our increasing focus on improving personal cloud storage, it makes sense to merge SkyDrive and Mesh into a single product for anytime and anywhere access for files. As a result, we will retire Mesh on February 13, 2013. After this date, some Mesh functions, such as remote desktop and peer to peer sync, will no longer be available and any data on the Mesh cloud, called Mesh synced storage or SkyDrive synced storage, will be removed. The folders you synced with Mesh will stop syncing, and you will not be able to connect to your PCs remotely using Mesh.
We encourage you to try out the new SkyDrive to see how it can meet your needs. During the transition period, we suggest that, in addition to using Mesh, you sync your Mesh files using SkyDrive. This way, you can try out SkyDrive without changing your existing Mesh setup. For tips on transitioning to SkyDrive, see SkyDrive for Mesh users on the Windows website. If you have questions, you can post them in the SkyDrive forums.
Mesh customers have been influential and your feedback has helped shape our strategy for Mesh and SkyDrive. We would not be here without your support and hope you continue to give us feedback as you use SkyDrive.
The Windows Live Mesh and SkyDrive teams
There is real danger of thinking of customers as faceless aggregations of hundreds of thousands or even millions of users. We need to think through decisions one user at a time and make it work for them individually. If millions of active users are on Microsoft Messenger, what would it take to make them want to use Skype? If 60% of the Windows Live Mesh users chose not to use Microsoft Skydrive, why is that? Considering customers one at a time is clearly the right thing for customers but, long haul, it’s also the right thing for the business. It builds the most important asset in the cloud, customer trust.
b: http://blog.mvdirona.com / http://perspectives.mvdirona.com
Since 2008, I’ve been excited by, working on, and writing about Microservers. In these early days, some of the workloads I worked with were I/O bound and didn’t really need or use high single-thread performance. Replacing the server class processors that supported these applications with high-volume, low-cost client system CPUs yielded both better price/performance and power/performance. Fortunately, at that time, there were good client processors available with ECC enabled (see You Really DO Need ECC) and most embedded system processors also supported ECC.
I wrote up some of the advantages of these early microserver deployments and showed performance results from a production deployment in an internet-scale mail processing application in Cooperative, Expendable, Microslice, Servers: Low-Cost, Low-Power Servers for Internet-Scale Services.
Intel recognizes the value of low-power, low-cost processors for less CPU demanding applications and announced this morning the newest members of the Atom family, the S1200 series. These new processors support 2 cores and 4 threads and are available in variants of up to 2Ghz while staying under 8.5 watts. The lowest power members of the family come in at just over 6W. Intel has demonstrated an S1200 reference board running spec_web at 7.9W including memory, SATA, Networking, BMC, and other on-board components.
Unlike past Atom processors, the S1200 series supports full ECC memory. And all members of the family support hardware virtualization (Intel VT-x2), 64 bit addressing, and up to 8GB of memory. These are real server parts.
Centerton (S1200 series) features:
One of my favorite Original Design Manufacturers, Quanta Computer, has already produced a shared infrastructure rack design that packs 48 Atom S1200 servers into a 3 rack unit form factor (5.25”).
Quanta S900-X31A front and back view:
Quanta S900-X31a server drawer:
Quanta has done a nice job with this shared infrastructure rack. Using this design, they can pack a booming 624 servers into a standard 42 RU rack.
I’m excited by the S1200 announcement because it’s both a good price/performer and power/performer and shows that Intel is serious about the microserver market. This new Atom gives customers access to microserver pricing without having to change instruction set architectures. The combination of low-cost, low-power, and the familiar Intel ISA with its rich tool chain and broad application availability is a compelling combination. It’s exciting to see the microserver market heating up and I like Intel’s roadmap looking forward.
Related Microserver focused postings:
· Cooperative Expendable Microslice Servers: Low-cost, Low-power Servers for Internet Scale Services
· The Case for Low-Cost, Low-Power Servers
· Low Power Amdahl Blades for Data Intensive Computing
· Microslice Servers
· ARMCortext-A9 Design Announced
· 2010 the Year of the Microslice Server
· Very Low Power Server Progress
· Nvidia Project Denver: ARM Powered Servers
· ARM V8 Architecture
· AMD Announced Server-Targeted ARM Part
· Quanta S900-X31A
b: http://blog.mvdirona.com / http://perspectives.mvdirona.com
I’ve worked in or near the database engine world for more than 25 years. And, ironically, every company I’ve ever worked at has been working on a massive-scale, parallel, clustered RDBMS system. The earliest variant was IBM DB2 Parallel Edition released in the mid-90s. It’s now called the Database Partitioning Feature.
Massive, multi-node parallelism is the only way to scale a relational database system so these systems can be incredibly important. Very high-scale MapReduce systems are an excellent alternative for many workloads. But some customers and workloads want the flexibility and power of being able to run ad hoc SQL queries against petabyte sized databases. These are the workloads targeted by massive, multi-node relational database clusters and there are now many solutions out there with Oracle RAC being perhaps the most well-known but there are many others including Vertica, GreenPlum, Aster Data, ParAccel, Netezza, and Teradata.
What’s common across all these products is that big databases are very expensive. Today, that is changing with the release of Amazon Redshift. It’s a relational, column-oriented, compressed, shared nothing, fully managed, cloud hosted, data warehouse. Each node can store up to 16TB of compressed data and up to 100 nodes are supported in a single cluster.
Amazon Redshift manages all the work needed to set up, operate, and scale a data warehouse cluster, from provisioning capacity to monitoring and backing up the cluster, to applying patches and upgrades. Scaling a cluster to improve performance or increase capacity is simple and incurs no downtime. The service continuously monitors the health of the cluster and automatically replaces any component, if needed.
The core node on which the Redshift clusters are build, includes 24 disk drives with an aggregate capacity of 16TB of local storage. Each node has 16 virtual cores and 120 Gig of memory and is connected via a high speed 10Gbps, non-blocking network. This a meaty core node and Redshift supports up to 100 of these in a single cluster.
There are many pricing options available (see http://aws.amazon.com/redshift for more detail) but the most favorable comes in at only $999 per TB per year. I find it amazing to think of having the services of an enterprise scale data warehouse for under a thousand dollars by terabyte per year. And, this is a fully managed system so much of the administrative load is take care of by Amazon Web Services.
Service highlights from: http://aws.amazon.com/redshift
Fast and Powerful – Amazon Redshift uses a variety to innovations to obtain very high query performance on datasets ranging in size from hundreds of gigabytes to a petabyte or more. First, it uses columnar storage and data compression to reduce the amount of IO needed to perform queries. Second, it runs on hardware that is optimized for data warehousing, with local attached storage and 10GigE network connections between nodes. Finally, it has a massively parallel processing (MPP) architecture, which enables you to scale up or down, without downtime, as your performance and storage needs change.
You have a choice of two node types when provisioning your own cluster, an extra large node (XL) with 2TB of compressed storage or an eight extra large node (8XL) with 16TB of compressed storage. You can start with a single XL node and scale up to a 100 node eight extra large cluster. XL clusters can contain 1 to 32 nodes while 8XL clusters can contain 2 to 100 nodes.
Scalable – With a few clicks of the AWS Management Console or a simple API call, you can easily scale the number of nodes in your data warehouse to improve performance or increase capacity, without incurring downtime. Amazon Redshift enables you to start with a single 2TB XL node and scale up to a hundred 16TB 8XL nodes for 1.6PB of compressed user data. Resize functionality is not available during the limited preview but will be available when the service launches.
Inexpensive – You pay very low rates and only for the resources you actually provision. You benefit from the option of On-Demand pricing with no up-front or long-term commitments, or even lower rates via our reserved pricing option. On-demand pricing starts at just $0.85 per hour for a two terabyte data warehouse, scaling linearly up to a petabyte and more. Reserved Instance pricing lowers the effective price to $0.228 per hour, under $1,000 per terabyte per year.
Fully Managed – Amazon Redshift manages all the work needed to set up, operate, and scale a data warehouse, from provisioning capacity to monitoring and backing up the cluster, and to applying patches and upgrades. By handling all these time consuming, labor-intensive tasks, Amazon Redshift frees you up to focus on your data and business insights.
Secure – Amazon Redshift provides a number of mechanisms to secure your data warehouse cluster. It currently supports SSL to encrypt data in transit, includes web service interfaces to configure firewall settings that control network access to your data warehouse, and enables you to create users within your data warehouse cluster. When the service launches, we plan to support encrypting data at rest and Amazon Virtual Private Cloud (Amazon VPC).
Reliable – Amazon Redshift has multiple features that enhance the reliability of your data warehouse cluster. All data written to a node in your cluster is automatically replicated to other nodes within the cluster and all data is continuously backed up to Amazon S3. Amazon Redshift continuously monitors the health of the cluster and automatically replaces any component, as necessary.
Compatible – Amazon Redshift is certified by Jaspersoft and Microstrategy, with additional business intelligence tools coming soon. You can connect your SQL client or business intelligence tool to your Amazon Redshift data warehouse cluster using standard PostgreSQL JBDBC or ODBC drivers.
Designed for use with other AWS Services – Amazon Redshift is integrated with other AWS services and has built in commands to load data in parallel to each node from Amazon Simple Storage Service (S3) and Amazon DynamoDB, with support for Amazon Relational Database Service and Amazon Elastic MapReduce coming soon.
Petabyte-scale data warehouses no longer need command retail prices of upwards $80,000 per core. You don’t have to negotiate an enterprise deal and work hard to get the 60 to 80% discount that always seems magically possible in the enterprise software world. You don’t even have to hire a team of administrators. Just load the data and get going. Nice to see.
b: http://blog.mvdirona.com / http://perspectives.mvdirona.com
I have been interested in, and writing about, microservers since 2007. Microservers can be built using any instruction set architecture but I’m particularly interested in ARM processors and their application to server-side workloads. Today Advanced Micro Devices announced they are going to build an ARM CPU targeting the server market. This will be 4-core, 64 bit, more than 2Ghz part that is expected to sample in 2013 and ship in volume in early 2014.
AMD is far from new to microserver market. In fact, much of my past work on microservers has been AMD-powered. What’s different today is that AMD is applying their server processor skills while, at the same time, leveraging the massive ARM processor ecosystem. ARM processors power Apple iPhones, Samsung smartphones, tablets, disk drives, and applications you didn’t even know had computers in them.
The defining characteristic of server processor selection is to focus first and most on raw CPU performance and accept the high cost and high-power consumption that follows from that goal. The defining characteristic of Microservers is we leverage the high-volume client and connected device ecosystem and make a CPU selection on the basis of price/performance and power/performance with an emphasis on building balanced servers. The case for microservers is anchored upon these 4 observations:
· Volume economics: Rather than draw on the small-volume economics of the server market, with Microservers we leverage the massive volume economics of the smart device world driven by cell phones, tablets, and clients. To give some scale to this observation, IDC reports that there were 7.6M server units sold in 2010. ARM reports that there were 6.1B Arm processors shipped last year. The connected and embedded device market volumes are 1000x larger than that of the server market and the performance gap is shrinking rapidly. Semiconductor analyst Semicast estimates that by 2015 there will be 2 ARM processors for every person in the world. In 2010, ARM reported that, on average, there were 2.5 ARM-based processors in each Smartphone. The connected and embedded device market is 1000x that of that of the server world.
Having watched and participated in our industry for nearly 3 decades, one reality seems to dominate all others: high-volume economics drives innovation and just about always wins. As an example, IBM mainframes ran just about every important server-side workload in the mid-80s. But, they were largely swept aside by higher-volume RISC servers running UNIX. At the time I loved RISC systems – databases systems would just scream on them and they offered customers excellent price/performance. But, the same trend played out again. The higher-volume X86 processors from the client world swept the superior raw performing RISC systems aside.
Invariably what we see happening about once a decade is a high-volume, lower-priced technology takes over the low end of the market. When this happens many engineers correctly point out that these systems can’t hold a candle to the previous generation server technology and then incorrectly believe they won’t get replaced. The new generation is almost never better in absolute terms but they are better price/performers so they first are adopted for the less performance critical applications. Once this happens, the die is cast and the outcome is just about assured. The high-volume parts move up market and eventually take over even the most performance critical workloads of the previous generation. We see this same scenario play out roughly once a decade.
· Not CPU bound: Most discussion in our industry centers on the more demanding server workloads like databases but, in reality, many workloads are not pushing CPU limits and are instead storage, networking, or memory bound. There are two major classes of workloads that don’t need or can’t fully utilize more CPU:
1. Some workloads simply do not require the highest performing CPUs to achieve their SLAs. You can pay more and buy a higher performing processor but it will achieve little for these applications. Some workloads just don’t require more CPU performance to meet their goals.
2. This second class of workloads is characterized by being blocked on networking, storage, or memory. And by memory bound I don’t mean the memory is too small. In this case it isn’t the size of the memory that is the problem, but the bandwidth. The processor looks to be fully utilized from an operating system perspective but the bulk of its cycles are waiting for memory. Disk and CPU bound systems are easy to detect by looking for which is running close to 100% utilization while the CPU load is way lower. Memory bound is more challenging to detect but its super common so worth talking about it. Most server processors are super-scalar, which is to say they can retire multiple instructions each cycle. On many workloads, less than 1 instruction is retired each cycle (you can see this by monitoring Instructions per cycle) because the processor is waiting for memory transfers.
If a workload is bound on network, storage, or memory, spending more on a faster CPU will not deliver results. The same is true for non-demanding workloads. They too are not bound on CPU so a faster part won’t help in this case either.
· Price/performance: Device price/performance is far better than current generation server CPUs. Because there is less competition in server processors, prices are far higher and price/performance is relatively low compared to the device world. Using server parts, performance is excellent but price is not.
Let’s use an example again: A server CPU is hundreds of dollars sometimes approaching $1,000 whereas the ARM processor in an iPhone comes in at just under $15. My general rule of thumb in comparing ARM processors with server CPUs is they are capable of ¼ the processing rate at roughly 1/10th the cost. And, super important, the massive shipping volume of the ARM ecosystem feeds the innovation and completion and this performance gap shrinks the performance gap with each processor generation. Each generational improvement captures more possible server workloads while further improving price/performance
· Power/performance: Most modern servers run over 200W, and many are well over 500W, while microservers can weigh in at 10 to 20W. Nowhere is power/performance more important than in portable devices, so the pace of power/performance innovation in the ARM world is incredibly strong. In fact, I’ve long used mobile devices as a window into future innovations coming to the server market. The technologies you seen in the current generation of cell phones has a very high probability of being used in a future server CPU generation.
This is not the first ARM based server processor that has been announced. And, even more announcements are coming over the next year. In fact, that is one of the strengths of the ARM ecosystem. The R&D investments can be leveraged over huge shipping volume from many producers to bring more competition, lower costs, more choice, and a faster pace of innovation.
This is a good day for customers, a good day for the server ecosystem, and I’m excited to see AMD help drive the next phase in the evolution of the ARM Server market. The pace of innovation continues to accelerate industry-wide and it’s going to be an exciting rest of the decade.
Past notes on Microservers:
b: http://blog.mvdirona.com / http://perspectives.mvdirona.com
When I come across interesting innovations or designs notably different from the norm, I love to dig in and learn the details. More often than not I post them here. Earlier this week, Google posted a number of pictures taken from their datacenters (Google Data Center Tech). The pictures are beautiful and of interest to just about anyone, somewhat more interesting to those working in technology, and worthy of detailed study for those working in datacenter design. My general rule with Google has always been that anything they show publically is always at least one generation old and typically more. Nonetheless, the Google team does good work so the older designs are still worth understanding so I always have a look.
Some examples of older but interesting Google data center technology:
· Efficient Data Center Summit
· Rough Notes: Data Center Efficiency Summit
· Rough notes: Data Center Efficiency Summit (posting #3)
· 2011 European Data Center Summit
The set of pictures posted last week (Google Data Center Tech) is a bit unusual in that they are showing current pictures of current facilities running their latest work. What was published was only pictures without explanatory detail but, as the old cliché says, a picture is worth a thousand words. I found the mechanical design to be most notable so I’ll dig into that area a bit but let’s start with showing a conventional datacenter mechanical design as a foil against which to compare the Google approach.
The conventional design has numerous issues the most obvious being that any design that is 40 years old and probably could use some innovation. Notable problems with the conventional design: 1) no hot aisle/cold aisle containment so there is air leakage and mixing of hot and cold air, 2) air is moved long distances between the Computer Room Air Handers (CRAHs) and the servers and air is an expensive fluid to move, and 3) it’s a closed system and hot air is recirculated after cooling rather than released outside with fresh air brought in and cooled if needed.
An example of an excellent design that does a modern job of addressing most of these failings is the Facebook Prineville Oregon facility:
I’m a big fan of the Facebook facility. In this design they eliminate the chilled water system entirely, have no chillers (expensive to buy and power), have full hot aisle isolation, use outside air with evaporative cooling, and treat the entire building as a giant, high-efficiency air duct. More detail on the Facebook design at: Open Compute Mechanical System Design.
Let’s have a look at the Google Concil Bluffs Iowa Facility:
You can see that have chosen a very large, single room approach rather than sub-dividing up into pods. As with any good, modern facility they have hot aisle containment which just about completely eliminates leakage of air around the servers or over the racks. All chilled air passes through the servers and none of the hot air leaks back prior to passing through the heat exchanger. Air containment is a very important efficiency gain and the single largest gain after air-side economization. Air-side economization is the use of outside air rather than taking hot server exhaust and cooling it to the desired inlet temperature (see the diagram above showing the Facebook use of full building ducting with air-side economization).
From the Council Bluffs picture, we see Google has taken a completely different approach. Rather than completely eliminate the chilled water system and use the entire building as an air duct, they have instead kept the piped water cooling system and instead focused on making it as efficient as possible and exploiting some of the advantages of water based systems. This shot from the Google Hamina Finland facility shows the multi-coil heat exchanger at the top of the hot aisle containment system.
From inside the hot aisle, this shot picture from the Mayes County data center, we can see the water is brought up from below the floor in the hot aisle using steel braided flexible chilled water hoses. These pipes bring cool water up to the top-of-hot-aisle heat exchangers that cool the server exhaust air before it is released above the racks of servers.
One of the key advantages of water cooling is that water is a cheaper to move fluid than air for a given thermal capacity. In the Google, design they exploit fact by bringing water all the way to the rack. This isn’t an industry first but it is nicely executed in the Google design. IBM iDataPlex brought water directly to the back of the rack and many high power density HPC systems have done this as well.
I don’t see the value of the short stacks above the heat exchanges. I would think that any gain in air acceleration through the smoke stack effect would be dwarfed by the loses of having the passive air stacks as restrictions over the heat exchangers.
Bringing water directly to the rack is efficient but I still somewhat prefer air-side economization systems. Any system that can reject hot air outside and bring in outside air for cooling (if needed) for delivery to the servers is tough to beat (see Diagram at the top for an example approach). I still prefer the outside air model, however, as server density climbs we will eventually get to power densities sufficiently high that water is needed either very near the server as Google has done or direct water cooling as used by IBM Mainframes in the 80s (thermal conduction module). One very nice contemporary direct water cooling system is the work by Green Revolution Cooling where they completely immerse otherwise unmodified servers in a bath of chilled oil.
Hat’s off to Google for publishing a very informative set of data center pictures. The pictures are well done and the engineering is very nice. Good work!
· Here’s a very cool Google Street view based tour of the Google Lenoir NC Datacenter.
· The detailed pictures released last week: Google Data Center Photo Album
b: http://blog.mvdirona.com / http://perspectives.mvdirona.com
The last few weeks have been busy and it has been way too long since I have blogged. I’m currently thinking through the server tax and what’s wrong with the current server hardware ecosystem but don’t have anything yet ready to go on that just yet. But, there are a few other things on the go. I did a talk at Intel a couple of weeks back and last week at the First Round Capital CTO summit. I’ve summarized what I covered below with pointers to slides.
In addition, I’ll be at the Amazon in Palo Alto event this evening and will do a talk there as well. If you are interested in Amazon in general or in AWS specifically, we have a new office open in Palo Alto and you are welcome to come down this evening to learn more about AWS, have a beer or the refreshment of your choice, and talk about scalable systems. Feel free to attend if you are interested:
Amazon in Palo Alto
October 11, 2012 at 5:00 PM - 9:00 PM
Pampas 529 Alma Street Palo Alto, CA 94301
First Round Capital CTO Summit:
I started this session by arguing that cost or value models are the right way to ensure you are working on the right problem. I come across far too many engineers and even companies that are working on interesting problems but they fail at the “top 10 problem” test. You never want to first have to explain the problem to a perspective customer before you get a chance to explain your solution. It is way more rewarding to be working on top 10 problems where the value of what you are doing is obvious and you only need to convince someone that your solution actually works.
Cost models are a good way to force yourself to really understand all aspects of what the customer is doing and know precisely what savings or advantage you bring. A 25% improvement on an 80% problem is way better than 50% solution to a 5% problem. Cost or value models are a great way of keeping yourself honest on what the real savings or improvement of your approach actually are. And its quantifiable data that you can verify in early tests and prove in alpha or beta deployments.
I then covered three areas of infrastructure where I see considerable innovation and showed all the cost model helped drive me there:
· Networking: The networking eco-system is still operating on the closed, vertically integrated, mainframe model but the ingredients are now in place to change this. See Networking, the Last Bastion of Mainframe Computing for more detail. The industry is currently going through great change. Big change is a hard transition for the established high-margin industry players but it’s a huge opportunity for startups.
· Storage: The storage (and database) worlds are going through a unprecedented change where all high-performance random access storage is migrating from hard disk drives to flash storage. The early flash storage players have focused on performance over price so there is still considerable room for innovation. Another change happening in the industry is the explosion of cold storage (low I/O density storage that I jokingly refer to as write-only) due to falling prices, increasing compliance requirements, and an industry realization that data has great value. This explosion in cold storage is opening much innovation and many startup opportunities. The AWS entrant in this market is Glacier where you can store seldom accessed data at one penny per GB per month (for more on Glacier: Glacier: Engineering for Cold Storage in the Cloud.
· Cloud Computing: I used to argue that targeting cloud computing was a terrible idea for startups since the biggest cloud operators like Google and Amazon tend to do all custom hardware and software and purchase very little commercially. I may have been correct initially but, with the cloud market growing so incredibly fast, every teleco is entering the market, each colo provider is entering, most hardware providers are entering, … the number of players is going from 10s to 1000s. And, at 1,000s, it’s a great market for a startup to target. Most of these companies are not going to build custom networking, server, and storage hardware but they do have the need to innovate with the rest of the industry.
Slides: First Round Capital CTO Summit
Intel Distinguished Speaker Series:
In this talk I started with how fast the cloud computing market segment is growing using examples form AWS. I then talked about why cloud computing is such an incredible customer value proposition. This isn’t just a short term fad that will pass over time. I mostly focused on how that statement I occasionally hear just can’t be possibly be correct: “I can run my on-premise computing infrastructure less expensively then hosting it in the cloud”. I walk through some of the reasons why this statement can only be made with partial knowledge. There are reasons why some computing will be in the cloud and some will be hosted locally and industry transitions absolutely do take time but cost isn’t one of the reasons that some workloads aren’t in the cloud.
I think walked through 5 areas of infrastructure innovation and some of what is happening in each area:
· Power Distribution
· Mechanical Systems
· Data Center Building Design
Slides: Intel Distinguished Speaker Series
I hope to see you tonight at the Amazon Palo Alto event at Pampas (http://goo.gl/maps/dBZxb). The event starts at 5pm and I’ll do a short talk at 6:35.
Earlier today Amazon Web Services announced Glacier, a low-cost, cloud-hosted, cold storage solution. Cold storage is a class of storage that is discussed infrequently and yet it is by far the largest storage class of them all. Ironically, the storage we usually talk about and the storage I’ve worked on for most of my life is the high-IOPS rate storage supporting mission critical databases. These systems today are best hosted on NAND flash and I’ve been talking recently about two AWS solutions to address this storage class:
Cold storage is different. It’s the only product I’ve ever worked upon where the customer requirements are single dimensional. With most products, the solution space is complex and, even when some customers may like a competitive product better for some applications, your product still may win in another. Cold storage is pure and unidimensional. There is only really one metric of interest: cost per capacity. It’s an undifferentiated requirement that the data be secure and very highly durable. These are essentially table stakes in that no solution is worth considering if it’s not rock solid on durability and security. But, the only dimension of differentiation is price/GB.
Cold storage is unusual because the focus needs to be singular. How can we deliver the best price per capacity now and continue to reduce it over time? The focus on price over performance, price over latency, price over bandwidth actually made the problem more interesting. With most products and services, it’s usually possible to be the best on at least some dimensions even if not on all. On cold storage, to be successful, the price per capacity target needs to be hit. On Glacier, the entire project was focused on delivering $0.01/GB/Month with high redundancy and security and to be on a technology base where the price can keep coming down over time. Cold storage is elegant in its simplicity and, although the margins will be slim, the volume of cold storage data in the world is stupendous. It’s a very large market segment. All storage in all tiers backs up to the cold storage tier so its provably bigger than all the rest. Audit logs end up in cold storage as do web logs, security logs, seldom accessed compliance data, and all other data I refer jokingly to as Write Only Storage. It turns out that most files in active storage tiers are actually never accessed (Measurement and Analysis of Large Scale Network File System Workloads ). In cold storage, this trend is even more extreme where reading a storage object is the exception. But, the objects absolutely have to be there when needed. Backups aren’t needed often and compliance logs are infrequently accessed but, when they are needed, they need to be there, they absolutely have to be readable, and they must have been stored securely.
But when cold objects are called for, they don’t need to be there instantly. The cold storage tier customer requirement for latency ranges from minutes, to hours, and in some cases even days. Customers are willing to give up access speed to get very low cost. Potentially rapidly required database backups don’t get pushed down to cold storage until they are unlikely to get accessed. But, once pushed, it’s very inexpensive to store them indefinitely. Tape has long been the media of choice for very cold workloads and tape remains an excellent choice at scale. What’s unfortunate, is that the scale point where tape starts to win has been going up over the years. High-scale tape robots are incredibly large and expensive. The good news is that very high-scale storage customers like Large Hadron Collider (LHC) are very well served by tape. But, over the years, the volume economics of tape have been moving up scale and fewer and fewer customers are cost effectively served by tape.
In the 80s, I had a tape storage backup system for my Usenet server and other home computers. At the time, I used tape personally and any small company could afford tape. But this scale point where tape makes economic sense has been moving up. Small companies are really better off using disk since they don’t have the scale to hit the volume economics of tape. The same has happened at mid-sized companies. Tape usage continues to grow but more and more of the market ends up on disk.
What’s wrong with the bulk of the market using disk for cold storage? The problem with disk storage systems is they are optimized for performance and they are expensive to purchase, to administer, and even to power. Disk storage systems don’t currently target cold storage workload with that necessary fanatical focus on cost per capacity. What’s broken is that customers end up not keeping data they need to keep or paying too much to keep it because the conventional solution to cold storage isn’t available at small and even medium scales.
Cold storage is a natural cloud solution in that the cloud can provide the volume economics and allow even small-scale users to have access to low-cost, off-site, multi-datacenter, cold storage at a cost previously only possible at very high scale. Implementing cold storage centrally in the cloud makes excellent economic sense in that all customers can gain from the volume economics of the aggregate usage. Amazon Glacier now offers Cloud storage where each object is stored redundantly in multiple, independent data centers at $0.01/GB/Month. I love the direction and velocity that our industry continues to move.
More on Glacier:
· Detail Page: http://aws.amazon.com/glacier
· Frequently Asked Questions: http://aws.amazon.com/glacier/faqs
· Console access: https://console.aws.amazon.com/glacier
· Developers Guide: http://docs.amazonwebservices.com/amazonglacier/latest/dev/introduction.html
· Getting Started Video: http://www.youtube.com/watch?v=TKz3-PoSL2U&feature=youtu.be
By the way, if Glacier has caught your interest and you are an engineer or engineering leader with an interest in massive scale distributed storage systems, we have big plans for Glacier and are hiring. Send your resume to email@example.com.
b: http://blog.mvdirona.com / http://perspectives.mvdirona.com
Facebook recently released a detailed report on their energy consumption and carbon footprint: Facebook’s Carbon and Energy Impact. Facebook has always been super open with the details behind there infrastructure. For example, they invited me to tour the Prineville datacenter just prior to its opening:
· Open Compute Project
· Open Compute Mechanical System Design
· Open Compute Server Design
· Open Compute UPS & Power Supply
Reading through the Facebook Carbon and Energy Impact page, we see they consumed 532 million kWh of energy in 2011 of which 509m kWh went to their datacenters. High scale data centers have fairly small daily variation in power consumption as server load goes up and down and there are some variations in power consumption due to external temperature conditions since hot days require more cooling than chilly days. But, highly efficient datacenters tend to be effected less by weather spending only a tiny fraction of their total power on cooling. Assuming a flat consumption model, Facebook is averaging, over the course of the year, 58.07MW of total power delivered to its data centers.
Facebook reports an unbelievably good 1.07 Power Usage Effectiveness (PUE) which means that for every 1 Watt delivered to their servers they lose only 0.07W in power distribution and mechanical systems. I always take publicly released PUE numbers with a grain of salt in that there has been a bit of a PUE race going on between some of the large operators. It’s just about assured that there are different interpretations and different measurement techniques being employed in computing these numbers so comparing them probably doesn’t tell us much. See PUE is Still Broken but I Still use it and PUE and Total Power Usage Efficiency for more on PUE and some of the issues in using it comparatively.
Using the Facebook PUE number of 1.07, we know they are delivering 54.27MW to the IT load (servers and storage). We don’t know the average server draw at Facebook but they have excellent server designs (see Open Compute Server Design) so they likely average at or below as 300W per server. Since 300W is an estimate, let’s also look at 250W and 400W per server:
· 250W/server: 217,080 servers
· 300W/server: 180,900 servers
· 350W/server: 155,057 servers
As a comparative data point, Google’s data centers consume 260MW in aggregate (Google Details, and Defends, It’s use of Electricity). Google reports their PUE is 1.14 so we know they are delivering 228MW to their IT infrastructure (servers and storage). Google is perhaps the most focused in the industry on low power consuming servers. They invest deeply in custom designs and are willing to spend considerably more to reduce energy consumption. Estimating their average server power draw at 250W and looking at the +/-25W about that average consumption rate:
· 225W/server: 1,155,555 servers
· 250W/server: 1,040,000 servers
· 275W/server: 945,454 servers
I find the Google and Facebook server counts interesting for two reasons. First, Google was estimated to have 1 million servers more than 5 years ago. The number may have been high at the time but it’s very clear that they have been super focused on work load efficiency and infrastructure utilization. To grow the search and advertising as much as they have without growing the server count at anywhere close to the same rate (if at all) is impressive. Continuing to add computationally expensive search features and new products and yet still being able to hold the server count near flat is even more impressive.
The second notable observation from this data is that the Facebook server count is growing fast. Back in October of 2009, they had 30,000 servers. In June of 2010 the count climbed to 60,000 servers. Today they are over 150k.
b: http://blog.mvdirona.com / http://perspectives.mvdirona.com
In I/O Performance (no longer) Sucks in the Cloud, I said
Many workloads have high I/O rate data stores at the core. The success of the entire application is dependent upon a few servers running MySQL, Oracle, SQL Server, MongoDB, Cassandra, or some other central database.
Last week a new Amazon Elastic Compute Cloud (EC2) instance type based upon SSDs was announced that delivers 120k reads per second and 10k to 85k writes per second. This instance type with direct attached SSDs is an incredible I/O machine ideal for database workloads, but most database workloads run on virtual storage today. The administrative and operational advantages of virtual storage are many. You can allocate more storage with a call of an API. Blocks are redundantly stored on multiple servers. It’s easy to checkpoint to S3. Server failures don’t impact storage availability.
The AWS virtual block storage solution is the Elastic Block Store (EBS). Earlier today two key features were released to support high performance databases and other random I/O intensive workloads on EBS. The key observation is that these random I/O-intensive workloads need to have IOPS available whenever they are needed. When a database runs slowly, the entire application runs poorly. Best effort is not enough and competing for resources with other workloads doesn’t work. When high I/O rates are needed, they are needed immediately and must be there reliably.
Perhaps the best way to understand the two new features is to look at how demanding database workloads are often hosted on-premise. Typically large servers are used so the memory and CPU resources are available when needed. Because a high performance storage system is needed and because it is important to be able to scale the storage capacity and I/O rates during the life of the application, direct attached disk isn’t the common choice. Most enterprise customers put these workloads on Storage Area Network devices which are typically connected to the server by a Fiber Channel network (a private communication channel used only for storage).
The aim of the announcement today is to take some of what has been learned from 30+ years of on-premise storage evolution. Customers want virtualized storage but, at the same time, they need the ability to reserve resources for demanding workloads. In this announcement, we take some of the best aspects what has emerged in on-premise storage solutions and give EC2 customers the ability to scale high-performance storage as needed, reserve and scale the available I/Os per Second (IOPS) as needed, and reserve dedicated network bandwidth to the storage device. The latter is perhaps the most important and the combination allows workloads to reserve both the IOPS rates at the storage as well as the network channel to get to the storage and be assured it will be there when they need it.
The storage, IOPS, and network capacity is there even if you haven’t used it recently. It’s there even if your neighbors are also busy using their respective reservations. It’s even there if you are running full networking traffic load to the EC2 instance. Just as when an on-premise customer allocates a SAN volume with a Fiber Channel attach that doesn’t compete with other network traffic, allocated resources stay reserved and they stay available. Let’s look at the two features that deliver a low-jitter, virtual SAN solution in AWS.
Provisioned IOPS is a feature of Amazon Elastic Block Store. EBS has always allowed customers to allocate storage volumes of the size they need and to attach these virtual volumes to their EC2 instances. Provisioned IOPS allows customers to declare the I/O rate they need the volumes to be able to deliver, up to 1,000 I/Os per second (IOPS) per volume. Volumes can be striped together to achieve reliable, low-latency virtual volumes of 20,000 IOPS or more. The ability to reliably configure and reserve over 10,000 IOPS means the vast majority of database workloads can be supported. And, in the near future, this limit will be raised allowing increasingly demanding workloads to be hosted on EC2 using EBS.
EBS-Optimized EC2 instances are a feature of EC2 that is the virtual equivalent of installing a dedicated network channel to storage. Depending upon the instance type, 500 Mbps up to a full 1Gbps are allocated and dedicated for storage use only. This storage communications channel is in addition to the network connection to the instance. Storage and network traffic no longer compete and, on large instance types, you can drive full 1Gbps line rate network traffic while, at the same time, also be consuming 1Gbps to storage. Essentially EBS Optimized instances have a dedicated storage channel that doesn’t compete with instance network traffic.
From the EBS detail page:
EBS standard volumes offer cost effective storage for applications with light or bursty I/O requirements. Standard volumes deliver approximately 100 IOPS on average with a best effort ability to burst to hundreds of IOPS. Standard volumes are also well suited for use as boot volumes, where the burst capability provides fast instance start-up times.
Provisioned IOPS volumes are designed to deliver predictable, high performance for I/O intensive workloads such as databases. With Provisioned IOPS, you specify an IOPS rate when creating a volume, and then Amazon EBS provisions that rate for the lifetime of the volume. Amazon EBS currently supports up to 1,000 IOPS per Provisioned IOPS volume, with higher limits coming soon. You can stripe multiple volumes together to deliver thousands of IOPS per Amazon EC2 instance to your application.
To enable your Amazon EC2 instances to fully utilize the IOPS provisioned on an EBS volume, you can launch selected Amazon EC2 instance types as “EBS-Optimized” instances. EBS-optimized instances deliver dedicated throughput between Amazon EC2 and Amazon EBS, with options between 500 Mbps and 1000 Mbps depending on the instance type used. When attached to EBS-Optimized instances, Provisioned IOPS volumes are designed to deliver within 10% of the provisioned IOPS performance 99.9% of the time. See Amazon EC2 Instance Types to find out more about instance types that can be launched as EBS-Optimized instances.
Providing scalable block storage at-scale, in 8 regions around the world is one of the most interesting combinations of distributed systems and storage problems we face. The problem has been well solved in high-cost on-premise solutions. We now get to apply what has been learned over the last 30+ years to solve the problem at cloud-scale with low-cost and 100s of thousands of concurrent customers. An incredible number of EC2 customers depend upon EBS for their virtual storage needs, the number is growing daily, and we are really only just getting started. If you want to be part of the engineering effort to make Elastic Block Store the virtual storage solution for the cloud, send us a note at firstname.lastname@example.org.
With the announcement today, EC2 customers now have access to two very high performance storage solutions. The first solution is the EC2 High I/O Instance type announced last week which delivers a direct attached, SSD-powered 100k IOIPS for $3.10/hour. In today’s announcement this direct attached storage solution is joined by a high-performance virtual storage solution. This new type of EBS storage allows the creation of striped storage volumes that can reliably delivery 10,000 to 20,000 IOPS across a dedicated virtual storage network.
Amazon EC2 customers now have both high-performance, direct attached storage and high-performance virtual storage with a dedicated virtual storage connection.
b: http://blog.mvdirona.com / http://perspectives.mvdirona.com
Many workloads have high I/O rate data stores at the core. The success of the entire application is dependent upon a few servers running MySQL, Oracle, SQL Server, MongoDB, Cassandra, or some other central database.
The best design patter for any highly reliable and scalable application whether on-premise or in cloud hosted, is to shard the database. You can’t be dependent upon a single server being able to scale sufficiently to hold the entire workload. Theoretically, that’s the solution and all workloads should run well on a sufficiently large fleet even if that fleet has a low individual server I/O performance. Unfortunately, few workloads scale as badly as database workloads. Even scalable systems such as MongoDB or Cassandra need to have a per-server I/O rate that meets some minimum bar to host the workload cost effectively with stable I/O performance.
The easy solution is to depend upon a hosted service like DynamoDB that can transparently scale to order 10^6 transactions per second and deliver low jitter performance. For many workloads, that is the final answer. Take the complexity of configuring and administering a scalable database and give it to a team that focuses on nothing else 24x7 and does it well.
Unfortunately, in the database world, One Size Does Not Fit All. DynamoDB is a great solution for some workloads but many workloads are written to different stores or depend upon features not offered in DynamoDB. What if you have an application written to run on sharded Oracle (or MySQL) servers and each database requires 10s of thousands of I/Os per second? For years, this has been the prototypical “difficult to host in the cloud” workload. All servers in the application are perfect for the cloud but the overall application won’t run unless the central database server can support the workload.
Consequently, these workloads have been difficult to host on the major cloud services. They are difficult to scale out to avoid needing very high single node I/O performance and they won’t yield a good customer experience unless the database has the aggregate IOPS needed.
Yesterday an ideal EC2 instance type was announced. It’s the screamer needed by these workloads. The new EC2 High I/O Instance type is a born database machine. Whether you are running Relational or NoSQL, if the workload is I/O intense and difficult to cost effectively scale-out without bound, this instance type is the solution. It will deliver a booming 120,000 4k reads per second and between 10,000 and 85,000 4k random writes per second. The new instance type:
· 60.5 GB of memory
· 35 EC2 Compute Units (8 virtual cores with 4.4 EC2 Compute Units each)
· 2 SSD-based volumes each with 1024 GB of instance storage
· 64-bit platform
· I/O Performance: 10 Gigabit Ethernet
· API name: hi1.4xlarge
If you have a difficult to host I/O intensive workload, EC2 has the answer for you. 120,000 read IOPS and 10,000 to 85,000 write IOPS for $3.10/hour Linux on demand or $3.58/hour Windows on demand. Because these I/O workloads are seldom scaled up and down in real time, the Heavy Utilization Reserved instance is a good choice where the server capacity can be reserved for $10,960 for a three year term and usage is $0.482/hour.
· Amazon EC2 detail page: http://aws.amazon.com/ec2/
· Amazon EC2 pricing page: http://aws.amazon.com/ec2/#pricing
· Amazon EC2 Instance Types: http://aws.amazon.com/ec2/instance-types/
· Amazon Web Services Blog: http://aws.typepad.com/aws/2012/07/new-high-io-ec2-instance-type-hi14xlarge.html
· Werner Vogels: http://www.allthingsdistributed.com/2012/07/high-performace-io-instance-amazon-ec2.html
Adrian Cockcroft of Netflix wrote an excellent blog on this instance type where he gave benchmarking results from Netflix: Benchmarking High Performance I/O with SSD for Cassandra on AWS.
You can now have 100k IOPS for $3.10/hour.
b: http://blog.mvdirona.com / http://perspectives.mvdirona.com
Why are there so many data centers in New York, Hong Kong, and Tokyo? These urban centers have some of the most expensive real estate in the world. The cost of labor is high. The tax environment is unfavorable. Power costs are high. Construction is difficult to permit and expensive. Urban datacenters are incredibly expensive facilities and yet a huge percentage of the world’s computing is done in expensive urban centers.
One of my favorite examples is the 111 8th Ave data center in New York. Google bought this datacenter for $1.9B. They already have facilities on the Columbia river where the power and land are cheap. Why go to New York when neither is true? Google is innovating in cooling technologies in their Belgium facility where they are using waste water cooling. Why go to New York where the facility is conventional, the power source predominantly coal-sourced, and the opportunity for energy innovation is restricted by legacy design and the lack of real estate available in the area around the facility. It’s pretty clear that 111 8th Ave isn’t going to be wind farm powered. A solar array could likely be placed on the roof but that wouldn’t have the capacity to run the interior lights in this large facility (See I love Solar but … for more on the space challenges of solar power at data center power densities). There isn’t space to do anything relevant along these dimensions.
Google has some of the most efficient datacenters in the world, running on some of the cleanest power sources in the world, and custom engineered from the ground up to meet their needs. Why would they buy an old facility, in a very expensive metropolitan area, with a legacy design? Are they nuts? Of course not, Google is in New York because many millions of Google customers are in New York or nearby.
Companies site datacenters near the customers of those data centers. Why not serve the planet from Iceland where the power is both cheap and clean? When your latency budget to serve customers is 200 msec, you can’t give up ¾ of that time budget on speed of light delays traveling long distances. Just crossing the continent from California to New York is a 74 msec round trip time (RTT). New York to London is 70 msec RTT. The speed of light is unbending. Actually, it’s even worse than the speed of light in that the speed of light in a fiber is about 2/3 of the speed of light in a vacuum (see Communicating Beyond the Speed of Light).
Because of the cruel realities of the speed of light, companies must site data centers where their customers are. That’s why companies selling world-wide, often need to have datacenters all over the world. That’s why the Akamai content distribution network has over 1,200 points of presence world-wide. To serve customers competitively, you need to be near those customers. The reason datacenters are located in Tokyo, New York, London, Singapore and other expensive metropolitan locations is they need to be near customers or near data that is in those locations. It costs considerably to maintain datacenters all over the world but there is little alternative.
Many articles recently have been quoting the Greenpeace open letter asking Ballmer, Bezos and Cook to “go to Iceland”. See for example Letter to Ballmer, Bezos, and Cook: Go to Iceland. Having come many of these articles recently, it seemed worth stopping and reflecting on why this hasn’t already happened. It’s not like company just love paying more or using less environmentally friendly power sources for their data centers.
Google is in New York because it has millions of customers in New York. If it were physically possible to serve these customers from an already built, hyper efficient datacenter like Google Dalles, they certainly would. But that facility is 70 msec round trip away from New York. What about Iceland? Roughly the same distance. It simply doesn’t work competitively. Companies build near their users because physics of the speed of light is unbending and uncaring.
So, what can we do? It turns out that many workloads are not latency sensitive. The right strategy is to house latency sensitive workloads near customers or the data needed at low latency and house latency insensitive workloads optimizing on other dimensions. This is exactly what Google does but, to do that, you need to have many datacenters all over the world so the appropriate facility can be selected on a workload-by-workload basis. This isn’t a practical approach for many smaller companies with only 1 or 2 datacenters to choose from.
This is another area where cloud computing can help. Cloud computing can allow mid-sized and even small companies to have many different datacenters optimized for different goals all over the world. Using Amazon Web Services, a company can house workloads near customers in Singapore, Tokyo, Brazil, and Ireland to be close to their international customers. Being close to these customers makes a big difference in the overall quality of customer experience (see: The Cost of Latency for more detail on how much latency really matters). As well as allowing a company to cost effectively have an international presence, cloud computing also allows companies to make careful decisions on where they locate workloads in North America. Again using AWS as the example, customers can place workloads in Virginia to serve the east coast or use Northern California to serve the population dense California region. If the workloads are not latency sensitive or is serving customers near the Pacific Northwest, they can be housed in the AWS Oregon region where the workload can be hosted coal free and less expensively than in Northern California.
The reality is that physics is uncaring and many workloads do need to be close to users. Cloud computing allows all companies to have access to datacenters all over the world so they can target individual workloads to the facilities that most closely meet their goals and the needs of their customers. Some computing will have to stay in New York even though it is mostly coal powered, expensive, and difficult to expand. But some workload will run very economically in the AWS West (Oregon) region where there is no coal power, expansion is cheap, and power inexpensive.
Workload placement decisions are more complex than “move to Iceland.”
b: http://blog.mvdirona.com / http://perspectives.mvdirona.com
Last night, Tom Klienpeter sent me The Official Report of the Fukushima Nuclear Accident Independent Investigation Commission Executive Summary. They must have hardy executives in Japan in that the executive summary runs 86 pages in length. Overall, It’s an interesting document but I only managed to read in to the first page before starting to feel disappointed. What I was hoping for is a deep dive into why the reactors failed, the root causes of the failures, and what can be done to rectify it.
Because of the nature of my job, I’ve spent considerable time investigating hardware and software system failures and what I find most difficult and really time consuming is getting to the real details. It’s easy to say there was a tsunami and it damaged the reactor complex and loss of power caused radiation release. But why did loss of power cause radiation release? Why didn’t the backup power systems work? Why does the design depend upon the successful operation of backup power systems? Digging to the root cause takes the time, requires that all assumptions be challenged, and invariably leads to many issues that need to be addresses. Good post mortems are detailed, get to the root cause, and it’s rare that a detailed investigation of any complex system doesn’t yield a long, detailed list of design and operational changes. The Rogers Commission on the Space Shuttle Challenger failure is perhaps the best example of digging deeply, finding root cause both technical and operational, and making detailed recommendations.
On the second page of this report, the committee members were enumerated. The committed includes 1) seismologist, 2) 2 medical doctors, 3) chemist, 4) journalist, 5) 2 lawyers, 6) social system designer, 7) one politician, and 8) no nuclear scientist, no reactor designers, and no reactor operators. The earthquake and subsequent tsunami was clearly the seed for the event but since we can’t prevent these, I would argue that they should only play a contextual role in the post mortem. What we need to understand is exactly why the both the reactor and nuclear material storage design were not stable in the presence of cooling system failure. It's weird that there were no experts in the subject area where the most dangerous technical problems were encountered. Basically we can’t stop earthquakes and tsunamis so we need to ensure that systems remain safe in the presence of them.
Obviously the investigative team is very qualified to deal with the follow-on events both in assessing radiation exposure risk, how the evacuation was carried out, and regulatory effectiveness. And it is clear these factors are all important. But still, it feels like the core problem is that cooling system flow was lost and the both the reactors and nuclear material storage ponds overheated. Using materials that, when overheated, release explosive hydrogen gas is a particularly important area of investigation.
Personally, the largest part of my interest were it my investigation, would be focused on achieving designs stable in the presence of failure. Failing that, getting really good at evacuation seems like a good idea but still less important than ensuring these reactors and others in the country fail into a safe state.
The report reads like a political document. Its heavy on blame, light on root cause and the technical details of the root cause failure, and the recommended solution depends upon more regulatory oversight. The document focuses on more oversight by the Japanese Diet (a political body) and regulatory agencies but doesn't go after the core issues that lead to the nuclear release. From my perspective, the key issues are 1) scramming the reactor has to 100% stop the reaction and the passive cooling has to be sufficient to ensure the system can cool from full operating load without external power, operational oversight, or other input beyond dropping the rods. Good SCRAM systems automatically deploy and stop the nuclear reaction. This is common. What is uncommon is ensuring the system can successfully cool from a full load operational state without external input of power, cooling water, or administrative input.
The second key point that this nuclear release drove home for me is 2) all nuclear material storage areas must be seismically stable, above flood water height, maintain integrity through natural disasters, and must be able to stay stable and safe without active input or supervision for long periods of time. They can't depends upon pumped water cooling and have to 100% passive and stable for long periods without tending.
My third recommendation is arguably less important than my first two but applies to all systems: operators can’t figure out what is happening or take appropriate action without detailed visibility into the state of the system. The monitoring system needs to be independent (power, communications, sensors, …) , detailed, and able to operate correctly with large parts of the system destroyed or inoperative.
My fourth recommendation is absolutely vital and I would never trust any critical system without this: test failure modes frequently. Shut down all power to the entire facility at full operational load and establish that temperatures fall rather than rise and no containment systems are negatively impacted. Shut off the monitoring system and ensure that the system continues to operate safely. Never trust any system in any mode that hasn’t been tested.
The recommendations from the Official Report of the Fukushima Nuclear Accident Independent Investigation Commission Executive Summary follow:
Monitoring of the nuclear regulatory body by the National Diet
A permanent committee to deal with issues regarding nuclear power must be established in the National Diet in order to supervise the regulators to secure the safety of the public. Its responsibilities should be:
1. To conduct regular investigations and explanatory hearings of regulatory agencies, academics and stakeholders.
2. To establish an advisory body, including independent experts with a global perspective, to keep the committee’s knowledge updated in its dealings with regulators.
3. To continue investigations on other relevant issues.
4. To make regular reports on their activities and the implementation of their recommendations.
Reform the crisis management system
A fundamental reexamination of the crisis management system must be made. The boundaries dividing the responsibilities of the national and local governments and the operators must be made clear. This includes:
1. A reexamination of the crisis management structure of the government. A structure must be established with a consolidated chain of command and the power to deal with emergency situations.
2. National and local governments must bear responsibility for the response to off-site radiation release. They must act with public health and safety as the priority.
3. The operator must assume responsibility for on-site accident response, including the halting of operations, and reactor cooling and containment.
Government responsibility for public health and welfare
Regarding the responsibility to protect public health, the following must be implemented as soon as possible:
1. A system must be established to deal with long-term public health effects, including stress-related illness. Medical diagnosis and treatment should be covered by state funding. Information should be disclosed with public health and safety as the priority, instead of government convenience. This information must be comprehensive, for use by individual residents to make informed decisions.
2. Continued monitoring of hotspots and the spread of radioactive contamination must be undertaken to protect communities and the public. Measures to prevent any potential spread should also be implemented.
3. The government must establish a detailed and transparent program of decontamination and relocation, as well as provide information so that all residents will be knowledgeable about their compensation options.
Monitoring the operators
TEPCO must undergo fundamental corporate changes, including strengthening its governance, working towards building an organizational culture which prioritizes safety, changing its stance on information disclosure, and establishing a system which prioritizes the site. In order to prevent the Federation of Electric Power Companies (FEPC) from being used as a route for negotiating with regulatory agencies, new relationships among the electric power companies must also be established—built on safety issues, mutual supervision and transparency.
1. The government must set rules and disclose information regarding its relationship with the operators.NAIIC 23
2. Operators must construct a cross-monitoring system to maintain safety standards at the highest global levels.
3. TEPCO must undergo dramatic corporate reform, including governance and risk management and information disclosure—with safety as the sole priority.
4. All operators must accept an agency appointed by the National Diet as a monitoring authority of all aspects of their operations, including risk management, governance and safety standards, with rights to on-site investigations.
Criteria for the new regulatory body
The new regulatory organization must adhere to the following conditions. It must be:
1. Independent: The chain of command, responsible authority and work processes must be: (i) Independent from organizations promoted by the government (ii) Independent from the operators (iii) Independent from politics.
2. Transparent: (i) The decision-making process should exclude the involvement of electric power operator stakeholders. (ii) Disclosure of the decision-making process to the National Diet is a must. (iii) The committee must keep minutes of all other negotiations and meetings with promotional organizations, operators and other political organizations and disclose them to the public. (iv) The National Diet shall make the final selection of the commissioners after receiving third-party advice.
3. Professional: (i) The personnel must meet global standards. Exchange programs with overseas regulatory bodies must be promoted, and interaction and exchange of human resources must be increased. (ii) An advisory organization including knowledgeable personnel must be established. (iii) The no-return rule should be applied without exception.
4. Consolidated: The functions of the organizations, especially emergency communications, decision-making and control, should be consolidated.
5. Proactive: The organizations should keep up with the latest knowledge and technology, and undergo continuous reform activities under the supervision of the Diet.
Reforming laws related to nuclear energy
Laws concerning nuclear issues must be thoroughly reformed.
1. Existing laws should be consolidated and rewritten in order to meet global standards of safety, public health and welfare.
2. The roles for operators and all government agencies involved in emergency response activities must be clearly defined.
3. Regular monitoring and updates must be implemented, in order to maintain the highest standards and the highest technological levels of the international nuclear community.
4. New rules must be created that oversee the backfit operations of old reactors, and set criteria to determine whether reactors should be decommissioned.
Develop a system of independent investigation commissions
A system for appointing independent investigation committees, including experts largely from the private sector, must be developed to deal with unresolved issues, including, but not limited to, the decommissioning process of reactors, dealing with spent fuel issues, limiting accident effects and decontamination.
Many of the report recommendations are useful but they fall short of addressing the root cause. Here’s what I would like to see:
1. Scramming the reactor has to 100% stop the reaction and the passive cooling has to be sufficient to ensure the system can cool from full operating load without external power, operational oversight, or other input beyond dropping the rods.
2. All nuclear material storage areas must be seismically stable, above flood water height, maintain integrity through natural disasters, and must be able to stay stable and safe without active input or supervision for long periods of time.
3. The monitoring system needs to be independent, detailed, and able to operate correctly with large parts of the system destroyed or inoperative.
4. Test all failure modes frequently. Assume that all systems that haven’t been tested will not work. Surprisingly frequently, they don’t.
The Official Report of the Fukushima Nuclear Accident Independent Investigation Commission Executive Summary can be found at: http://naiic.go.jp/wp-content/uploads/2012/07/NAIIC_report_lo_res2.pdf.
Since our focus here is primarily on building reliable hardware and software systems, this best practices document may be of interest: Designing & Deploying Internet-Scale Services: http://mvdirona.com/jrh/talksAndPapers/JamesRH_Lisa.pdf
b: http://blog.mvdirona.com / http://perspectives.mvdirona.com
The NASCAR Sprint Cup Stock Car Series kicks its season off with a bang and, unlike other sports, starts the season off with the biggest event of the year rather than closing with it. Daytona Speed Weeks is a multi-week, many race event the finale of which is the Daytona 500. The 500 starts with a huge field of 43 cars and is perhaps famous for some of the massive multi-car wrecks. The 17 car pile-up of 2011, made a 43 card field look like the appropriate amount of redundancy just to get a car over the finish line at the end.
Watching 43 stock cars race for the green flag at the start of the race is an impressive show of power as 146,000 lbs of metal charge towards the start line at nearly 200 miles per hour running so close that they appear to be connected. From the stands, the noise is deafening, the wall of air they are pushing can be felt 20 rows up and the air is hot from all the waste heat spilling off the field as they scream to the line.
Imagine harnessing all the power of all the engines from the 43 cars heading towards the start line at Daytona in a single engine? In fact, let’s make it harder, imagine having all the power of all the cars that take the green flag at both Daytona Sprnt Cup races each year. That would be a single engine capable of putting out 64,500 hp. Actually, for safety reasons, NASCAR restricts engine output at the Daytona and Talladega superspeedways to approximately 430 hp but let’s stick with the 750 hp they can produce when unrestricted. If we harnessed that power into a single engine, we would have an unbelievable 64,500 HP. Last week Jennifer and I were invited to tour the Hanjin Oslo container ship which happens to be single engine powered. Believe it or not, that single engine is more powerful that the aggregate horsepower of both Daytona starting fields. It has a single 74,700 hp engine.
Last week Peter Kim who supervises the Hanjin shipping port at Terminal 46 invited us to tour the port facility and the Hanjin Oslo container ship. I love technology, scale, and learning how well run operations work so I jumped on the opportunity.
Shortly after arriving, we watched the Oslo being brought into terminal 46. The captain and pilot were both looking down from the bridge wing towering more than 100’ above us giving commands to the tugs as the Oslo is being eased into the dock. Even before the ship was tied off, the port was rapidly coming to life. Dock workers were scrambling to their stations, trucks were starting, container cranes were moving into position, Customs and Border Patrol was getting ready to board, and line handlers were preparing to tie the ship off. There were workers and heavy equipment moving into position throughout the terminal. And, over the next 12 hours, more than a thousand containers would be moved before the ship would be off to its next destination at 6:30am the following morning.
The Oslo is not the newest ship in the Hanjin fleet having been built in 1998. It’s not the biggest ship nor is it the most powerful. But it’s a great example of a well-run, super clean, and expertly maintained container ship. And, starting with the size, here’s the view from the bridge.
The ship truly is huge. What I find even more amazing is that, as large as the Oslo is, there are container ships out there with up to twice the cargo carrying capacity and as much as 45% more horse power. In fact, the world’s most powerful diesel engine is deployed in a container ship. It’s a 14 cylinder, 3 floor high monster that produces 109,000 hp designed by the Finnish company Wartsila.
The Hanjin Oslo uses a (slightly) smaller inline 10 cylinder version of the same engine design. The key difference between it and the world’s largest diesel shown above is that the engine in the Oslo is 4 cylinders shorter at 10 cylinders inline rather than 14 and it produces proportionally less power. On the Oslo, the engine spans 3 decks so you can only see 1/3 of it at any one time. Here’s the view from the Hanjin Oslo engine room top deck, mid deck, and lower deck:
The engine is clearly notable for its size and power output. But, what I find most surprising is it’s a two stroke engine. Two stroke engines produce power at the beginning of the power stroke where the piston is heading down, dump the exhaust towards the end of that stroke, then bring in fresh air at the beginning of the next stroke as the piston begins heading back up, and then compresses the air for the remainder of that stroke. Towards the end of the compression stroke, fuel is injected into the cylinder where it combusts rapidly building pressure and pushing the piston back down on the power stroke. Four stroke engines separate these functions into four strokes: 1) power going down, 2) exhaust going up, 3) intake going down, and then 4) compression going up.
Two-stroke engines are common in lawn mowers, chainsaws, and some very small outboards because of their high power to weight ratio and simplicity of design that makes very low cost engines possible. Larger diesel engines used in trucks and automobiles are almost exclusively 4 stroke engines. Ironically, the very highest output diesel engines found in large marine applications are also two strokes.
From Spending an evening with the Hanjin team, I was super impressed. I love the technology, the scale was immense, everything was very well maintained, and they are clearly excellent operators. If I was moving goods between continents, I would look first to Hanjin.
Cooling is the largest single non-IT (overhead) load in a modern datacenter. There are many innovative solutions to addressing the power losses in cooling systems. Many of these mechanical system innovations work well and others have great potential but none are as powerful as simply increasing the server inlet temperatures. Obviously less cooling is cheaper than more. And, the higher the target inlet temperatures, the higher percentage of time that a facility can spend running on outside air (air-side economization) without process-based cooling.
The downsides of higher temperatures are 1) high semiconductor leakage losses, 2) higher server fan speed which increases the losses to air moving, and 3) higher server mortality rates. I’ve measured the former and, although these losses are inarguably present, these losses are measureable but have a very small impact at even quite high server inlet temperatures. The negative impact of fan speed increases is real but can be mitigated via different server target temperatures and more efficient server cooling designs. If the servers are designed for higher inlet temperatures, the fans will be configured for these higher expected temperatures and won’t run faster. This is simply a server design decision and good mechanical designs work well at higher server temperatures without increased power consumption. It’s the third issue that remains the scary one: increased server mortality rates.
The net of these factors is fear of higher server mortality rates is the prime factor slowing an even more rapid increase in datacenter temperatures. An often quoted study reports the failure rate of electronics doubles with every 10C increase of temperature (MIL-HDBK 217F). This data point is incredibly widely used by the military, NASA space flight program, and in commercial electronic equipment design. I’m sure the work is excellent but it is a very old study, wasn’t focused on a large datacenter environment, and the rule of thumb that has emerged from is a linear model of failure to heat.
A recent paper that does an excellent job of methodically digging through the possible issues of high datacenter temperature and investigating each concern methodically. I like Temperature Management in Data Centers: Why Some (Might) Like it Hot for two reasons: 1) it unemotionally works through the key issues and concerns, and 2) it draws from a sample of 7 production data centers at Google so the results are credible and from a substantial sample
From the introduction:
Interestingly, one key aspect in the thermal management of a data center is still not very well understood: controlling the setpoint temperature at which to run a data center’s cooling system. Data centers typically operate in a temperature range between 20C and 22C, some are as cold as 13C degrees [8, 29]. Due to lack of scientiﬁc data, these values are often chosen based on equipment manufacturers’ (conservative) suggestions. Some estimate that increasing the setpoint temperature by just one degree can reduce energy consumption by 2 to 5 percent [8, 9]. Microsoft reports that raising the temperature by two to four degrees in one of its Silicon Valley data centers saved $250,000 in annual energy costs . Google and Facebook have also been considering increasing the temperature in their data centers .
The authors go on to observe that “the details of how increased data center temperatures will affect hardware reliability are not well understood and existing evidence is contradictory.” The remainder of the paper presents the data as measured in the 7 production datacenters under study and concludes each section with an observation. I encourage you to read the paper and I’ll cover just the observations here:
Observation 1: For the temperature range that our data covers with statistical signiﬁcance (< 50C), the prevalence of latent sector errors increases much more slowly with temperature, than reliability models suggest. Half of our model/data center pairs show no evidence of an increase, while for the others the increase is linear rather than exponential.
Observation 2: The variability in temperature tends to have a more pronounced and consistent eﬀect on Latent Sector Error rates than mere average temperature
Observation 3: Higher temperatures do not increase the expected number of Latent Sector Errors (LSEs) once a drive develops LSEs, possibly indicating that the mechanisms that cause LSEs are the same under high or low temperatures.
Observation 4: Within a range of 0-36 months, older drives are not more likely to develop Latent Sector Errors under temperature than younger drives.
Observation 5: High utilization does not increase Latent Sector Error rates under temperatures.
Observation 6: For temperatures below 50C, disk failure rates grow more slowly with temperature than common models predict. The increase tends to be linear rather than exponential, and the expected increase in failure rates for each degree increase in temperature is small compared to the magnitude of existing failure rates.
Observation 7: Neither utilization nor the age of a drive signiﬁcantly aﬀect drive failure rates as a function of temperature.
Observation 8: We do not observe evidence for increasing rates of uncorrectable DRAM errors, DRAM DIMM replacements or node outages caused by DRAM problems as a function of temperature (within the range of temperature our data comprises).
Observation 9: We observe no evidence that hotter nodes have a higher rate of node outages, node downtime or hardware replacements than colder nodes.
Observation 10: We ﬁnd that high variability in temperature seems to have a stronger eﬀect on node reliability than average temperature.
Observation 11: As ambient temperature increases, the resulting increase in power is signiﬁcant and can be mostly attributed to fan power. In comparison, leakage power is negligible.
Observation 12: Smart control of server fan speeds is imperative to run data centers hotter. A signiﬁcant fraction of the observed increase in power dissipation in our experiments could likely be avoided by more sophisticated algorithms controlling the fan speeds.
Observation 13: The degree of temperature variation across the nodes in a data center is surprisingly similar for all data centers in our study. The hottest 5% nodes tend to be more than 5C hotter than the typical node, while the hottest 1%
nodes tend to be more than 8–10C hotter.
The paper under discussion: http://www.cs.toronto.edu/~nosayba/temperature_cam.pdf.
Other notes on increased data center temperatures:
· Exploring the Limits of Datacenter Temperature
· Chillerless Data Center at 95F
· Computer Room Evaporative Cooling
· Next Point of Server Differentiation: Efficiency at Very High Temperature
· Open Compute Mechanical System Design
· Example of Efficient Mechanical Design
· Innovative Datacenter Design: Ishikari Datacenter
b: http://blog.mvdirona.com / http://perspectives.mvdirona.com
Holzle did the keynote talk at the 2012
Open Networking Summit where he focused on
Defined Networking in
Wide Area Networking.
Urs leads the Technical Infrastructure group at Google where he is Senior VP and
Technical Fellow. Software defined networking (SDN) is the central management of
networking routing decisions rather than depending upon distributed routing
algorithms running semi-autonomously on each router.
Essentially what is playing out in the networking world is a replay of
what we have seen in the server world across many dimensions. The dimension that
is central to the SDN discussion is a datacenter full of 10k to 50k servers are
not managed individually by an administrator and the nodes making up the
networking fabric shouldn’t be either.
The key observations behind SDN are 1) if the entire system
is under single administrative control, central routing control is possible, 2)
at the scale of a single administrative domain, central control of networking
routing decisions is practical, and 3) central routing control allows many
advantages including faster convergence on failure, priority-based routing
decisions when resource constrained, application-aware routing and it enables
the same software system that manages application deployment to manage network
In Holzle’s talk, he motivated SDN by first talking about
Cost per bit/sec delivered should go down with
scale rather than up (consider analogy in compute and storage)
However, cost/bit doesn’t naturally decrease with
size due to:
Quadratic complexity in pairwise interactions
Manual management and configuration of individual
Complexity of automation due to non-standard
vendor configuration APIs
Solution: Manage the
WAN as a fabric
rather than as a collection of individual boxes
Current equipment and protocols don’t support
Internet protocols are box-centric rather than
Little support for monitoring and operations
Optimized for “eventual
consistency” in networking
Little baseline support for low-latency routing
and fast failover
Advantages of central traffic engineering:
Better networking utilization with a global view
Converges faster to target optimum on failure
Allows more control and to specify application
Deterministic behavior simplifies planning vs
overprovisioning for worst case variability
Can mirror product event streams for testing to
support faster innovation and roust software development
Controller uses modern server hardware (50x
Decentralized requires a full scale test bed of
production network to test new traffic engineering features
Centralized can tap real production input to
research new ideas and to test new implementations
SDN Testing Strategy:
Various logical modules enable testing in
Virtual environment to experiment and test with
the complete system end-to-end
Everything is real except the hardware
Allows use of tools to validate state across all
devices after every update from central server
Enforce ‘make before break’ semantics
Able to simulate the entire back-bone with real
monitoring and alerts
Google is using custom networking equipment with
100s of ports of 10GigE
Dataplane runs on merchant silicon routing
Control plane runs on Linux hosted on custom
Quagga BGP and ISIS stacks
Only supports the protocols in use at Google
OpenFlow Deployment History:
The OpenFlow deployment was done on the Google
internal (non-customer facing) network
Phase I: Spring 2010
Install OpenFlow-controlled switches but make
them look like regular routers
BGP/ISIS/OSPF now interfaces with OpenFlow
controller to program switch state
Pre-deploy gear at one site, take down 50% of
bandwidth, perform upgrade, bring new equipment online and repeat with the
Repeat at other sites
Phase II: Mid 2011
Activate simple SDN without traffic engineering
Ramp traffic up on test network
Test transparent software rollouts
Phase III: Early 2012
All datacenter backbone traffic carried by new
Rolled out central traffic engineering
Optimized routing based upon 7 application level
Globally optimized flow placement
External copy scheduler works with the OpenFlow
controller to implement deadline scheduling for large data copies
Google SDN Experience:
Much faster iteration: deployed production
quality centralized traffic engineering in 2 months
Fewer devices to update
Much better testing prior to roll-out
Simplified high-fidelity test environment
No packet loss during upgrade
No capacity loss during upgrade
Most features don’t touch the switch
Higher network utilization
Unified view of entire network fabric (rather
than router-by-router view)
Able to implement:
Traffic engineering with higher quality of
service awareness and predictability
Latency, loss, bandwidth, and deadline
sensitivity in routing decisions
Improved routing decisions:
Based upon a priori knowledge of network topology
Based upon L1 and L3 connectivity
Improved monitoring and alerts
OpenFlow protocol barebones but good enough
Master election/control plane partition
challenging to handle
What to leave on router and what to run
Flow programming can be slow for large networks
OpenFlow is ready for real world use
SDN is ready for real world use
Enables rich feature deployment
Simplified network management
Googles Datacenter WAN runs on OpenFlow
Largest production network at Google
A video of Urs’ talk is available at:
OpenFlow @ Google
b: http://blog.mvdirona.com / http://perspectives.mvdirona.com
Most of the time I write about the challenges posed by
scaling infrastructure. Today, though, I wanted mention some upcoming
events that have to do with a different sort of scale.
In Amazon Web Services we are tackling lots of really hairy
challenges as we build out one the world’s largest cloud computing platforms.
From data center design, to network architecture, to data persistence, to
high-performance computing and beyond we have a virtually limitless set
of problems needing to be solved. Over the coming years AWS will be
blazing new trails in virtually every aspect of computing and infrastructure.
In order to tackle these opportunities we are searching for
innovative technologists to join the AWS team. In other words we need to
scale our engineering staff. AWS has hundreds of open positions
throughout the organization. Every single AWS team is hiring including
EC2, S3, EBS, EMR, CloudFront, RDS, DynamoDB and even the AWS-powered Amazon Silk
On May 17th and 18th we will be holding recruiting events in
three cities: Houston, Minneapolis, and Nashville. If you live near any
of those cities and are passionate about defining and building the future of
computing you will find more information at the following URL http://aws.amazon.com/careers/local-events/
You can also send your resume to email@example.com
and we will follow up with you.
b: http://blog.mvdirona.com / http://perspectives.mvdirona.com
I met Google’s Wolf-Dietrich Weber at the 2009 CIDR conference where he presented what is still one of my favorite datacenter power-related papers. I liked the paper because the gain was large, the authors weren’t confused or distracted by much of what is incorrectly written on datacenter power consumption, and the technique is actually practical. In Power Provisioning for a Warehouse-sized Computer, the authors argue that we should oversell power, the most valuable resource in a data center. Just as airlines oversell seats, their key revenue producing asset, datacenter operators should oversell power.
Most datacenter operators take the critical power, the total power available to the data center less power distribution losses and mechanical system cooling loads, then reduce it by at least 10 to 20% to protect against the risk of overdraw which can draw penalty or power loss. Servers are then provisioned to this reduced critical power level. But, the key point is that almost no data center is ever anywhere close to 100% utilized (or even close to 50% for that matter but that’s another discussion) so there is close to no chance that all servers will draw their full load at the same time. And, with some diversity of workloads, even with some services spiking to 100%, we can often exploit the fact that peak loads across dissimilar services are not fully correlated. On this understanding, we can provision more servers than we actually have critical power.
This exactly what airlines do when selling seats. And, just as airlines need to be able to offer a free ticket to Hawaii in the unusual event that they find a flight over-subscribed, we need the same safety valve here. Some datacenter equivalents of a free ticket to Hawaii is: 1) delay all non-customer impacting workloads (administrative and operational batch jobs, 2) stop non-critical or best-effort workloads, 3) force servers into lower power states. This last one is a favorite research topic but is almost never done in practice because it is the equivalent of solving the oversold airline seat problem by actually having two people sit in the same seat. It sort of works but isn’t safe and doesn’t make for happy customers. Option #3 reduces the resources available to all workloads by lowering overall quality of service. For most businesses this is not a good economic choice. The best answers are options 1 and 2 above.
One class of application that is particularly difficult to manage efficiently are online data-intensive workloads. Web search, advertising, and machine translation are examples of this workload type. These workloads can be very profitable so option #3 above, that of reducing the quality of service doesn’t make economic sense. In the note the cost of latency we reviewed the importance of very rapid response in these workload types and ecommerce systems. Reducing the quality of service for these high value workloads to save power, doesn’t make economic sense.
The best answer for these workloads is what Barroso and Hoelzle refer to Energy Proportional Computing (The Case for Energy Proportional Computing). Essentially the goal of energy proportional computing is that a server at 10% load should consume 10% of the power of a server running at 100% load. Clearly there is overhead and this goal will never be fully achieved but, the closer we get, the lower the cost and environmental impact for hosing OLDI workloads.
The good news is there has been progress. When energy proportional computing was first proposed, many servers at idle would consume 80% of the power that it would consume at full load. Today, a good server can be as low as 45% at idle. We are nowhere close to where we want to be but good progress is being made. In fact, CPUs are quite good by this measure today -- the worst offenders are the other components in the server. Memory has big opportunities and the mobile consumer device world shows us what is possible. I expect we’ll continue to progress by stealing ideas from the cell phone industry and applying them to servers.
In Power Management of Online Data-Intensive Services, a research team from Google and the University of Michigan target the OLDI power proportionality problem focusing on Google search, advertising, and translation workloads. These workloads are difficult because the latency goals are achieved using large in-memory caches and, as workload moves from peak to valley, all these machines need to stay available in order to meet the application latency goals. It is not an option to concentrate the workload on fewer servers – the cache size requires all the servers continue to be available so, as workload goes down towards idle, all the servers continue to have some small amount of workload so they can’t be dropped into full system low power states.
The data cache size requires the memory of all the servers so as the workload volume goes down, each server gets progressively less busy but never actually hit idle. They always need to be online and available so the next request can be served at the required latency. The paper draws the following conclusions:
· CPU active low-power modes provide the best single power-performance mechanism but, by themselves, cannot achieve power proportionality
· There is a pressing need to improve idle low-power modes for shared caches and on-chip memory controllers
· There is a substantial opportunity to save memory system power with low-power modes [mobile systems do this well today so the techniques are available]
· Even with query-batching, full system idle low-power modes cannot provide acceptable latency-power tradeoffs
· Coordinated, full-system active low-power modes hold the greatest promise to achieve energy proportionality with acceptable query latency
Summarizing the OLDI workload type as presented in the paper, the workload latency goals are achieved by spreading very large data caches over the operational servers. As the workload goes from peak to trough, these servers all get less busy but never are actually at idle so can’t be dropped into a full system lower power state.
I like to look at servers supporting these workloads as being in a two dimensional grid. Each row represents one entire copy of the cache spread over 100s of servers. A single row could serve the workload and successfully deliver on the application latency goals but a single row will not scale. To scale to workloads beyond that which can be served on a single row, more rows are added. When a search query comes into the system, it is sent to the 100s of systems in a single row but only to the servers in a single row. Looking at the workload this way, I would argue, we actually do have some ability to make OLDI workloads power proportional at a warehouse-scale. When the workload goes up towards peak, more rows are needed. When the workload reduces towards trough, fewer rows are used and the rows not currently in use can be used to support other workloads.
This row-level scaling technique produces very nearly full proportionality at the overall datacenter level with two problems: 1) the workload can’t scale down below a row for all the reason outlined in the paper, 2) if the workload is very dynamic and jumps from trough to peak quickly, more rows need to be kept ready in case they are needed which further reduces the power proportionality of the technique.
If a workload is substantially higher scale than a single row and predictably swings from trough to peak, this per-row scaling technique produces very good results. It fails where workloads change dramatically or where less-than-single row scaling is needed.
The two referenced papers:
· Power Management of Online Data-Intensive Services
· Power Provisioning for a Warehouse-sized Computer
Thanks to Alex Mallet for sending me Power Management of Data-Intensive Services.
b: http://blog.mvdirona.com / http://perspectives.mvdirona.com
I love solar power, but in reflecting carefully on a couple of high profile datacenter deployments of solar power, I’m really developing serious reservations that this is the path to reducing data center environmental impact. I just can’t make the math work and find myself wondering if these large solar farms are really somewhere between a bad idea and pure marketing, where the environmental impact is purely optical.
The first of my two examples is the high profile installation of a large solar array at the Facebook Prineville Oregon Facility. The installation of 100 kilowatts of solar power was the culmination of the unfriend coal campaign run by Greenpeace. Many in the industry believe the campaign worked. In the purest sense, I suppose it did. But let’s look at the data more closely and make sure this really is environmental progress. What was installed in Prineville was a 100 kilowatt solar array at a more than 25 megawatt facility (Facebook Installs Solar Panels at new Data Center ). Even though this is actually a fairly large solar array, its only providing 0.4% of the overall facility power.
Unfortunately, the actually numbers are further negatively impacted by weather and high latitude. Solar arrays produce far less than their rated capacity due to night duration, cloud cover, and other negative impacts from weather. I really don’t want to screw up my Seattle recruiting pitch too much but let’s just say that occasionally there are clouds in the pacific northwest :-). Clearly there fewer clouds at 2,868’ elevation in the Oregon desert but, even at that altitude, the sun spends the bulk of the time poorly positioned for power generation.
Using this solar panel output estimator, we can see that the panels at this location and altitude, yield an effective output of 13.75%. That means that, on average, this array will only put out 13.75 killowatts. That would have this array contributing 0.055% of the facility power or, worded differently, it might run the lights in the datacenter but it has almost no measurable possible impact on the overall energy consumed. Although this is pointed to as an environmentally conscious decisions, it really has close to no influence on the overall environmental impact of this facility. As a point of comparison, this entire solar farm produces approximately as much output as one high density rack of servers consumes. Just one rack of servers is not success, it doesn’t measurably change the coal consumption, and almost certainly isn’t good price/performance.
Having said that the Facebook solar array is very close to purely marketing expense, I hasten to add that Facebook is one of the most power-efficient and environmentally-focused large datacenter operators. Ironically, they are in fact very good environmental stewards, but the solar array isn’t really a material contributor to what they are achieving.
Apple iDataCenter, Maiden, North Carolina
The second example I wanted to look at is Apple’s facility at Maiden, North Carolina, often referred as iDataCenter. In the Facebook example discussed above, the solar array was so small as to have nearly no impact on the composition or amount of power consumed by the facility. However, in this example, the solar farm deployed at the Apple Maiden facility is absolutely massive. In fact, this photo voltaic deployment is reported to be largest commercial deployment in the US at 20 megawatts. Given the scale of this deployment, it has a far better chance to work economically.
The Apple Maiden facility is reported to cost $1B for the 500,000 sq ft datacenter. Apple wisely chose not to publicly announce their power consumption numbers but estimates have been as high as 100 megawatts. If you conservatively assume that only 60% of the square footage is raised floor and they are averaging a fairly low 200W/sq ft, the critical load would still be 60MW (the same as the 700,000 sq ft Microsoft Chicago datacenter). At a moderate Power Usage Efficiency (PUE) of 1.3, Apple Maiden would be at 78MW of total power. Even using these fairly conservative numbers for a modern datacenter build, it would be 78MW total power, which is huge. The actual number is likely somewhat higher.
Apple elected to put in a 20MW solar array at this facility. Again, using the location and elevation data from Wikipedia and the solar array output model referenced above, we see that the Apple location is more solar friendly than Oregon. Using this model, we see that the 20MW photo voltaic deployment has an average output of 15.8% which yields 3.2MW.
The solar array requires 171 acres of land which is 7.4 million sq ft. What if we were to build an solar array large enough to power the entire facility using these solar and land consumption numbers? If the solar farm were to be able to supply all the power of the facility it would need to be 24.4 times larger. It would be a 488 megawatt capacity array requiring 4,172 acres which is 181 million sq ft. That means that a 500,000 sq ft facility would require 181 million sq ft of power generation or, converted to a ratio, each data center sq ft would require 362 sq ft of land.
Do we really want to give up that much space at each data center? Most data centers are in highly populated areas, where a ratio of 1 sq ft of datacenter floor space requiring 362 sq ft of power generation space is ridiculous on its own and made close to impossible by the power generation space needing to be un-shadowed. There isn’t enough roof top space across all of NY to take this approach. It is simply not possible in that venue.
Let’s focus instead on large datacenters in rural areas where the space can be found. Apple is reported to have cleared trees off of 171 acres of land in order to provide photo voltaic power for 4% of their overall estimate data center consumption. Is that gain worth clearing and consuming 171 acres? In Apple Planning Solar Array Near iDataCenter, the author Rich Miller of Data Center Knowledge quotes local North Carolina media reporting that “local residents are complaining about smoke in the area from fires to burn off cleared trees and debris on the Apple property.”
I’m personally not crazy about clearing 171 acres in order to supply only 4% of the power at this facility. There are many ways to radically reduce aggregate data center environmental impact without as much land consumption. Personally, I look first to increasing the efficiency of power distribution, cooling, storage, networking and server and increasing overall utilization and the best routes to lowering industry environmental impact.
Looking more deeply at the Solar Array at Apple Maiden, the panels are built by SunPower. Sunpower is reportedly carrying $820m in debt and has received a $1.2B federal government loan guarantee. The panels are built on taxpayer guarantees and installed using tax payer funded tax incentives. It might possibly be a win for the overall economy but, as I work through the numbers, it seems less clear. And, after the spectacular failure of solar cell producer Solyndra which failed in bankruptcy with a $535 million dollar federal loan guarantee, it’s obvious there are large costs being carried by tax payers in these deployments. Generally, as much as I like data centers, I’m not convinced that tax payers should by paying to power them.
As I work through the numbers from two of the most widely reported upon datacenter solar array deployments, they just don’t seem to balance out positively without tax incentives. I’m not convinced that having the tax base fund datacenter deployments is a scalable solution. And, even if it could be shown that this will eventually become tax neutral, I’m not convinced we want to see datacenter deployments consuming 100s of acres of land on power generation. And, when trees are taken down to allow the solar deployment, it’s even harder to feel good about it. From what I have seen so far, this is not heading in the right direction. If we had $x dollars to invest in lowering datacenter environmental impact and the marketing department was not involved in the decision, I’m not convinced the right next step will be solar.
b: http://blog.mvdirona.com / http://perspectives.mvdirona.com
Every couple of weeks I get questions along the lines of “should I checksum application files, given that the disk already has error correction?” or “given that TCP/IP has error correction on every communications packet, why do I need to have application level network error detection?” Another frequent question is “non-ECC mother boards are much cheaper -- do we really need ECC on memory?” The answer is always yes. At scale, error detection and correction at lower levels fails to correct or even detect some problems. Software stacks above introduce errors. Hardware introduces more errors. Firmware introduces errors. Errors creep in everywhere and absolutely nobody and nothing can be trusted.
Over the years, each time I have had an opportunity to see the impact of adding a new layer of error detection, the result has been the same. It fires fast and it fires frequently. In each of these cases, I predicted we would find issues at scale. But, even starting from that perspective, each time I was amazed at the frequency the error correction code fired.
On one high scale, on-premise server product I worked upon, page checksums were temporarily added to detect issues during a limited beta release. The code fired constantly, and customers were complaining that the new beta version was “so buggy they couldn’t use it”. Upon deep investigation at some customer sites, we found the software was fine, but each customer had one, and sometimes several, latent data corruptions on disk. Perhaps it was introduced by hardware, perhaps firmware, or possibly software. It could have even been corruption introduced by one of our previous release when those pages where last written. Some of these pages may not have been written for years.
I was amazed at the amount of corruption we found and started reflecting on how often I had seen “index corruption” or other reported product problems that were probably corruption introduced in the software and hardware stacks below us. The disk has complex hardware and hundreds of thousands of lines of code, while the storage area network has complex data paths and over a million lines of code. The device driver has tens of thousands of lines of code. The operating systems has millions of lines of code. And our application had millions of lines of code. Any of us can screw-up, each has an opportunity to corrupt, and its highly likely that the entire aggregated millions of lines of code have never been tested in precisely the combination and on the hardware that any specific customer is actually currently running.
Another example. In this case, a fleet of tens of thousands of servers was instrumented to monitor how frequently the DRAM ECC was correcting. Over the course of several months, the result was somewhere between amazing and frightening. ECC is firing constantly.
The immediate lesson is you absolutely do need ECC in server application and it is just about crazy to even contemplate running valuable applications without it. The extension of that learning is to ask what is really different about clients? Servers mostly have ECC but most clients don’t. On a client, each of these corrections would instead be a corruption. Client DRAM is not better and, in fact, often is worse on some dimensions. These data corruptions are happening out there on client systems every day. Each day client data is silently corrupted. Each day applications crash without obvious explanation. At scale, the additional cost of ECC asymptotically approaches the cost of the additional memory to store the ECC. I’ve argued for years that Microsoft should require ECC for Windows Hardware Certification on all systems including clients. It would be good for the ecosystem and remove a substantial source of customer frustration. In fact, it’s that observation that leads most embedded systems parts to support ECC. Nobody wants their car, camera, or TV crashing. Given the cost at scale is low, ECC memory should be part of all client systems.
Here’s an interesting example from the space flight world. It caught my attention and I ended up digging ever deeper into the details last week and learning at each step. The Russian space mission Phobos-Grunt (also written Fobos-Grunt both of which roughly translate to Phobos Ground) was a space mission designed to, amongst other objectives, return soil samples from the Martian moon Phobos. This mission was launched atop the Zenit-2SB launch vehicle taking off from the Baikonur Cosmodrome 2:16am on November 9th 2011. On November 24th it was officially reported that the mission had failed and the vehicle was stuck in low earth orbit. Orbital decay has subsequently sent the satellite plunging to earth in a fiery end of what was a very expensive mission.
What went wrong aboard Phobos-Grunt? February 3rd the official accident report was released: The main conclusions of the Interdepartmental Commission for the analysis of the causes of abnormal situations arising in the course of flight testing of the spacecraft "Phobos-Grunt". Of course, this document is released in Russian but Google Translate actually does a very good job with it. And, IEEE Spectrum Magazine reported on the failing as well. The IEEE article, Did Bad Memory Chips Down Russia’s Mars Probe, is a good summary and the translated Russian article offers more detail if you are interested in digging deeper.
The conclusion of the report is that there was a double memory fault on board Phobos-Grunt. Essentially both computers in a dual-redundant set failed at the same or similar times with a Static Random Access Memory failure. The computer was part of the newly-developed flight control system that had focused on dropping the mass of the flight control systems from 30 kgs (66 lbs) to 1.5 kgs (3.3 lbs). Less weight in flight control is more weight that can be in payload, so these gains are important. However, this new flight control system was blamed for the delay of the mission by 2 years and the eventual demise of the mission.
The two flight control computers are both identical TsM22 computer systems supplied by Techcom, a spin-off of the Argon Design Bureau
Phobos Grunt Design). The official postmortem reports that both computers
suffered an SRAM failure in a WS512K32V20G24M SRAM. These SRAMS are manufactured by White Electronic Design and the model number can be decoded as “W” for White Electronic Design, “S” for SRAM, “512K32” for a 512k memory by 32 bit wide access, “V” is the improvement mark, “20” for 20ns memory access time, “G24” is the package type, and “M” indicates it is a military grade part.
In the paper "
Extreme latchup susceptibility in modern commercial-off-the-shelf (COTS) monolithic 1M and 4M CMOS static random-access memory (SRAM) devices"
Joe Benedetto reports that these SRAM packages are very susceptible to “latchup”, a condition which requires power recycling to return to operation and can be
permanent in some cases. Steven McClure of NASA Jet Propulsion Laboratory is the leader of the Radiation Effects Group.
He reports these SRAM parts would be very unlikely to be approved for use at JPL
(Did Bad Memory Chips Down Russia’s Mars Probe).
It is rare that even two failures will lead to disaster and this case is no exception. Upon double failure of the flight control systems, the spacecraft autonomously goes into “safe mode” where the vehicle attempts to stay stable in low-earth orbit and orients its solar cells towards the sun so that it continues to have sufficient power. This is a common design pattern where the system is able to stabilize itself in an extreme condition to allow flight control personal back on earth to figure out what steps to take to mitigate the problem. In this case, the mitigation is likely fairly simple in just restarting both computers (which probably happened automatically) and restarting the mission would likely have been sufficient.
Unfortunately there was still one more failure, this one a design fault. When the spacecraft goes into safe mode, it is incapable of communicating with earth stations, probably due to spacecraft orientation. Essentially if the system needs to go into safe mode while it is still in earth orbit, the mission is lost because ground control will never be able to command it out of safe mode.
I find this last fault fascinating. Smart people could never make such an obviously incorrect mistake, and yet this sort of design flaw shows up all the time on large systems. Experts in each vertical area or component do good work. But the interaction across vertical areas are complex and, if there is not sufficiently deep, cross-vertical-area technical expertise, these design flaws may not get seen. Good people design good components and yet there often exist obvious fault modes across components that get missed.
Systems sufficiently complex enough to require deep vertical technical specialization risk complexity blindness. Each vertical team knows their component well but nobody understands the interactions of all the components. The two solutions are 1) well-defined and well-documented interfaces between components, be they hardware or software, and 2) and very experienced, highly-skilled engineer(s) on the team focusing on understanding inter-component interaction and overall system operation, especially in fault modes. Assigning this responsibility to a senior manager often isn’t sufficiently effective.
The faults that follow from complexity blindness are often serious and depressingly easy to see in retrospect, as was the case in this example.
Summarizing some of the lessons from this loss: The SRAM chip probably was a poor choice. The computer systems should restart, scrub memory for faults, and be able to detect and load corrupt code from secondary locations before going into safe-mode. Safe-mode has to actually allow mitigating actions to be taken from a ground station or it is useless. Software systems should be constantly scrubbing memory for faults and check-summing the running software for corruption. A tiny amount of processor power spent on continuous, redundant checking and a few more lines of code to implement simple recovery paths when fault is encountered may have saved the mission. Finally we have to all remember the old adage “nothing works if it is not tested.” Every major fault has to be tested. Error paths are the common ones to not be tested so it is particularly important to focus on them. The general rule is to keep error paths simple, use the fewest possible, and test frequently.
Back in 2007, I wrote up a set of best practices on software design, testing, and operations of high scale systems:
On Designing and Deploying Internet-Scale Services. This paper targets large-scale services but it’s surprising to me that some, and perhaps many, of the suggestions could be applied successfully to a complex space flight system. The common theme across these two only partly-related domains is that the biggest enemy is complexity, and the exploding number of failure modes that follow from that complexity.
This incident reminds us of the importance of never trusting anything from any component in a multi-component system. Checksum every data block and have well-designed, and well-tested failure modes for even unlikely events. Rather than have complex recovery logic for the near infinite number of faults possible, have simple, brute-force recovery paths that you can use broadly and test frequently. Remember that all hardware, all firmware, and all software have faults and introduce errors. Don’t trust anyone or anything. Have test systems that bit flips and corrupts and ensure the production system can operate through these faults – at scale, rare events are amazingly common.
To dig deeper in the Phobos-Grunt loss:
b: http://blog.mvdirona.com /