Saturday, February 07, 2009

Last July, Facebook released Cassandra to open source under the Apache license: Facebook Releases Cassandra as Open Source.  Facebook uses Cassandra as email search system where, as of last summer, they had 25TB and over 100m mailboxes. This video gets into more detail on the architecture and design: My notes are below if you don’t feel like watching the video.

·         Authors:

o   Prashant Malik

o   Karthnik Ranganathan

o   Avinash Lakshman

·         Structured storage system over P2p (keys are consistent hashed over servers)

·         Initially aimed at email inbox search problem

·         Design goals:

o   Cost Effective

o   Highly Available

o   Incrementally Scalable

o   Efficient data layout

o   Minimal administration

·         Why Cassandra

o   MySQL drives too many random I/Os

o   File-based solutions require far too many locks

·         What is Cassandra

o   Structured storage over a distributed cluster

o   Redundancy via replication

o   Supports append/insert without reads

o   Supports a caching layer

o   Supports Hadoop operations

·         Cassandra Architecture

o   Core Cassandra Services:

§  Messaging (async, non-blocking)

§  Failure detector

§  Cluster membership

§  Partitioning scheme

§  Replication strategy

o   Cassandra Middle Layer

§  Commit log

§  Mem-table

§  Compactions

§  Hinted handoff

§  Read repair

§  Bootstrap

o   Cassandra Top Layer

§  Key, block, & column indexes

§  Read consistency

§  Touch cache

§  Cassandra API

§  Admin API

§  Read Consistency

o   Above the top layer:

§  Tools

§  Hadoop integration

§  Search API and Routing

·         Cassandra Data Model

o   Key (uniquely specifies a “row”)

§  Any arbitrary string

o   Column families are declared or deleted in advance by administrative action

§  Columns can be added or deleted dynamically

§  Column families have attribute:

·         Name: arbitrary string

·         Type: simple,

o   Key can “contain” multiple column families

§  No requirement that two keys have any overlap in columns

o   Columns can be added or removed arbitrarily from column families

o   Columns:

§  Name: arbitrary string

§  Value: non-indexed blob

§  Timestamp (client provided)

o   Column families have sort orders

§  Time-based sort or name-based sort

o   Super-column families:

§  Big tables calls them locality groups

§  Super-column families have a sort order

§  Essentially a multi-column index

o   System column families

§  For internal use by Cassandra

o   Example from email application

§  Mail-list (sorted by name)

·         All mail that includes a given word

§  Thread-list (sorted by time)

·         All threads that include a given word

§  User-list (sorted by time)

·         All mail that includes a given word user

·         Cassandra API

o   Simple get/put model

·         Write model:

o   Quorum write or aysnc mode (used by email application)

o   Async: send request to any node

§  That node will push the data to appropriate nodes but return to client immediately

o   Quorum write:

§  Blocks until quorum is reached

o   If node down, then write to another node with a hint saying where it should be written two

§  Harvester every 15 min goes through and find hints and moves the data to the appropriate node

o   At write time, you first write to a commit log (sequential)

§  After write to log it is sent to the appropriate nodes

§  Each node receiving write first records it in a local log

·         Then makes update to appropriate memtables (1 for each column family)

§  Memtables are flushed to disk when:

·         Out of space

·         Too many keys (128 is default)

·         Time duration (client provided – no cluster clock)

§  When memtables written out two files go out:

·         Data File

·         Index File

o   Key, offset pairs (points into data file)

o   Bloom filter (all keys in data file)

§  When a commit log has had all its column families pushed to disk, it is deleted

·         Data files accumulate over time.  Periodically data files are merged sorted into a new file (and creates new index)

·         Write properties:

o   No locks in critical path

o   Sequential disk access only

o   Behaves like a write through cache

§  If you read from the same node, you see your own writes.  Doesn’t appear to provide any guarantee on read seeing latest change in failure case

o   Atomicity guaranteed for a key

o   Always writable

·         Read Path:

o   Connect to any node

o   That node will route to the closes data copy which services immediately

o   If high consistency required, don’t return from local immediately

§  First send digest request to all replicas

§  If delta is found, the updates are sent to the nodes that don’t have current data (read repair)

·         Replication supported via multiple consistent hash rings:

o   Servers are hashed over ring

o   Keys are hashed over ring

o   Redundancy via walking around the ring and placing on the next node (rack position unaware) or on the next node on a different rack (rack aware) or on a next system in a different data center (implication being that the ring can span data centers)

·         Cluster membership

o   Cluster membership and failure detection via gossip protocol

·         Accrual failure detector

o   Default sets PHI to 5 in Cassandra

o   Detection is 10 to 15 seconds with PHI=5

·         UDP control messages and TCP for data messages

·         Complies with Staged Event Driven Architecture (SEDA)

·         Email system:

o   100m users

o   4B threads

o   25TB with 3x replication

o   Uses and joins across 4 tables:

§  Mailbox (user_id to thread_id mapping)

§  Msg_threads (thread to subject mapping)

§  Msg_store (thread to message mapping)

§  Info (user_id to user name mapping)

·         Able to load using Hadoop at 1.5TB/hour

o   Can load 25TB at network bandwidth over Cassandra Cluster


James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | |  | blog:


Saturday, February 07, 2009 11:16:20 AM (Pacific Standard Time, UTC-08:00)  #    Comments [2] - Trackback
 Thursday, January 29, 2009

I did the final day keynote at the Conference on Innovative Data Systems Research earlier this month.  The slide deck is based upon the CEMS paper: The Case for Low-Cost, Low-Power Servers but it also included a couple of techniques I’ve talked about before that I think are super useful:

·         Power Load Management: The basic idea is to oversell power, the most valuable resource in a data center.  Just as airlines oversell seats, there revenue producing asset. Rather than taking the data center critical power (total power less power distribution losses and mechanical loads) and then risking it down by 10 to 20% to play it safe since utility over-draw brings high cost. Servers are then provisioned to this risked down critical power level. But, the key point is that almost no data center is ever anywhere close to 100% utilized (or even close to 50% for that matter but that’s another discussion) so there is close to zero chance that all servers will draw their full load.  And, with some diversity of workloads, even with some services spiking to 100%, we can often exploit the fact that peak loads across dissimilar services are not fully correlated. On this understanding, we can provision more servers than we have critical power. This idea was originally proposed by Xiabo Fan, Wolf Weber, and Luiz Barroso (all of Google) in Power Provisioning in a Warehouse-Sized Computer. It’s a great paper.

·         Resource Consumption Shaping: This is an extension to the idea above of applying yield management techniques to power and instead applying to all resources in the data center. The key observation here is that nearly all resources in a data center are billed at peak.  Power, Networking, Servers counts, etc.  It all bills at peak. So we can play two fairly powerful tricks: 1) exploiting workload heterogeneity and over-subscribing all resources just as we did with power in Power Load Management above, and 2) move peaks to valleys to further reduce costs and exploit the fact that the resource valleys are effectively free. This is an idea that Dave Treadwell and I came up with a couple of years back and it’s written up in more detail in Resource Consumption Shaping.


The slide deck I presented at the CIDR conference are at:




James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | |  | blog:


Thursday, January 29, 2009 5:58:33 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
 Tuesday, January 27, 2009

Wow, 2TB for $250 from Western Digital: Once its shipping in North America, I’ll have to update The Cost of Bulk Cold Storage.

Update: Released in the US at $299: Western Digital's 2TB Caviar Green hard drive launches, gets previewed.


Sent my way by Savas Parastatididis.




James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | |  | blog:


Tuesday, January 27, 2009 5:28:33 AM (Pacific Standard Time, UTC-08:00)  #    Comments [1] - Trackback
 Sunday, January 25, 2009

In Microslice Servers and the Case for Low-Cost, Low-Power Servers, I observed that CPU bandwidth is outstripping memory bandwidth. Server designers can address this by: 1) designing better memory subsystems or 2) reducing the CPU per-server.  Optimizing for work done per dollar and work done per joule argues strongly for the second approach for many workloads.


In Low Power Amdahl Blades for Data Intensive Computing (Amdahl Blades-V3.pdf (84.25 KB)), Alex Szalay makes a related observation and arrives at a similar point.  He argues that server I/O requirements for data intensive computing clusters grow in proportion to CPU performance. As per-server CPU performance continues to increase, we need to add additional I/O capability to each server.  We can add more disks but this drives up both power and cost as more disk require more I/O channels. Another approach is use a generation 2 flash SSDs such as the Intel X25-E or the OCZ Vertex (I’m told the Samsung 2.06Gb/s (SLC) is also excellent but I’ve not yet seen their random write IOPS rates). Both the OCZ and the Intel components are excellent performers nearing FusionIO but at a far better price point making them considerably superior in work done per dollar.


The Szalay paper looks first at the conventional approach of adding flash SSDs to a high-end server. To get the required I/O rates, three high-performance SSDs would be needed.  But, to get full I/O rates from the three devices, three I/O channels would be needed which drives up power and cost. What if we head the other way and, rather than scaling up the I/O sub-system, we scale down the CPU per server? Alex shows that a low-power, low-cost commodity board coupled with a single, high-performance flash SSDs would form an excellent building block for a data intensive cluster. It’s a very similar direction to CEMS servers but applied to data intensive workloads.


One of the challenges of low-power, high-density servers along the lines proposed by Alex and I is network cabling.  With CEMS there are 240 servers/rack and a single top-of-rack switch is inadequate so we go with a mini-switch per six-server tray and each of 40 trays connected to a top-of-rack switch.  The Low Power Amdahl Blades are yet again more dense. Alex makes a more radical approach proposal to interconnect the rack using very short-range radio. From the paper,


Considering their compact size and low heat dissipation, one can imagine building clusters of thousands of low-power Amdahl blades. In turn, this high density will create challenges related to interconnecting these blades using existing communication technologies (i.e., Ethernet, complex wiring if we have 10,000 nodes). On the other hand, current and upcoming high-speed wireless communications offer an intriguing alternative to wired networks. Specifically, current wireless USB radios (and their WLP IP-based variants) offer point-to-point speeds of up to 480 Mbps over small distances (~3-10 meters). Further into the future, 60 GHz-based radios promise to offer Gbps of wireless bandwidth.


I’m still a bit skeptical that we can get rack-level radio networking to be win in work done per dollar and work done per joule but it is intriguing and I’m looking forward to getting into more detail on this approach with Alex.



Remember it’s work done per dollar and work done per joule that we should be chasing.  And, in optimizing for these metrics, we increasingly face challenges of insufficient I/O and memory bandwidth per core. Both CEMS and Low-Power Amdahl Blades address the system balance issue by applying more low-power servers rather than adding more I/O and memory bandwidth to each server. 


It’s the performance of the aggregate cluster we care about and work done dollar and work done per joule is the final arbiter.




James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | |  | blog:


Sunday, January 25, 2009 7:28:18 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
 Friday, January 23, 2009

In The Case For Low-Power Servers I reviewed the Cooperative, Expendable, Micro-slice Servers project.  CEMS is a project I had been doing in my spare time in investigating using low-power, low costs servers running internet-scale workloads. The core premise of the CEMS project: 1) servers are out-of-balance, 2) client and embedded volumes, and 3) performance is the wrong metric.

Out-of-Balance Servers:  The key point is that CPU bandwidth is increasing far faster than memory bandwidth (see page 7 of Internet-Scale Service Efficiency).  CPU performance continues to improve at roughly historic rates.  Core count increases have replaced the previous reliance on frequency increase but performance improvements continue unabated.  As a consequence, CPU performance is outstripping memory bandwidth with the result that more and more cycles are spent in pipeline stalls. There are two broad approaches to this problem: 1) improve the memory subsystem, and 2) reduce CPU performance. The former drives up design cost and consumes more power. The later is a counter-intuitive approach.  Just run the CPU slower.  


The CEMS project investigates using low-cost, low-power client and embedded CPUs to produce better price-performing servers.  The core observation is that internet-scale workloads are partitioned over 10s to 1000s of servers.  Running more slightly slower servers is an option if it produces better price performance. Raw, single-server performance is neither needed nor the most cost effective goal


Client and Embedded Volumes: It’s always been a reality of the server world that volumes are relatively low.  Clients and embedded devices are sold at an over 10^9 annual clip.  Volume drives down costs.  Servers leveraging client and embedded volumes can be MUCH less expensive and still support the workload.


Performance is the wrong metric: Most servers are sold on the basis of performance but I’ve long argued that single dimensional metrics like raw performance are the wrong measure. What we need to optimize for is work done per dollar and work done per joule (a watt-second). In a partitioned workload running over many servers, we shouldn’t care about or optimize for single server performance. What’s relevant is work done/$ and work done/joule. The CEMS projects investigates optimizing for these metrics rather than raw performance.


Using work done/$ and work done/joule as the optimization point, we tested a $500/slice server design on a high-scale production workload and found nearly 4x improvement over the current production hardware.

Earlier this week Rackable Systems announced Microslice Architecture and Products.  These servers come in at $500/slice and optimize for work done/$ and work done/joule. I particularly like this design in that its using client/embedded CPUS but includes full ECC memory and the price/performance is excellent.  These servers will run partitionable workloads like web-serving extremely cost effectively.




James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | |  | blog:

Friday, January 23, 2009 6:23:00 AM (Pacific Standard Time, UTC-08:00)  #    Comments [7] - Trackback
 Monday, January 19, 2009

I recently stumbled across: Snippets on Software.  It’s a collection of mini-notes on software with links to more if you are interested in more detail. Some snippets are wonderful, some clearly aren’t exclusive to software and some I would argue are just plain wrong. Nonetheless, it’s a great list.


It’s too long to read from end-to-end in one sitting but it’s well worth skimming. Below a few snippets that I enjoyed to whet your appetite:


"there is only one consensus protocol, and that's Paxos" - all other approaches are just broken versions of Paxos. – Mike Burrows


Conway’s Law: Any organization that designs a system (defined broadly) will produce a design whose structure is a copy of the organization's communication structure.


"Reading, after a certain age, diverts the mind too much from its creative pursuits. Any man who reads too much and uses his own brain too little falls into lazy habits of thinking." -- Albert Einstein


A human being should be able to change a diaper, plan an invasion, butcher a hog, conn a ship, design a building, write a sonnet, balance accounts, build a wall, set a bone, comfort the dying, take orders, give orders, cooperate, act alone, solve equations, analyze a new problem, pitch manure, program a computer, cook a tasty meal, fight efficiently, die gallantly. Specialization is for insects. -Robert A. Heinlein

"You can try to control people, or you can try to have a system that represents reality. I find that knowing what's really happening is more important than trying to control people." -- Larry Page


Everyone knows that debugging is twice as hard as writing a program in the first place. So if you're as clever as you can be when you write it, how will you ever debug it?  --Brian Kernighan




James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | |  | blog:


Monday, January 19, 2009 9:55:01 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
 Wednesday, January 14, 2009

The Conference on Innovative Data Systems Research was held last week at Asilomar California. It’s a biennial systems conference.  At the last CIDR, two years ago, I wrote up Architecture for Modular Data Centers where I argued that containerized data centers are an excellent way to increase the pace of innovation in data center power and mechanical systems and are also a good way to grow data centers more cost effectively with a smaller increment of growth.


Containers have both supporters and detractors and its probably fair to say that the jury is still out.  I’m not stuck on containers as the only solution but any approach that supports smooth, incremental data center expansion is interesting to me. There are some high scale modular deployments are in the works (First Containerized Data Center Announcement) so, as an industry, we’re starting to get some operational experience with the containerized approach.


One of the arguments that I made in the Architecture for Modular Systems paper was that a fail-in-place might be the right approach to server deployment. In this approach, a module of servers (multiple server racks) is deployed and, rather than servicing them as they fail, the overall system capacity just slowly goes down as servers fail. As each fails, they are shut off but not serviced. Most data centers are power-limited rather than floor space limited.  Allowing servers to fail in place trades off space which we have in abundance in order to get high efficiency service. Rather than servicing systems as they fail, just let them fail-in-place and when the module healthy-server density gets too low, send it back for remanufacturing at the OEM who can do it faster, cheaper, and recycle all that is possible.


Fail in place (Service Free Systems) was by far the most debated part of the modular datacenter work. But, it did get me thinking about how cheaply a server could be delivered. And, over time, I’ve become convinced that that optimizing for server performance is silly. What we should be optimizing for is work done/$ and work done/joule (a watt-second). Taking those two optimizations points with a goal of a sub-$500 server, led to the Cooperative, Expendable, Micro-Slice Server project that I wrote up for this years CIDR.


In this work, we took an existing very high scale web property (many thousands of servers) and ran their production workload on the existing servers currently in use. We compared the server SKU currently being purchased with a low-cost, low-power design using work done/$ and work done/joule as the comparison metric. Using this $500 server design, we were able to achieve:


·         RPS/dollar:3.7x

·         RPS/Joule: 3.9x

·         RPS/Rack: 9.4x


Note that I’m not a huge fan of gratuitous density (density without customer value).  See Why Blade Servers aren’t the Answer to all Questions for the longer form of this argument. I show density here only because many find it interesting, it happens to be quite high and, in this case, did not bring a cost penalty.


The paper is at:


Abstract:  evaluates low cost, low power servers for high-scale internet-services using commodity, client-side components. It is a follow-on project to the 2007 CIDR paper Architecture for Modular Data Centers. The goals of the CEMS project are to establish that low-cost, low-power servers produce better price/performance and better power/performance than current purpose-built servers. In addition, we aim to establish the viability and efficiency of a fail-in-place model. We use work done per dollar and work done per joule as measures of server efficiency and show that more, lower-power servers produce the same aggregate throughput much more cost effectively and we use measured performance results from a large, consumer internet service to argue this point.


Thanks to Giovanni Coglitore and the rest of the Rackable Systems team for all their engineering help with this work.


James Hamilton

Amazon Web Services

Wednesday, January 14, 2009 4:39:39 PM (Pacific Standard Time, UTC-08:00)  #    Comments [6] - Trackback
 Sunday, January 11, 2009

Last night, TechCrunch hosted The Crunchies and two of my favorite services got awards. Ray Ozzie and David Treadwell accepted Best Technology Innovation/Achievement for Windows Live Mesh.  Amazon CTO Werner Vogels accepted Best Enterprise Startup for Amazon Web Services.


Also awarded (from

Best Application Or Service

Get Satisfaction
Google Reader (winner)
MySpace Music (runner-up)

Best Technology Innovation/Achievement

Facebook Connect (runner-up)
Google Friend Connect
Google Chrome
Windows Live Mesh (winner)
Yahoo BOSS

Best Design

Animoto (runner-up)
Cooliris (winner)

Best Bootstrapped Startup

GitHub (winner)
StatSheet (runner-up)

Most Likely To Make The World A Better Place

GoodGuide (winner)
Kiva (runner-up)
Better Place

Best Enterprise Startup
Amazon Web Services
Google App Engine (runner-up)

Best International Startup

eBuddy (winner)
Wuala (runner-up)

Best Clean Tech Startup

Better Place (runner-up)
Boston Power
Laurus Energy
Project Frog (winner)

Best New Gadget/Device

Android G1 (runner-up)
Ausus EEE 1000 Series
Flip MinoHD
iPhone 3G (winner)

Best Time Sink Site/Application

Mob Wars
Tap Tap Range (winner)
Texas Hold Em (runner-up)

Best Mobile Startup

ChaCha (runner-up)
Evernote (winner)
Qik Skyfire

Best Mobile Application

Google Mobile Application (runner-up)
imeem mobile (winner)
Pandora Radio

Best Startup Founder

Linda Avery and Anne Wojcicki (23andMe)
Michael Birch and Xochi Birch (Bebo)
Robert Kalin (Etsy)
Evan Williams, Jack Dorsey, Biz Stone (Twitter ) (winner)
Paul Buchheit, Jim Norris, Sanjeev Singh, Bret Taylor (FriendFeed ) (runner-up)

Best Startup CEO

Tony Hsieh (Zappos)
Jason Kilar (Hulu) (runner-up)
Elon Musk (SpaceX)
Andy Rubin (Android)
Mark Zuckerberg (Facebook) (winner)

Best New Startup Of 2008
FriendFeed (winner)
Topsin Media

Best Overall Startup In 2008

Amazon Web Services
Facebook (winner)
Twitter (runner-up)

James Hamilton

Amazon Web Services


Sunday, January 11, 2009 8:52:35 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
 Saturday, January 10, 2009

Back in 2000, Joel Spolsky published a set of 12 best practices for a software development team. It’s been around for a long while now and there are only 12 points but it’s very good. Simple, elegant, and worth reading: The Joel Test: 12 Steps to Better Code.


Thanks to Patrick Niemeyer for sending this one my way.




James Hamilton

Amazon Web Services

Saturday, January 10, 2009 8:12:40 AM (Pacific Standard Time, UTC-08:00)  #    Comments [4] - Trackback

 Thursday, January 01, 2009

Earlier in the week, there was an EE Times posting, Server Makers get Googled, and a follow-up post from Gigaom How Google Is Influencing Server Design.  I’ve long been an advocate of making industry leading server designs more available to smaller data center operators since, in aggregate, they are bigger power consumers and have more leverage as a group.  The key design aspects brought up in these two articles:

·         Higher data center temperatures

·         12V-only power supplies

·         Two servers on a board


An early article from The Register back in October, Google Demanding Intel’s Hottest Chips sourced a ex-Google employee that clearly wasn’t involved with Google’s data center or server design teams.  The details are often incorrect but the article brought up two more issues of interest:

·         High temperatures processors

·         Containerized data center design.


Let’s look at the each of these five issues in more detail.


Higher Data Center Temperatures: A 1.7 PUE data center is a good solid design – not even close to industry leading but better than most small scale data centers.  By definition, a 1.7 PUE facility delivers 59% of total data center power draw to the IT load, the servers, networking gear, storage, etc.  From Where Does the Power Go and What to do About It we know that the losses in power distribution are around 8%. By subtraction, we have 33% of all power delivered to a data center consumed by cooling. Broadly speaking, there are two big ways to address giving up 1/3 of all the power consumed by a data center in cooling. The first is to invest in more efficient mechanical systems and the second is to simply do less cooling.  Essentially to run the data center hotter.


Running the data center hotter is a technique with huge upside potential and it’s good to see the industry starting to rally around this approach. In a recent story ( by Data Center Knowledge, Google recommends operating data centers at higher temperatures than the norm. "The guidance we give to data center operators is to raise the thermostat," Google energy program manager Erik Teetzel told Data Center Knowledge. "Many data centers operate at 70 [Fahrenheit] degrees or below. We'd recommend looking at going to 80 [Fahrenheit] degrees."


Generally, there are two limiting factors to raising DC temperatures: 1) server component failure points, and 2) the precision of temperature control. We’ll discuss the component failure point more below in “high temperature processors”.  Precision of temperature control is potentially more important in that it limits how close we can safely get to the component failure point.  If the data center has very accurate control, say +/-2C, then we can run within 5C and certainly within 10C of the component failure point. If there is wide variance throughout the center, say +/-20C, then much more headroom must be maintained. 


Temperature management accuracy reduces risk and risk reduction allows higher data center temperatures.

12-only Power Supplies: Most server power supplies are a disaster in two dimensions: 1) incredibly inefficient at rated load, and 2) much worse at less than rated load.  Server power supplies are starting to get the attention they deserve but it’s still easy to find a supply that is only 80% efficient. Good supplies run in the 90 to 95% range but customers weren’t insisting so high efficiency supplies so they weren’t being used. This is beginning to change and server vendors typically offer high efficiency supplies either by default or as an extra cost option.


As important as it is to have an efficient power supply at the server rated load, it’s VERY rare to have a server operate at anything approaching maximum rated load. Server utilizations are usually below 30% and often as poor as 10% to 15%. At these lower loads, power supply efficiency is often much lower than the quoted efficiency at full load.  There are two cures to this problem: 1) flatten the power supply efficiency curves so that at low load they are much nearer to the efficiency at high load, and 2) move the peak efficiency down to the likely server operating load. The former is happening broadly.  I’ve not seen anyone doing the later but it’s a simple, easy to implement concept.


Server fans, CPUs, and memory all run off the 12V power supply rails in most server designs.  Direct attached storage uses both the 12V and 3.3V rails.  Standardizing the supply to simply produce 12V and using high efficiency voltage regulators close to the component loads is a good design for two reasons: 1) 12v only supplies are slightly simpler and simplicity allows more effort to be invested in efficiency, and 2) bringing 12V close to the components minimizes the within-the-server power distribution losses. IBM has done exactly this with their data center optimized iDataPlex servers.


Two Servers on a Board: Increasing server density by a factor of 2 is good but, generally density is not the biggest problem in a data center (see Why Blade Servers aren’t the Answer to All Questions).  I am more excited by designs that lower costs by sharing components and so this is arguably a good thing even if you don’t care all that much about server density.


I just finished some joint work with Rackable Systems focusing on maximizing work done per dollar and work done per joule on server workloads.  This work shows improvements of over 3x over existing server designs on both metrics. And, as a side effect of working hard on minimizing costs, the design also happens to be very dense with 6 servers per rack unit all sharing a single power supply. This work will be published at the Conference on Innovative Data System Research this month and I’ll post it here as well.


GigaOM had an interesting post reporting that Microsoft is getting server vendors to standardize on their components: They also report that Google custom server design is beginning to influence server suppliers:  It’s good to see data center optimized designs beginning to be available for all customers rather than just high scale purchasers.


High Temperature Processors:

The Register’s Google Demanding Intel’s Hottest Chips? reports

When purchasing  server processors directly from Intel, Google has insisted on a guarantee that the chips can operate at temperatures five degrees centigrade higher than their standard qualification, according to a former Google employee. This allowed the search giant to maintain higher temperatures within its data centers, the ex-employee says, and save millions of dollars each year in cooling costs.


Predictably Intel denies this. And logic suggests that it’s probably not 100% accurate exactly as reported.  Processors are not even close to the most sensitive component in a server.  Memory is less heat tolerant than processors.  Disk drives are less heat tolerant than memory.  Batteries are less heat tolerant than disks. In short, processors  aren’t the primary limiting factor in any server design I’ve looked at. However, as argued above, raising data center temperature will yield huge gains and part of achieving these gains are better cooling designs and more heat tolerant parts. 


In this case, I strongly suspect that Google has asked all its component suppliers to step up to supporting higher data center ambient temperatures but I doubt that Intel is sorting for temp resistance and giving Google special parts. As a supplier, I suspect they are signed up to “share the risk” of higher DC temps with Google but I doubt they supplying special parts. 


Raising DC temperatures is 100% the right approach and I would love to see the industry cooperate to achieve 40C data center temperatures. It’ll be good for the environment and good for the pocketbook.


Containerized Designs:

Also in Google Demanding Intel’s Hottest Chips? the Register talks about Google work in containerized data centers mentioning the Google Will-Power Project. Years ago there was super secret work at Google to build containerized data centers and a patent was filed. Will Whitted is the patent holder and hence the name, Will-Power.  However Will reported in a San Francisco Chronicle article O Googlers, Where Art Thou? that the Google project was canceled years ago.  It’s conceivable that Google has quietly continued the work but our industry is small, secrets are not held particularly well given the number of suppliers involved and this one has been quiet. I suspect Google didn’t continue with the modular designs.  However, Microsoft has invested in modular data centers based upon containers in Chicago, First Containerized Data Center Announcement, and the new fourth generation design covered by Gigaom Microsoft Reveals Fourth-Gen Design Data Center Design, my posting Microsoft Generation 4 Modular Data Centers and the detailed Microsoft posting by Manos, Belady, and Costello: Our Vision for Generation 4 Modular Data Centers – One Way of Getting it Just Right.


I’ve been a strong proponent of Containerized data centers (Architecture for a Modular Data Center) so it’s good to see this progress at putting modular designs into production. 


Thanks to Greg Linden for pointing these articles out to me. Greg’s blog, Geeking with Greg is one of my favorites.




James Hamilton
Amazon Web Services


Thursday, January 01, 2009 11:17:16 AM (Pacific Standard Time, UTC-08:00)  #    Comments [3] - Trackback
 Wednesday, December 31, 2008

In a previous posting, Pat Selinger IBM Ph.D. Fellowships, I mentioned Pat Selinger as one of the greats of the relational database world.  Working with Pat was one of the reasons why leaving IBM back in the mid-90’s was a tough decision for me.  In the December 2008 edition of the Communications of the ACM, an interview I did with Pat back in 2005 is published: Database Dialogue with Pat Selinger. It originally went out as an ACM Queue article.


If you haven’t checked out the CACM recently, you should. The new format is excellent and the articles are now worth reading. The magazine is regaining its old position of decades ago as a must read publication.  The new CACM is excellent.


Thanks to Andrew Cencini for pointing me towards this one. I hadn’t yet read my December issues.


James Hamilton
Amazon Web Services

Wednesday, December 31, 2008 5:35:39 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
 Sunday, December 28, 2008

I’ve long argued that tough constraints often make for a better service and few services are more constrained than Wikipedia where the only source of revenue is user donations. I came across this talk by Domas Mituzas of Wikipedia while reading old posts on Data Center Knowledge.   The posting A Look Inside Wikipedia’s Infrastructure includes a summary of the talk Domas gave at Velocity last summer.  


Interesting points from the Data Center Knowledge posting and the longer document referenced below from the 2007 MySQL coference:

·  Wikipedia serves the world from roughly 300 servers

o  200 application servers

o  70 Squid servers

o  30 Memcached servers (2GB each)

o  20 MySQL servers using Innodb, each with 16GB of memory (200 to 300GB each)

o  They also use Squid, Nagios, dsh, nfs, Ganglia, Linux Virtual Service, Lucene over .net on Mono, PowerDNS, lighttpd, Apache, PHP, MediaWiki (originated at Wikipedia)

·  50,000 http requests per second

·  80,000 MySQL requests per second

·  7 million registered users

·  18 million objects in the English version


For the 2007 MySQL Users Conference, Domas posted great details on the Wikipidia architecture: Wikipedia: Site internals, configuration, code examples and management issues (30 pages).  I’ve posted other big service scaling and architecture talks at:


James Hamilton
Amazon Web Services


Updated: Corrected formatting issue.

Sunday, December 28, 2008 7:04:05 AM (Pacific Standard Time, UTC-08:00)  #    Comments [2] - Trackback
 Saturday, December 27, 2008

From Viraj Mody of the Microsoft Live Mesh team sent this my way: Dan Farino About MySpace Architecture.  


MySpace, like Facebook, uses relational DBs extensively front-ended by a layer of Memcached servers. Less open source at MySpace but otherwise unsurprising – a nice scalable design with 3000 front end servers with well over 100 database servers (1m users per DB server).


Notes from Viraj:

·         ~3000 FEs running IIS 6

·         NET 2.0 and 3.5 on FE and BE machines

·         DB is SQL 2005 but hit scaling limits to so they built their own unmanaged memcache implementation on 64-bit machines; uses .NET for exposing communications with layer

·         DB partitioned to assign ~1million users per DB and Replicated

·         Media content (audio/video) hosted on DFS built using Linux served over http

·         Extensive use of PowerShell for server management

·         Started using ColdFusion, moved when scale became an issue

·         Profiling tools build using CLR profiler and technology from Microsoft Research

·         Looking to upgrade code to use LINQ

·         Spent a lot of time building diagnostic utilities

·         Pretty comfortable with the 3-tier FE + memcache + DB architecture

·         Dealing with caching issues – not a pure write thru/read thru cache. Currently reads populate the cache and write flush the cache entry and just write to the DB. Looking to update this, but it worked well since it was ‘injected’ into the architecture.


I collect high scale service architecture and scaling war stories.  These were previously posted here:

·         Scaling Amazon:

·         Scaling Second Life:

·         Scaling Technorati:

·         Scaling Flickr:

·         Scaling Craigslist:

·         Scaling Findory:

·         MySpace 2006:

·         MySpace 2007:

·         Twitter, Flickr, Live Journal, Six Apart, Bloglines,, SlideShare, and eBay:

·        Scaling LinkedIn:


James Hamilton
Amazon Web Services


Updated: Corrected formatting issue.

Saturday, December 27, 2008 3:12:42 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
 Wednesday, December 24, 2008

Five or six years ago Bill Gates did a presentation to a small group at Microsoft on his philanthropic work at the Bill and Melinda Gates Foundation.  It was by far and away the most compelling talk I had seen in that it was Bill applying his talent to solving world health problems with the same relentless drive, depth of understanding, constant learning, excitement and focus with which he applied himself daily (at the time) at Microsoft.


Thanks to O’Reilly, I just watched an interview with Bill by Charlie Rose which has a lot of the same character and Bill makes some of the same points as that talk I saw some years ago. Find it in Dale Dogherty post Admiring Bill Gates.  It’s well worth watching.



Wednesday, December 24, 2008 2:07:17 PM (Pacific Standard Time, UTC-08:00)  #    Comments [2] - Trackback
 Tuesday, December 23, 2008

Related to The Cost of Bulk Storage posting, Mike Neil dropped me a note. He's built an array based upon this Western Digital part:  Its unusually power efficient:

Power Dissipation



5.4 Watts


2.8 Watts


0.40 Watts


0.40 Watts


And it’s currently only $105: 


It’s always been the case that home storage is wildly cheaper than data center hosted storage.  What excites me even more than the continued plunging cost of raw storage is that data center hosted storage is asymptotically approaching the home storage cost . Data center storage includes someone else doing capacity planning and buying new equipment when needed, someone else replacing failed disks and servers and, in the case of S3, it’s geo-redundant (data is stored in multiple data centers).


I’ve not yet discarded my multi-TB home storage system yet but the time is near.



Tuesday, December 23, 2008 8:46:34 AM (Pacific Standard Time, UTC-08:00)  #    Comments [3] - Trackback
 Monday, December 22, 2008

I wrote this blog entry a few weeks ago before my recent job change.  It’s a look at the cost of high-scale storage and how it has fallen over the last two years based upon the annual fully burdened cost of power in a data center and industry disk costs trends. The observations made in this post are based upon understanding these driving costs and should model any efficient, high-scale bulk storage farm. But, and I need to be clear on this, it was written prior to my joining AWS and there is no information below relating to any discussions I’ve had with AWS or how the AWS team specifically designs, deploys, or manages their storage farm.


When Amazon released Amazon S3, I argued that it was priced below cost at $1.80/GB/year.  At that time, my estimate of their cost was $2.50/GB/year.  The Amazon charge of $1.80/GB/year for data to be stored twice in each of two data centers is impressive. It was amazing when it was released and it remains an impressive value today. 


Even though the storage price was originally below cost by my measure, Amazon could still make money if they were running a super-efficient operation (likely the case).  How could they make money charging less than cost for storage? Customers are charged for ingress/egress on all data entering or leaving the AWS cloud.  The network ingress/egress charged by AWS are reasonable, but telecom pricing strongly rewards volume purchases, so what Amazon pays is likely much less than the AWS ingress/egress charges.  This potentially allows the storage business to be profitable even when operating at a storage cost loss.


One concern I’ve often heard is the need to model the networking costs between the data centers since there are actually two redundant copies stored in two independent data centers.  Networking, like power, is usually billed at the 95 percentile over a given period. The period is usually a month but more complex billing systems exist. The constant across most of these high-scale billing systems is that the charge is based upon peaks. What that means is adding ingress or egress at an off peak time is essentially free. Assuming peaks are short-lived, the sync to the other data center can be delayed until the peak has passed.  If the SLA doesn’t have a hard deadline on when the sync will complete (it doesn’t), then the inter-DC bandwidth is effectively without cost.  I call this technique Resource Consumption Shaping and it’s one of my favorite high-scale service cost savers.


What is the cost of storage today in an efficient, commodity bulk-storage service? Building upon the models in the cost of power in large-scale data centers and the annual fully burdened cost of power, here’s the model I use for cold storage with current data points:

Note that this is for cold storage and I ignore the cost of getting the data to or from the storage farm.  You need to pay for the networking you use.  Again, since it’s cold storage the model assume you can use 80% of the disk which wouldn’t be possible for data with high I/O rates per GB. And we’re using commodity SATA disks at 1TB that only consume 10W of power. This is a cold storage model.  If you are running higher I/O rates, figure out what percentage of the disk you can successfully use and update the model in the spreadsheet (ColdStorageCost.xlsx (13.86 KB)). If you are using higher-power, enterprise disk, you can update the model to use roughly 15W for each.

Update: Bryan Apple found two problems with the spreadsheet that have been updated in the linked spreadsheet above. Ironically the resulting fully brudened cost/GB/year is the unchanged. Thanks Bryan.

 For administration costs, I’ve used a fixed, fairly conservative factor of a 10% uplift on all other operations and administration costs. Most large-scale services are better than this and some are more than twice as good but I included the conservative 10% number


Cold storage with 4x copies at high-scale can now be delivered at: 0.80/GB/year.  It’s amazing what falling server prices and rapidly increasing disks sizes have done.  But, it’s actually pretty hard to do and I’ve led storage related services that didn’t get close to this efficient --  I still think that Amazon S3 is a bargain.


Looking at the same model but plugging in numbers from about two years ago shows how fast we’re seeing storage costs plunge. Using $2,000 servers rather than $1,200, server power consumption at 250W rather than 160W, disk size at ½ TB and disk cost at $250 rather than 160, yield an amazingly different $2.40/GB/year.


Cold storage with redundancy at: $0.80 GB/year and still falling. Amazing.




James Hamilton


Monday, December 22, 2008 7:24:50 AM (Pacific Standard Time, UTC-08:00)  #    Comments [19] - Trackback

 Wednesday, December 17, 2008

Resource Consumption Shaping is an idea that Dave Treadwell and I came up with last year.  The core observation is that service resource consumption is cyclical. We typically pay for near peak consumption and yet frequently are consuming far below this peak.  For example, network egress is typically charged at the 95th percentile of peak consumption over a month and yet the real consumption is highly sinusoidal and frequently far below this charged for rate.  Substantial savings can be realized by smoothing the resource consumption.


Looking at the network egress traffic report below, we can see this prototypical resource consumption pattern:


You can see from the chart above that resource consumption over the course of a day varies by more than a factor of two. This variance is driven by a variety of factors, but an interesting one is the size of the Pacific Ocean, where the population density is near zero. As the service peak load time-of-day sweeps around the world, network load falls to base load levels as the peak time range crosses the Pacific ocean.  Another contributing factor is wide variance in the success of this example service in different geographic markets.


We see the same opportunities with power.  Power is usually charged at the 95th percentile over the course of the month.  It turns out that some negotiated rates are more complex than this but the same principle can be applied to any peak-load-sensitive billing system.  For simplicity sake, we’ll look at the common case of systems that charge the 95th percentile over the month.


Server power consumption varies greatly depending upon load.  Data from an example server SKU shows idle power consumption of 158W and full-load consumption of about 230W.  If we defer batch and non-user synchronous workload  as we approach the current data center power peak we can reduce overall peaks. As the server power consumption moves away from a peak, we can reschedule this non-critical workload.  Using this technique we throttle back the power consumption and knock off the peaks by filling the valleys. Another often discussed technique is to shut off non-needed servers and use workload peak clipping and trough filling to allow the workload to be run with less servers turned on. Using this technique it may actually be possible run the service with less servers overall.  In Should we Shut Off Servers, I argue that shutting off servers should NOT be the first choice.


Applying this technique to power has a huge potential upside because power provisioning and cooling dominates the cost of a data center.  Filling valleys allows better data center utilization in addition to lowering power consumption charges. 


The resource-shaping techniques we’re discussing here, that of smoothing spikes by knocking off peaks and filling valleys, applies to all data center resources.  We have to buy servers to meet the highest load requirements.  If we knock off peaks and fill valleys, less servers are needed.  This also applies also to internal networking.  In fact, Resource Shaping as a technique applies to all resources across the data center. The only difference is the varying complexity of scheduling the consumption of these different resources.


One more observation along this theme, this time returning to egress charges. We mentioned earlier that egress was charged at the 95th percentile.  What we didn’t mention is that ingress/egress are usually purchased symmetrically.  If you need to buy N units of egress, then you just bought N units of ingress whether you need it or not. Many services are egress dominated. If we can find a way to trade ingress to reduce egress, we save.  In effect, it’s cross-dimensional resource shaping, where we are trading off consumption of a cheap or free resources to save an expensive one.  On an egress dominated service,  even ineffective techniques that trade off say 10 units of ingress to save only 1 unit of egress may still work economically.  Remote Differential Compression is one approach to reducing egress at the expense of a small amount of ingress.


The cross-dimensional resource-shaping technique described above where we traded off ingress to reduce egress can be applied across other dimensions as well.  For example, adding memory to a system can reduce disk and/or network I/O.  When does it make sense to use more memory resources to save disk and/or networking resources?  This one is harder to dynamically tune in that it’s a static configuration option but the same principles can be applied.


We find another multi-resource trade-off possibility with disk drives.  When a disk is purchased, we are buying both a fixed I/O capability and a fixed disk capacity in a single package.  For example, when we buy a commodity 750GB disk, we get a bit less than 750GB of capacity and the capability of somewhat more than 70 random I/Os per second (IOPS).  If the workload needs more than 70 I/Os per second, capacity is wasted. If the workload consumes the disk capacity but not the full IOPS capability, then the capacity will be used up but the I/O capability will be wasted. 


Even more interesting, we can mix workloads from different services to “absorb” the available resources. Some workloads are I/O bound while others are storage bound. If we mix these two storage workloads types, we may be able to fully utilize the underlying resource.  In the mathematical limit, we could run a mixed set of workloads with ½ the disk requirements of a workload partitioned configuration.  Clearly most workloads aren’t close to this extreme limit but savings of 20 to 30% appear attainable.   An even more powerful saving is available from mixing workloads using storage by sharing excess capacity. If we pool the excess capacity and dynamically move it around, we can safely increase the utilization levels on the assumption that not all workloads will peak at the same time. As it happens, the workloads are not highly correlated in their resource consumption so this technique appears to offer even larger savings than what we would get through mixing I/O and capacity-bound workloads.  Both gains are interesting and both are worth pursuing.


Note that the techniques that I’ve broadly called resource shaping are an extension to an existing principle called network-traffic shaping  I see great potential in fundamentally changing the cost of services by making services aware of the real second-to-second value of a resource and allowing them to break their resource consumption into classes of urgent (expensive), less urgent (somewhat cheaper), and bulk (near free). 



James Hamilton,

Wednesday, December 17, 2008 9:16:03 AM (Pacific Standard Time, UTC-08:00)  #    Comments [3] - Trackback

 Saturday, December 13, 2008

I’ve resigned from Microsoft and will join the Amazon Web Services team at the start of next year.  As an AWS user, I’ve written thousands of lines of app code against S3, and now I’ll have an opportunity to help improve and expand the AWS suite.

In this case, I’m probably guilty of what many complain about in bloggers: posting rehashed news reported broadly elsewhere without adding anything new:










Job changes generally bring some stress, and that’s probably why I’ve only moved between companies three times in 28 years. I worked 6 years as an auto-mechanic, 10 years at IBM, and 12 years at Microsoft. Looking back over my 12 years at Microsoft, I couldn’t have asked for more excitement, more learning, more challenges, or more trust.

I’ve had a super interesting time at Microsoft and leaving is tough, but I also remember feeling the same way when I left IBM after 10 years to join Microsoft. Change is good; change challenges; change forces humility; change teaches. I’m looking forward to it even though all new jobs are hard. Onward!



Saturday, December 13, 2008 5:26:43 PM (Pacific Standard Time, UTC-08:00)  #    Comments [16] - Trackback
 Saturday, December 06, 2008

In the Cost of Power in Large-Scale Data Centers, we looked at where the money goes in a large scale data center.  Here I’m taking similar assumptions and computing the Annual Cost of Power including all the infrastructure as well as the utility charge. I define the fully burdened cost of power to be the sum of 1) the cost of the power from the utility, 2) the cost of the infrastructure that delivers that power, and 3) the cost of the infrastructure that gets the heat from dissipating the power back out of the building.


We take the monthly cost of the power and cooling infrastructure assuming a 15 year amortization cycle and 5% annual cost of money billed annually divided by the overall data center critical load to get the annual infrastructure cost per watt. The fully burdened cost of power is the cost of consuming 1W for an entire year and includes the infrastructure power and cooling and the power consumed. Essentially it’s the cost of all the infrastructure except the cost of the data center shell (the building).  From Intense Computing or In Tents Computing, we know that 82% of the cost of the entire data center is power delivery and cooling. So taking the entire monthly facility cost divided by the facility critical load * 82% is an good estimator of the infrastructure cost of power.


The fully burdened cost of power is useful for a variety of reasons but here’s two: 1) current generation servers get more work done per joule than older serves -- when is it cost effective to replace them?  And 2) SSDs consume much less power than HDDs --how much can I save in power over three years by moving to using SSDs and is it worth doing?


We’ll come back to those two examples after we work through what power costs annually. In this model, like the last one (, we’ll assume a 15MW data center that was built at a cost of $200M and runs at a PUE of 1.7. This is better than most, but not particularly innovative.


Should I Replace Old Servers?

Let’s say we have 500 servers, each of which can process 200 application operations/second. These servers are about 4 years old and consume 350W each.  A new server has been benchmarked to process 250 operations/second,  and each of these servers costs $1,3000 and consumes 165W at full load. Should we replace the farm?


Using the new server, we only need 400 servers to do the work of the previous 500 (500*200/250). The new server farm consumes less power. The savings are $111kw ((500*350)-(400*160)).  Let’s assume a plan to keep the new servers for three years.  We save 111kw each year for three years and we know from the above model that we are paying $2.12/kw/year. Over three years, we’ll save $705,960.  The new servers will cost $520,000 so, by recycling the old servers and buying new ones we can save $185,960. To be fair, we should accept a charge to recycle the old ones and we need to model the cost of money to spend $520k in capital. We ignore the recycling costs and use a 5% cost of money to model the impact of the capital cost of the servers. Using a 5% cost of money over three years amortization period, we’ll have another $52,845 in interest if we were to borrow to buy these servers or just in recognition that tying up capital has a cost. 


Accepting this $52k charge for tying up capital, it’s still a gain of $135k to recycle the old servers and buy new ones. In this case, we should replace the servers.


What is an SSD Worth?

Let’s look at the second example of the two I brought up above. Let’s say I can replace 10 disk drives with a single SSD. If the workload is not capacity bound and is I/O intensive, this can be the case (see When SSDs Make Sense in Server Applications). Each HDD consumes roughly 10W whereas the SSD only consumes 2.5W. Replacing these 10 HDD with a single SSD could save 97.5W/year and, over a three year life. That’s a savings of 292.5W. Using the fully burdened cost of power from the above model, we could save $620 (292.5W*$2.12) on power alone.  Let’s say the disk drives are $160 each and will last three years, what’s the break-even point where the SSD is a win assuming the performance is adequate and we ignoring other factors such as lifetime and service?  We’ll take the cost of the 10 disks and add in the cost of power saved to see what we could afford to pay for an SSD – the breakeven point (10*160+620 => $2220).  If the SSD is under $2220, then it is a win. The Intel X-25E has a street price of around $700 the last time I checked and, in many application workloads, it will easily replace 10 disks. Our conclusion is that, in this case with these assumptions, the SSD looks like the better investment than 10 disks.


When you factor in the fully burdened price of power, savings can add up quickly.  Compute your fully burdened cost of power (the spread sheet<JRH>) and figure out when you should be recycling old servers or considering lower power components.


If you are interested in tuning the assumptions to more closely match your current costs, here it is: PowerCost.xlsx (11.27 KB).




James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | |  | blog:


Saturday, December 06, 2008 4:23:54 AM (Pacific Standard Time, UTC-08:00)  #    Comments [8] - Trackback

 Wednesday, December 03, 2008

Michael Manos yesterday published Our Vision for Generation 4 Modular Data Centers – One Way of Getting it Just Right. In this posting, Mike goes through the next generation modular data  center designs for Microsoft. Things are moving quickly. I first argued for modular designs in a Conference on Innovative Data Systems paper submitted in 2006.  Last Spring I blogged First Containerized Data Center Announcement that looks at the containerized portion of the Chicago data center.


In this more recent post, the next generation design is being presented in surprising detail. The Gen4 design has 4 classes of service:

·         A: No UPS and no generator

·         B: UPS with optional generator

·         C: UPS, generator with +1 maintenance support

·         D: UPS and generator with +2 support


I’ve argued for years that high-minute UPS and generators are a poor investment.  We design services to be able to maintain SLA through server hardware or software error.  If a service is hosted over a large number of data centers, the loss of an entire data center should not impact the ability of the service to meet the SLA. There is no doubt that this is true and there are services that exploit this fact and reduce their infrastructure costs by not deploying generators. The problem is the vast majority of services don’t run over a sufficiently large number of data centers and some have single points of failure not distributed across data centers. Essentially some services can be hosted without high-minute UPSs and generators but many can’t be. Gen4 gets around that by offering a modular design where A class has no backup and D class is a conventional facility with good power redundancy (roughly a tier-3 design).


The Gen4 design is nearly 100% composed of prefabricated parts. Rather than just the server modules, all power distribution, mechanical, and even administration facilities are modular and prefabricated. This allows for rapid and incremental deployment.  With a large data center costing upwards of $200m (Cost of Power in High Scale Data Centers), an incremental approach to growth is a huge advantage.


Gen4 aims to achieve a PUE of 1.125 and to eliminate the use of water in the mechanical systems relying instead 100% on air-side economization.


Great data, great detail, and hats off to Mike and the entire Microsoft Global Foundations Services for sharing this information with the industry. It’s great to see.




Thanks to Mike Neil for pointing this posting out to me.


James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | |  | blog:


Wednesday, December 03, 2008 7:05:46 AM (Pacific Standard Time, UTC-08:00)  #    Comments [2] - Trackback

Disclaimer: The opinions expressed here are my own and do not necessarily represent those of current or past employers.

<February 2009>

This Blog
Member Login
All Content © 2015, James Hamilton
Theme created by Christoph De Baene / Modified 2007.10.28 by James Hamilton