Sunday, June 20, 2010

This morning I came across Exploring the software behind Facebook, the World’s Largest Site. The article doesn’t introduce new data not previously reported but it’s a good summary of the software used by Facebook and the current scale of the social networking site:

·         570 billion page views monthly

·         3 billion photo uploads monthly

·         1.2 million photos served per second

·         30k servers


The later metric, the 30k servers number is pretty old (Facebook has 30,000 servers). I would expect the number to be closer to 50k now based only upon external usage growth.


The article was vague on memcached usage saying only “Terrabytes”. I’m pretty interested in memcached and Facebook is, by far, the largest user, so I periodically check their growth rate. They now have 28 terabytes of memcached data behind 800 servers. See Scaling memcached at Facebook for more detail.


The mammoth memchached fleet at Facebook has had me wondering for years how close the cache is to the entire data store?  If you factor out photos and other large objects, how big is the entire remaining user database? Today the design is memecached insulating the fleet of database servers. What is the aggregate memory size of the memcached and database fleet? Would it be cheaper to store the entire database 2-way redundant in memory with changes logged to support recovery in the event that a two server loss?


Facebook is very close if not already able to store the entire data store minus large objects in memory and within a factor of two of being able to store in memory twice and have memcached be the primary copy completely omitting the database tier. It would be a fun project.




James Hamilton



b: /


Sunday, June 20, 2010 6:56:35 AM (Pacific Standard Time, UTC-08:00)  #    Comments [15] - Trackback
 Sunday, June 13, 2010

I’ve been talking about the application low-power, low-cost processors to server workloads for years starting with The Case for Low-Cost, Low-Power Servers. Subsequent articles get into more detail: Microslice Servers, Low-Power Amdahl Blades for Data Intensive Computing, and Successfully Challenging the Server Tax


Single dimensional measures of servers like “performance” without regard to server cost or power dissipation are seriously flawed. The right way to measure server performance is work done per dollar and work done by joule. If you adopt these measures of workload performance, we find that cold storage workload and highly partitionable workloads run very well on low-cost, low-power servers. And we find the converse as well. Database workloads run poorly on these servers (see When Very Low-Power, Low-Cost Servers Don’t Make Sense).


The reasons why scale-up workloads in general and database workload specifically run poorly on low-cost, low-powered servers are fairly obvious.  Workloads that don’t scale-out, need bigger single servers to scale (duh). And workloads that are CPU bound tend to run more cost effectively on higher powered nodes. The later isn’t strictly true. Even with scale-out losses, many CPU bound workloads still run efficiently on low-cost, low-powered servers because what is lost on scaling is sometimes more than gained by lower-cost and lower power consumption.


I find the bounds where a technology ceases to work efficiently to be the most interesting area to study for two reasons: 1) these boundaries teach us why current solutions don’t cross the boundary and often gives us clues on how to make the technology apply more broadly, and most important, 2) you really need to know where not to apply a new technology. It is rare that a new technology is a uniform across-the board win. For example, many of the current applications of flash memory make very little economic sense. It’s a wonderful solution for hot I/O-bound workloads where it is far superior to spinning media. But flash is a poor fit for many of the applications where it ends up being applied. You need to know where not to use a technology.


Focusing on the bounds of why low-cost, low-power servers don’t run a far broader class of workloads also teaches us what needs to change to achieve broader applicability. For example, if we ask what if the processor cost and power dissipation was zero, we quickly see, when scaling down processors costs and power, it is what surrounds the processor that begins to dominate. We need to get enough work done on each node to pay for the cost and power of all the surrounding components from northbridge, through  memory, networking, power supply, etc. Each node needs to get enough done to pay for the overhead components.


This shows us an interesting future direction: what if servers shared the infrastructure and the “all except the processor” tax was spread over more servers? It turns out this really is a great approach and applying this principle opens up the number of workloads that can be hosted on low-cost, low-power servers. Two examples of this direction are the Dell Fortuna and Rackable CloudRack C2. Both these shared infrastructure servers take a big step in this direction.


SeaMicro Releases Innovative Intel Atom Server

One of the two server startups I’m currently most excited about is SeaMicro. Up until today, they have been in stealth mode and I haven’t been able to discuss what they are building. It’s been killing me. They are targeting the low-cost, low-power server market and they have carefully studied the lessons above and applied the learning deeply. SeaMicro has built a deeply integrated, shared infrastructure, low-cost, low-power server solution with a broader potential market than any I’ve seen so far. They are able to run the Intel x86 instruction set avoiding the adoption friction of using different ISAs and they have integrated a massive number servers very deeply into an incredibly dense package. I continue to point out that rack density for densities sake is a bug not a feature (see Why Blades aren’t the Answer to All Questions) but the SeaMicro server module density is “good density” that reduces cost and increases efficiency. At under 2kw for a 10RU module, it is neither inefficient or challenging from a cooling perspective.


Potential downsides of the SeaMicro approach is that the Intel Atom CPU is not quite as power efficient as some of the ARM-based solutions and it doesn’t currently support ECC memory. However, the SeaMicro design is available now and it is a considerable advancement over what is currently in the market. See You Really do Need ECC Memory in Servers for more detail on why ECC can be important. What SeaMicro has built is actually CPU independent and can integrate other CPUs as other choice become available and the current Intel Atom-based solution will work well for many server workloads. I really like what they have done.


SeaMicro have taken shared infrastructure to a entirely new level in building a 512 server module that takes just 10 RU and dissipates just under 2Kw. Four of these modules will fit in an industry standard rack, consume a reasonable 8kW, and deliver more work done joule, work done per dollar, and more work done per rack than the more standard approaches currently on the market.

The SeaMicro server module is comprised of:

·         512 1.6Ghz Intel Atoms (2048 CPUs/rack)

·         1 TB DRAM

·         1.28 Tbps networking fabric

·         Up to 16x 10Gbps ingress/egress network or up to 64 1Gbps if running 1GigE

·         0 to 64 SATA SSD or HDDs

·         Standard x86 instruction set architecture (no recompilation)

·         Integrated software load-balancer

·         Integrated layer 2 networking switch

·         Under 2kW power


The server nodes are built 8 to a board in one of the nicest designs I’ve seen for years:



The SeaMicro 10 RU chassis can be hosted 4 to an industry standard rack:


This is an important hardware advancement and it is great to see the faster pace of innovation sweeping the server world driven by innovative startups like SeaMicro.




James Hamilton



b: /


Sunday, June 13, 2010 8:01:11 PM (Pacific Standard Time, UTC-08:00)  #    Comments [7] - Trackback
 Thursday, June 10, 2010

A couple of days ago I came across an interesting article by Microsoft Fellow Mark Russinovich. In this article, Mark hunts a random Internet Explorer crash with his usual tools: The Case of the Random IE Crash. He chases down the IE issue to a Yahoo! Toolbar. This caught my interest for two reasons: 1) the debug technique used to chase it down was interesting, and 2) it’s a two week old computer with no toolbars ever installed. From Mark’s blog:


This came as a surprise because the system on which the crash occurred was my home gaming system, a computer that I’d only had for a few weeks. The only software I generally install on my gaming systems are Microsoft Office and games. I don’t use browser toolbars and if I did, would obviously use the one from Bing, not Yahoo’s. Further, the date on the DLL showed that it was almost two years old. I’m pretty diligent about looking for opt-out checkboxes on software installers, so the likely explanation was that the toolbar had come onto my system piggybacking on the installation of one of the several video-card stress testing and temperature profiling tools I used while overclocking the system. I find the practice of forcing users to opt-out annoying and not giving them a choice even more so, so was pretty annoyed at this point. A quick trip to the Control Panel and a few minutes later and my system was free from the undesired and out-of-date toolbar.


It’s a messy world out there and its very tough to control what software gets installed on a computer. This broad class of problems are generally referred to as Drive-by Downloads:

The expression drive-by download is used in four increasingly strict meanings:

1.       Downloads which the user indirectly authorized but without understanding the consequences (eg. by installing an unknown ActiveX component or Java applet).

2.       Any download that happens without knowledge of the user.

3.       Download of spyware, a computer virus or any kind of malware that happens without knowledge of the user. Drive-by downloads may happen by visiting a website, viewing an e-mail message or by clicking on a deceptive popup window: the user clicks on the window in the mistaken belief that, for instance, an error report from the PC itself is being acknowledged, or that an innocuous advertisement popup is being dismissed; in such cases, the "supplier" may claim that the user "consented" to the download although actually unaware of having initiated an unwanted or malicious software download.

4.       Download of malware through exploitation of a web browser, e-mail client or operating system vulnerability, without any user intervention whatsoever. Websites that exploit the Windows Metafile vulnerability (eliminated by a Windows update of 5 January 2006) may provide examples of "drive-by downloads" of this sort.


This morning I came across what looks like a serious case of a drive-by download where the weapon of choice was the widely trusted Windows Update: Microsoft Secretly Installs Firefox Extension Through WU.


I’m a huge fan of Windows Update – I think its dramatically improved client-side security and reliability. The combination of Windows Error Reporting and Windows Update allow system failures to be statistically tracked, focus the resources on those causing the most problems, and then deliver the fixes broadly and automatically. These two tools are incredibly important to the health Windows ecosystem so I hope this report is inaccurate.




James Hamilton



b: /


Thursday, June 10, 2010 5:57:45 AM (Pacific Standard Time, UTC-08:00)  #    Comments [2] - Trackback

One of the most important attributes needed in a cloud solution is what I call cloud data freedom. Having the ability to move data out of the cloud quickly, efficiently, cheaply, and without restriction is a mandatory prerequisite in my opinion to trusting a cloud. In fact, you need the ability to move the data both ways. Moving in cheaply, efficiently, and quickly is often required just to get the work done. And the ability to move out cheaply, efficiently, quickly, and without restriction is the only way to avoid lock-in.  Data movement freedom is the most important attribute of an open cloud and a required prerequisite to avoiding provider lock in.


The issue came up in the comments on this post: Netflix on AWS where Jan Miczaika asked:


James, as long as Amazon is growing constantly and has a great culture of smart people it will work out fine. Should the going ever get tough (what I of course don't hope) these principles may be thrown overboard. It would not be the first time companies sacrifice long-term values for short-term profits.

This is all very hypothetical. Still, for strategic long-term planning, I believe it should be taken into account.


And I responded:

Jan, it is inarguably true that there have been instance of previously good companies making incredibly short sighted decisions. It has happened before and it could happen again.

The point I'm making is not that there exists any company that is incapable of damn dumb decisions. My point is that the cloud computing model is a huge win economically. I agree with you that no company is assured to be great, customer focused, and thinking clearly forever. Disasters can happen. That's why I would never do business with a cloud provider that didn't have great support for export of LARGE amounts of data cost effectively. Its super important that the data not be locked in. I don't care so much about the low level control plane programming model -- I can change how I call those APIs. But its super important that the data can be moved to another service easily. And, this export service has to be cheap and there is no way I would use the network for very high scale data movements. You have to assume that the data is going to keep growing and so its physical media export that you want. Recall Andrew Tenenbaum's "Never underestimate the bandwidth of a station wagon full of tapes hurtling down the highway" (

I'm saying you need to use cloud computing but I'm not saying you should trust one company to be the right answer for ever. Don't step in without a good quality export service based upon physical media at a known and reasonably price.

This morning AWS Import/Export announced that the service is now out of beta and it now supports a programmatic, web services interface. From the announce letter of earlier today:

AWS Import/Export accelerates moving large amounts of data into and out of AWS using portable storage devices for transport. The service is exiting beta and is now generally available. Also, a new web service interface augments the email-based interface that was available during the service's beta. Once a storage device is loaded with data for import or formatted for an export, the new web service interface makes it easy to initiate shipment to AWS in minutes, or to check import or export status in real-time.

You can use AWS Import/Export for:

·         Data Migration - If you have data you need to upload into the AWS cloud for the first time, AWS Import/Export is often much faster than transferring that data via the Internet.

·         Content Distribution Send data you are computing or storing on AWS to your customers on portable storage devices.

·         Direct Data Interchange - If you regularly receive content on portable storage devices from your business associates, you can have the data sent directly to AWS for import into your Amazon S3 buckets.

·         Offsite Backup - Send full or incremental backups to Amazon S3 for reliable and redundant offsite storage.

·         Disaster Recovery - In the event that you need to quickly retrieve a large backup stored in Amazon S3, use AWS Import/Export to transfer the data to a portable storage device and deliver it to your site.

To use AWS Import/Export, you just prepare a portable storage device, and submit a Create Job request with the open source AWS Import/Export command line application, a third party tool, or by programming directly against the web service interface. AWS Import/Export will return a unique identifier for the job, a digital signature for authenticating your device, and an AWS address to which you ship your storage device. After copying the digital signature to your device, ship it along with its interface connectors and power supply to AWS.

You can learn more about AWS Import/Export and get started using the web service at

Also announced this morning: AWS Management Console for S3 and 3 days ago: Amazon Cloudfront adds HTTPS Support, Lower prices, and Opens an NYC Edge Location. Things are moving pretty quickly right now.



James Hamilton



b: /


Thursday, June 10, 2010 5:18:00 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
 Monday, June 07, 2010

Last month I wrote about Solving World Problems with Economic Incentives. In that post I talked about the power of economic incentives when compared to regulatory body intervention. I’m not really against laws and regulations – the EPA, for example, has done some good work and much of what they do has improved the situation. But 9 times out of 10 good regulation is first blocked and/or water down by lobby groups, what finally gets enacted is often not fully through and brings unintended consequences, it is often overly prescriptive (see Right Problem but Wrong Approach), and regulations are enacted at the speed of government (think continental drift – there is movement but it’s often hard to detect).


If an economic incentive can be carefully crafted such that its squarely targeting the desired outcome rather than how to get there, wonderful things can happen. This morning I came across a great example of applying economic incentive to drive a positive outcome rapidly.  


First, some background on the base issue. I believe that web site latency has a fundamental impact on customer satisfaction and there is considerable evidence that it drives better economic returns. See The Cost of Latency for more detail on this issue.  Essentially I’m arguing that there really is a economic argument to reduce web page latency and astute companies are doing it today. If I’m right that economic incentives are enough, why isn’t the latency problem behind us?


The problem is that economic incentives only drive desired outcomes when there is a general, widely held belief that there is direct correlation between the outcome and the improved economic condition. In the case of web page latency, I’ll claim the evidence from Goggles Steve Souder, Jake Brutlag, and Marissa Mayer, Bing’s Eric Schurman (now Amazon), Dave Artz from AOL, Phil Dixon from Shopzilla and many others is very compelling. But, compelling isn’t enough. The reason I still write about it, is its not widely believed. Many argue that web page latency isn’t highly correlated with better economic outcomes.


Regardless of your take on this important topic, I urge you to read Steve Souder’s post Velocity and the Bottom Line. By the way, Velocity 2010 is coming up and you should consider doing the trip. It’s a good conference, the 2009 even produced some wonderful data and I expect 2010 to be at least as good.. I plan to be down for a day to give a talk.


Returning to web latency. I really believe there is an economic incentive to improve web site latency. But, if this belief is not widely held, it has no impact. I think that is about to change. Google recently announced Google Now Counts Site Speed as a Ranking Factor. Silly amounts of money is spent on search engine optimization. Getting to the top of the ranking is worth big bucks. This economic value of improved ranking is widely held and drives considerable behavior and investment today. It’s a very powerful tool. In fact, so valuable that an entire industry has grown up around helping achieve better ranking. Ranking is a very powerful incentive.


What Google has done here is a tiny first step but it’s a very cool first step with lots of potential upside. If ranking is believed to be materially impacted by site performance, we are going to see the entire web speed up. This could be huge if Google keeps taking steps down this path. Steve Souder’s books High Performance Web Sites and Even Faster Web Sites will continue to have a bright future.  Content Distribution Networks Like Limelight, Akamai, and Cloudfront will grow even faster. The very large cloud services providers like Amazon Web Services, with data centers all over the world will continue to grow quickly. We are going to see accelerated datacenter building in Asia. Lots will change.


If Google continues to move down the path of making web site latency a key factor in site ranking, we are going to see a faster web. More! Faster!


Thanks to Todd Hoff of High Scalability for pointing me towards this one in Web Speed Can Push You Off Googles Search Rankings.




James Hamilton



b: /


Monday, June 07, 2010 5:44:22 AM (Pacific Standard Time, UTC-08:00)  #    Comments [5] - Trackback
 Saturday, June 05, 2010

Industry trends come and go. The ones that stay with us and have lasting impact are those that fundamentally change the cost equation. Public clouds clearly pass this test. The potential savings approach 10x and, in cost sensitive industries, those that move to the cloud fastest will have a substantial cost advantage over those that don’t.


And, as much as I like saving money, the much more important game changer is speed of execution. Those companies depending upon public clouds will noticeably more nimble. Project approval to delivery times fall dramatically when there is no capital expense to be approved. When the financial risk of new projects is small, riskier projects can be tried. The pace of innovation increases. Companies where innovation is tied the financial approval cycle and the hardware ordering to install lag are at a fundamental disadvantage.


Clouds change companies for the better, clouds drive down costs, and clouds change the competitive landscape in industries. We have started what will be an exciting decade.


Earlier today I ran across a good article by Rodrigo Flores, CTO of newScale. In this article, Rodrigo says;


First, give up the fight: Enable the safe, controlled use of public clouds. There’s plenty of anecdotal and survey data indicating the use of public clouds by developers is large. A newScale informal poll in April found that about 40% of enterprises are using clouds – rogue, uncontrolled, under the covers, maybe. But they are using public clouds.


The move to the cloud is happening now. He also predicts:


IT operations groups are going to be increasingly evaluated against the service and customer satisfaction levels provided by public clouds. One day soon, the CFO may walk into the data center and ask, “What is the cost per hour for internal infrastructure, how do IT operations costs compare to public clouds, and which service levels do IT operations provide?” That day will happen this year.


This is a super important point. It was previously nearly impossible to know what it would cost to bring an application up and host it for its operational life. There was no credible alternative to hosting the application internally. Now, with care and some work, a comparison is possible and I expect that comparison to be made many times this year. This comparison won’t always be made accurately but the question will be asked and every company now has access to the data to be able to credibly make the comparison.  


I particularly like his point that self service is much better than “good service”.  Folks really don’t want to waste time calling service personal no matter how well trained those folks are. Customers just want to get their jobs done with as little friction as possible. Less phone calls are good.


Think like an ATM: Embrace self-service immediately. Bank tellers may be lovely people, but most consumers prefer ATMs for standard transactions. The same applies to clouds. The ability by the customer to get his or her own resources without an onerous process is critical.


Self service is cheaper, faster, and less frustrating for all involved. I’ve seen considerable confusion on this point. Many people tell me that customers want to be called on by sales representatives and they want the human interaction from the customer service team. To me, it just sounds like living in the past. These are old, slow, and inefficient models.


Public clouds are the new world order.  Read the full article at:   The Competitive Threat of Public Clouds.


James Hamilton



b: /


Saturday, June 05, 2010 5:34:58 AM (Pacific Standard Time, UTC-08:00)  #    Comments [6] - Trackback
 Monday, May 31, 2010

I did a talk at the Usenix Tech conference last year, Where does the Power Go in High Scale Data Centers. After the talk I got into a more detailed discussion with many folks from Netflix and Canada’s Research in Motion, the maker of the Blackberry. The discussion ended up in a long lunch over a big table with folks from both teams. The common theme of the discussion was predictably, given the companies and folks involved, innovation in high scale service and how to deal with incredible growth rates. Both RIM and Netflix are very successful and, until you have experienced and attempted to manage internet growth rates, you really just don’t know. I'm impressed with what they are doing. Growth brings super interesting problems and I learned from both and really enjoyed spending time with them.


I recently came across an interesting talk by Santosh Rau, the Netflix Cloud Infrastructure Engineering Manager. The fact that Netflix actually has a Cloud Infrastructure engineering manager is what caught my attention. Netflix continues to innovate quick and is moving fast with cloud computing.


My notes from Rau’s talk:

·         Details on Netflix

o   More than 10m subscribers

o   Over 100,000 DVD titles

o   50 distribution centers

o   Over 12,000 instant watch titles

·         Why is Netflix going to the cloud

o   Elastic infrastructure

o   Pay for what you use

o   Simple to deploy and maintain

o   Leverage datacenter geo-diversity

o   Leverage application services (queuing, persistence, security, etc.

·         Why did Netflix chose Amazon Web Services

o   Massive scale

o   More mature services

o   Thriving, active developer community of over 400,000 developers with excellent support

·         Netflix goals for move to the cloud:

o   Improved availability

o   Operational simplicity

o   Architect to exploit the characteristic of the cloud

·         Services in cloud:

o   Streaming control service: stream movie content to customers

§  Architecture: Three Netflix services running in EC2 (replication, queueing, and streaming) with inter-service communication via SQS and persistent state in SimpleDB.

§  Good cloud workload in that usage can vary greatly and there is value in having regional data centers and a better customer experience is possible by streaming content from locations near users

o   Encoding Service: Encodes movies in format required by diverse set of supported devices.

§  Good cloud workload in that its very computational intense and as new formats are introduced, massive encoding work needs to be done and there is value in doing it quickly (more servers for less time).

o   AWS Services used by Netflix

§  Elastic compute Cloud

§  Elastic Block Storage

§  Simple Queuing Service

§  SimpleDB

§  Simple Storage Service

§  Elastic Load Balancing

§  Elastic MapReduce

o   Developer Challenges:

§  Reliability and capacity

§  Persistence strategy

·         Oracle on EC2 over EBS vs MySQL vs SimpleDB

·         SimpleDB: Highly available replicating across zones

·         Eventually consistent (now supports full consistency (I love eventual consistency but…)

§  Data encryption and key management

§  Data replication and consistency


Predictably, the talk ended with “Netflix is hiring” but, in this case, it is actually worth mentioning. They are doing very interesting work and moving lightening fast. RIM is hiring as well:


The slides for the talk are at: slideshare.




James Hamilton



b: /


Monday, May 31, 2010 5:59:29 AM (Pacific Standard Time, UTC-08:00)  #    Comments [4] - Trackback
 Tuesday, May 25, 2010

PUE is still broken and I still use it. For more on why PUE has definite flaws, see: PUE and Total Power Usage Efficiency. However, I still use it because it’s an easy to compute summary of data center efficiency. It can be gamed endlessly but it’s easy to compute and it does provide some value.


Improvements are underway in locking down of the most egregious abuses of PUE. Three were recently summarized in Technical Scribblings  RE Harmonizing Global Metrics for Data Center Energy Efficiency.  In this report from John Stanley, the following were presented:

·         Total energy to include all forms of energy whether electric or otherwise (e.g. gas fired chiller must include chemical energy being employed). I like it but It’ll be a challenge to implement

·         Total energy should include lighting, cooling, and all support infrastructure. We already knew this but its worth clairifying since it’s a common “fudge” employed by smaller operators

·         PUE energy should be calculated using source energy. This is energy at the source prior to high voltage distribution losses and including all losses in energy production. For example, for gas plants, it’s the fuel energy used including heat losses and other inefficiencies. This one seems hard to compute with precision and I’m not sure how I could possibly figure out source energy where some power is base load power and some is from peak plants and some is from out of state purchases. This recommendation seems a bit weird.


As with my recommendations in PUE and Total Power Usage Efficiency, these proposed changes add complexity while increasing precision. Mostly I think the increased complexity is warranted although the last, computing source energy, looks hard to do and I don’t fully buy that the complexity is justified.


It’s a good short read: Technical Scribblings  RE Harmonizing Global Metrics for Data Center Energy Efficiency. Thanks to Vijay Rao of AMD for sending this one my way.




James Hamilton



b: /


Tuesday, May 25, 2010 9:07:40 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback

Federal and state governments are prodigious information technology users.  Federal Chief Information Security Office Vivek Kundra reports that the United States government is spending $76B annually on 10,000 different systems. In a recently released report, State of Public Sector Cloud Computing, Vivek Kundra summarizes the benefits of cloud computing:


There was a time when every household, town, farm or village had its own water well.  Today, shared public utilities give us access to clean water by simply turning on the tap; cloud computing works in a similar fashion.  Just like the water from the tap in your kitchen, cloud computing services can be turned on or off quickly as needed.  Like at the water company, there is a team of dedicated professionals making sure the service provided is safe and available on a 24/7 basis.  Best of all, when the tap isnt on, not only are you saving water, but you arent paying for resources you dont currently need.

§  Economical.  Cloud computing is a pay-as-you-go approach to IT, in which a low initial investment is required to get going.  Additional investment is incurred as system use increases and costs can decrease if usage decreases.  In this way, cash flows better match total system cost.

§  Flexible.  IT departments that anticipate fluctuations in user load do not have to scramble to secure additional hardware and software.  With cloud computing, they can add and subtract capacity as its network load dictates, and pay only for what they use.

§  Rapid Implementation.  Without the need to go through the procurement and certification processes, and with a near-limitless selection of services, tools, and features, cloud computing helps projects get off the ground in record time. 

§  Consistent Service.  Network outages can send an IT department scrambling for answers.  Cloud computing can offer a higher level of service and reliability, and an immediate response to emergency situations. 

§  Increased Effectiveness.  Cloud computing frees the user from the finer details of IT system configuration and maintenance, enabling them to spend more time on mission-critical tasks and less time on IT operations and maintenance. 

§  Energy Efficient.  Because resources are pooled, each user community does not need to have its own dedicated IT infrastructure.  Several groups can share computing resources, leading to higher utilization rates, fewer servers, and less energy consumption.  

This document defines cloud computing and describes the federal government approach and then goes on to cover 30 case studies. The case studies are the most interesting part of the report in that they provide a sampling of the public sector move to cloud computing showing its real and project are underway and substantial progress is being made.


It’s good to see the federal government showing leadership at a time when the need for federal services are undiminished but the burgeoning federal deficit needs to be brought under control. The savings possible through cloud computing are substantial and the federal IT spending base is enormous, so its particularly good to be adopting this new technology delivery platform at scale.


·         Document: State of Public Sector Cloud Computing

·         Executive Summary: State of Public Sector Cloud Computing


Thanks to Werner Vogels for sending this article my way.




James Hamilton



b: /


Tuesday, May 25, 2010 5:17:52 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
 Monday, May 24, 2010

Economic forces are more powerful than politics.  Political change is slow.  Changing laws takes time.  Lobbyist water down the intended legislation.  Companies find loop holes.  The population as a whole, lacks the strength of conviction to make the tough decisions and stick with them.


Economic forces are far more powerful and certainly more responsive than political forces. Effectively, what I’m observing is great good can be done if there is a business model and profit encouraging it. Here’s my favorite two examples, partly because they are both doing great things and partly because they are so different in their approach, but still have the common thread of using the free market to improve the world.


Google RE<C

As a society, we can attempt to limit greenhouse gas emissions by trading carbon credits or passing laws attempting to force change but, in the end, it seems we just keep burning coal.  In my view, the Google approach to tackling this problem is wonderful: invest in renewable energy technologies that can be cheaper than coal.  More on the program: Google's Goal: Renewable Energy Cheaper than Coal and Plug into a Greener Grid: RE<C and RechargeIT Initiatives. They are working on solar thermal, high-altitude wind, and geo-thermal.  The core idea is that, if renewable sources are cheaper than coal, economic forces would quickly make the right thing happen and we actually would stop burning coal. I love the approach but its fiendishly difficult.


Bill & Melinda Gates Foundation

Here’s a related approach. The problem set is totally different but there are some parallels with the previous example in that they are attempting to set up an economic system where it can be profitable to do good for society.


I attended a small presentation by Bill Gates about 5 years ago.  By my measure, it was by far the best talk I’ve ever seen Gates gave.  I suspect Bill wouldn’t agree that it was his best but it had a huge impact on me. No press was there and I saw nothing written about it afterwards, but two things caught my interest: 1) Gates’ understanding of world health problems is astoundingly deep, and 2) I loved his technique of applying free-market principles to battle world problems ranging from disease through population growth.


In this talk Bill noted that North American disease has a very profitable business model and consequently is heavily invested.  Third world disease lacks a business model and, as a result, there is very little investment. It’s clear that many diseases  are easy to control or even eradicate but there is no economic incentive and so there is no sustained progress. There are charity donations but no deep and sustained R&D investment since there are no obvious profits to be made. Bill proposed that we encourage business models that allows drug companies to invest R&D into third world health problems. They should be able to invest knowing they will be able make money on the outcome.


Current drug costs are driven almost exclusively by R&D costs. The manufacturing costs are quite low by comparison. Does this remind you of anything? It’s the software world all over again. So, the question that brings up is: Can we create a model where drugs are sold in huge volume at very low cost?  I recall buying a copy of Unix for an IBM XT back in 1985 and it was $1,000 (Mark Williams Coherent).  Today 1/10 of that will buy an O/S and many are free with the business model being built on services.  Can we do the same thing to the drug world?  Where else could this technique play out?


Using the free market to drive change is the most leveraged approach I’ve ever seen to drive change. Where else can we cost effectively change the economic model and drive a better outcome for society as a whole?


James Hamilton



b: /


Monday, May 24, 2010 4:40:26 AM (Pacific Standard Time, UTC-08:00)  #    Comments [3] - Trackback
 Tuesday, May 18, 2010

I am excited by very low power, very low cost servers and the impact they will have on our industry. There are many workloads where CPU is in excess and lower power and lower cost servers are the right answer. These are workloads that don’t fully exploit the capabilities of the underlying server. For these workloads, the server is out of balance with excess CPU capability (both power and cost). There are workloads were less is more. But, with technology shifts, it’s easy to get excited and try to apply the new solution too broadly.


We can see parallels in the Flash memory world. At first there was skepticism that Flash had a role to play in supporting server workloads. More recently, there is huge excitement around flash and I keep coming across applications of the technology that really don’t make economic sense. Not all good ideas apply to all problems. In going after this issue I wrote When SSDs make sense in Server applications and then later When SSDs Don’t Make Sense in Server Applications. Sometimes knowing where not to apply a technology is more important than knowing where to apply it. Looking at the negative technology applications is useful.


Returning to very low-cost, low-power servers, I’ve written a bit about where they make sense and why:

·         Very Low-Power Server Progress

·         The Case for Low-Cost, Low-Power Servers

·         2010 the Year of the Microslice Computer

·         Linux/Apache on ARM Servers

·         ARM Cortex-A9 SMP Design Announced


But I haven’t looked much at where very low-power, low-cost servers do not make sense. When aren’t they a win when looking at work done per dollar and work done per joule? Last week Dave DeWitt sent me a paper that looks the application of Wimpy (from the excellent  FAWN, Fast Array of Wimpy Nodes, project at CMU) servers and their application to database workloads. In Wimpy Node Clusters: What About Non-Wimpy Workloads Willis Lang, Jignesh Patel, and Srinanth Shankar find that Intel Xeon E5410 is slightly better than Intel Atom when running parallel clustered database workloads including TPC-E and TPC-H. The database engine in this experiment is IBM DB2 DB-X (yet another new name for the product originally called DB2 Parallel Edition – see IBM DB2 for information on DB2 but the Wikipedia page is not yet caught up to the latest IBM name change).


These results show us that on complex, clustered database workloads, server processors can win over low-power parts. For those interested in probing the very low-cost, low-power processor space, the paper is worth a read: Wimpy Node Clusters: What About Non-Wimpy Workloads.


The generalization of their finding that I’ve been using is CPU intensive and workloads with poor scaling characteristic are poor choices to be hosted on very low-power, low-cost servers. CPU intensive workloads are a lose because these workloads are CPU-bound so run best where there is maximum CPU per-server in the cluster. Or worded differently, the multi-server cluster overhead is minimized by having fewer, more-powerful nodes.  Workloads with poor scaling characteristics are another category not well supported by wimpy nodes and the explanation is similar. Although these workloads may not be CPU-bound, they don’t run well over clusters with large server counts. Generally, more resources per node is the best answer if the workload can’t be scaled over large server counts.


Where very low-power, low-cost servers win is:

1.      Very cold storage workloads. I last posted on these workloads last year Successfully Challenging the Server Tax. The core challenge with cold storage apps is that overall system cost is dominated by disk but the disk needs to be attached to a server. We have to amortize the cost of the server over the attached disk storage. The more disk we attach to a single server, the lower the cost. But, the more disk we attach to a single server, the larger the failure zone. Nobody wants to have to move 64 to 128 TB every time a server fails. The tension is more disk to server ratio drives down costs but explodes the negative impact of server failures. So, if we have a choice of more disks to a given server or, instead, to use a smaller, cheaper server, the conclusion is clear. Smaller wins. This is a wonderful example of where low-power servers are a win.

2.      Workloads with good scaling characteristics and non-significant local resource requirements.  Web workloads that just accept connections and dispatch can run well on these processors. However, we still need to consider the “and non-significant local resource” clause. If the workload scales perfectly but each interaction needs access to very large memories for example, it  may be poor choice for Wimpy nodes.  If the workload scales with CPU and local resources are small, Wimpy nodes are a win.

The first example above is a clear win. The second is more complex. Some examples will be a win but others will not. The better the workload scales and the less fixed resources (disk or memory) required, the bigger the win.


Good job by Willis Lang, Jignesh Patel, and Srinanth Shankar in showing us where wimpy nodes lose with detailed analysis.


James Hamilton



b: /


Tuesday, May 18, 2010 4:09:47 AM (Pacific Standard Time, UTC-08:00)  #    Comments [10] - Trackback
 Friday, May 14, 2010

I recently came across a nice data center cooling design by Alan Beresford of EcoCooling Ltd. In this approach, EcoCooling replaces the CRAC units with a combined air mover, damper assembly, and evaporative cooler. I’ve been interested by evaporative coolers and their application to data center cooling for years and they are becoming more common in modern data center deployments (e.g. Data Center Efficiency Summit).


An evaporative cooler is a simple device that cools air through taking water through a state change from fluid to vapor. They are incredibly cheap to run and particularly efficient in locals with lower humidity. Evaporative coolers can allow the power intensive process-based cooling to be shut off for large parts of the year. And, when combined with favorable climates or increased data center temperatures can entirely replace air conditioning systems. See Chillerlesss Datacenter at 95F, for a deeper discussion see Costs of Higher Temperature Data Centers, and for a discussion on server design impacts: Next Point of Server Differentiation: Efficiency at Very High Temperature.


In the EcoCooling solution, they take air from the hot aisle and release it outside the building. Air from outside the building is passed through an evaporative cooler and then delivered to the cold aisle. For days too cold outside for direct delivery to the datacenter, outside air is mixed with exhaust air to achieve the desired inlet temperature.



This is a nice clean approach to substantially reducing air conditioning hours. For more information see: Energy Efficient Data Center Cooling or the EcoCooling web site:




James Hamilton



b: /



Friday, May 14, 2010 7:06:00 PM (Pacific Standard Time, UTC-08:00)  #    Comments [7] - Trackback
 Monday, May 10, 2010

Wide area network costs and bandwidth shortage are the most common reasons why many enterprise applications run in a single data center. Single data center failure modes are common. There are many external threats to single data center deployments including utility power loss, tornado strikes, facility fire, network connectivity loss,  earthquake, break in, and many others I’ve not yet been “lucky” enough to have seen. And, inside a single facility, there are simply too many ways to shoot one’s own foot.  All it takes is one well intentioned networking engineer to black hole the entire facilities networking traffic. Even very high quality power distribution systems can have redundant paths taken out by fires in central switch gear or cascading failure modes.  And, even with very highly redundant systems, if the redundant paths aren’t tested often, they won’t work.  Even with incredibly redundancy, just having the redundant components in the same room, means that a catastrophic failure of one system, could possibly eliminate the second. It’s very hard to engineer redundancy with high independence and physical separate of all components in a single datacenter.


With incredible redundancy, comes incredible cost. Even with incredible costs, failure modes remain that can eliminate the facility entirely. The only cost effective solution is to run redundantly across multiple data centers.  Redundancy without physical separation is not sufficient and making a single facility bullet proof has expenses asymptotically heading towards infinity with only tiny increases in availability as the expense goes up. The only way to get the next nine is have redundancy between two data centers. This approach is both more available and considerably more cost effective.


Given that cross-datacenter redundancy is the only effective way to achieve cost-effective availability, why don’t all workloads run in this mode? There are 3 main blocker for the customer I’ve spoken with: 1) scale, 2) latency, and 3) WAN bandwidth and costs.


The scale problem is, stated simply, most companies don’t run enough IT infrastructure to be able to afford multiple data centers in different parts of the country. In fact, many companies only really need a small part of a collocation facility.  Running multiple data centers at low scale drives up costs. This is one of the many ways cloud computing can help drive down costs and improve availability.  Cloud service providers like Amazon Web Services, run 10s of data centers. You can leverage the AWS scale economics to allow even very low scale applications to run across multiple data centers with diverse power, diverse networking, different fault zones, etc. Each datacenter is what AWS calls an Availability Zone. Assuming the scale economics allow it, the second blocker to cross data center replication is the availability of WAN bandwidth and its cost.


There are also physical limits – mostly the speed of light in fiber – on how far apart redundant components of an application can be run. This limitation is real but won’t prevent redundancy data centers from getting far “enough” away to achieve the needed advantages. Generally, 4 to 5 msec is tolerable for most workloads and replication systems. Bandwidth availability and costs is the prime reason why most customers don’t run geo-diverse.


I’ve argued that latency need not be a blocker. So, if the application has the scale to be able to be run over multiple data centers, the major limiting factor remaining is WAN bandwidth and cost. It is for this reason that I’ve long been interested in WAN compression algorithsm and appliances. These are systems that do compression between branch offices and central enterprise IT centers. Riverbed is one of the largest and most successful of the WAN accelerator providers. Naïve application of block-based compression is better than nothing but compression ratios are bounded and some types of traffic compress very poorly. Most advanced WAN accelerators employ three basic techniques: 1) data type specific optimizations, 2) dedupe, and 3) block-based compression.


Data type specific optimizations are essentially a bag of closely guarded heuristics that optimize for Exchange, SharePoint, remote terminal protocol, or other important application data types. I’m going to ignore these type-specific optimizations and focus on dedupe followed by block-based compression since they are the easiest to apply to cross data center traffic replication traffic.


Broadly, dedupe breaks the data to be transferred between datacenters into either fixed or variable sized blocks. Variable blocks are slightly better but either works. Each block is cryptographically hashed and, rather than transferring the block to the remote datacenter, just send the hash signature. If that block is already in the remote system block index, then it or its clone has already been sent sometime in the past and nothing need to be sent now. In employing this technique we are exploiting data redundancy at a course scale. We are essentially remembering what is on both sides of the WAN and only sending blocks that have not been seen before. The effectiveness of this broad technique is very dependent upon the size and efficiency of the indexing structures, the choice of block boundaries, and inherent redundancy in the data. But, done right, the compression ratios can be phenomenal with 30 to 50:1 not being uncommon. This, by the way, is the same basic technology being applied in storage deduplications by companies like Data Domain.


If a block has not been sent before, then we actually do have to transfer it. That’s when we apply the second level compression technique. Usually a block-oriented compression algorithm and frequently some variant of LZ. The combination of dedupe and block compression is very effective. But, the system I’ve described above introduces latency. And, for highly latency sensitive workloads like EMC SRDF, this can be a problem. Many latency sensitive workloads can’t employ the tricks I’m describing here and either have to run single data center or run at higher cost without compression. 


Last week I ran across a company targeting latency sensitive cross-datacenter replication traffic. Infineta Systems announced this morning a solution targeting this problem: Infineta Unveils Breakthrough Acceleration Technology for Enterprise Data Centers. The Infineta Velocity engine is a dedupe appliance that operates at 10Gbps line rate with latencies under 100 microseconds per network packet. Their solution aims to get the bulk of the advantages of the systems I described above at much lower overhead and latency. They achieve their speed-up three ways: 1) hardware implementation based upon FPGA, 2) fixed-sized, full packet block size,  3) bounded index exploiting locality, and 4) heuristic signatures.


The first technique is fairly obvious and one I’ve talked about in the past. When you have a repetitive operation that needs to run very fast, the most cost and power effective solution may be a hardware implementation. It’s getting easier and easier to implement common software kernels in FPGAs or even ASICs. see Heterogeneous Computing using GPGPUs and FPGAs for related discussions on the application of hardware acceleration and, for an application view,  High Scale Network Research.


The second technique is another good one. Rather than spend time computing block boundaries, just use the network packet as the block boundary. Essentially they are using the networking system to find the block boundaries. This has the downside of not being as effective as variable sized block systems and they don’t exploit type specific knowledge but they can run very fast at low overhead and close to the higher compression rates yielded by these more computationally intensive techniques.  They are exploiting the fact that 20% of the work produces 80% of the gain.


The third technique helps reduce the index size. Rather than having a full index of all blocks that have even been sent, just keep the last N. This allows the index structure to be 100% memory resident without huge, expensive memories. This smaller index is much less resource intensive requiring much less memory and no disk accesses. Avoiding disk is the only way to get anything approaching 100 microsecond latency. Infineta is exploiting temporal locality.  Redundant data packets often show up near each other. Clearly this is not always the case and they won’t get the maximum possible compression but they claim to get most of the compression possible in full block index systems without the latency penalty of a disk access and without less memory overhead.


The final technique wasn’t described in enough detail for me to fully understand it.  What Infineta is doing is avoiding the cost of fully hashing each packet but taking an approximate signature of carefully chosen packet offsets. Clearly you can take a fast signature on less than the full packet and this signature can be used to know that the packet is not in the index on the other side. But, if the fast hash is present, it doesn’t prove the packet has already been sent. Two different packets can have the same fast hash. Infineta were a bit cagey on this point but what they might be doing is using the very fast approx has to find those that have not yet been sent unambiguously. Using this technique, a fast hash can be used to find those packets that absolutely need to be sent so we can start to compress and send those and waste no more resources on hashing. For those that may not need to be sent, take a full signature and check to see if it is on the remote site. If my guess is correct, the fast hash is being used to avoid spending resources quickly on packets that are not in the index on the other side.


Infineta looks like an interesting solution.  More data on them at:

·         Press release:,15851,445

·         Web site:

·         Announcing $15m Series A funding: Infineta Comes Out of Stealth and Closes $15 Million Round of Funding


James Hamilton



b: /


Monday, May 10, 2010 12:07:33 PM (Pacific Standard Time, UTC-08:00)  #    Comments [8] - Trackback
 Thursday, May 06, 2010

Earlier this week Clustrix announced a MySQL compatible, scalable database appliance that caught my interest. Key features supported by Clustrix:

·         MySQL protocol emulation (MySQL protocol supported so MySQL apps written to the MySQL client libraries just work)

·         Hardware appliance delivery package in a 1U package including both NVRAM and disk

·         Infiniband interconnect

·         Shared nothing, distributed database

·         Online operations including alter table add column


I like the idea of adopting a MySQL programming model. But, it’s incredibly hard to be really MySQL compatible unless each node is actually based upon the MySQL execution engine. And it’s usually the case that a shared nothing, clustered DB will bring some programming model constraints. For example, if global secondary indexes aren’t implemented, it’s hard to support uniqueness constraints on non-partition key columns and it’s hard to enforce referential integrity. Global secondary indexes maintenance implies a single insert, update, or delete that would normally only require a single node change would require atomic updates across many nodes in the cluster making updates more expensive and susceptible to more failure modes. Essentially, making a cluster look exactly the same as a single very large machine with all the same characteristics isn’t possible. But, many jobs that can’t be done perfectly are still well worth doing. If Clustrix delivers all they are describing, it should be successful.


I also like the idea of delivering the product as a hardware appliance. It keep the support model simple, reduces install and initial setup complexity, and enables application specific hardware optimizations.


Using Infiniband as a cluster interconnect is a nice choice as well. I believe that 10GigE with RDMA support will provide better price performance than Infiniband but commodity 10GigE volumes and quality RDMA support is still 18 to 24 months away so Inifiband is a good choice for today.


Going with a shared nothing architecture avoids dependence on expensive shared storage area networks and the scaling bottleneck of distributed lock managers.  Each node in the cluster is an independent database engine with its own physical (local) metadata, storage engine, lock manager, buffer manager, etc. Each node has full control of the table partitions that reside on that node. Any access to those partitions must go through that node. Essentially, bringing the query to the data rather than the data to the query. This is almost always the right answer and it scales beautifully.  

In operation, a client connects to one of the nodes in the cluster and submits a SQL statement. The statement is parsed and compiled. During compilation, the cluster-wide (logical) metadata is accessed as needed and an execution plan is produced. The cluster-wide (logical) metadata is either replicated to all nodes or stored centrally with local caching. The execution plan produced by the query compilation will be run on as many nodes as needed with the constraint that table or index access be on the nodes that house those table or index partitions. Operators higher in the execution plan can run on any node in the cluster.  Rows flow between operators that span node boundaries over the infiniband network.  The root of the query plan runs on the node where the query was started and the results are returned to client program using the MySQL client protocol


As described, this is a very big engineering project. I’ve worked on teams that have taken exactly this approach and they took several years to get to the first release and even subsequent releases had programming model constraints. I don’t know how far along Clustrix is a this point but I like the approach and I’m looking forward to learning more about their offering.


White paper: Clustrix: A New Approach

Press Release: Clustrix Emerges from Stealth Mode with Industry’s First Clustered DB



James Hamilton



b: /


Thursday, May 06, 2010 9:49:22 AM (Pacific Standard Time, UTC-08:00)  #    Comments [9] - Trackback
 Tuesday, May 04, 2010

 Dave Patterson did a keynote at Cloud Futures 2010.  I wasn’t able to attend but I’ve heard it was a great talk so I asked Dave to send the slides my way. He presented Cloud Computing and the Reliable Adaptive Systems Lab.


The Berkeley RAD Lab principal investigators include: Armando FoxRandy Katz & Dave Patterson (systems/networks), Michael Jordan (machine learning), Ion Stoica (networks & P2P), Anthony Joseph (systems/security), Michael Franklin (databases), and Scott Shenker (networks) in addition to 30 Phd students, 10 undergrads, and 2 postdocs.


The talk starts by arguing that cloud computing actually is a new approach drawing material from the Above the Clouds paper that I mentioned early last year: Berkeley Above the Clouds. Then walked through why pay-as-you-go computing with small granule time increments allow SLAs to be hit without stranding valuable resources.




The slides are up at: Cloud Computing and the RAD Lab and if you want to read more about the RAD lab: If you haven’t already read it, this is worth reading: Above the Clouds: A Berkeley View.




James Hamilton



b: /


Tuesday, May 04, 2010 5:48:56 AM (Pacific Standard Time, UTC-08:00)  #    Comments [4] - Trackback
 Friday, April 30, 2010

Rich Miller of Datacenter Knowledge covered this last week and it caught my interest. I’m super interested in modular data centers (Architecture for Modular Datacenters) and highly efficient infrastructure (Data Center Efficiency Best Practices) so the Yahoo! Computing Coop caught my interest.


As much as I like the cost, strength, and availability of ISO standard shipping containers, 8’ is an inconvenient width. It’s not quite wide enough for two rows of standard racks and there are cost and design advantages in having at least two rows in a container. With two rows, air can be pulled in each side with a single hot aisle in the middle with large central exhaust fans. Its an attractive design point and there is nothing magical about shipping containers. What we want is commodity, prefab, and a moderate increment of growth.


The Yahoo design is a nice one. They are using a shell borrowed from a Tyson foods design. Tyson is the grower of a large part of the North American chicken production.  These prefab facilities are essentially giant air handlers with the shell making up a good part of the mechanical plant. They pull air in either side of the building, it passes through two rows of servers into the center of the building. The roof slopes to the center from both side with central exhaust fans. Each unit is 120’ x 60’ and houses 3.6 MW of critical load.


Because of the module width they have 4 rows of servers. It’s not clear if the air from outside has to pass through both rows to get the central hot aisle but it sounds like that is the approach. Generally serial cooling where the hot air from one set of servers is routed through another is worth avoiding. It certainly can work but requires more air flow than single pass cooling using the same approach temperature.


Yahoo! believes they will be able to bring a new building online in 6 months at a cost of $5M per megawatt. In the Buffalo New York location, they expect to only use process-based cooling 212 hours/year and have close to zero water consumption when the air conditioning is not in use. See the Data Center Knowledge article for more detail: Yahoo Computing Coop: Shape of Things to Come?

More pictures at: A Closer Look at Yahoo’s New Data Center. Nice design Yahoo.




James Hamilton



b: /


Friday, April 30, 2010 9:44:30 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
 Thursday, April 29, 2010

Facebook released Flashcache yesterday: Releasing Flashcache. The authors of Flashcache, Paul Saab and Mohan Srinivasan, describe it as “a simple write back persistent block cache designed to accelerate reads and writes from slower rotational media by caching data in SSD's.”


There are commercial variants of flash-based write caches available as well. For example, LSI has a caching controller that operates at the logical volume layer. See LSI and Seagate take on Fusion-IO with Flash. The way these systems work is, for a given logical volume, page access rates are tracked.  Hot pages are stored on SSD while cold pages reside back on spinning media. The cache is write-back and pages are written back to their disk resident locations in the background.


For benchmark workloads with evenly distributed, 100% random access patterns, these solutions don’t contribute all that much. Fortunately, the world is full of data access pattern skew and some portions of the data are typically very cold while others are red hot. 100% even distributions really only show up in benchmarks – most workloads have some access pattern skew. And, for those with skew, a flash cache can substantially reduce disk I/O rates at lower cost than adding more memory.


What’s interesting about the Facebook contribution is that its open source and supports Linux.  From:


Flashcache is a write back block cache Linux kernel module. [..]Flashcache is built using the Linux Device Mapper (DM), part of the Linux Storage Stack infrastructure that facilitates building SW-RAID and other components. LVM, for example, is built using the DM.


The cache is structured as a set associative hash, where the cache is divided up into a number of fixed size sets (buckets) with linear probing within a set to find blocks. The set associative hash has a number of advantages (called out in sections below) and works very well in practice.


The block size, set size and cache size are configurable parameters, specified at cache creation. The default set size is 512 (blocks) and there is little reason to change this.


More information on usage:  Thanks to Grant McAlister for pointing me to the Facebook release of Flashcache. Nice work Paul and Mohan.




James Hamilton



b: /


Thursday, April 29, 2010 6:36:25 AM (Pacific Standard Time, UTC-08:00)  #    Comments [4] - Trackback
 Wednesday, April 21, 2010

There have been times in past years when it really looked like we our industry was on track to supporting only a single relevant web browser. Clearly that’s not the case today.  In a discussion with a co-working today on the importance of “other” browsers, I wanted to put some data on the table so I looked up the browser stats for this web site (  I hadn’t looked for a while and found the distribution  truly interesting:


Admittedly, those that visit this site clearly don’t represent the broader population well.  Nonetheless, the numbers are super interesting.  Firefox eclipsing Internet Explorer and by such a wide margin was surprising to me. You can’t see it in the data above but the IE share continues to decline.  Chrome is already up to 17%.


Looking at the share data posted on Wikipedia ( and using the Net Market Share data) we see that IE has declined from over 91.4% to  61.4% in just 5 years. Again a surprisingly rapid change.


Focusing on client operating systems, from the skewed sample that accesses this site, we see several interesting trends: 1) Mac share continues to climb sharply at 16.6%, 2) Linux at 9%, 3) iphone, ipod and ipad in aggregate at over 5 ¼%, and 4) Android already over a ¼%.


Overall we are seeing more browser diversity,  more O/S diversity, and unsurprisingly, more mobile devices.




James Hamilton



b: /


Wednesday, April 21, 2010 12:25:25 PM (Pacific Standard Time, UTC-08:00)  #    Comments [4] - Trackback
 Saturday, April 17, 2010

We live on a boat which has lots of upside but broadband connectivity isn’t one of them. As it turns out, our marina has WiFi but it is sufficiently unreliable that we needed another solution. I wish there was a Starbucks hotspot across the street – actually there is one within a block but we can’t quite pick up the signal even with an external antennae (Syrens). 


WiFi would have been a nice solution but didn’t work so we decided to go with WiMAX. We have used ClearWire for over a year on the boat and, generally, it has worked acceptably well. Not nearly as fast as WiFi but better than 3G cellular.  Recently ClearWire changed its name to Clear and “upgraded” the connectivity technology to full WiMAX. Unfortunately, the upgrade substantially reduced the coverage area, has been fairly unstable, and the Customer support although courteous and friendly is so far away from the engineering team that they basically just can’t make a difference no matter how hard they try.


We decided we had to find a different solution. I use AT&T 3G cellular with tethering and would have been fine with that as a solution. It’s a bit slower than Clear but its stable and coverage is very broad. Unfortunately, the “unlimited” plan we got some years ago is very limited to 5Gig/month and we move far more data than that. I can’t talk AT&T into offering a solution so, again, we needed something else.


Sprint now has a WiMAX service that offers good performance (although they can be a bit aggressive on throttling) and they have fairly broad coverage in our area and are expanding quickly (Sprint announces seven new WiMAX markets). Sprint has the additional nice feature on some modems where, if WiMAX is unavailable, it transparently falls back to 3G. The 3G service is still limited to 5Gig but, as long as we are on WiMAX a substantial portion of the month, we’re fine.


The remaining challenge was Virtual Private Networks (VPN) over WiMAX can be unstable. I really wish my work place supported Exchange RPC over HTTP (one of the coolest Outlook/Exchange features of all time). However, many companies believe that Exchange RPC over HTTP is insecure in that it doesn’t’ require 2 factor authentication. Ironically, many of these companies allow Blackberries’ and iPhones to access email without 2 factor auth. I won’t try to explain why one is unsafe and the other is fine but I think it might have something to do with the popularity of iPhones and Blackberries with execs and senior technical folks :-).


In the absence of RPC over HTTP, logging into the work network via VPN is the only answer. My work place uses Aventail but there are a million solutions out there. I’ve used many and love none.  There are many reasons why these systems can be unstable, cause blue screens, and otherwise negatively impact the customer experience. But one that has been driving me especially nuts is frequent dropped connections and hangs when using the VPN over WiMAX. It appears to happen more frequently when there is more data in flight but to lose a connection every few minutes is quite common. 


It turns out the problem is the default MTU on most client systems is 1500 but the WiMAX default is often smaller. It should still work and just be super inefficient but it doesn’t. For more details see


To check Vista MTUs:


netsh interface ipv4 show subinterfaces


To change the MTU to 1400:


netsh interface ipv4 set subinterface "your vpn interface here" mtu=1400 store=persistent


I’m using an MTU of 1400 with Sprint and its working well. Thanks to for the easy MTU update. If you are having flakey VPN support especially if running over WiMAX, check your MTU.




James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | |  | blog:


Saturday, April 17, 2010 5:43:03 AM (Pacific Standard Time, UTC-08:00)  #    Comments [6] - Trackback
 Monday, April 12, 2010

Standards and benchmarks have driven considerable innovation. The most effective metrics are performance-based.  Rather than state how to solve the problem, they say what needs to be achieved and leave the innovation open.


I’m an ex-auto mechanic and was working as a wrench in a Chevrolet dealership in the early 80. I hated the emission controls that were coming into force at that time because they caused the cars to run so badly. A 1980 Chevrolet 305 CID with 4 BBL carburetor would barely idle in perfect tune. It was a mess. But, the emission standards didn’t say it had to run badly only what needed to be achieved. And, competition to achieve those goals produced compliant vehicles that ran well. Ironically, as emission standards forced more precise engine management, both fuel economy and power density has improved as well. Initially both suffered as did drivability but competition brought many innovations to market and we ended up seeing emissions compliance to increasingly strict standards at the same time that both power density and fuel economy improved.


What’s key is the combination of competition and performance-based standards.  If we set high goals and allow companies to innovate in how they achieve those goals, great things happen. We need to take that same lesson and apply it to data centers.


Recently, the American Society of Heating, Refrigerating and Air-Conditioning Engineers (ASHRAE) added data centers to their building efficiency standard, ASHRAE Standard 90.1. This standard defines the energy efficiency for most types of buildings in America and is often incorporated into building codes across the country. Unfortunately, as currently worded, this document is using a prescriptive approach. To comply, you must use economizers and other techniques currently in common practice. But, are economizers the best way to achieve the stated goal?  What about a system that harvested waste heat and applied it growing cash crops like Tomatoes? What about systems using heat pumps to scavenge low grade heat (see Data Center Waste Heat Reclaimation)? Both these innovations would be precluded by the proposed spec as they don’t use economizers.


Urs Hoelzle, Google’s Infrastructure SVP, recently posted Setting Efficiency Goals for Data Centers where he argues we need goal-based environmental targets that drive innovation rather than prescriptive standards that prevent it.  Co-signatures with Urs include:

·         Chris Crosby, Senior Vice President, Digital Realty Trust

·         Hossein Fateh, President and Chief Executive Officer, Dupont Fabros Technology

·         James Hamilton, Vice President and Distinguished Engineer, Amazon

·         Urs Hoelzle, Senior Vice President, Operations and Google Fellow, Google

·         Mike Manos, Vice President, Service Operations, Nokia

·         Kevin Timmons, General Manager, Datacenter Services, Microsoft


I thinks we’re all excited by the rapid pace of innovation in high scale data centers. We know its good for the environment and for customers.  And I think we’re all uniformly in agreement with ASHRAE in the intent of 90.1. What’s needed to make it a truly influential and high-quality standard is that it be changed to be performance-based rather than prescriptive. But, otherwise, I think we’re all heading in the same direction.




James Hamilton



b: /


Monday, April 12, 2010 9:39:48 AM (Pacific Standard Time, UTC-08:00)  #    Comments [2] - Trackback

Disclaimer: The opinions expressed here are my own and do not necessarily represent those of current or past employers.

<June 2010>

This Blog
Member Login
All Content © 2015, James Hamilton
Theme created by Christoph De Baene / Modified 2007.10.28 by James Hamilton