I’m always interested in research on cloud service efficiency, and last week, at the Uptime Institute IT Symposium in New York City, management consultancy McKinsey published a report entitled Clearing the air on Cloud Computing. McKinsey is a well respected professional services company that describes itself as “a management consulting firm advising leading companies on organization, technology, and operations”. Over the first 22 years of my career in server-side computing at Microsoft and IBM, I’ve met McKinsey consultants frequently, although they were typically working on management issues and organizational design rather than technology. This particular report focuses more on technology, where the authors investigate the economics of very high scale data centers and cloud computing. This has been my prime area of interest for the last 5 years, and my first observation is the authors are taking on an incredibly tough challenge.

Gaining a complete inventory of the costs of internal IT is very difficult. The costs hide everywhere. Some are in central IT teams, some are in central procurement groups, some with the legal and contract teams, and some in departmental teams doing IT work although not part of corporate IT. It’s incredibly difficult to get a full, accurate, and unassailable inventory into the costs of internal IT. Further complicating the equation, internal IT is often also responsible for mission-critical tasks that have nothing to do with comparing internal IT with cloud services offerings. Internal IT is often responsible for internal telco and for writing many of the applications that actually run the business. Basically, it’s very hard to first find all the comparable internal IT costs and, even with a complete inventory of IT costs, it’s then even harder to separate out mission-critical tasks that internal IT teams own that have nothing to do with whether the applications are cloud or internally hosted. I’m arguing that this report’s intent, of comparing costs in a generally applicable way, across all industries, is probably not possible to do accurately and may not be a good idea.

In the report, the authors conclude that current cloud computing offerings “are not cost-effective compared to large enterprise data centers.” They argue that cloud offerings are most attractive for small and medium sized enterprises. The former is a pretty strong statement, and contradicts most of what I’ve learned about high scale service, so it’s definitely worth digging deeper.

It’s not clear that a credible detailed accounting of all comparable IT costs that generalizes across all industries can be produced. Each company is different and these costs are both incredibly hard to find and entangled with many other mission-critical tasks the internal IT team owns that has nothing to do with whether they are internally hosted or utilizing the cloud. From all the work I’ve done around high scale services, it’s inarguably true that some internal IT tasks are very leveraged. These tasks form the core competency of the business and are usually at least developed internally if not hosted internally. In what follows, I’ll argue that non-differentiated services — services that need to be good but aren’t the company’s competitive advantage — are much more economically hosted in very high-scale cloud computing environments. The hosting decision should be driven by company strategy and a decision to concentrate investment capital where it has the most impact. The savings available using a shared cloud for non-differentiated services are dramatic, and are available for all companies, from the smallest startup to the largest enterprise. I’ll look at some of these advantages below.

In this report the authors conclude that cloud computing makes sense for small and medium enterprises but “are not cost-effective to large enterprise data centers.” The authors argue there are economies of scale that makes sense for the small and medium sized businesses, but the cost advantages break down at the very large. Essentially they are arguing that big companies already have all the economies of scale available to internet-scale services. On the face, this appears unlikely. And, upon further digging, we’ll see it’s simply incorrect across many dimensions.

Let’s think about economies of scale. Large power plants produce lower cost power than small regional plants. Very large retail store chains spend huge amounts on optimizing all aspects of their businesses from supply chain optimization through customer understanding and, as a consequence, can offer lower prices. There are exceptions to be sure but, generally, we see a pretty sharp trend towards economies of scale across a wide range of businesses. There will always be big, dumb, poorly run players and there will always be nimble but small innovators. The one constant is those that understand how to grow large and get the economies of scale and yet still stay nimble, often deliver very high quality products at much lower cost to the customer.

Perhaps the economies of scale don’t apply to the services world? Looking at services such as payroll and internal security, we see that almost no companies choose to do their own internally. These services clearly need to be done well, but they are not differentiated. It’s hard to be so good at payroll that it yields a competitive advantage, unless your company is actually specializing in payroll. Internal operations such as payroll and security are often sublet to very large services companies that focus on them. ADP, for example, has been successful at providing a very high scale service that makes sense for even the biggest companies. I actually think it’s a good thing that the companies I’ve worked for over the last twenty years didn’t do their own payroll and instead focus their investment capital on technology opportunities that grow the business and help customers. It’s the right answer.

We find another example in enterprise software. When I started my career, nearly all large companies developed their own internal IT applications. At the time, most industry experts speculated that none of the big companies would ever move to packaged ERP systems. But, the economies of scale of the large ERP development shops are substantial and, today, very few companies develop their own ERP or CRM systems. The big companies like SAP can afford to invest in the software base at rates even the largest enterprise couldn’t afford. Fifteen years ago SAP had 4,200 engineers working on their ERP system. Even the largest enterprise could never economically justify spending a fraction of that. Large central investments at scale typically make better economic sense unless the system in question is one of a company’s core strategic assets.

I’ve argued that smart, big players willing to invest deeply in innovating at scale can produce huge cost advantages and we’ve gone through examples from power generation, through retail sales, payroll, security, and even internal IT software. The authors of the McKinsey study are essentially arguing that, although all major companies have chosen to enjoy the large economies of scale offered by packaged software products over internal development, this same trend won’t extend to cloud hosted solutions. Let’s look closely at the economics to see if this conclusion is credible.

In the enterprise, most studies report that the cost of people dominates the cost of servers and data center infrastructure. In the cloud services world, we see a very different trend. Here we find that the costs of servers dominate, followed by mechanical systems, and then power distribution (see the Cost of Power in Large Data Centers). As an example, looking at all aspects of operational costs in a mid-sized service led years ago, the human administrative costs were under 10% of the overall operational costs. I’ve seen very large, extremely well run services where the people costs have been driven below 4%. Given that people costs dominate many enterprise deployments, how do high-scale cloud services get these cots so low? There are many factors contributing but the most important two are 1) cloud services run at very high scale and can afford to invest more in automation amortizing that investment across a much larger server population, and 2) services teams can specialize focused on doing one thing and doing it very well. This kind of specialization yields efficiency gains, but it is only affordable at multi-tenant scale. The core argument here is that the number 1 cost in the enterprise is people whereas, in high scale services, these costs have been amortized down to sub-10%. Arguing there are no economies at cloud scale is the complete opposite of my experience and observations.

<JRH>Page 25 of study shows a “disguised client example“ where the example company had 1,704 people working in IT before the move to cloud services and still required 1,448 after the move. I’m very skeptical that any company with 1,704 people working in IT – clearly a large company – would move to cloud computing in one, single discrete step. It’s close to impossible and would be foolhardy. Consequently, I suspect the data either represents a partial move to the cloud or is only a paper exercise. If the former, the data is incomplete and, if the later, the data is speculative. The story is clouded further by including in the headcount inventory desktop support, real estate, telecommunications and many other responsibilities that wouldn’t be impacted by the move to cloud services. Adding extraneous costs in large numbers dilutes the savings realized by this disguised customer. Overall, this slide doesn’t appear informative.

We’ve shown that at very high scale the dominant costs are server hardware and data center infrastructure. Very high scale services hire server designers and have an entire team focused on the acquisition of some of the most efficient server designs in the world. Google goes so far as to design custom servers (see Jeff Dean on Google Infrastructure) something very hard to economically do at less than internet-scale. I’ve personally done joint design work with Rackable Systems in producing servers optimized for cloud services workloads (Microslice Servers). When servers are the dominant cost and you are running at 10^5 to 10^6 servers scale, considerable effort can and should be spent on obtaining the most cost effective servers possible for the workload. This is hard to do economically at lower scale.

We’ve shown that people costs are largely automated out of very high scale services and that the server hardware is either custom, jointly developed, or specifically targeted to the workload. What about data center infrastructure? The Uptime Institute reports that the average data center Power Usage Effectiveness is 2.0 (smaller is better). What this number means is that for every 1W of power that goes to a server in an enterprise data center, a matching watt is lost to power distribution and cooling overhead. Microsoft reports that its newer designs are achieving a PUE of 1.22 (Out of the box paradox…). All high scale services are well under 1.7 and most, including Amazon, are under 1.5. High scale services can invest much more in infrastructure innovations by spreading this large investment out over a large number of data centers. As a consequence, these internet-scale services are a factor of 2 more efficient than the average enterprise. This is good for the environment and, with power being such a substantial part of the cost of high-scale computing, it substantially reduces costs as well.

Utilization is the factor that many in the industry hate talking about because the industry-wide story is so poor. The McKinsey report says that enterprise server utilization is actually down around 10% which is approximately consistent with I’ve seen working with enterprise customers over the years. The implication is the servers and the facilities that house them are only 10% used. This sounds like the beginning of an incredibly strong argument for cloud services but the authors take a different path and argue it would be easy to increase enterprise utilization far higher than 10%. With an aggressive application of virtualization and related technologies, they feel utilizations as high as 35% are possible. That conclusion is possibly correct, but it’s worth spending a minute on this point. At 35% efficiency, a full 2/3 is still wasted which seems unfortunate, unnecessary, and hard on the environment. Improving from 10% to 35% will require time, new software, new training, etc. but it may be possible. What’s missing in this observation is that 1) cloud services can invest more in these efficiency innovations and they are already substantially down that path, 2) large user populations allow a greater investment in infrastructure efficiency at a higher rate, and 3) not all workloads have correlated peaks, so larger, heterogeneous populations offer substantially larger optimization possibilities than most enterprises can achieve alone (see: resource consumption shaping).

In the discussion above, we focused on the costs “below” the software (data center infrastructure and servers) and found a substantial and sustainable competitive advantage in high scale deployments. Looking at people costs, we see the same advantage again. On the software-side, the cost picture ranges from less in the cloud to equal but it isn’t higher. There doesn’t seem to be a dimension that supports the claim of this report. I just can’t find the data to support the claim that enterprises shouldn’t consider cloud service deployments. Looking at slides on the McKinsey presentation that make the cost argument in detail, the graphs on slides 22, 23, and 24 just don’t make sense to me. I’ve spent considerable time on the data but just can’t get it to line up with the AWS price sheet or any other measure of reality. The limitation might be mine but it seems others are having trouble matching this data to reality as well.

My conclusion: any company not fully understanding cloud computing economics and not having cloud computing as a tool to deploy where it makes sense is giving up a very valuable competitive edge. No matter how large the IT group, if I led the team, I would be experimenting with cloud computing and deploying where it make sense. I would want my team to know it well and to be deploying to the cloud when the work done is not differentiated or when the capital was better leveraged elsewhere

IT is complex and a single glib answer is almost always wrong. My recommendation is to start testing and learning about cloud services, to take a closer look at your current IT costs, and to compare the advantages of using a cloud service offering with both internal hosting and mixed hosting models.


James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |

H:mvdirona.com | W:mvdirona.com/jrh/work | blog:http://perspectives.mvdirona.com