Zynga is often in the news because gaming is hot and Zynga has been, and continues to be, a successful gaming company. What’s different here is the story isn’t about gaming nor is it really about Zynga itself. The San Francisco gaming house with a public valuation of $2.5B was an earlier adopter of cloud computing and they built this massive business that ended up going public on AWS. That’s hardly noteworthy these days. Some absolutely massive companies like Netflix have made that same decision.
What’s different here is back in 2012 press reports covered the Zynga decision to move from an all cloud deployment model to using their own data centers. The move was broadly covered in the industry press with example headlines like Zynga Gives Amazon Cloud the Slip, and Zynga Cloud Case Study: the Journey to a Real Private Cloud. This decision has often been referenced as evidence that returning to on premise deployments when a company achieves very high IT investment scale is the right decision. This never seemed very credible to me but I understand the reporters perspective. A slightly more realistic but still incorrect interpretation is to reference this move as evidence that stable workload could be more economically hosted on premise. These days, there is more data available and it’s fairly easy to show that “being at scale” requires order 10^6 servers, deep investments in server and networking equipment and the software stacks above, and massive investments in the software stack that manages these resources. I’ve taken a run at most of those questions and offered my perspective in the last two talks I have done at re:Invent AWS Innovation at Scale and Why Scale Matters and how the Cloud is Different.
The news of the Zynga move has largely faded as more and more companies commit to the cloud. Yet, I kept watching this case closely because it is an unusual case where a nimble, start-up culture company with deep cloud experience decided to build their own private deployment with as many of the characteristics of the cloud as possible with a goal of being more economic.
This week more news came available when Robert McMillan of the Wall Street Journal wrote For Zynga, A Journey from Cloud to Home — and Back Again. In this article, Zynga is reported to have spent over $100m on a private cloud infrastructure and yet decided to return to the cloud all in. This decision is partly interesting because the previous move was so well publicized but mostly because Zynga has incredible visibility into both the on premise and public cloud world since they have run both at very high scale for several years.
The article quotes Zynga CEO Mark Pincus during the last investors call “There’s a lot of places that are not strategic for us to have scale and we think not appropriate, like running our own data centers. We’re going to let Amazon do that.” The articles continues by saying “The company Wednesday said it would shut its data centers and shift its computing workload back to Amazon, as part of $100 million in spending reductions.”
This announcement is interesting because Zygna has a very large infrastructure investment just as most large enterprises have. And yet, even with that large infrastructure investment, they still elected to move to fully to the cloud. What was an example that challenged cloud economics at the very high end of the scale, has now become an example of a company with a deep understanding of both cloud and on premise deployments at scale, deciding to fully commit to the cloud.
Of course, there is room for argument saying that the Zynga business has been through change and, of course they have seen change. What business hasn’t? In fact that’s part of the point. Cloud deployment make rapid change, scale up, scale down, and redeployments easy. In the Zynga case, they are now more deeply invested in mobile gaming with different compute requirements. A related potential pushback might be to point out that Zynga is under financial pressure. But, again, what business isn’t under financial pressure? In fact, I view the decisions made under business and financial pressure as perhaps the most informative. When a business with massive on premise and cloud deployments and deep skills in both needs to be even more nimble and frugal, the direction they take is particularly interesting.
I admit to a cloud bias but it’s really good to see another substantial company fully commit to cloud computing.
While we are on the subject of on-premise vs cloud, how does the cost of people compare? Looking at the specific case of Zynga spending $100M on private cloud infrastructure (how much of that was for building/renting data centers?), any guess on how much they spent on people building and operating all that stuff? I would imagine it adds up to a pretty substantial figure.
The question on people costs is a good one. I’ve seen enterprise estimates of upwards of 50% of the overall costs are people. Perhaps some of those costs are exagerated and, for sure, some of the costs are application specialists and these roles are largely unchanged by moving to the cloud. The same work needs to be done either way.
There are some substantial gains to be had on the infrastructure resources side. What happens there is each enterprise ends up having a few Oracle servers and a few SQL Server systems, more web servers but still not that many, etc. Because there are very few of any server type, most admin operations are manual. In the cloud there are massive numbers of any given server type so everything is automated. We couldn’t hire enough administrators to do it by hand, doing it manually is more error prone, and it’s much more expensive. So, with scale, comes more automation and, with automation, comes lower cost and higher reliability.
Administrative people costs at scale are so low I seldom include them when looking at the cost of operating cloud services. These costs are wildly dominated by hardware costs and related non-people operational costs like power. In the enterprise, the equation is turned over and people costs are reported to be far higher.
David White asked above, won’t this yeild a ton of lost jobs? The naive answer is automation can reduce jobs and we have seen that in manufacturing sectors. But, what I notice is that IT industry is “hard” requires lots of judgment and, as a consequence, smart flexible people stay super valuable to successful growing companies. As a consequence, I’ve never seen the move to the cloud yeild lower people costs and don’t recommend that companies look to people cost reductions as the motivation to move to a cloud deployment model.
It seems to me that there are some important insights to be gained here. I have a strong hunch that one thing Zynga discovered when they built their data centers was that they needed a lot more people than they thought to run them.
Some quick back-of-the-envelope numbers: if you’re going to staff your data center 24×7, you need 5x the people you need for 8×5. So for a small data center staffed by 5 people all the time, you need about 25 people. And that’s just for the bare minimum staffing: racking and cabling gear and doing basic monitoring, oncall and troubleshooting. With 4 data centers, that’s 100 people and an annual cost on the $10M range.
I posit that the number of people you need for DCO (data center operations) doesn’t depend much on how many servers you have in each data center; that it will be roughly the same whether it’s 100, 1000 or 10,000 servers. What DOES change with the number of servers is the workflow inside the data center; how much you automate and streamline. With 100 servers, you’ll fuss over and replace individual servers; with 50,000 servers, you don’t; instead you add/delete whole racks at a time.
Let’s suppose Zynga (or most any comparable-sized enterprise) has 1,000 servers and 25 people to a data center, what are the corresponding numbers for Amazon/Google/Microsoft? For Amazon, the number of “50,000 to 80,000 servers” (50x-80x) has been disclosed. According to this article, 50 people is typical for a Microsoft data center, which seems plausible — I would imagine they have about 25 people working 24×7 doing core DCO, and another 25 people working 8×5. And so we end up with the situation you describe: at Amazon/Microsoft scale, the human cost of DCO is a rounding error, but at Zynga scale, it’s a significant number — plausibly by two orders of magnitude?
Does this sound about right?
Those numbers might be slightly higher than absolutely required but your general point is spot on. There is a core facility staffing requried for security and it’s hard to scale that back when the facility gets smaller. And there is a core staff of operations folks that need to be available and, depending upon the SLA for break/fix, it’s hard to scale that number down even as the facility shrinks.
The actual people costs are pretty small when held up against equipment costs, networking, and power but your math looks pretty close to me.
My understanding of the original Zynga move out of AWS was that it was to do with quite specific circumstances beyond just scale: they had stable and predictable demand and the load generated by their software for compute etc was not optimal for AWS pricing, giving quite a large business benefit in capitalising some of the cost of operations.
Very interesting that,even such an extreme example has found it a better business model to reduce its capital investment.
Tim raises the common theme that the cloud is a big win for fast changing workloads but stable workloads are better hosted on premise. Clearly, working exclusively on cloud computing solutions for years makes me somewhat biased but, since this comes up frequently in many differnt contexts, let’s dig deeeper. This belief, through frequently repetition, has become a widely held “truth.”
The high scale cloud use custom designed servers, custom designed storage, and custom designed networking equipment sourced directly from ODMs that don’t have a sales channel and generally are not well setup to sell to 10s of thousands of customers which is required by OEMs selling to enterprise customers. On premise equipment is normally off the shelf equipment attempting to cater to a large market purchased from OEMs some of which have incredibly high margins and all have expensive distribution channels. The margins are public record and the cost of the distribution channels is often far in excess of 30%. The cloud directly sources components like memory, CPU, routing ASICs, disks, and SSDs directly from manufacturers in very large volumes. The cloud does custom datacenter designs and builds many facilities every year constantly improving new build designs and modernizing existing builds with latest learning. All the facilities are built at scale with 50,000 to 80,000 servers in each. Because the server counts are order 10^6, there are incredible economies of scale. The on-premis eargument is you can do all this cheaper by buying from the existing high margin storage, networking, and server suppliers at lower cost. Seem kind of unlikely whether the workload is steady state or not. The most important advantage of the cloud is it allows companies to be more agile, to try new ideas faster, to be more productive, and move faster. These are the big gains. But, it’s also considerably less expensive and the argument that steady state workloads belong on premise usually are based upon an incomplete understandign of the costs of on premise workloads.
I cover the topic in more detail in the last couple of talks I have done at re:Invent:
https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=4&cad=rja&uact=8&ved=0CC0QtwIwAw&url=http%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DJIQETrFC_SQ&ei=4FZ2VILbK8bM8gXH24CwDw&usg=AFQjCNF5XsxRZDWFTk89khOq11XWOsVpWg&sig2=EUibZLaBWAa0X45n35t4kA&bvm=bv.80642063,d.dGc
http://m.youtube.com/watch?feature=plpp&v=WBrNrI2ZsCo&p=PLhr1KZpdzukelu-AQD8r8pYotiraThcLV
Slides to both are posted at: http://mvdirona.com/jrh/work
Sounds like the prodigal son has returned. This seems like an even more telling moment than when they left so I hope this announcement gets the same amount of press as when they left the cloud.
On minor correction to your article:
This decision has often been referenced as evidence that returning to on premise when a company achieves very high IT investment scale in the right decision.
should be
This decision has often been referenced as evidence that returning to on premise when a company achieves very high IT investment scale is the right decision
I agree with your assesment Dave and I updated the article with the correction you pointed out. Thanks for the comment.
I’m a big cloud advocate too, but I think you’re overlooking the biggest obstacle of all Tony. There are many good arguments in favor of the cloud – economics, agility etc. But the reality is, very few people like to change, and the cloud is change for older established companies. That change typically comes in the form of layoffs, unless the CIO is capable of re-deploying IT infrastructure staff to higher value add work. At smaller companies, operations are sometimes run from spreadsheets. And, as you can imagine, the guy who built and maintains those spreadsheets is very happy with the status quo….
The cloud is, I believe, inevitable. But it’s going to take a generation or two before we see the majority of our data in the cloud, simply because of human nature.
David’s point is a good one: Inertia is currently the biggest blocker of companies moving to the cloud. The economic and agility arguments have become fairly well accepted. The only point I don’t agree with is that moving to the cloud, or corporate change in general,will cost jobs. My read of the industry is companies failing to change is the biggest consumer of jobs (and failed companies). If companies stay nimble, stay successful, and continue to grow, they continue to need their existing staff while bringing on more skills.
I argue it’s seldom that a good company that’s growing doesn’t keep needing good employees even when going through big trechnology changes.
Today it seems there really is only three reasons I can see for any given system to not to be in the cloud:
* Regulatory compliance, such as policy or law that mandates data be stored on premises, or be kept within a specific jurisdiction where cloud services are unavailable
* Software dependencies with specialist, proprietary hardware – anything from audio/video capture handling over SDI/AES, PLC controller management, etc.
* A significant dependency on COTS that simply cannot function on the cloud due to software design that makes it incredibly difficult – for example I can think of one or two database products that are overwhelmingly expensive to run on the could simply due to design.
Realistically if none of these exist problems exist, the only problem left to remain is system architects who continue to design systems as if they were running in collocation or on premises. These are the ones that give the cloud a bad image in companies, naively believing systems can just lift-and-shift and assume everything will get cheaper and/or easier without putting enough thought into the dealing with the different environment.
Tony, you assesment is pretty close to where I would come out as well. Each year, these niche areas get smaller.
Looking at your database example, I don’t know the speciifics of why the examples you are thinking of are expensive to host in the cloud, there are solutions to that one as well. Perhaps the simplist solution is to ignore the problem and proceed anyway on the argument that database, although expensive, is a tiny portion of the overall application stack and, since the stack has to all run in the same location, the advantage of running it in the cloud is sufficient to make the database licensing problems worth living with. Another solution I’ve seen done when unusual equipment or non-cloud friendly licenses are required is to host the offending product in an colo that maintains a direct connection to AWS and is physically near. This gets you nearly in-the-same-datacenter latency, allows the rest of the stack to be hosted in the cloud and still have good access to the non-cloud hosted component.
I discovered your article a bit late but I think my comment is still relevant.
Why we chose to go with traditional servers in a rack versus AWS is cost of AWS usage for development. A VM with 256GB RAM costs no less than $3.83/hour = $2757/month. In comparison, to host such a server in a traditional data centre costs less than $200/month (with the hardware included). These are development servers so they don’t need to be that reliable (and even then we rarely have hardware problems).
This is one niche which AWS doesn’t seem to cover…
If you could actually get a 1/4 TB memory server for $200/month all in, that would be truely exceptional. That’s 27 cents per hour!
I beieve AWS has the niche you describe very well covered both in on demand pricing and also in the much lower but not guaranteed, Spot pricing. However, feel free to post here or send me personally examples that you feel indicates otherwise.
Here it is:
https://www.hetzner.de/us/hosting/produkte_rootserver/px121ssd
116euro/month but we have some extras (disks+private net linking a few of these together).
AWS Spot instance prices come closer to this but we do need our servers to be up most of the time.
Btw, I was expecting I will get an email if someone posts an answer to my post
Thanks for the update to your question with more information and sorry that my wordpress site didn’t email you when your question was answered. At under US$130/month, I agree it seems like respectable value. Digging deeper, it’s a 2012 desktop processor that Intel discontinued a year and a half back. It’s pretty old at this point but I have almost exactly this processor on one of my boat systems so I know it reasonably well. Generally no big surprises there. What catches my attention is the claim of 256GB of memory. Could you do me a favor and cat /proc/meminfo and send me the result (jrh@mvdirona.com) or post it here? Thanks,
Nicolae, I was mistaken in my previous post — it’s a single socket system but it is a current generation processor.
It has a very low processor to memory ratio so the the EC2 comparisons tend to have more processor or less memory or both. EC2 pricing is available on-demand (hourly), spot pricing (as available), or using reserved instances. The EC2 spot pricing options compare very favorably but you said that wasn’t ideal for your workload. The closest EC2 pricing model is reserved instances. Using up front RIs on 3 year terms, here are some comparaibles using monthly effective pricing:
*M4.2xl: 8vcpu/32GB: $130
*M3.2xl: 8vcpu/30GB: $152
*C4.2xl: 8vcpu/15GB: $119
*C4.4xl: 16vcpu/30GB: $238
*C3.2xl: 8vcpu/15GB: $114
*C3.4xl: 16vcpu/30GB: $229
Some of these option have much less memory and some have much more processor but these are the options that line up best at comparable pricing.
I’m still interested in seeing the output from cat /proc/meminfo if you could send it my way. Thanks!