Perspectives focuses on high-scale services, data center design and operations, server hardware design and optimization, high-scale storage software and hardware systems, flash memory, service design principles, power efficiency and power management.
Most recent posting: http://perspectives.mvdirona.com.
Talks and presentations: http://mvdirona.com/jrh/Work.
About the Author: James is an SVP and Distinguished Engineer working on technology across Amazon. He joined the company in January 2009 on the Amazon Web Services team, where he focused on infrastructure efficiency, reliability, and service scaling. Prior to AWS, James held leadership roles on several high-scale products and services including Windows Live, Exchange Hosted Services, Microsoft SQL Server, and IBM DB2. He loves all aspects of high-scale services and is interested in optimizing all components from data center power and cooling, through server design, networking systems, and the distributed software systems they host.
James, I thought you once wrote about declining ROI of increased redundancy, specifically N+2 vs. N+1. As human failure is a leading cause of outages, there has to be a formula that shows that the complexity introduced by N+X eventually becomes the cause of, rather than then the solution to outages. Thoughts?
Yeah, I have written about it and it is unquestionably true. I’ve seen a few facilities without sufficient redundancy but I’ve also seen a lot of failure caused by the creeping complexity of “just a bit more redundancy.” There doesn’t exist a mathematical formulation and two good engineers could easily end up in different places.
My favorite example are multiple redundant UPS that are installed N+2 (redundancy with concurrent maintainability) which requires that the UPS all be phase synced. On the surface this is a wonderful design and it’s a hard debate to go with a simpler model. It really does look more redundant from a design perspective. But it introduces a bunch of new failure modes. For example, if there is any loss of communications between the UPS for any reason, they go into bypass mode and, if there is a power fault in bypass, they drop the load. In one role I’ve had, we had both designs deployed at good scale and, over time, we get to watch designs through different fault modes. Simpler designs that give redundancy with concurrent maintainability but don’t require syncing UPSs end up looking better from my perspective.
I wish there were a simple equation showing what is right but it’s a judgement call. It’s easy to see either extreme but it super hard to find the “right” place and actually get agreement form all involved. These are the hard decisions where data helps but, for the most part, even less well thought out designs, only fail rarely. It takes a long time (or a lot of scale) to see the difference between OK and excellent designs.
Hi James – We haven’t met yet, but I have come across your perspectives several times in the past when researching data center economics. With the current NAND shortage partially from the 3D transition, NAND utilization within a given system (raw vs overprovision+FTL) is becoming material to the deployment economics. Even though this is a short-term problem, do you see a trend in hyper-scale data centers wanting to have more control of the NAND (like Open-Channel SSD or other technologies) as well as managing the supply chain at the NAND level versus the SSD. More specifically, with advancements in FPGAs, do you see a convergence of the value of a programmable/tunable flash controller with that of next gen-NAND? Would like to get your thoughts on this.
The biggest driver to wanting more control over the control system is the makers of integrated finished goods are taking excess profit. When margins are kept under control by market dynamics as they still (just barely) are in the disk drive business, there is no real pressure to take over the controller. In the case of flash systems, margins have been excessive and there is considerable pressure to de-verticalize. This may reduce and stabilize with time as the market matures but I suspect there will be continued interest by the big players to directly source.
You asked if the current cost of NAND driven by the shortage will drive more interest in reducing costs by reducing over-provisioning. The current raw NAND pricing is, as you said, a point in time issue rather than a systemic problem. I don’t see this factor as likely to stay important and reducing over-provisioning doesn’t look like a great cost lever — very limited total improvement and hard engineering to get this down while not impacting performance stability (jitter). Overall this approach is expensive, risky, and of bounded value.
You also asked about FPGAs leading to more programmable/tunable flash. There is value in tunning I/O scheduling better in general and, for mega-workloads, tuning to the workload. I don’t really see the advancements of FPGAs as the driver behind this. The work can be done with care with software and general purpose processors and in volume can be done in hardware with ASICs. FPGAs just make it easier to get to hardware and lower volumes will support it. Much of what is being done can profit from hardware implementation but, for most of what is being doin, there is no hard limit preventing software implementation on a good general purpose processor included as part of the controller. With the big players, I do see value in increased control over controller scheduling and features. But, after the big players, most customers will not be interested in engineering at this level.
Summary: Two big forces behind gaining control of flash controllers: 1) cost, and 2) increased control of controller features and scheduling. Reducing over-subscription doesn’t look like much of a factor to me. Cost will be the dominant factor in my opinion.
James-This is Dana Foruria reaching out. We met when I was with Western Digital some time back. I am a Tech Account Manager with Intel in their Programmable Solutions Group. I wanted to see if I could arrange for a casual tech conversation between yourself and a forward thinking person in our group by the name of Johan Van De Groenendaal…I think the two of you would have a spirited conversation…possibly meet at Paddy Coyne’s in Seattle for a pint and a good discussion?
That sounds like fun Dana but it’s more complex to arrange that usual. I just got a back from a bit more than two weeks in Seattle. I’m in Amsterdam November through February and will be at re:Invent in Las Vegas and will likely be back in Seattle for a week in January. And, of course, I’m always on email.
Hi James,
Firstly, a big THANK YOU for this blog. I’ve been following your blog for the past 3 years and it is refreshing for someone to come out into the open. Question for you: we’ve been talking a lot about datacenter economics and I can’t help but wonder when optics will *finally* become the de facto high speed interface technology to the processor. Optics have pretty much taken over the front panel and we are now constrained by front port module density. There have been many attempts at bringing optics closer to the processor, but none have gone mainstream due to economics.
However, there are some trends that seem to require new technologies or new ways of thinking:
1) Copper interfaces are requiring more signal conditioning (i.e. power)
2) Analog elements are hard to make in deep-sub micron technologies (see #1)
3) Voltage levels have pretty much plateaued (i.e. no more V^2 power savings).
4) Pin counts on BGA packages have exploded and it is increasingly more difficult to route copper traces
If you look at the above, the parallel to copper-based communications in the 1940’s and the transition to optics in the 60’s is eerily similar.
Thoughts?
– Stuck with figuring how to get stuff working
Albert asks when do we use optics for off-die communications rather than electrical signals. This absolutely is the right answer and it absolutely will happen. I jokingly say it’s always “10 years out” from production applications. But, it will happen and, when it does, it will allow many advancements including full memory virtualization where a rack of servers can be allocated and then re-allocated different amounts of memory as customer needs change. With memory being the largest component expense in a server, better utilizing memory without introducing latency is a very big deal.
Early technology steps towards this end game are underway. In the communications world, there are at least 4 companies at various levels of advancement working on integrated silicon with electrical signaling with optical lasers and sensors on the same die. The world of Silicon Photonics is moving fast right now. It’s still hard and yields aren’t yet great but the technology to build these parts, get the temperature stability they need, and do wafer level optical and electrical testing are all advancing quickly. These parts are getting very close to reality.
Today there are also companies in production building multi-chip modules where the digital signal processor and the lasers are on separate dies but they are mounted in the same package. This is in volume production today and working well. It’s not quite as good as single die silicon photonics but it is the most likely technology to bring optics directly to a processor. Further out into the future, I suspect we’ll see it done on CPUs with single die solutions but we’ll get great results way before then with multi-die solutions.
For the first time, I think this is much closer than the usual “10 years out”.
Richard, you folks take an unusually detailed and precise approach to the question of when to replace a server. If the data is complete, and it sounds like it is, you will be making some of the best decisions in the industry. Although I’m not in love with the frequently used idea of replacing servers early to delay data center builds, I would never debate this decision if the detailed cost models show it to be a win.
Nice work on a problem that is usually solved with opinion and belief rather than a detailed cost model.
James – we’ve crossed paths a few times in the industry and you came to mind – have been working on a definitive tech refresh model based on Moore’s Law as applied to Intel’s Tic & Toc, and then applied to web scale infrastructure refresh rates. For some reason, I am unable to find anything in the industry that states, what we all take for granted, that xx months is the optimal tech refresh rate. We all know it is between 36 and 48 (or at least we have all come to “believe” that), but when you begin to factor in building datacenter space and time value of money, then that 12mos delta actually is WAY too broad – should it be 36 or 38 or 42 (btw, answer to the ultimate question!)? So, have you ever worked out exactly what you think is the optimal month for refresh? And then, follow on, have you taken a look at next 2 decades and ironed out when that time might shift based on Moore’s Law inevitable end?
Richard, you just hit one of my favorite questions. How frequently should servers be upgrded?
When running efficiently at scale and really managing all costs carefully, it’s servers that dominate. So, there are few questions more important to absolutely nail than the one you ask above. When should servers REALLY be replaced if you just look at getting the most done for the least cost and avoid emotion, servier amortization rates, and vendor marketing claims. I’ve seen operators that run for 5+ years and insist that it’s a win. I raised an eyebrow at that one but, when we dig deeper, there are rare circumstances where it might have been correct. I’ve also see operators that replace every 1 to 2 years believing this is optimium for thier cost models. I suspect they need to overhaul their cost models and, personally, I’ve never modeled a workload where 1 to 2 year replacement cycles make sense. But, having been around this important question for a number of years, there are some constants. The first is that most operators do not have a single uniform way of looking at costs. Some prefer operational costs and will happy spend more on power to avoid more capital costs. Some prefer capital costs to operational and will actually spend asymetrically more on capital to reduce operational costs even though the equation doesn’t pencil out overall as positive.
Financial motivations can be complex. I prefer to model all costs as costs per month (or per day or per hour). This is a great way to look at operational costs. For capitial costs, just compute the lifetime of the component and the cost of capital for this particular businss and look at these expenses the same way. It’s super important to conver all costs as costs per unit time to ensure that capital costs are neither favored over operational nor avoided. All costs should be looked at with respect to their impact on the business so you can optimize for the cost of capital for this particular business.
Making it somewhat more complex, you actually want to look at work done per unit time rather than servers per unit time. What this means is that different workloads have different optimial replacement rates. It really depends upon what resources the workload is bound up on and the ratio of these resoruces to the rest in both the old and new server design to help make an upgrade decision. The decision is unfortunately, workload specific and to a lesser extent technology rate of change influenced.
The best example of technology rate of change having a big impact, there have been times when the industry has gone through a large step function. For database workloads that often where bound on IOPS in the days of HDD-based transaction servers, the availability of flash based storage device completely changed the economics. For many server and workload configs, immediate update was the right answer. The same thing happened just before Intel retired the Netburst architecture. The current AMD generation at that time and the subsequent Intel generation was so much more efficient that it drove a faster than normal upgrade cycle for companies that actually replace servers strictly according to cost model. There have also been times when the rate of change was on the slow side which, strictly speeaking, should have lead to slower upgrade cycles.
Sadely allmost nobody replaces “strictly according to cost model.” Most decisions are drivden by financial amortization shchedule, a decision to reinvest in the workload and to updates, emotional factors, or vendor marketing. Few operators really know their costs per unit in sufficient detail. Many try to make the server upgrade decision on that basis but it’s very hard to get really complete models and these decisions often don’t look fully informed.
The answer to your question unfortunately is “it depends”. It depends upon the workload, the cost of capital at the specific company, the current technology rate of change velocity, the resources that limit the workload on current servers and the resource that would limit the workload on the future server. Because it’s so complex, I recommend selecting the best current server design once every year or perhaps as frequently as every quarter. The model the cost using the old servers and the new and replace when new is less expensive. I don’t kow of a way to come up with a generic “replace in N months” formula that ignores workload and other factors. Perhaps there is something in your 42 month number :-).
James – as always, you have done some “deep thought” (same reference) on this topic…would not have suspected less.
Yes – totally get what you are saying – we have transitioned to two principle SKUs in our inventory – Performance & Big Data (this excludes specialized HW we still use for EDW & db today). We are entering our 4th generation and have pretty robust models around costing – we, as I imagine you all do, monitor down to the electron (well maybe not quite that granular) in our infrastructure – and we have also isolated 2-3 principle workloads with little variation in our search, small variation in our data, and lots of variation in our cloud/virtual layer (which is just HW abstracted anyway). All that said, we have been looking at all components of our HW stack: datacenters, servers, network, storage, cooling, power, etc., and trying to get down to a mathematical model to inform us of when to refresh vs leaving the door open for visceral responses.
At the moment, we have determined that based on Moore’s law, and using 36mos as the ideal refresh time, we must build 25-30% more datacenter space if we employed a 2yr refresh and 15-20% more space if we did 4yrs. Using a very simplistic model to derive these numbers and not factoring in the actual extra space needed to handle the phased refresh requirement (land gear first, take old gear out last) – so those are very rough %’s.
That said, when we looked at that and asked – what is the least amount of nodes I would need during a 10yr lifespan (our containers) if I used 2,3,4, or 5 year refresh rates – and again, we came up with 36mos – very rough, only considering Moore’s law again (actual our spec’d gain we see YoY on Intel’s tic-toc) – but we seemed to be getting closer to some sort of “truth”
Next part, as you lay out, is to then look at specific workloads as we can do that given how we have isolated our loads fairly well.
Lastly, game changers (SSD, etc) are something that alters everything – one we are starting to think about is the end of Moore’s Law – we see that in (my guess), next 10-20yrs – we are not exactly certain how that will manifest, but have started to think about that as it applies to our 20yr datacenter forecasts.
Great stuff!!
That makes sense. The one quibble that I might have is your thinking around avoiding data center build out by doing an early server replacement round. If you are going to need the capacity anyway, you might as well just let the economics of the combined opex and capex equation drive the decision. If you are just going to need some extra capacity for a short term and then shrink down again leaving the new data center build vacant, then you need to think through sub-optimized early server replacements. But, when capacity is monotonically growing, I wouldn’t do an early server replacment just to avoid a data center build out that you are going to do anyway.
Have you considered cloud hosting some of your capacity as a way of avoiding some of these optimizations that have a long term commitment? That would avoid some of these more difficult optimizations and, with some of your capacity in the cloud, it would give you an opportunity to objectively analyze which you prefer, which give you the ability to adapt more quickly, and which offers the better cost equation.
Absolutely have thought about it – we are wedded to our TCO models, and in the end whatever gives us 1) Availability, 2) Security, 3) Performance, and 4) and the best TCO, we will choose. So far, that has been building our own (not quite your scale, but not small either) – we are more than open to shifting workloads to wherever there are better economies of scale. However, we have found in our SKUs that we get the best performance/cost ratios at higher and higher density of server/rack builds. Today we build a 48u rack w/ 96 nodes that runs up to 40kW – this gives us the best unit economics on a per/compute basis – the trick is to maximize the utilization as well as smooth out for peak (we have some tricks to address this with our over clocking capabilities – very fun stuff). All that said though, we are wide open to the idea of shifting workloads to where we get the “best bang for the buck” – my earlier comments on pushing out datacenter builds applies to our “just in time” mentality – when we look at time value of money, future dollars are cheaper and thus, if we can prolong our need to build out datacenter capacity on a full 20yr model, we can eeek out a few points of savings based on NPV calcs. Yes, penny pinching perhaps, but at scale, that can be millions!! :-)
Great conversation and thread – thanks so much for entertaining my ramblings!!!
Hello, James:
Can you email me, please. I have a new article where I quote your blog post I Love Solar Power But… I would like you to make sure I did not make any mistakes. The article is for DataCenter Dynamics and due Monday the 4th of May.
Michael
Sure Michael. I’m at jrh@mvdirona.com.
Hey Scott. We have spent time in Tofino. It’s beutiful on a brigth sunny day and it’s exciting in a storm with the Pacific pounding in. Nice place for a some time off.
You asked about Halographic memory. It’s an exciting area that has been around for way more than a decade. As I’m sure you know, the challenge with radical technology changes is timing. It’s easy to be early and jump in when a research prototype shows promise and then watch 2 or 3 generations of companies die tring to productize it.
Even some not particularily radical storage technologies like Heat Assisted Magneto Resistive (HAMR and also called EAMR by some) have looked super promissing for way more than a decade but it’s still “just around the bend”. The HDD industry keeps finding ways to get more out of the current technology without having to make the expensive leap to HAMR. It’ll happen but I bet it’ll be long after those working on it 10 years ago would have guessed.
I’m absolutely interested in learning more on the storage startup. Feel free to contact me at james@amazon.com.
James,
We share a heritage and love of the PNW. I was born in Seattle and still have family there. My wife is from British Columbia and we vacation most summers at Tofino where we rent a house and invite her cousins and childhood friends to visit.
I am a long time technology exec, having worked for IBM, then AT&T and served as executive, board and advisory assignments with numerous smaller SW and services firms. Now working as advisor to private equity firm looking at technology and broadband opportunities. Among other projects I am looking at a company trying to complete the perfection of holographic storage solutions.
I would be interested in your viewpoints on an opportunity from the perspectives of a large data center operator like AWS??
Thanks for any viewpoints that you may have.
Cheers,
Scott