AWS Graviton2


In November of last year, AWS announced the first ARM-based AWS instance type (AWS Designed Processor: Graviton). For me this was a very big deal because I’ve been talking about ARM based servers for more than a decade, believing that massive client volumes fund the R&D stream that feeds most server-side innovation. In our industry, the deep innovations are expensive and “expensive” only make sense when the volumes are truly massive.

For someone like myself that focuses on server-side computing, this is a sad fact. But the old days of server-only innovation started to die with the mainframe, and the process completed during the glory years of the Unix super-servers. Today, when I’m placing a bet on a server-side technology, the first thing I look for is which technology is fueled by the largest volumes and, most of the time, it’s the massive client and especially the consumer market that drives these volumes. For more than a decade, I’ve been watching the client computing and especially the mobile device market for new technologies that can be effectively applied server-side. The most obvious example is the Intel X86 processor family that started its life as a client processor but ended up taking over the server market. Many other examples include most power management innovations and new technologies such as die stacking that showed up first in client devices.

Understanding this dynamic, my prediction back in 2008 that ARM processors would end up powering important parts of the server market was an obvious one. If you agree that volume drives innovation in our business, it’s hard to argue with far more than 90B ARM parts shipped.

But, server-side success for ARM processors has been far from instant. Some very well-funded startups like Calxeda ended up running out of money. Some very large, competent and well-known companies have looked hard at the market, made significant investments, but ended up backing away for a variety of reasons, often completely unrelated to technical problems with what they were doing. AMD and Qualcomm are amongst the companies that have invested and then backed away, but the list is far longer. I saw the details behind some of this work and much of it was excellent. But new technology is hard. All companies, even very successful ones, need to focus their resources where they see the most value and often where they see short-term value.

I understand this, but it’s been difficult to watch so many projects fails. Some of these projects were massive investments and some of the work was very good. Nonetheless, as fast as projects were shut down, the opportunity remained obvious and, as a consequence, new investments were always being started. After nearly a decade, that’s still true. Many projects have started, almost the same number have been shut down, but the common element is that there are always many ARM Server investments underway.

In some ways it’s good that there continues to be deep investments in ARM server processors, but producing a winning part requires deep investment and patience. Much of the modern corporate world is only just “ok” at deep investments, and most are absolutely horrible at patience. Server processor development takes time, the ecosystem needs time to develop, and customers need time to adopt new technologies. Big changes never happen overnight and, without patience, they simply don’t happen at all.

Back in 2014 I was quoted as saying “the development of ARM-based chips for data center servers wasn’t progressing fast enough … to consider using them over Intel processors.” Like many quotes, it’s not exactly what I said but the gist was generally correct. In my opinion, at that time there were no ARM server parts under development that looked like they could win meaningful market segment share. All these investments were just slightly too incremental and a part that was only “about as good as what was currently in market”, isn’t going to attract much attention, isn’t going to cause the ecosystem to spring to action, and customers won’t go to the effort to port to it. Unless the new part is notable or remarkable in some dimension, it’s going to fail.

This was the backdrop to why I was almost giddy with excitement in the front row when Peter Desantis announced the AWS Graviton processor during his keynote at AWS re:Invent conference. Here’s what I posted at the time: AWS Designed Processor: Graviton. I was excited because what Peter announced was a good part with good specs that raised the price/performance bar for many workloads. But I was even more excited knowing that AWS has a roadmap for ARM processors, is patient, and specializes in moving quickly. The first Graviton part was good but, as I enjoyed the first Graviton announcement back in 2018, I knew what many speculated at that time: another part was underway.

The new part is Graviton2 and this is an exceptional server processor that will be a key part of the EC2 compute offering powering the M6g (general purpose), M6gd (general purpose with SSD block storage) the C6g (compute optimized), the R6g (memory optimized) and the R6gd (memory optimized with SSD block storage) instance families. This 7nm part is based upon customized 64-bit ARM Neoverse N1 cores and it is smoking fast. Rather than being offered as an alternative instance type that will run some workloads with better price/performance, it’s being offered as a better version of an existing, very highly-used EC2 instance type, the M5.

Here’s comparative data between M6g and M5, the previous generation instance type, from an excellent Forbes article by Patrick Moorhead:

  • >40% better integer performance on SPECint2017 rate (estimate)
  • >20% better floating-point performance on SPECfp2017 Rate (estimate)
  • >20% better web serving performance on NGINX
  • >40% better performance on Memcached with lower latency and higher throughput
  • >20% better media encoding performance for uncompressed 1080p to H.264 video
  • 25% better BERT ML inference
  • >50% better EDA performance on Cadence Xcelium EDA tool

This is a fast part and I believe there is a high probability we are now looking at what will become the first high volume ARM Server. More speeds and feeds:

  • >30B transistors in 7nm process
  • 64KB icache, 64KB dcache, and 1MB L2 cache
  • 2TB/s internal, full-mesh fabric
  • Each vCPU is a full non-shared core (not SMT)
  • Dual SIMD pipelines/core including ML optimized int8 and fp16
  • Fully cache coherent L1 cache
  • 100% encrypted DRAM
  • 8 DRAM channels at 3200 Mhz

The Anapurna team at AWS is doing amazing work. I wish I could show you all the work they currently have underway but only some of it is public. Even with multiple, difficult competing projects concurrently underway, they delivered Graviton2 on an unusually short schedule seldom seen in the semi-conductor world. It’s a great team to work with and Graviton2 is impressive work.

ARM Servers have been inevitable for a long time but it’s great to finally see them here and in customers hands in large numbers.

25 comments on “AWS Graviton2
  1. Colin Breck says:

    How much are the performance improvements related to avoiding the speculative execution patches for Spectre and Meltdown?

  2. Murtaza says:

    Hi James,

    You wrote:

    > Some very large … companies have looked hard at the market, made significant investments, but ended up backing away for a variety of reasons, often completely unrelated to technical problems with what they were doing. AMD and Qualcomm are amongst the companies that have invested and then backed away, but the list is far longer.

    Leaving aside the business aspects of it, what do you think were other key technical reasons what made Graviton a reality where other companies simply didn’t push through with their plans? I’m sure, if what AMD/Qualcomm saw what AWS saw in terms of performance improvements with ARM, they’d have been pressed to keep pursuing? I’m just curious, is all, especially, since you also pointed out that you had a chance to review some of their work in detail and you walked away impressed.

    Thanks.

    • There are a vast number of different reasons why there weren’t any big successes in the ARM Server market segment and no single thread that explains the unfavorable results thus far. As three examples of how different the causes have been, AMD decided to really focus and get much better much operationally. They decided to refocus on their core business X86 processors. Calxeda delivered a first “OK but not great” part to get the ecosystem rolling but ran out of money before they delivered a performance leading server component. Qualcomm produced a very good processor but Avago tried to buy the entire company (for other reasons) and in order to fight off Avago their board needed to sign up for higher short term profitability and some of the longer term businesses had to go. The most natural outcome is they would have sold the server business but they had earlier been forced to license their current and future server designs to China so, with that commitment in place, it was not possible to sell the server chip business and they discontinued the investment in the server market segment. As with all engineering projects but especially here, there are a vast number of ways to fail and only a small number of ways to succeed and it’s almost always the case that before a company finds great success in a new technology area several others have failed and some of those failures were technically quite good.

      Of course, timing is important as well. 5 to 7 years ago, any ARM Server part entering the market was a generation and a half behind Intel from a process perspective. It takes a truly great part to do well while being delivered on a previous generation process node. Today the ARM servers being delivered to market are on TSMC 7nm and this has been in market for years whereas Intel is just ramping up on 10nm. It’s easier to do a competitive ARM server part today, the software ecosystem for ARM is better prepared today, and cloud computing allows a new entrant to see high volume more quickly. None of these factors assures volume nor any form of success at all but each makes a good result a bit more probable.

  3. Charbax says:

    I’ve been filming videos about Arm servers for over a decade. Looking forward to interview someone about Graviton2, and to see Arm servers finally take over the world.

    • I’m looking forward to ARM servers supporting a lot of really important workloads but I think we are getting to the point in our industry where there isn’t one single processor type, one single database, one single machine learning framework, one single networking ASIC, or one single anything. I think that’s a good thing. But, like you, I’m looking forward to seeing ARM servers heavily used in server-side computing.

  4. Colin Breck says:

    Are you able to mention what the overall impact on power consumption is?

    • Because there are so many power sensitive applications where ARMs are used, much has been invested in power minimization and management and they do very well. It’s easy to get remarkably better power consumption with an ARM part. But, in this particular case, our focus was more on server-side price/performance and, with that focus, our power consumption isn’t really materially better the alternatives.

  5. Aray Alapu says:

    Congratulations. It is good that you are enabling new and better price/performance alternatives. But choices raise questions.
    How do you see Amazon picking between ARM, Intel, and AMD options?
    How should the customer view the same?

    • Amazon’s behavior is usually pretty easy to predict. They are always interested in new ideas, always willing to give them a try, and they will chose the best price/performing host for their workloads. They often chose different platforms for different workloads. I expect Graviton will be quite interesting for them. For customers, it’ll be much the same. The reason AWS hosts a far wider variety of instance types than competitors is customers want instances crafted to their specific workloads. Nobody wants to pay for CPU that is not being used just to get a large memory system. Nobody wants to buy 2x the servers just to get the network bandwidth they need. We love giving customers exactly what they want and we think it’s not important to say you have to use our favorite. As examples: in the relational database world AWS offers PostgreSQL, MySQL, Oracle, SQL Server, MariaDB, and Aurora as options in the database world and, outside of relational, we offer key-value, document, graph, in-memory, and text search. In machine learning we offer Tensorflow, MXNet, and PyTorch. While many competitors have invested deeply in their favorite, we offer choice and the same is true in the processor world. We have a lot of customers on each of AMD, Intel, and Graviton and that’s exactly the way we like it.

      We do focus on making it easy for customers to find the right solution for their workloads. But it’s high on our priority list that all customers to know that, if they are running on AWS, they will have access to the best instance match for their workload, the processors that best supports their workload, the database offerings they favor, and the machine learning framework most appropriate for their specific workload.

      In the old where you had to install and manage all the products you used, there was a tendency to choose a very small number of servers, one processor company probably got all your business, it was super common to host only one relational database and, for many customers, they didn’t even use other, more-specialized databases. The era of cloud computing takes away the hassle of having workload optimized instances, processors, databases, and machine learning frameworks and what happening is the “one ring to rule them all” approach is going away and developers are choosing the solution that most closely matches their workload. Early indications are Graviton will be very heavily used and we’re optimistic that’ll continue to be the case but I don’t expect we’ll ever pressure customers to use Graviton unless it’s the best fit for the workload. Competition drives innovation.

      • Jesse Bearce says:

        Hyperscale cloud and hyperscale choice is the future of computing. For 50 years a common architecture was bad at everything, but Moore’s Law made it the best approach. Hyperscale economies-of-scale and heterogeneous architectures are critical to advancing the essence of Moore’s Law beyond the limits of Moore’s Law; more choice is complementary to overall computing demand and a positive for AWS and the ecosystem; thank you for driving us forward!

        James, the real problem is better efficiency leads to increasing demand for computing, which produces more CO2. The only solution is to eliminate the CO2 from the source of energy. Unfortunately, adding renewables or purchasing carbon credits does not reduce the rise in global temperature and the supply-demand matching problem limits renewable absorption. This problem is beyond the influence of our individual companies; how can we come together with the Hyperscale ecosystem to solve this problem for humanity and unleash the full potential of Hyperscale computing?

        • Good question Jesse. We actually do think that powering server-side computing with renewable energy is a key step forward and we’re committed to 100% renewable by 2030 and we’re over 50% right now. You argue that even 100% renewable doesn’t change the fact that servers emit heat. It’s true, very close to 100% of the energy fed into a server will end up converted to heat energy. The obvious answer there is to produce a more energy efficient system. Less power in is less heat out.

          Generally since we pay for the power, all the power produces heat, and (ironically) we also have to pay to remove the heat, we’re pretty motivated to reduce power consumption. But, I think the reason you raise the point is that you think that radiated energy from computing is likely a major contributor to global warming. it’s a factor to be sure but not a major one. Our biggest problem is not the radiated energy but the fact that the radiated energy is not escaping as efficiently as it was. The greenhouse effect. The by products of combustion, agriculture and many other factors are emitting greenhouse gases and the planet is no longer able to effectively radiate heat as efficiently as before.

          Clearly if you can’t get rid of heat as efficiently as before, there are two possible solutions: 1) correct the problem to allow better heat transfer, or 2) stop emitting heat. Most of the discussion around global warming focuses around the #1, the build up of greenhouse gases in our atmosphere. On that one, moving away from fossil fuels can really help. In the data center world, moving to renewable power makes a big difference. On the second point, I don’t know of any way to compute without power and, as a consequence heat. And, more generally looking past computing, 7.5B people living will produce a lot of heat. Our 7.5B people are all looking for and, in many cases, deserve a better standard of living. A better standard of living will also produce far more heat (and more importantly, more greenhouse gases).

          If you look at the key drivers of greenhouse gases, many if not most, are tied to consumer applications and consumer applications go up as the number of people increases and/or the general standard of living increases. The shear number of people on earth is a key driver of consumer energy use and, in the limit, just everything we do in all industries that produces greenhouse gases is done “for people” and, more people, generally means more emissions. Higher standards of living generally mean more emissions.

          The good news is that direct consumer consumption can be made MUCH more efficient. Nothing but good news there but the bad news is we need to make big changes and these big changes don’t look like we are anywhere close to on an acceptable improvement rate. My take is that we need to get earths population on a negative growth rate quickly. The good news is that countries with a high standard of living, seem to get to a negative population growth rate but earths population is still growing.

          Clearly we should stop activities that produce greenhouse gases but the best effort at doing that, the Paris Climate Agreement, doesn’t step up to aggressive goals, doesn’t really have teeth to force action, and the US isn’t even a participant at this point. We’re not in a good place on population growth. We’re not yet in a good place global agreements to reduce emissions. I think we need to work on both of those swiftly.

          On computing, I think we should use renewable energy and we should make the system as efficient as possible. Generally, I feel pretty good about what can be done to improve the impact of server-side computing on the environment. 100% renewable energy for these workloads can be done and AWS is already more than 50% of the way there. But I also know that this won’t solve the overall problem faced by the planet. Sure, every step helps but server power emissions aren’t a significant driver of the problem. This source of heat is tiny when compared to client side device heat emissions and even this isn’t the dominant problem faced by our planet. We’ll keep making progress with a 100% renewable near term goal but that’s only working on the server-side computing problem which isn’t big enough to move the needle on the overall problem.

          • Jesse Bearce says:

            Thank you for your response. I agree heat energy is not a macro concern, but I am in a unique position to see the overall growth of cloud computing and the electrical energy consumption to support it. The wonderful thing about AWS and Hyperscale computing is the improvement you’ve driven in PUE. Computing performance has more than doubled over the past five years while the total energy consumption has maintained nearly flat, that is amazing! As Hyperscale continues to increase as a percentage of enterprise computing, we’ll see continued progress. My concern is the economic efficiency leads to higher demand on the electrical grid, which pulls energy from fossil fuel sources, a modern-day example of Jevon’s Paradox.

            Thank you for your commitment to 100% renewables by 2030. I am encouraged by the Hyperscale commitment to renewables. You have demonstrated leadership and increased the capacity of renewables in the US and worldwide when other forms of leadership have failed to act or delayed the progress to zero carbon.

            Is AWS planning battery storage and have you considered partnering with the other Hyperscalers to solve the storage problem collectively? Is there an ecosystem consortium, like Open Compute Platform, that makes collective sense for the industry? Count me in if I can help. Thank you.

            • Your core premise is that cloud computing is more efficient and, because it’s more efficient, a form of Jevon’s Paradox kicks in and consumption goes up. It’s certainly the case that lower costs does make make more problems amiable to computing solutions. So, no debate there. But if the cloud operators closed their doors tomorrow, would the world’s compute loads reduce? My thinking is probably not on the argument that IT expenses in most industries are a tiny part of their overall expense. Basically, I’m arguing the world is under using IT based solutions now and they are held back by the complexity of solving problems rather much more than the cost. Skills rather than costs are the most common blockers of new IT workloads. Which is to say, if machine learning, as just one workload example, keeps getting easier (it will), then computing usage will continue to go up rapidly.

              Increased computing is, in many if not most cases, better for the environment. Crashing cars is worse for the environment than crash simulation. Aircraft engine efficiency is driven by CFD simulations and recent progress is very dependent upon computing (as well as material technology of course). It’s hard for me to say that more social media is better for the environment but I wouldn’t necessarily assume that more computing is bad for the environment. In many cases, it’s better than the alternative like less efficient aircraft engines or more crashed cars. I won’t claim to have a read on the percentage of workloads that improve the environment vs how many don’t but, it’s super clear that many workloads do help the environment.

              My overall take most server-side computing will move to large operators. And, large operators will move to renewable energy in the near term. So most server-side computing will be done using renewable energy. But most computing power consumption is client-side so most of the problem remains. And most of the worlds climate problem is not driven by computing power consumption so it to will only have been marginally positive impacted by the expected move of all big operators to renewable energy sources. Clearly the move to renewable energy is an important one but, to move the needle on climate control, we’re going to have to look far beyond server-side computing.

              Finally, you asked about consortia focused on storage where all hyperscalers chip in ideas? I don’t know of one and I’m not personally not super supportive of industry the industry consortium model to drive innovation. However, my biases aside, there are MASSIVE investments being made in energy storage tech both outside IT and within. A deep solution there could change the world (and make the inventors and funders very rich) so it’s an area where there is no shortage of investment, smart people, and innovation. If I saw a way to advance that technology faster, I would drop everything to do it.

              • Jesse Bearce says:

                Thanks again for the exchange. I see the server-side energy consumption from a unique perspective and believe the footprint is bigger and growing faster than previous estimates. The IT industry alone cannot solve the problem, but we can play a significant part. I found Andrew Chien’s Zero-Carbon Cloud initiative interesting for its potential to lower TCO, improve renewable absorption, and help balance the power grid. Removing the carbon from the grid is a complex problem; please take a look; it could accelerate the timeline and adoption of renewables.
                http://people.cs.uchicago.edu/~aachien/lssg/research/zccloud/
                Andrew references your TCO models, so you’ve already contributed, but hopefully, you find this interesting and can shape and drive progress.

  6. Dileep Bhandarkar says:

    Congratulations. Looks very interesting but not quite a home run. The net will be better. As you say, patience is needed. At least this design has a captive user!

    • From my perspective, a home run is a good part with many happy customers and profitable volumes. It’s too early to tell but based upon customer feedback so far, I’m optimistic. Adoption has been excellent so far.

  7. Fazal Majid says:

    Congratulations on being the first to make server ARM mainstream!

    I find the announcement AWS will start migrating its own services like ELB over to ARM even more notable, and since you control the chip, we no longer have to worry about potential backdoors in security-by-obscurity server management misfeatures like AMT/IME/PSP leading to compromise of TLS keys or the like.

    Would Amazon consider making developer workstations using the Graviton2? There is a serious dearth of developer workstation class machines running arm64.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.