In The Case for Low-Cost, Low-Power Servers, I made the argument that the right measures of server efficiency was work done per dollar and work done per joule. Purchasing servers on single dimensional metrics like performance or power or even cost alone, makes no sense at all. Single dimensional purchasing leads to micro-optimizations that push one dimension to the detriment of others. Blade servers have been one of my favorite examples of optimizing the wrong metric (Why Blade Servers aren’t the Answer to All Questions). Blades often trade increased cost to achieve server density. But density doesn’t improve work done per dollar nor does it produce better work done per joule. In fact, density often takes work done per joule in the wrong direction by driving higher power consumption due to the challenge of cooling higher power densities.
There is no question that selling in high volume drives price reductions so client and embedded parts have the potential to be the best price/performing components. And, as focused as the server industry has been on power of late, the best work is still in the embedded systems world where a cell phone designer would sell their souls for a few more amp-hours if they could have it without extra size or extra-weight. Nobody focuses on power as much as embedded systems designers and many of the tricks arriving in the server world showed up years ago in embedded devices.
A very common processor used in cell phone applications is the ARM. The ARM business is model is somewhat unusual in that they sells a processor design and then the design is taken and customized by many teams including Texas Instruments, Samsung, and Marvel. These processors find their way into cell phones, printers, networking gear, low-end Storage Area Networks, Network Attached Storage devices, and other embedded applications. The processors produce respectable performance and great price/performance and absolutely amazing power/performance.
Could this processor architecture be used in server applications? The first and most obvious push back is that it’s a different instruction set architecture but servers software stacks really are not that complex. If you can run Linux and Apache some web workloads can be hosted. There are many Linux ports to ARM — the software will run. The next challenge, and this one is the hard one, does the workload partition into sufficiently fine slices to be hosted on servers built using low end processors. Memory size limitations are particularly hard to work around in that ARM designs have the entire system on the chip including the memory controller and none I’ve seen address more than 2GB. But, for those workloads that do scale sufficiently finely, ARM can work.
I’ve been interested in seeing this done for a couple of years and have been watching ARM processors scale up for quite some time. Well, we now have an example. Check out http://www.linux-arm.org/Main/LinuxArmOrg. That web site is hosted on 7 servers, each running the following:
· Single 1.2Ghz ARM processor, Marvell MV78100
· 1 disk
· 1.5 GB DDR2 with ECC!
· Nginx web proxy/load balancer
· Apache web server
Note that, unlike Intel Atom based servers, this ARM-based solution has the full ECC memory support we want in server applications (actually you really want ECC in all applications from embedded through client to servers).
Clearly this solution won’t run many server workloads but it’s a step in the right direction. The problems I have had when scaling systems down to embedded processors have been dominated by two issues: 1) some workloads don’t scale down to sufficiently small slices (what I like to call bad software but, as someone who spent much of his career working on database engines, I probably should know better), and 2) surrounding component and packaging overhead. Basically, as you scale down the processor expense, other server costs begin to dominate. For example, If you half the processor cost and also ½ the throughput, its potentially a step backwards since all the other components in the server didn’t also half in cost. So, in this example, you would get ½ the throughput with something more than ½ the cost. Generally not good. But, what’s interesting are those cases where it’s non-linear in the other direction. Cut the cost to N% with throughput at M% where M is much more than N. As these system on a chip (SOC) server solutions improve, this is going to be more common.
It’s not always a win based upon the discussion above but it is a win for some workloads today. And, if we can get multi-core versions of ARM, it’ll be a clear win for many more workloads. Actually, the Marvel MV78200 actually is a two core SOC but it’s not cache coherent which isn’t a useful configuration in most server applications.
The ARM is a clear win on work done per dollar and work done per joule for some workloads. If a 4-core, cache coherent version was available with a reasonable memory controller, we would have a very nice server processor with record breaking power consumption numbers. Thanks for the great work ARM and Marvel. I’m looking forward to tracking this work closely and I love the direction its taking. Keep pushing.
–jrh
b: http://blog.mvdirona.com / http://perspectives.mvdirona.com
Please note that the above link to LinuxArmOrg is no longer viable.
Old articles have old links and invariably so don’t stand the test of time.
Hi James, I’m a time traveler arriving late to the game. I’m communicating to you on my Pi 4B 8GB with 240GB ssd. Noticing it’s been awhile since this discussion, I have to ask, have you continued your work with this? I loaded raspberry pi OS 64 bit the other night when booting up with the ssd, the performance is grand (well for the basic stuff I’ve done so far). However, I’m wanting to maximize my performance and my adventure coding, LAMP and would like to put, 64 bit apps, say Apache, MySql, etc… My question in the end is, will this really make a diff, perf wise in the end? THANKS!
Yes, Arms are surprisingly capable devices. Just amazing price/performers. It’s kind of amazing you can use them as a full client system. I have 5 of them spread throughout the boat where I live controlling generator start/stop, power load shedding, supporting remote control of most devices, supporting remote monitoring and alerting, controlling power systems, sending email on anomalies, etc. More on this system: https://mvdirona.com/2018/04/control-systems-on-dirona/.
At work, we have released the Graviton2 processor and many of our customers are running their entire companies on the Arm based Graviton2: https://www.youtube.com/watch?v=LNqRvP6Xvrw&t=3s. We use Arm processors in the AWS Nitro fleet as well and there are well over a million of these deployed across the fleet: https://perspectives.mvdirona.com/2019/02/aws-nitro-system/.
At home, we’re all surrounded by Arms: my watch, my phone, several in the satellite system, the TV, many remote monitoring systems, remote controls, … They are literally everywhere.
You asked if running a 64 bit operating system on your Pi will make a performance difference? Running a 64 bit O/S brings some increase in instruction space overhead and in return opens up the full 8GB of address space you have in your Pi 4. There are some apps that can exploit the larger address space and, in general, I run 64 bit everywhere it’s well supported. But, on my Pis, they are all 32bit H/W systems with only 1GB of memory, so I still run the 32 bit version of Raspian. Only the very latest Pi hardware supports 64bit and it’s still pretty early days for 64bit Raspian. But, beyond the Raspberry Pi community, nearly all server Arms out there are running 64 bit O/Ss.
Had an example… “He’s dead Jim”
You’re right though James, as SOCs get cheaper we can bring microservers into the home – I’m already doing that now (and failing to a large degree) but my limited successes with a per-configured solution like FreedomBox and YuNoHost have spurred me on.
Some of these little beasties really burn rubber.
I agree ARM server processors are respectable today and better is near. Qualcomm, for example, has a very nice part in market with an exciting roadmap. As the TSMC 7 nanometer technology node heads into widespread use, the TSMC technology used by most ARM competitors will roughly parallel what Intel will be using at the same time. This will be the first time the ARM competitors have roughly the same semiconductor technology as Intel who, in the past, has been a full generation and sometimes a bit more ahead.