This originally came up in an earlier blog comment but it’s an interesting question and one not necessarily one restricted to the changes driven by deep learning training and other often GPU-hosted workloads. This trend has been underway for a long time and is more obvious when looking at networking which was your example as well. When configuring systems, it’s very important that the most expensive components are the limiting resource. Servers are, by far, the largest cost in a data center. Networking costs tend to run down around 15%. It would be nuts to allow an expensive server to be underutilized because the network was the bottleneck. You can’t allow a ~15% cost to block utilization on a ~60% cost.
This make perfect sense and actually the basic rule goes back to manufacturing assembly line design dating way before data centers. If your manufacturing process uses one particularly expensive machine during the process, then you want that machine to be fully utilized. If you can’t fully utilize that machine due to some less expensive resource being the bottleneck, it’s a bad design.
The same is true in data centers design. You want the most expensive resource to be fully utilized. Servers cost far more than networking so any design that allows the workloads to bottleneck on networking is a poor design. As easy as this is to understand, for the decades prior to cloud computing, networking almost always was the bottleneck. This is partly because Cisco, and to a lesser extent Juniper, were very expensive, vertically integrated suppliers. Consequently, networking margins were far higher than servers and so it was closer to reasonable to bottleneck on networking resources and just about every data center during the pre-cloud era did exactly that.
However, if you look closely at that historical example, even with crazy expensive networking equipment, bottlenecking on it was still didn’t make economic sense. It actually wasn’t the right decision at the time and three big changes have happened since:
1) Cloud providers operate at scale and understand the economics so quickly add networking resources to avoid bottlenecking on networking and not fully utilizing the most expensive component (servers),
2) Cloud providers have the scale and ability to do custom networking hardware designs that drop networking costs dramatically. As networking costs drop, the argument against being bottlenecked on these resources continues to be more obvious. The need to ensure that the network is not the bottleneck becomes increasingly clear as the relative cost of networking is reduce through internal development,
3) Modern workloads are, in many cases, more networking intensive. For the most part, this is just an fact unrelated to the economic argument above. It just means that the ratio of networking resource to servers need to increase to meet the needs of these workloads without bottlenecking on networking resources which we argued above isn’t a good economic decision.
Machine learning is the poster child workload for being network intensive and, since all the rules above continue to apply, the networking resource to server ratio will continue to escalate. This won’t change but I should make a quick note on ML training. The reason it is so network intensive is a single server is way too slow for many training jobs. If a single server can’t train fast enough, then multiple servers have to be used. Because training is a tightly connected workload, there is a lot of networking traffic in this model.
In perhaps an obvious prediction since the process is already well underway: custom hardware or GPUs will be added to servers in large numbers with specialized, inside-the-server interconnects. For these workloads, the general purpose CPUs will become just schedulers and coordinators for large numbers of specialized ASICS and the overall training workload that a single server can handle will go up. Clearly these training workloads are growing very fast but I suspect that specialized hardware with libraries and frameworks that exploit them will allow a greater percentage of these workloads to run single server.
This will slightly relieve the pressure to accelerate the growth of networking resources so I mention it here. However, the rule above that networking shouldn’t be the bottleneck, stays true and the network resources will continue to grow fast.
In summary, the need to recover from old designs from the Cisco-era means that networking resource need to “catch up” means more networking growth. The massive growth of machine learning workloads will partly be satisfied by custom ASICs investments but not fully and, consequently, networking requirements continue to grow fast. Networking resources cost less than server resources so networking should never be the bottleneck resource.
When all these factors are considered together, the need to increase the ratio of networking resources to server resources will continue for years to come. The move from 10G to 25G/40G happened in a fraction of the time needed for the industry to move from 1G to 10G. Network resources need to continue to accelerate partly because of new workloads like machine learning training but more conventional workloads are drivers of this change as well. The net? It’s a good time to be designing, building or selling machine learning ASICs or networking components.