This originally came up in an earlier blog comment but it’s an interesting question and one not necessarily one restricted to the changes driven by deep learning training and other often GPU-hosted workloads. This trend has been underway for a long time and is more obvious when looking at networking which was your example as well. When configuring systems, it’s very important that the most expensive components are the limiting resource. Servers are, by far, the largest cost in a data center. Networking costs tend to run down around 15%. It would be nuts to allow an expensive server to be underutilized because the network was the bottleneck. You can’t allow a ~15% cost to block utilization on a ~60% cost.
This make perfect sense and actually the basic rule goes back to manufacturing assembly line design dating way before data centers. If your manufacturing process uses one particularly expensive machine during the process, then you want that machine to be fully utilized. If you can’t fully utilize that machine due to some less expensive resource being the bottleneck, it’s a bad design.
The same is true in data centers design. You want the most expensive resource to be fully utilized. Servers cost far more than networking so any design that allows the workloads to bottleneck on networking is a poor design. As easy as this is to understand, for the decades prior to cloud computing, networking almost always was the bottleneck. This is partly because Cisco, and to a lesser extent Juniper, were very expensive, vertically integrated suppliers. Consequently, networking margins were far higher than servers and so it was closer to reasonable to bottleneck on networking resources and just about every data center during the pre-cloud era did exactly that.
However, if you look closely at that historical example, even with crazy expensive networking equipment, bottlenecking on it was still didn’t make economic sense. It actually wasn’t the right decision at the time and three big changes have happened since:
1) Cloud providers operate at scale and understand the economics so quickly add networking resources to avoid bottlenecking on networking and not fully utilizing the most expensive component (servers),
2) Cloud providers have the scale and ability to do custom networking hardware designs that drop networking costs dramatically. As networking costs drop, the argument against being bottlenecked on these resources continues to be more obvious. The need to ensure that the network is not the bottleneck becomes increasingly clear as the relative cost of networking is reduce through internal development,
3) Modern workloads are, in many cases, more networking intensive. For the most part, this is just an fact unrelated to the economic argument above. It just means that the ratio of networking resource to servers need to increase to meet the needs of these workloads without bottlenecking on networking resources which we argued above isn’t a good economic decision.
Machine learning is the poster child workload for being network intensive and, since all the rules above continue to apply, the networking resource to server ratio will continue to escalate. This won’t change but I should make a quick note on ML training. The reason it is so network intensive is a single server is way too slow for many training jobs. If a single server can’t train fast enough, then multiple servers have to be used. Because training is a tightly connected workload, there is a lot of networking traffic in this model.
In perhaps an obvious prediction since the process is already well underway: custom hardware or GPUs will be added to servers in large numbers with specialized, inside-the-server interconnects. For these workloads, the general purpose CPUs will become just schedulers and coordinators for large numbers of specialized ASICS and the overall training workload that a single server can handle will go up. Clearly these training workloads are growing very fast but I suspect that specialized hardware with libraries and frameworks that exploit them will allow a greater percentage of these workloads to run single server.
This will slightly relieve the pressure to accelerate the growth of networking resources so I mention it here. However, the rule above that networking shouldn’t be the bottleneck, stays true and the network resources will continue to grow fast.
In summary, the need to recover from old designs from the Cisco-era means that networking resource need to “catch up” means more networking growth. The massive growth of machine learning workloads will partly be satisfied by custom ASICs investments but not fully and, consequently, networking requirements continue to grow fast. Networking resources cost less than server resources so networking should never be the bottleneck resource.
When all these factors are considered together, the need to increase the ratio of networking resources to server resources will continue for years to come. The move from 10G to 25G/40G happened in a fraction of the time needed for the industry to move from 1G to 10G. Network resources need to continue to accelerate partly because of new workloads like machine learning training but more conventional workloads are drivers of this change as well. The net? It’s a good time to be designing, building or selling machine learning ASICs or networking components.
Hi James,
As always I enjoyed your post!
On a slightly different but related topic, I once tried to devise a model for comparing CPU to networking in terms of price. Would be very happy for some feedback. http://itsonlyme.name/blog/cpu-network-tradeoff
Nice written up. Jim Gray used a nice terminology to think through these options you are discussing in your note: function shipping vs data shipping. He argued that you just about always want to function ship when processing data at scale but there are some exceptions. Some decisions can only be made by looking at all the data in aggregate and some data is very hard to summarize — there are times when it’s hard to avoid shipping a large amount of data to function.
One factor you didn’t fully cover in your note is latency. Networks can always add bandwidth by running more channels in parallel but you can’t reduce latency. David Patterson does an excellent job of arguing that latency is THE problem in this excellent keynote: https://www.ll.mit.edu/HPEC/agendas/proc04/invited/patterson_keynote.pdf (covered in more detail in various Patterson papers).
Another important factor is the workload itself. Some workloads are composed of many computation steps each of which have few state dependencies and can be computed in parallel. These workloads run great across a cluster. Other workloads are tightly coupled where independent computations are small and there is a relatively large amount of state sharing between these computation stages. These workloads are challenging to run over a cluster due to mass network bandwidth requirements and the latency of transferring the needed state prior to starting the next computation phase. You point out that you can just add more network bandwidth and I agree. What you can’t do is remove the network latency. These tightly coupled jobs run poorly over a cluster because each stage is small and has to wait for state from the previous stages to be first computed and then delivered.
Workload characteristics influence the choice of whether to scale up or out. Network bandwidth is a linear cost — just add a parallel network lane, but network latency is the sum of distance (speed of light in whatever media you are using) and software overhead. The latter is solvable with thoughtful engineering whereas physical distance provides a lower bound on latency improvements. You can’t just add NICs or hire better engineers :-).
Clearly, function shipping will not fit any workload, and at the end of the day many workloads require both, e.g. ship unstructured data feature extraction code or pre filtering code of semi-structure data, and ship back the results for further processing.
Regarding latency, I believe that in the cases where function shipping is the right thing to do, the major latency consideration would be whether the shipped code is reused a lot or not (as with Lambda, where subsequent invocation are much faster).
On another note, SSDs seem to somewhat change the rules described by Patterson w.r.t. latency Vs. BW. That is for storage. I am not sure what RDMA did with this respect to the local network.
Also, regarding the original post, when considering both the custom hardware or GPUs being added to servers and the storage density increase coming with 60TB-100TB SSDs (not yet at the right price point), it seems that the bottleneck gets into the server itself (RAM?/Memory Bus?).
On your last question, what will the bottleneck be for ML workloads running on specialized hardware: Memory bandwidth.
Patterson’s work on latency vs bandwidth is based upon fundamental physics so it is unchanged by SSDs. When moving data, you can make the wire “fatter” by adding more lanes by you can’t make it faster than speed of light in the media being used. That true of memory, networks, and all forms of storage.