Data center networks are nowhere close to the biggest cost or the even the most significant power consumers in a data center (Cost of Power in Large Scale Data Centers) and yet substantial networking constraints loom large just below the surface. There are many reasons why we need innovation in data center networks but let’s look at a couple I find particularly interesting and look at the solution we offered in a recent SIGCOMM paper VL2: A Scalable and Flexible Data Center Network.
By far the biggest infrastructure cost in a high-scale service is the servers themselves. The first and most important optimization of server resources is to increase server utilization. The best way to achieve higher server utilization is to run the servers as large homogeneous resource pool where workloads can be run on available servers without constraint. There are (at least) two challenges with this approach: 1) most virtual machine live migration techniques only work within a subnet (a layer 2 network) and 2) compute resources that communicate frequently and in high volume need to be “near” each other.
Layer 2 networks are difficult to scale to entire data centers so all but the smallest facilities are made up of many layer 2 subnets each of what might be as small as 20 servers or as large as 500. Scaling layer 2 networks much beyond order 10^3 servers is difficult and seldom done with good results and most are in the O(10^2) range. The restriction of not being able to live migrate workloads across layer 2 boundaries is a substantial limitation on hardware resource balancing and can lead to lower server utilization. Ironically, even though networking is typically only a small portion of the overall infrastructure cost, constraints brought by networking can waste the most valuable components, the servers themselves, through poor utilization.
The second impediment to transparent workload placement – the ability to run any workload on any server is driven by the inherent asymmetry typical of data center networks. Most data center networks are seriously over-subscribed. This means there is considerably more bandwidth between servers in the same rack than between racks. And, again, there is considerable more bandwidth between racks on the same aggregation switch than between racks on different aggregation switches through the core routers. Oversubscription levels of 80 to 1 are common and as much as 240 to 1 can easily be found. If two servers need to communicate extensively and in volume with each other, then they need to be placed near to each other with respect to the network. These networking limitations make workload scheduling and placement considerably more difficult and drive reduced levels of server utilization. Networking is, in effect, “in the way” and blocking the efficient optimization of the most valuable resources in the data center. Server under-utilization wastes much of the capital spent on servers and leaves expensive power distribution and cooling resources underutilized.
Data Intensive computing
In the section above, we talked about networking over-subscription levels of 80:1 and higher being common. In the request/response workloads found in many internet services, these over-subscription levels can be tolerable and work adequately well. They are never ideal but they can be sufficient to support the workload. But, for workloads that move massive amounts of data between nodes rather than small amounts of data between the server and the user, oversubscription can be a disaster. Examples of these data intensive workloads are data analysis clusters, many high performance computing workloads, and the new poster child of this workload-type, MapReduce. MapReduce clusters of hundreds of servers are common and there are many clusters are thousands of servers operating upon petabytes of data. It is quite common for a MapReduce job to transfer the entire multi-petabyte data set over the network during a single job run. This can tax the typically shared networking infrastructure incredibly and the network is often the limiting factor in job performance. Or, worded differently, all the servers and all the other resources in the cluster are being underutilized because of insufficient network capacity.
What Needs to Change?
Server utilization can continue to be improved without lifting the networking constraints but, when facing an over-constrained problem, it makes no sense to allow a lower cost component impose constraints on the optimization of a higher cost component. Essentially, the network is in the way. And, the same applies to data intensive computing. These workloads can be run on over-subscribed networks but they don’t run well. Any workload that is network constrained is saving money on the network at the expensive of underutilizing more valuable components such as the servers and storage.
The biggest part of the needed solution is lower cost networking gear. The reason why most data centers run highly over-subscribed networks is the expense of high-scale networking gear. Rack switches are relatively inexpensive and, as a consequence, they are seldom over-subscribed. Within the rack bandwidth is usually only limited by the server port speed. Aggregation routers connect rack switches. These implement layer 3 protocols but that’s not the most important differentiator. Many cheap top of rack switches also implement layer 3 protocols. Aggregation switches are more expensive because they have larger memory, larger routing tables, and they support much higher port counts. Essentially they are networking equivalent of scale-up servers. And, just as with servers, scaling up networking gear drives costs exponentially. These expensive aggregation and core routers force, or strongly encourage, some degree of oversubscription in an effort to get the costs scaling closer to linearly as the network grows.
Low cost networking gear is a big part of the solution but it doesn’t address the need to scale the layer 2 network discussed above. The two approaches being looked at to solve this problem are to 1) implement a very large layer 2 network or 2) implement a layer 2 overlay network. Cisco and much of the industry is taking the approach of implementing very large layer 2 networks. Essentially changing and extending layer 2 with layer 3 functionality (see The Blurring of layer 2 and layer 3). You’ll variously see the efforts to scale layer 2 referred to as Data Center Ethernet (DCE) or IEEE Data Center Bridging (DCB).
The second approach is to leverage the industry investment in layer 3 networking protocols and implement an overlay network. This was the technique employed by Albert Greenberg and a team of researchers including myself in VL2: A Scalable and Flexible Data Center Network which was published at SIGCOMM 2009 earlier this year. The VL2 project is built using commodity 24-port, 1Gbps switch gear. Rather than using scale-up aggregation and core routers, these low cost, high-radix, commodity routers are cabled to form a Clos network that can reasonably scale to O(10^6) ports. This network topology brings many advantages including: 1) no oversubscription, 2) incredibly robust with many paths between any two ports, 3) inexpensive depending only upon high-volume, commodity components, and 4) able to support large data centers in a single, non-blocking network fabric.
The VL2 approach combines the following:
· Overlay: VL2 is an overlay where all traffic is encapsulated at the source end point and decapsulated destination end point. VL2 separates Location Addresses (PA) from Application Addresses (AA). PAs are the standard hierarchically assigned IP addresses used in the underlying physical network fabric. AAs are the addresses used by the application and the AAs form a single, flat layer 2 address space. Virtual machines can be moved anywhere in the network and still have the same AA. To the application it looks like a single, very-large subnet but, the physical transport network is a conventional layer 3 network with hierarchically assigned IP addresses and subnets. VL2 implements a single flat address space without requiring layer 2 extensions not present in commodity routers and without requiring protocol changes in the application.
· Central Directory: The directory implements Application Address to Location Address lookup and back in a central directory which keeps the implementation simple, avoid broadcast domain scaling issues, and supports O(10^6) port scaling.
· Valiant Load Balancing: VLB is used to randomly spread flows over the multipath fabric. Entire flows are spread randomly rather than single packets in a fallow to ensure in-order delivery (all packets on a flow take the same path in the absence of link failure). The paper agrees that spreading packets rather than flows would yield more stable results in the presence of dissimilar flow sizes but experimental results suggest flow spreading may be an acceptable approximation.
If you are interested in digging deeper into the VL2 approach:
· The VL2 Paper: VL2: A Scalable and Flexible Data Center Network
· An excellent presentation both motivating and discussing VL2: Networking the Cloud
In my view, we are on the cusp of big changes in the networking world driven by the availability of high-radix, low-cost, commodity routers coupled with protocol innovations.