Data center networks are nowhere close to the biggest cost or the even the most significant power consumers in a data center (Cost of Power in Large Scale Data Centers) and yet substantial networking constraints loom large just below the surface. There are many reasons why we need innovation in data center networks but let’s look at a couple I find particularly interesting and look at the solution we offered in a recent SIGCOMM paper VL2: A Scalable and Flexible Data Center Network.
Server Utilization
By far the biggest infrastructure cost in a high-scale service is the servers themselves. The first and most important optimization of server resources is to increase server utilization. The best way to achieve higher server utilization is to run the servers as large homogeneous resource pool where workloads can be run on available servers without constraint. There are (at least) two challenges with this approach: 1) most virtual machine live migration techniques only work within a subnet (a layer 2 network) and 2) compute resources that communicate frequently and in high volume need to be “near” each other.
Layer 2 networks are difficult to scale to entire data centers so all but the smallest facilities are made up of many layer 2 subnets each of what might be as small as 20 servers or as large as 500. Scaling layer 2 networks much beyond order 10^3 servers is difficult and seldom done with good results and most are in the O(10^2) range. The restriction of not being able to live migrate workloads across layer 2 boundaries is a substantial limitation on hardware resource balancing and can lead to lower server utilization. Ironically, even though networking is typically only a small portion of the overall infrastructure cost, constraints brought by networking can waste the most valuable components, the servers themselves, through poor utilization.
The second impediment to transparent workload placement – the ability to run any workload on any server is driven by the inherent asymmetry typical of data center networks. Most data center networks are seriously over-subscribed. This means there is considerably more bandwidth between servers in the same rack than between racks. And, again, there is considerable more bandwidth between racks on the same aggregation switch than between racks on different aggregation switches through the core routers. Oversubscription levels of 80 to 1 are common and as much as 240 to 1 can easily be found. If two servers need to communicate extensively and in volume with each other, then they need to be placed near to each other with respect to the network. These networking limitations make workload scheduling and placement considerably more difficult and drive reduced levels of server utilization. Networking is, in effect, “in the way” and blocking the efficient optimization of the most valuable resources in the data center. Server under-utilization wastes much of the capital spent on servers and leaves expensive power distribution and cooling resources underutilized.
Data Intensive computing
In the section above, we talked about networking over-subscription levels of 80:1 and higher being common. In the request/response workloads found in many internet services, these over-subscription levels can be tolerable and work adequately well. They are never ideal but they can be sufficient to support the workload. But, for workloads that move massive amounts of data between nodes rather than small amounts of data between the server and the user, oversubscription can be a disaster. Examples of these data intensive workloads are data analysis clusters, many high performance computing workloads, and the new poster child of this workload-type, MapReduce. MapReduce clusters of hundreds of servers are common and there are many clusters are thousands of servers operating upon petabytes of data. It is quite common for a MapReduce job to transfer the entire multi-petabyte data set over the network during a single job run. This can tax the typically shared networking infrastructure incredibly and the network is often the limiting factor in job performance. Or, worded differently, all the servers and all the other resources in the cluster are being underutilized because of insufficient network capacity.
What Needs to Change?
Server utilization can continue to be improved without lifting the networking constraints but, when facing an over-constrained problem, it makes no sense to allow a lower cost component impose constraints on the optimization of a higher cost component. Essentially, the network is in the way. And, the same applies to data intensive computing. These workloads can be run on over-subscribed networks but they don’t run well. Any workload that is network constrained is saving money on the network at the expensive of underutilizing more valuable components such as the servers and storage.
The biggest part of the needed solution is lower cost networking gear. The reason why most data centers run highly over-subscribed networks is the expense of high-scale networking gear. Rack switches are relatively inexpensive and, as a consequence, they are seldom over-subscribed. Within the rack bandwidth is usually only limited by the server port speed. Aggregation routers connect rack switches. These implement layer 3 protocols but that’s not the most important differentiator. Many cheap top of rack switches also implement layer 3 protocols. Aggregation switches are more expensive because they have larger memory, larger routing tables, and they support much higher port counts. Essentially they are networking equivalent of scale-up servers. And, just as with servers, scaling up networking gear drives costs exponentially. These expensive aggregation and core routers force, or strongly encourage, some degree of oversubscription in an effort to get the costs scaling closer to linearly as the network grows.
Low cost networking gear is a big part of the solution but it doesn’t address the need to scale the layer 2 network discussed above. The two approaches being looked at to solve this problem are to 1) implement a very large layer 2 network or 2) implement a layer 2 overlay network. Cisco and much of the industry is taking the approach of implementing very large layer 2 networks. Essentially changing and extending layer 2 with layer 3 functionality (see The Blurring of layer 2 and layer 3). You’ll variously see the efforts to scale layer 2 referred to as Data Center Ethernet (DCE) or IEEE Data Center Bridging (DCB).
The second approach is to leverage the industry investment in layer 3 networking protocols and implement an overlay network. This was the technique employed by Albert Greenberg and a team of researchers including myself in VL2: A Scalable and Flexible Data Center Network which was published at SIGCOMM 2009 earlier this year. The VL2 project is built using commodity 24-port, 1Gbps switch gear. Rather than using scale-up aggregation and core routers, these low cost, high-radix, commodity routers are cabled to form a Clos network that can reasonably scale to O(10^6) ports. This network topology brings many advantages including: 1) no oversubscription, 2) incredibly robust with many paths between any two ports, 3) inexpensive depending only upon high-volume, commodity components, and 4) able to support large data centers in a single, non-blocking network fabric.
The VL2 approach combines the following:
· Overlay: VL2 is an overlay where all traffic is encapsulated at the source end point and decapsulated destination end point. VL2 separates Location Addresses (PA) from Application Addresses (AA). PAs are the standard hierarchically assigned IP addresses used in the underlying physical network fabric. AAs are the addresses used by the application and the AAs form a single, flat layer 2 address space. Virtual machines can be moved anywhere in the network and still have the same AA. To the application it looks like a single, very-large subnet but, the physical transport network is a conventional layer 3 network with hierarchically assigned IP addresses and subnets. VL2 implements a single flat address space without requiring layer 2 extensions not present in commodity routers and without requiring protocol changes in the application.
· Central Directory: The directory implements Application Address to Location Address lookup and back in a central directory which keeps the implementation simple, avoid broadcast domain scaling issues, and supports O(10^6) port scaling.
· Valiant Load Balancing: VLB is used to randomly spread flows over the multipath fabric. Entire flows are spread randomly rather than single packets in a fallow to ensure in-order delivery (all packets on a flow take the same path in the absence of link failure). The paper agrees that spreading packets rather than flows would yield more stable results in the presence of dissimilar flow sizes but experimental results suggest flow spreading may be an acceptable approximation.
If you are interested in digging deeper into the VL2 approach:
· The VL2 Paper: VL2: A Scalable and Flexible Data Center Network
· An excellent presentation both motivating and discussing VL2: Networking the Cloud
In my view, we are on the cusp of big changes in the networking world driven by the availability of high-radix, low-cost, commodity routers coupled with protocol innovations.
–jrh
b: http://blog.mvdirona.com / http://perspectives.mvdirona.com
Good point Sudipta. Given that VL2 can support a wide variety of multipathed networks, what’s your take on the right approach. What network topologies look best to you especially when trying to minimize cabling cost, scale down to a small initial deployment, and support incremental growth without recabling?
James Hamilton
jrh@mvdirona.com
VL2 can work with different topologies, not just Clos-like. Topology design has to take into account the constraints of switch degree and the availability of bandwidth in 10G increments on core links. We have a couple of other topology designs in our notebooks.
Further to James’ excellent description, let me add some historical perspective into VL2 development. We actually started with the idea of VLB based oblivious routing to accommodate arbitrary traffic patterns between servers subject to line card constraints. Then, we arrived at a simple topology that imposes full bipartite connectivity between two layers of core switches. This arrangement looks like a folded-Clos network.
We are not using the notion of Clos routing that is associated with Clos topologies. Clos uses single path routing in a circuit switched fabric for a set of permutation demands between input and output ports. VL2 looks like a huge packet switch and uses multi-path routing.
I generally agree with you Denis and, although I love 10G to the server, 1G can support one heck of a lot more than a single disk unless you have a purely sequential workload. 1G will support about 12,000 IOPS so, for a random workload. On commodity disk, that’s around 120 disks if random. If it really is a purely sequential workload then I agree that a disk or two can saturate a 1Gbps link.
–jrh
jrh@mvdirona.com
I buy that – with 1Gb a single spindle could saturate the entire network link. 1 spindle per server is not that sexy after all. With 10Gb you can have 5-10 spindles per network socket, so it gets a lot more exciting.
So the notion of "direct-attached storage" is obsolete, only "same DC storage" and "other DC storage" is meaningful. And now that I think of it, that’s what AWS EBS has demonstrated to us already.
But I also feel that Clos can take us much further than that – it gives us unlimited (DC-sized) cumulative storage and bandwidth with linear scale. And so, "infinite bandwidth" + "infinite storage" == "infinite data processing capacity". Oh great, now I’m worried about my job security. :-)
Thanks for the feedback Denis.
You asked "why doesn’t everyone build out a clos topology"? Some have. It looks like Facebook has and the topology is common in HPC. I think the reason they aren’t yet common in commercial applications has been the expense of current networking gear. Up until the recent arrival of merchant silicon in switches, 48 ports of 10Gbps has been pretty expensive. Clos designs requires lots of net gear so the approach is most practical when the net gear is commodity priced.
The layer 2 requirement for VM migration is driven by the need to move the IP address. You don’t want to have a VM come to life on a subnet with IP address from a different subnet. Here’s a note on Xen live migration: http://www.novell.com/communities/node/5050/xen-virtual-machine-migration and another on VMWare.
I completely agree that making all points equidistant in a data center opens up huge possibilities.
James Hamilton
jrh@mvdirona.com
Excellent piece, James – enough background to explain the problem, the importance of the problem, the solution with its precursors, but not too long to become exhausting. I really liked the format.
The Clos-like thing occurred to me about two or three years back, with conclusion that "high level switch contention" problem is mostly bogus, although I didn’t realize I was 50 years late with the solution already known as Clos. Is there anything new in the world? :-)
I also had a question. Do you know why doesn’t everyone build out Clos topology and call the issue of network bottlenecks solved for good?
As I understood it, layer 2 requirement comes from the need to sustain live connections during migration. Many services such as relational databases and HTTP servers already support notion of "retry" so they could also already benefit from Clos. Most services written from scratch could (and should) be made to have retry logic as well. So I think that the lack of VL2-like solution could not have been a major hindrance.
It seems to me that everyone simply treats that "related data must be close to each other" as a golden truth. Whereas if we challenge that "truth" we have an enormous opportunity opening up in front of us.