Wide area network costs and bandwidth shortage are the most common reasons why many enterprise applications run in a single data center. Single data center failure modes are common. There are many external threats to single data center deployments including utility power loss, tornado strikes, facility fire, network connectivity loss, earthquake, break in, and many others I’ve not yet been “lucky” enough to have seen. And, inside a single facility, there are simply too many ways to shoot one’s own foot. All it takes is one well intentioned networking engineer to black hole the entire facilities networking traffic. Even very high quality power distribution systems can have redundant paths taken out by fires in central switch gear or cascading failure modes. And, even with very highly redundant systems, if the redundant paths aren’t tested often, they won’t work. Even with incredibly redundancy, just having the redundant components in the same room, means that a catastrophic failure of one system, could possibly eliminate the second. It’s very hard to engineer redundancy with high independence and physical separate of all components in a single datacenter.
With incredible redundancy, comes incredible cost. Even with incredible costs, failure modes remain that can eliminate the facility entirely. The only cost effective solution is to run redundantly across multiple data centers. Redundancy without physical separation is not sufficient and making a single facility bullet proof has expenses asymptotically heading towards infinity with only tiny increases in availability as the expense goes up. The only way to get the next nine is have redundancy between two data centers. This approach is both more available and considerably more cost effective.
Given that cross-datacenter redundancy is the only effective way to achieve cost-effective availability, why don’t all workloads run in this mode? There are 3 main blocker for the customer I’ve spoken with: 1) scale, 2) latency, and 3) WAN bandwidth and costs.
The scale problem is, stated simply, most companies don’t run enough IT infrastructure to be able to afford multiple data centers in different parts of the country. In fact, many companies only really need a small part of a collocation facility. Running multiple data centers at low scale drives up costs. This is one of the many ways cloud computing can help drive down costs and improve availability. Cloud service providers like Amazon Web Services, run 10s of data centers. You can leverage the AWS scale economics to allow even very low scale applications to run across multiple data centers with diverse power, diverse networking, different fault zones, etc. Each datacenter is what AWS calls an Availability Zone. Assuming the scale economics allow it, the second blocker to cross data center replication is the availability of WAN bandwidth and its cost.
There are also physical limits – mostly the speed of light in fiber – on how far apart redundant components of an application can be run. This limitation is real but won’t prevent redundancy data centers from getting far “enough” away to achieve the needed advantages. Generally, 4 to 5 msec is tolerable for most workloads and replication systems. Bandwidth availability and costs is the prime reason why most customers don’t run geo-diverse.
I’ve argued that latency need not be a blocker. So, if the application has the scale to be able to be run over multiple data centers, the major limiting factor remaining is WAN bandwidth and cost. It is for this reason that I’ve long been interested in WAN compression algorithsm and appliances. These are systems that do compression between branch offices and central enterprise IT centers. Riverbed is one of the largest and most successful of the WAN accelerator providers. Naïve application of block-based compression is better than nothing but compression ratios are bounded and some types of traffic compress very poorly. Most advanced WAN accelerators employ three basic techniques: 1) data type specific optimizations, 2) dedupe, and 3) block-based compression.
Data type specific optimizations are essentially a bag of closely guarded heuristics that optimize for Exchange, SharePoint, remote terminal protocol, or other important application data types. I’m going to ignore these type-specific optimizations and focus on dedupe followed by block-based compression since they are the easiest to apply to cross data center traffic replication traffic.
Broadly, dedupe breaks the data to be transferred between datacenters into either fixed or variable sized blocks. Variable blocks are slightly better but either works. Each block is cryptographically hashed and, rather than transferring the block to the remote datacenter, just send the hash signature. If that block is already in the remote system block index, then it or its clone has already been sent sometime in the past and nothing need to be sent now. In employing this technique we are exploiting data redundancy at a course scale. We are essentially remembering what is on both sides of the WAN and only sending blocks that have not been seen before. The effectiveness of this broad technique is very dependent upon the size and efficiency of the indexing structures, the choice of block boundaries, and inherent redundancy in the data. But, done right, the compression ratios can be phenomenal with 30 to 50:1 not being uncommon. This, by the way, is the same basic technology being applied in storage deduplications by companies like Data Domain.
If a block has not been sent before, then we actually do have to transfer it. That’s when we apply the second level compression technique. Usually a block-oriented compression algorithm and frequently some variant of LZ. The combination of dedupe and block compression is very effective. But, the system I’ve described above introduces latency. And, for highly latency sensitive workloads like EMC SRDF, this can be a problem. Many latency sensitive workloads can’t employ the tricks I’m describing here and either have to run single data center or run at higher cost without compression.
Last week I ran across a company targeting latency sensitive cross-datacenter replication traffic. Infineta Systems announced this morning a solution targeting this problem: Infineta Unveils Breakthrough Acceleration Technology for Enterprise Data Centers. The Infineta Velocity engine is a dedupe appliance that operates at 10Gbps line rate with latencies under 100 microseconds per network packet. Their solution aims to get the bulk of the advantages of the systems I described above at much lower overhead and latency. They achieve their speed-up three ways: 1) hardware implementation based upon FPGA, 2) fixed-sized, full packet block size, 3) bounded index exploiting locality, and 4) heuristic signatures.
The first technique is fairly obvious and one I’ve talked about in the past. When you have a repetitive operation that needs to run very fast, the most cost and power effective solution may be a hardware implementation. It’s getting easier and easier to implement common software kernels in FPGAs or even ASICs. see Heterogeneous Computing using GPGPUs and FPGAs for related discussions on the application of hardware acceleration and, for an application view, High Scale Network Research.
The second technique is another good one. Rather than spend time computing block boundaries, just use the network packet as the block boundary. Essentially they are using the networking system to find the block boundaries. This has the downside of not being as effective as variable sized block systems and they don’t exploit type specific knowledge but they can run very fast at low overhead and close to the higher compression rates yielded by these more computationally intensive techniques. They are exploiting the fact that 20% of the work produces 80% of the gain.
The third technique helps reduce the index size. Rather than having a full index of all blocks that have even been sent, just keep the last N. This allows the index structure to be 100% memory resident without huge, expensive memories. This smaller index is much less resource intensive requiring much less memory and no disk accesses. Avoiding disk is the only way to get anything approaching 100 microsecond latency. Infineta is exploiting temporal locality. Redundant data packets often show up near each other. Clearly this is not always the case and they won’t get the maximum possible compression but they claim to get most of the compression possible in full block index systems without the latency penalty of a disk access and without less memory overhead.
The final technique wasn’t described in enough detail for me to fully understand it. What Infineta is doing is avoiding the cost of fully hashing each packet but taking an approximate signature of carefully chosen packet offsets. Clearly you can take a fast signature on less than the full packet and this signature can be used to know that the packet is not in the index on the other side. But, if the fast hash is present, it doesn’t prove the packet has already been sent. Two different packets can have the same fast hash. Infineta were a bit cagey on this point but what they might be doing is using the very fast approx has to find those that have not yet been sent unambiguously. Using this technique, a fast hash can be used to find those packets that absolutely need to be sent so we can start to compress and send those and waste no more resources on hashing. For those that may not need to be sent, take a full signature and check to see if it is on the remote site. If my guess is correct, the fast hash is being used to avoid spending resources quickly on packets that are not in the index on the other side.
Infineta looks like an interesting solution. More data on them at:
· Press release: http://www.infineta.com/news/news_releases/press_release:5585,15851,445
· Web site: http://www.infineta.com
· Announcing $15m Series A funding: Infineta Comes Out of Stealth and Closes $15 Million Round of Funding http://tinyurl.com/2g4zhbc
b: http://blog.mvdirona.com / http://perspectives.mvdirona.com
Disclaimer: The opinions expressed here are my own and do not
necessarily represent those of current or past employers.