Monday, May 10, 2010

Wide area network costs and bandwidth shortage are the most common reasons why many enterprise applications run in a single data center. Single data center failure modes are common. There are many external threats to single data center deployments including utility power loss, tornado strikes, facility fire, network connectivity loss,  earthquake, break in, and many others I’ve not yet been “lucky” enough to have seen. And, inside a single facility, there are simply too many ways to shoot one’s own foot.  All it takes is one well intentioned networking engineer to black hole the entire facilities networking traffic. Even very high quality power distribution systems can have redundant paths taken out by fires in central switch gear or cascading failure modes.  And, even with very highly redundant systems, if the redundant paths aren’t tested often, they won’t work.  Even with incredibly redundancy, just having the redundant components in the same room, means that a catastrophic failure of one system, could possibly eliminate the second. It’s very hard to engineer redundancy with high independence and physical separate of all components in a single datacenter.

 

With incredible redundancy, comes incredible cost. Even with incredible costs, failure modes remain that can eliminate the facility entirely. The only cost effective solution is to run redundantly across multiple data centers.  Redundancy without physical separation is not sufficient and making a single facility bullet proof has expenses asymptotically heading towards infinity with only tiny increases in availability as the expense goes up. The only way to get the next nine is have redundancy between two data centers. This approach is both more available and considerably more cost effective.

 

Given that cross-datacenter redundancy is the only effective way to achieve cost-effective availability, why don’t all workloads run in this mode? There are 3 main blocker for the customer I’ve spoken with: 1) scale, 2) latency, and 3) WAN bandwidth and costs.

 

The scale problem is, stated simply, most companies don’t run enough IT infrastructure to be able to afford multiple data centers in different parts of the country. In fact, many companies only really need a small part of a collocation facility.  Running multiple data centers at low scale drives up costs. This is one of the many ways cloud computing can help drive down costs and improve availability.  Cloud service providers like Amazon Web Services, run 10s of data centers. You can leverage the AWS scale economics to allow even very low scale applications to run across multiple data centers with diverse power, diverse networking, different fault zones, etc. Each datacenter is what AWS calls an Availability Zone. Assuming the scale economics allow it, the second blocker to cross data center replication is the availability of WAN bandwidth and its cost.

 

There are also physical limits – mostly the speed of light in fiber – on how far apart redundant components of an application can be run. This limitation is real but won’t prevent redundancy data centers from getting far “enough” away to achieve the needed advantages. Generally, 4 to 5 msec is tolerable for most workloads and replication systems. Bandwidth availability and costs is the prime reason why most customers don’t run geo-diverse.

 

I’ve argued that latency need not be a blocker. So, if the application has the scale to be able to be run over multiple data centers, the major limiting factor remaining is WAN bandwidth and cost. It is for this reason that I’ve long been interested in WAN compression algorithsm and appliances. These are systems that do compression between branch offices and central enterprise IT centers. Riverbed is one of the largest and most successful of the WAN accelerator providers. Naïve application of block-based compression is better than nothing but compression ratios are bounded and some types of traffic compress very poorly. Most advanced WAN accelerators employ three basic techniques: 1) data type specific optimizations, 2) dedupe, and 3) block-based compression.

 

Data type specific optimizations are essentially a bag of closely guarded heuristics that optimize for Exchange, SharePoint, remote terminal protocol, or other important application data types. I’m going to ignore these type-specific optimizations and focus on dedupe followed by block-based compression since they are the easiest to apply to cross data center traffic replication traffic.

 

Broadly, dedupe breaks the data to be transferred between datacenters into either fixed or variable sized blocks. Variable blocks are slightly better but either works. Each block is cryptographically hashed and, rather than transferring the block to the remote datacenter, just send the hash signature. If that block is already in the remote system block index, then it or its clone has already been sent sometime in the past and nothing need to be sent now. In employing this technique we are exploiting data redundancy at a course scale. We are essentially remembering what is on both sides of the WAN and only sending blocks that have not been seen before. The effectiveness of this broad technique is very dependent upon the size and efficiency of the indexing structures, the choice of block boundaries, and inherent redundancy in the data. But, done right, the compression ratios can be phenomenal with 30 to 50:1 not being uncommon. This, by the way, is the same basic technology being applied in storage deduplications by companies like Data Domain.

 

If a block has not been sent before, then we actually do have to transfer it. That’s when we apply the second level compression technique. Usually a block-oriented compression algorithm and frequently some variant of LZ. The combination of dedupe and block compression is very effective. But, the system I’ve described above introduces latency. And, for highly latency sensitive workloads like EMC SRDF, this can be a problem. Many latency sensitive workloads can’t employ the tricks I’m describing here and either have to run single data center or run at higher cost without compression. 

 

Last week I ran across a company targeting latency sensitive cross-datacenter replication traffic. Infineta Systems announced this morning a solution targeting this problem: Infineta Unveils Breakthrough Acceleration Technology for Enterprise Data Centers. The Infineta Velocity engine is a dedupe appliance that operates at 10Gbps line rate with latencies under 100 microseconds per network packet. Their solution aims to get the bulk of the advantages of the systems I described above at much lower overhead and latency. They achieve their speed-up three ways: 1) hardware implementation based upon FPGA, 2) fixed-sized, full packet block size,  3) bounded index exploiting locality, and 4) heuristic signatures.

 

The first technique is fairly obvious and one I’ve talked about in the past. When you have a repetitive operation that needs to run very fast, the most cost and power effective solution may be a hardware implementation. It’s getting easier and easier to implement common software kernels in FPGAs or even ASICs. see Heterogeneous Computing using GPGPUs and FPGAs for related discussions on the application of hardware acceleration and, for an application view,  High Scale Network Research.

 

The second technique is another good one. Rather than spend time computing block boundaries, just use the network packet as the block boundary. Essentially they are using the networking system to find the block boundaries. This has the downside of not being as effective as variable sized block systems and they don’t exploit type specific knowledge but they can run very fast at low overhead and close to the higher compression rates yielded by these more computationally intensive techniques.  They are exploiting the fact that 20% of the work produces 80% of the gain.

 

The third technique helps reduce the index size. Rather than having a full index of all blocks that have even been sent, just keep the last N. This allows the index structure to be 100% memory resident without huge, expensive memories. This smaller index is much less resource intensive requiring much less memory and no disk accesses. Avoiding disk is the only way to get anything approaching 100 microsecond latency. Infineta is exploiting temporal locality.  Redundant data packets often show up near each other. Clearly this is not always the case and they won’t get the maximum possible compression but they claim to get most of the compression possible in full block index systems without the latency penalty of a disk access and without less memory overhead.

 

The final technique wasn’t described in enough detail for me to fully understand it.  What Infineta is doing is avoiding the cost of fully hashing each packet but taking an approximate signature of carefully chosen packet offsets. Clearly you can take a fast signature on less than the full packet and this signature can be used to know that the packet is not in the index on the other side. But, if the fast hash is present, it doesn’t prove the packet has already been sent. Two different packets can have the same fast hash. Infineta were a bit cagey on this point but what they might be doing is using the very fast approx has to find those that have not yet been sent unambiguously. Using this technique, a fast hash can be used to find those packets that absolutely need to be sent so we can start to compress and send those and waste no more resources on hashing. For those that may not need to be sent, take a full signature and check to see if it is on the remote site. If my guess is correct, the fast hash is being used to avoid spending resources quickly on packets that are not in the index on the other side.

 

Infineta looks like an interesting solution.  More data on them at:

·         Press release: http://www.infineta.com/news/news_releases/press_release:5585,15851,445

·         Web site: http://www.infineta.com

·         Announcing $15m Series A funding: Infineta Comes Out of Stealth and Closes $15 Million Round of Funding http://tinyurl.com/2g4zhbc

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Monday, May 10, 2010 12:07:33 PM (Pacific Standard Time, UTC-08:00)  #    Comments [8] - Trackback
Hardware
Tracked by:
"Increase traffic" (Increase traffic) [Trackback]
Tuesday, May 11, 2010 5:28:29 AM (Pacific Standard Time, UTC-08:00)
you can do some redundancy at the city-level where cost is less of an issue, but you need to worry more about fault lines, whether you are downwind of any volcano and other other large events that you can predict in the PNW or bay area and so should design for, but the bandwidth is always the enemy.

Of course, rsync is a form of dedupe, but running it in hardware at 10Gps is a new feature. I can see that handy even between buildings.
Tuesday, May 11, 2010 10:47:50 AM (Pacific Standard Time, UTC-08:00)
James, This one is a super-interesting post and you have nailed all the pain points (as always).

"The effectiveness of this broad technique is very dependent upon the size and efficiency of the *indexing structures*, ...".

I think that flash memory could provide a nice way to scale the index lookup part of the system. We have built a flash-assisted storage deduplication system in MSR. We also include low RAM usage as a design goal.

** To reduce RAM usage, we store 2-byte compact key signatures in RAM instead of full 20-byte SHA-1 hashes.

** We also provide techniques to reduce RAM usage by indexing only a sample of the chunks and not all of them.

The paper will appear in USENIX Annual Technical Conference 2010 next month in Boston.

ChunkStash: Speeding up Inline Storage Deduplication using Flash Memory,
B. Debnath, Sudipta Sengupta, and Jin Li,
To Appear in 2010 USENIX Annual Technical Conference, Boston, USA, June 2010.

Snippet from the paper abstract:
We design a flash-assisted inline deduplication system using ChunkStash, a chunk metadata store on flash. ChunkStash uses one flash read per chunk lookup and works in concert with RAM prefetching strategies. It organizes chunk metadata in a log-structure on flash to exploit fast sequential writes. It uses an in-memory hash table to index them, with hash collisions resolved by a variant of cuckoo hashing. The in-memory hash table stores compact key signatures instead of full chunk hashes so as to strike tradeoffs between RAM usage and false flash reads. Further, by indexing a small fraction of chunks per container, ChunkStash can reduce RAM usage significantly with negligible loss in deduplication quality.

--Sudipta
Wednesday, May 12, 2010 5:13:05 AM (Pacific Standard Time, UTC-08:00)
I 100% agree Steve. Metro-level is often a good compromise between fault containment and low latency. But, as you said, once you leave the building the WAN drives cost and gets in the way.

--jrh
jrh@mvdirona.com
Wednesday, May 12, 2010 5:15:29 AM (Pacific Standard Time, UTC-08:00)
Super interesting two level index approach Sudipta. It sounds like you are using an in-memory approximate index and an on-flash precise chunk index. Nice approach. Please send me the paper.

--jrh
jrh@mvdirona.com
Thursday, May 13, 2010 6:10:42 PM (Pacific Standard Time, UTC-08:00)
Hi James, I am ike, a cloud computing guy from china.
This is a great site, I have learned a lot from you. Thx!
Friday, May 14, 2010 6:38:38 PM (Pacific Standard Time, UTC-08:00)
Good to hear Ike. I would love to hear more about cloud computing in China. Drop me a note sometime.

jrh@mvdiroa.com
Friday, May 14, 2010 7:15:15 PM (Pacific Standard Time, UTC-08:00)
Hi James, great post as always. I am trying to understand how Infineta is different from Riverbed. Is it about operating at 10Gbps speeds and a hardware implementation? First of all, I don't think of the inter-datacenter replication traffic to be very latency sensitive. In addition, given the latency of 10ms or more between data centers because of geographic distance, an additional latency of up to 500 microseconds shouldn't make the situation any worse. What am I missing with this thinking? I am also wondering if network is the right place to perform dedupe - it seems that it would be better to do in the replication software. What do you think?


Thanks,

Ram
Ram
Monday, May 17, 2010 4:17:33 AM (Pacific Standard Time, UTC-08:00)
You are right Ram, the core technology is similar to that which is used by Riverbed. The key difference is latency. Infineta keeps the block index 100% in memory and gives up a small amount of compression efficiency by using a smaller index. This gives them speed and reduces appliance cost but gives up some compression effectiveness.

The target for this work is different from that which Riverbed has grown up. Riverbed got its start in branch office acceleration. Of course, they too are interested in growing there markets to other markets that have similar requirements. Two of the potential adjacent markets of interest are 1) enterprise to cloud in support of hybrid clouds, and 2) inter-datacenter replication flows.

There are some inter-datacenter workloads that are very latency sensitive. For example, EMC SRDF replication requires very low latency (<5 msec) when run in synchronous mode. Sync SRDF is typically used between centers very near each other. The prototypical example is in the financial district in New York where replication to New Jersey datacenters is the norm.

Any synchronous replication scheme requires very low latency and so they are usually run between datacenters that are relatively close together (metro area rather than cross country).

--jrh
jrh@mvdirona.com
Comments are closed.

Disclaimer: The opinions expressed here are my own and do not necessarily represent those of current or past employers.

Archive
<July 2014>
SunMonTueWedThuFriSat
293012345
6789101112
13141516171819
20212223242526
272829303112
3456789

Categories
This Blog
Member Login
All Content © 2014, James Hamilton
Theme created by Christoph De Baene / Modified 2007.10.28 by James Hamilton