Sunday, August 01, 2010

A couple of weeks back Greg Linden sent me an interesting paper called Energy Proportional Datacenter Networks. The principal of energy proportionality was first coined by Luiz Barroso and Urs Hölzle in an excellent paper titled The Case for Energy-Proportional Computing. The core principal behind energy proportionality is that computing equipment should consume power in proportion to their utilization level. For example, a computing component that consumes N watts at full load, should consume X/100*N Watts when running at X% load. This may seem like a obviously important concept but, when the idea was first proposed back in 2007, it was not uncommon for a server running at 0% load to be consuming 80% of full load power. Even today, you can occasionally find servers that poor. The incredibly difficulty of maintaining near 100% server utilization makes energy proportionality a particularly important concept.

                                                                                                                         

One of the wonderful aspects of our industry is, when an attribute becomes important, we focus on it. Power consumption as a whole has become important and, as a consequence, average full load server power has been declining for the last couple of years. It is now easy to buy a commodity server under 150W. And, with energy proportionality now getting industry attention, this metric is also improving fast. Today a small percentage of the server market has just barely cross the 50% power proportionality line which is to say, if they are idle, they are consuming less than 50% of the full load power.

 

This is great progress that we’re all happy to see. However, it’s very clear we are never going to achieve 100% power proportionality so raising server utilization will remain  the most important lever we have in reducing wasted power. This is one of the key advantages of cloud computing. When a large number of workloads are brought together with non-correlated utilization peaks and valleys, the overall peak to trough load ratio is dramatically flattened. Looked at it simplistically, aggregating workloads with dissimilar peak resource consumption levels tends to reduce the peak to trough ratios. There need to be resources available for peak resource consumption but the utilization level goes up and down with workload so reducing the difference between peak and trough aggregate load levels, cuts costs, save energy, and is better for the environment.

 

Understanding the importance of power proportionality, it’s natural to be excited by the Energy Proportional Datacenter Networks paper. In this paper, the authors observer “if the system is 15% utilized (servers and network) and the servers are fully power-proportional, the network will consume nearly 50% of the overall power.” Without careful reading, this could lead one to believe that network power consumption was a significant industry problem and immediate action at top priority was needed. But the statement has two problems. The first is the assumption that “full  energy proportionality” is even possible. There will always be some overhead in having a server running. And we are currently so distant from this 100% proportional goal, that any conclusion that follows from this assumption is unrealistic and potentially mis-leading.

 

The second issue is perhaps more important: the entire data center might be running at 15% utilization. 15% utilization means that all the energy (and capital) that went into the datacenter power distribution system, all the mechanical systems, the servers themselves, is only being 15% utilized. The power consumption is actually a tiny part of the problem.  The real problem is the utilization level means that most resources in a nearly $100M investment are being wasted by low utilization levels.There are many poorly utilized data centers running at 15% or worse utilization but I argue the solution to this problem is to increase utilization. Power proportionality is very important and many of us are working hard to find ways to improve power proportionality. But, power proportionality won’t reduce the importance of ensuring that datacenter resources are near fully utilized.

 

Just as power proportionality will never be fully realized, 100% utilization will never be realized either so clearly we need to do both. However, it’s important to remember that the gain from increasing utilization by 10% is far more than the possible gain to be realized by improving power proportionality by 10%. Both are important but utilization is by far the strongest lever in reducing cost and the impact of computing on the environment.

 

Returning to the negative impact of networking gear on overall datacenter power consumption, let’s look more closely. It’s easy to get upset when you look at net gear power consumption. It is prodigious. For example a Juniper EX8200 consumes nearly 10Kw. That’s roughly as much power consumption as an entire rack of servers (server rack powers range greatly but 8 to 10kW is pretty common these days). A fully configured Cisco Nexus 7000 requires 8 full 5kW circuits to be provisioned. That’s 40kW or roughly as much power provisioned to a single network device as would be required by 4 average racks of servers. These numbers are incredibly large individually but it’s important to remember that there really isn’t all that much networking gear in a datacenter relative to the number of servers.

 

To get precise, let’s build a model of the power required in a large datacenter to understand the details of networking gear power consumption relative to the rest of the facility. In this model, we’re going to build a medium to large sized datacenter with 45,000 servers. Let’s assume these servers are operating at an industry leading 50% power proportionality and consume only 150W each at full load. This facility has a PUE of 1.5 which is to say that for every watt of power delivered to IT equipment (servers, storage and networking equipment), there is ½ watt lost in power distribution and powering mechanical systems. PUEs as high as 1.2 are possible but rare and PUEs as poor as 3.0 are possible and unfortunately quite common. The industry average is currently 2.0 which says that in an average datacenter, for every watt delivered to the IT load, 1 watt is spent on overhead (power distribution, cooling, lighting, etc.).

 

For networking, we’ll first build a conventional, over-subscribed, hierarchical network.  In this design, we’ll use Cisco 3560 as a top of rack (TOR) switch. We’ll connect these TORs 2-way redundant to Juniper Ex8216 at the aggregation layer. We’ll connect the aggregation layer to Cisco 6509E in the core were we need only 1 device for a network of this size but we use 2 for redundancy. We also have two border devices in the model:

 

Using these numbers as input we get the following:

 

·         Total DC Power:                                10.74MW (1.5 * IT load)

·         IT Load:                                                7.16MW (netgear + servers and storage)

o   Servers & Storage:          6.75MW (45,100 * 150W)

o   Networking gear:             0.41 MW (from table above)

 

This would have the networking gear as consuming only 3.8% of the overall power consumed by the facility (0.41/10.74). If we were running at 0% utilization which I truly hope is far worse that anyone’s worst case today, what networking consumption would we see? Using the numbers above with the servers at 50% power proportionally (unusually good), we would have the networking gear at 7.2% of overall facility power (0.41/((6.75*0.5+0.41)*1.5)). 

 

This data argues strongly that networking is not the biggest problem or anywhere close. We like improvements wherever we can get them and so I’ll never walk away from a cost effective networking power solution. But, networking power consumption is not even close to our biggest problem so we should not get distracted.

 

What if we built an modern fat tree network using commodity high-radix networking gear along the lines alluded to in Data Center Networks are in my Way and covered in detail in the VL2 and PortLand papers? Using 48 port network devices we would need 1875 switches in the first tier, 79 in the next, then 4, and 1 at the root. Let’s use 4 at the root to get some additional redundancy which would put the switch count at 1,962 devices. Each network device dissipates roughly 150W and driving each of the 48 transceivers requires roughly 5w each  (a rapidly declining number). This totals to 390W per 48 port device. Any I’ve seen are better but let’s use these numbers to ensure we are seeing the network in its worst likely form. Using these data we get:

 

·         Total DC Power:                                11.28MW

·         IT Load:                                                7.52MW

o   Servers & Storage:          6.75MW

o   Networking gear:             0.77MW

 

Even assuming very high network power dissipation rates with 10GigE to every host in the entire datacenter with a constant bisection bandwidth network topology that requires a very large number of devices, we still only have the network at 6.8% of the overall data center. If we assume the very worst case available today where we have 0% utilization with 50% power proportional servers, we get 12.4% power consumed in the network (0.77/((6.75*0.5+0.77)*1.5)).

 

12% is clearly enough to worry about but, in order for the network to be that high, we had to be running at 0% utilization which is to say, that all the resources in the entire data center are being wasted. 0% utilization means we are wasting 100% of the servers, all the power distribution, all the mechanical systems, and the entire networking system, etc. At 0% utilization, it’s not the network that is the problem. It’s the server utilization that needs attention.

 

Note that in the above model more than 60% of the power consumed by the networking devices were the per-port transceivers. We used 5W/port for these but overall transceiver power is expected to drop down to 3W or less over the next couple of years so we should expect a drop in network power consumption of 30 to 40% in the very near future.

 

Summarizing the findings: My take from this data is it’s a rare datacenter where more than 10% of power is going to networking devices and most datacenters will be under 5%. Power proportionality in the network is of some value but improving server utilization is a far more powerful lever. In fact a good argument can be made to spend more on networking gear and networking power if you can use that investment to drive up server utilization by reducing constraints on workload placement.

 

                                                                                --jrh

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Sunday, August 01, 2010 9:44:07 AM (Pacific Standard Time, UTC-08:00)  #    Comments [4] - Trackback
Hardware
 Friday, July 09, 2010

Hierarchical storage management (HSM) also called tiered storage management is back but in a different form. HSM exploits the access pattern skew across data sets by placing cold, seldom accessed data on slow cheap media and frequently accessed data on fast near media.  In old days, HSM typically referred to system mixing robotically managed tape libraries with hard disk drive staging areas. HSM was actually never gone – its just a very old technique to exploit data access pattern skew to reduce storage costs.  Here’s an old unit from FermiLab.

 

Hot data or data currently being accessed is stored on disk and old data that has not been recently accessed is stored on tape. It’s a great way to drive costs down below disk but avoid the people costs of tape library management and to (slightly) reduce the latency of accessing data on tape.

 

The basic concept shows up anywhere where there is data access locality or skew in the access patterns where some data is rarely accessed and some data is frequently accessed. Since evenly distributed, non-skewed access pattern only show up in benchmarks, this concept works on a wide variety of workloads. Processors have cache hierarchies where the top of the hierarchy is very expensive register files and there are multiple layers of increasingly large caches between the register file and memory. Database management systems have large in memory buffer pools insulating access to slow disk. Many very high scale services like Facebook have mammoth in-memory caches insulating access to slow database systems. In the Facebook example, they have 2TB of Memcached in front of their vast MySQL fleet: Velocity 2010.

 

Flash memory again opens up the opportunity to apply HSM concepts to storage.  Rather than using slow tape and fast disk, we use (relatively) slow disk and fast NAND flash. There are many approaches to implementing HSM over flash memory and hard disk drives. Facebook implemented Flashcache which tracks access patterns at the logical volume layer (below the filesystem) in Linux with hot pages written to flash and cold pages to disk.  LSI is a good example implementation done at the disk controller level with their CacheCade product. Others have done it in application specific logic where hot indexing structures are put on flash and cold data pages are written to disk. Yet another approach that showed up around 3 years ago is a Hybrid Disk Drive.

 

Hybrid drives combine large, persistent flash memory caches with disk drives in a single package. When they were first announced, I was excited by them but I got over the excitement once I started benchmarking. It was what looked to be a good idea but the performance really was unimpressive.

 

Hybrid rives still looks like a good idea but now we actually have respectable performance with the Seagate Momentus XT. This part is actually aimed at the laptop market but I’m always watching client progress to understand what can be applied to servers. This finally looks like its heading in the right direction.  See the AnandTech article on this drive for more performance data: Seagate's Momentus XT Reviewed, Finally a Good Hybrid HDD. I still slightly prefer the Facebook FlashCache approach but these hybrid drives are worth watching.

 

Thanks to Greg Linden for sending this one my way.

 

                                                --jrh

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Friday, July 09, 2010 8:30:14 AM (Pacific Standard Time, UTC-08:00)  #    Comments [1] - Trackback
Hardware
 Sunday, June 13, 2010

I’ve been talking about the application low-power, low-cost processors to server workloads for years starting with The Case for Low-Cost, Low-Power Servers. Subsequent articles get into more detail: Microslice Servers, Low-Power Amdahl Blades for Data Intensive Computing, and Successfully Challenging the Server Tax

 

Single dimensional measures of servers like “performance” without regard to server cost or power dissipation are seriously flawed. The right way to measure server performance is work done per dollar and work done by joule. If you adopt these measures of workload performance, we find that cold storage workload and highly partitionable workloads run very well on low-cost, low-power servers. And we find the converse as well. Database workloads run poorly on these servers (see When Very Low-Power, Low-Cost Servers Don’t Make Sense).

 

The reasons why scale-up workloads in general and database workload specifically run poorly on low-cost, low-powered servers are fairly obvious.  Workloads that don’t scale-out, need bigger single servers to scale (duh). And workloads that are CPU bound tend to run more cost effectively on higher powered nodes. The later isn’t strictly true. Even with scale-out losses, many CPU bound workloads still run efficiently on low-cost, low-powered servers because what is lost on scaling is sometimes more than gained by lower-cost and lower power consumption.

 

I find the bounds where a technology ceases to work efficiently to be the most interesting area to study for two reasons: 1) these boundaries teach us why current solutions don’t cross the boundary and often gives us clues on how to make the technology apply more broadly, and most important, 2) you really need to know where not to apply a new technology. It is rare that a new technology is a uniform across-the board win. For example, many of the current applications of flash memory make very little economic sense. It’s a wonderful solution for hot I/O-bound workloads where it is far superior to spinning media. But flash is a poor fit for many of the applications where it ends up being applied. You need to know where not to use a technology.

 

Focusing on the bounds of why low-cost, low-power servers don’t run a far broader class of workloads also teaches us what needs to change to achieve broader applicability. For example, if we ask what if the processor cost and power dissipation was zero, we quickly see, when scaling down processors costs and power, it is what surrounds the processor that begins to dominate. We need to get enough work done on each node to pay for the cost and power of all the surrounding components from northbridge, through  memory, networking, power supply, etc. Each node needs to get enough done to pay for the overhead components.

 

This shows us an interesting future direction: what if servers shared the infrastructure and the “all except the processor” tax was spread over more servers? It turns out this really is a great approach and applying this principle opens up the number of workloads that can be hosted on low-cost, low-power servers. Two examples of this direction are the Dell Fortuna and Rackable CloudRack C2. Both these shared infrastructure servers take a big step in this direction.

 

SeaMicro Releases Innovative Intel Atom Server

One of the two server startups I’m currently most excited about is SeaMicro. Up until today, they have been in stealth mode and I haven’t been able to discuss what they are building. It’s been killing me. They are targeting the low-cost, low-power server market and they have carefully studied the lessons above and applied the learning deeply. SeaMicro has built a deeply integrated, shared infrastructure, low-cost, low-power server solution with a broader potential market than any I’ve seen so far. They are able to run the Intel x86 instruction set avoiding the adoption friction of using different ISAs and they have integrated a massive number servers very deeply into an incredibly dense package. I continue to point out that rack density for densities sake is a bug not a feature (see Why Blades aren’t the Answer to All Questions) but the SeaMicro server module density is “good density” that reduces cost and increases efficiency. At under 2kw for a 10RU module, it is neither inefficient or challenging from a cooling perspective.

 

Potential downsides of the SeaMicro approach is that the Intel Atom CPU is not quite as power efficient as some of the ARM-based solutions and it doesn’t currently support ECC memory. However, the SeaMicro design is available now and it is a considerable advancement over what is currently in the market. See You Really do Need ECC Memory in Servers for more detail on why ECC can be important. What SeaMicro has built is actually CPU independent and can integrate other CPUs as other choice become available and the current Intel Atom-based solution will work well for many server workloads. I really like what they have done.

 

SeaMicro have taken shared infrastructure to a entirely new level in building a 512 server module that takes just 10 RU and dissipates just under 2Kw. Four of these modules will fit in an industry standard rack, consume a reasonable 8kW, and deliver more work done joule, work done per dollar, and more work done per rack than the more standard approaches currently on the market.

The SeaMicro server module is comprised of:

·         512 1.6Ghz Intel Atoms (2048 CPUs/rack)

·         1 TB DRAM

·         1.28 Tbps networking fabric

·         Up to 16x 10Gbps ingress/egress network or up to 64 1Gbps if running 1GigE

·         0 to 64 SATA SSD or HDDs

·         Standard x86 instruction set architecture (no recompilation)

·         Integrated software load-balancer

·         Integrated layer 2 networking switch

·         Under 2kW power

 

The server nodes are built 8 to a board in one of the nicest designs I’ve seen for years:

 

 

The SeaMicro 10 RU chassis can be hosted 4 to an industry standard rack:

 

This is an important hardware advancement and it is great to see the faster pace of innovation sweeping the server world driven by innovative startups like SeaMicro.

 

                                                                                --jrh

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Sunday, June 13, 2010 8:01:11 PM (Pacific Standard Time, UTC-08:00)  #    Comments [7] - Trackback
Hardware
 Tuesday, May 18, 2010

I am excited by very low power, very low cost servers and the impact they will have on our industry. There are many workloads where CPU is in excess and lower power and lower cost servers are the right answer. These are workloads that don’t fully exploit the capabilities of the underlying server. For these workloads, the server is out of balance with excess CPU capability (both power and cost). There are workloads were less is more. But, with technology shifts, it’s easy to get excited and try to apply the new solution too broadly.

 

We can see parallels in the Flash memory world. At first there was skepticism that Flash had a role to play in supporting server workloads. More recently, there is huge excitement around flash and I keep coming across applications of the technology that really don’t make economic sense. Not all good ideas apply to all problems. In going after this issue I wrote When SSDs make sense in Server applications and then later When SSDs Don’t Make Sense in Server Applications. Sometimes knowing where not to apply a technology is more important than knowing where to apply it. Looking at the negative technology applications is useful.

 

Returning to very low-cost, low-power servers, I’ve written a bit about where they make sense and why:

·         Very Low-Power Server Progress

·         The Case for Low-Cost, Low-Power Servers

·         2010 the Year of the Microslice Computer

·         Linux/Apache on ARM Servers

·         ARM Cortex-A9 SMP Design Announced

 

But I haven’t looked much at where very low-power, low-cost servers do not make sense. When aren’t they a win when looking at work done per dollar and work done per joule? Last week Dave DeWitt sent me a paper that looks the application of Wimpy (from the excellent  FAWN, Fast Array of Wimpy Nodes, project at CMU) servers and their application to database workloads. In Wimpy Node Clusters: What About Non-Wimpy Workloads Willis Lang, Jignesh Patel, and Srinanth Shankar find that Intel Xeon E5410 is slightly better than Intel Atom when running parallel clustered database workloads including TPC-E and TPC-H. The database engine in this experiment is IBM DB2 DB-X (yet another new name for the product originally called DB2 Parallel Edition – see IBM DB2 for information on DB2 but the Wikipedia page is not yet caught up to the latest IBM name change).

 

These results show us that on complex, clustered database workloads, server processors can win over low-power parts. For those interested in probing the very low-cost, low-power processor space, the paper is worth a read: Wimpy Node Clusters: What About Non-Wimpy Workloads.

 

The generalization of their finding that I’ve been using is CPU intensive and workloads with poor scaling characteristic are poor choices to be hosted on very low-power, low-cost servers. CPU intensive workloads are a lose because these workloads are CPU-bound so run best where there is maximum CPU per-server in the cluster. Or worded differently, the multi-server cluster overhead is minimized by having fewer, more-powerful nodes.  Workloads with poor scaling characteristics are another category not well supported by wimpy nodes and the explanation is similar. Although these workloads may not be CPU-bound, they don’t run well over clusters with large server counts. Generally, more resources per node is the best answer if the workload can’t be scaled over large server counts.

 

Where very low-power, low-cost servers win is:

1.      Very cold storage workloads. I last posted on these workloads last year Successfully Challenging the Server Tax. The core challenge with cold storage apps is that overall system cost is dominated by disk but the disk needs to be attached to a server. We have to amortize the cost of the server over the attached disk storage. The more disk we attach to a single server, the lower the cost. But, the more disk we attach to a single server, the larger the failure zone. Nobody wants to have to move 64 to 128 TB every time a server fails. The tension is more disk to server ratio drives down costs but explodes the negative impact of server failures. So, if we have a choice of more disks to a given server or, instead, to use a smaller, cheaper server, the conclusion is clear. Smaller wins. This is a wonderful example of where low-power servers are a win.

2.      Workloads with good scaling characteristics and non-significant local resource requirements.  Web workloads that just accept connections and dispatch can run well on these processors. However, we still need to consider the “and non-significant local resource” clause. If the workload scales perfectly but each interaction needs access to very large memories for example, it  may be poor choice for Wimpy nodes.  If the workload scales with CPU and local resources are small, Wimpy nodes are a win.

The first example above is a clear win. The second is more complex. Some examples will be a win but others will not. The better the workload scales and the less fixed resources (disk or memory) required, the bigger the win.

 

Good job by Willis Lang, Jignesh Patel, and Srinanth Shankar in showing us where wimpy nodes lose with detailed analysis.

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Tuesday, May 18, 2010 4:09:47 AM (Pacific Standard Time, UTC-08:00)  #    Comments [10] - Trackback
Hardware
 Friday, May 14, 2010

I recently came across a nice data center cooling design by Alan Beresford of EcoCooling Ltd. In this approach, EcoCooling replaces the CRAC units with a combined air mover, damper assembly, and evaporative cooler. I’ve been interested by evaporative coolers and their application to data center cooling for years and they are becoming more common in modern data center deployments (e.g. Data Center Efficiency Summit).

 

An evaporative cooler is a simple device that cools air through taking water through a state change from fluid to vapor. They are incredibly cheap to run and particularly efficient in locals with lower humidity. Evaporative coolers can allow the power intensive process-based cooling to be shut off for large parts of the year. And, when combined with favorable climates or increased data center temperatures can entirely replace air conditioning systems. See Chillerlesss Datacenter at 95F, for a deeper discussion see Costs of Higher Temperature Data Centers, and for a discussion on server design impacts: Next Point of Server Differentiation: Efficiency at Very High Temperature.

 

In the EcoCooling solution, they take air from the hot aisle and release it outside the building. Air from outside the building is passed through an evaporative cooler and then delivered to the cold aisle. For days too cold outside for direct delivery to the datacenter, outside air is mixed with exhaust air to achieve the desired inlet temperature.

 

 

This is a nice clean approach to substantially reducing air conditioning hours. For more information see: Energy Efficient Data Center Cooling or the EcoCooling web site: http://www.ecocooling.co.uk/.

 

                                                                --jrh

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

 

Friday, May 14, 2010 7:06:00 PM (Pacific Standard Time, UTC-08:00)  #    Comments [7] - Trackback
Hardware
 Monday, May 10, 2010

Wide area network costs and bandwidth shortage are the most common reasons why many enterprise applications run in a single data center. Single data center failure modes are common. There are many external threats to single data center deployments including utility power loss, tornado strikes, facility fire, network connectivity loss,  earthquake, break in, and many others I’ve not yet been “lucky” enough to have seen. And, inside a single facility, there are simply too many ways to shoot one’s own foot.  All it takes is one well intentioned networking engineer to black hole the entire facilities networking traffic. Even very high quality power distribution systems can have redundant paths taken out by fires in central switch gear or cascading failure modes.  And, even with very highly redundant systems, if the redundant paths aren’t tested often, they won’t work.  Even with incredibly redundancy, just having the redundant components in the same room, means that a catastrophic failure of one system, could possibly eliminate the second. It’s very hard to engineer redundancy with high independence and physical separate of all components in a single datacenter.

 

With incredible redundancy, comes incredible cost. Even with incredible costs, failure modes remain that can eliminate the facility entirely. The only cost effective solution is to run redundantly across multiple data centers.  Redundancy without physical separation is not sufficient and making a single facility bullet proof has expenses asymptotically heading towards infinity with only tiny increases in availability as the expense goes up. The only way to get the next nine is have redundancy between two data centers. This approach is both more available and considerably more cost effective.

 

Given that cross-datacenter redundancy is the only effective way to achieve cost-effective availability, why don’t all workloads run in this mode? There are 3 main blocker for the customer I’ve spoken with: 1) scale, 2) latency, and 3) WAN bandwidth and costs.

 

The scale problem is, stated simply, most companies don’t run enough IT infrastructure to be able to afford multiple data centers in different parts of the country. In fact, many companies only really need a small part of a collocation facility.  Running multiple data centers at low scale drives up costs. This is one of the many ways cloud computing can help drive down costs and improve availability.  Cloud service providers like Amazon Web Services, run 10s of data centers. You can leverage the AWS scale economics to allow even very low scale applications to run across multiple data centers with diverse power, diverse networking, different fault zones, etc. Each datacenter is what AWS calls an Availability Zone. Assuming the scale economics allow it, the second blocker to cross data center replication is the availability of WAN bandwidth and its cost.

 

There are also physical limits – mostly the speed of light in fiber – on how far apart redundant components of an application can be run. This limitation is real but won’t prevent redundancy data centers from getting far “enough” away to achieve the needed advantages. Generally, 4 to 5 msec is tolerable for most workloads and replication systems. Bandwidth availability and costs is the prime reason why most customers don’t run geo-diverse.

 

I’ve argued that latency need not be a blocker. So, if the application has the scale to be able to be run over multiple data centers, the major limiting factor remaining is WAN bandwidth and cost. It is for this reason that I’ve long been interested in WAN compression algorithsm and appliances. These are systems that do compression between branch offices and central enterprise IT centers. Riverbed is one of the largest and most successful of the WAN accelerator providers. Naïve application of block-based compression is better than nothing but compression ratios are bounded and some types of traffic compress very poorly. Most advanced WAN accelerators employ three basic techniques: 1) data type specific optimizations, 2) dedupe, and 3) block-based compression.

 

Data type specific optimizations are essentially a bag of closely guarded heuristics that optimize for Exchange, SharePoint, remote terminal protocol, or other important application data types. I’m going to ignore these type-specific optimizations and focus on dedupe followed by block-based compression since they are the easiest to apply to cross data center traffic replication traffic.

 

Broadly, dedupe breaks the data to be transferred between datacenters into either fixed or variable sized blocks. Variable blocks are slightly better but either works. Each block is cryptographically hashed and, rather than transferring the block to the remote datacenter, just send the hash signature. If that block is already in the remote system block index, then it or its clone has already been sent sometime in the past and nothing need to be sent now. In employing this technique we are exploiting data redundancy at a course scale. We are essentially remembering what is on both sides of the WAN and only sending blocks that have not been seen before. The effectiveness of this broad technique is very dependent upon the size and efficiency of the indexing structures, the choice of block boundaries, and inherent redundancy in the data. But, done right, the compression ratios can be phenomenal with 30 to 50:1 not being uncommon. This, by the way, is the same basic technology being applied in storage deduplications by companies like Data Domain.

 

If a block has not been sent before, then we actually do have to transfer it. That’s when we apply the second level compression technique. Usually a block-oriented compression algorithm and frequently some variant of LZ. The combination of dedupe and block compression is very effective. But, the system I’ve described above introduces latency. And, for highly latency sensitive workloads like EMC SRDF, this can be a problem. Many latency sensitive workloads can’t employ the tricks I’m describing here and either have to run single data center or run at higher cost without compression. 

 

Last week I ran across a company targeting latency sensitive cross-datacenter replication traffic. Infineta Systems announced this morning a solution targeting this problem: Infineta Unveils Breakthrough Acceleration Technology for Enterprise Data Centers. The Infineta Velocity engine is a dedupe appliance that operates at 10Gbps line rate with latencies under 100 microseconds per network packet. Their solution aims to get the bulk of the advantages of the systems I described above at much lower overhead and latency. They achieve their speed-up three ways: 1) hardware implementation based upon FPGA, 2) fixed-sized, full packet block size,  3) bounded index exploiting locality, and 4) heuristic signatures.

 

The first technique is fairly obvious and one I’ve talked about in the past. When you have a repetitive operation that needs to run very fast, the most cost and power effective solution may be a hardware implementation. It’s getting easier and easier to implement common software kernels in FPGAs or even ASICs. see Heterogeneous Computing using GPGPUs and FPGAs for related discussions on the application of hardware acceleration and, for an application view,  High Scale Network Research.

 

The second technique is another good one. Rather than spend time computing block boundaries, just use the network packet as the block boundary. Essentially they are using the networking system to find the block boundaries. This has the downside of not being as effective as variable sized block systems and they don’t exploit type specific knowledge but they can run very fast at low overhead and close to the higher compression rates yielded by these more computationally intensive techniques.  They are exploiting the fact that 20% of the work produces 80% of the gain.

 

The third technique helps reduce the index size. Rather than having a full index of all blocks that have even been sent, just keep the last N. This allows the index structure to be 100% memory resident without huge, expensive memories. This smaller index is much less resource intensive requiring much less memory and no disk accesses. Avoiding disk is the only way to get anything approaching 100 microsecond latency. Infineta is exploiting temporal locality.  Redundant data packets often show up near each other. Clearly this is not always the case and they won’t get the maximum possible compression but they claim to get most of the compression possible in full block index systems without the latency penalty of a disk access and without less memory overhead.

 

The final technique wasn’t described in enough detail for me to fully understand it.  What Infineta is doing is avoiding the cost of fully hashing each packet but taking an approximate signature of carefully chosen packet offsets. Clearly you can take a fast signature on less than the full packet and this signature can be used to know that the packet is not in the index on the other side. But, if the fast hash is present, it doesn’t prove the packet has already been sent. Two different packets can have the same fast hash. Infineta were a bit cagey on this point but what they might be doing is using the very fast approx has to find those that have not yet been sent unambiguously. Using this technique, a fast hash can be used to find those packets that absolutely need to be sent so we can start to compress and send those and waste no more resources on hashing. For those that may not need to be sent, take a full signature and check to see if it is on the remote site. If my guess is correct, the fast hash is being used to avoid spending resources quickly on packets that are not in the index on the other side.

 

Infineta looks like an interesting solution.  More data on them at:

·         Press release: http://www.infineta.com/news/news_releases/press_release:5585,15851,445

·         Web site: http://www.infineta.com

·         Announcing $15m Series A funding: Infineta Comes Out of Stealth and Closes $15 Million Round of Funding http://tinyurl.com/2g4zhbc

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Monday, May 10, 2010 12:07:33 PM (Pacific Standard Time, UTC-08:00)  #    Comments [8] - Trackback
Hardware
 Friday, April 30, 2010

Rich Miller of Datacenter Knowledge covered this last week and it caught my interest. I’m super interested in modular data centers (Architecture for Modular Datacenters) and highly efficient infrastructure (Data Center Efficiency Best Practices) so the Yahoo! Computing Coop caught my interest.

 

As much as I like the cost, strength, and availability of ISO standard shipping containers, 8’ is an inconvenient width. It’s not quite wide enough for two rows of standard racks and there are cost and design advantages in having at least two rows in a container. With two rows, air can be pulled in each side with a single hot aisle in the middle with large central exhaust fans. Its an attractive design point and there is nothing magical about shipping containers. What we want is commodity, prefab, and a moderate increment of growth.

 

The Yahoo design is a nice one. They are using a shell borrowed from a Tyson foods design. Tyson is the grower of a large part of the North American chicken production.  These prefab facilities are essentially giant air handlers with the shell making up a good part of the mechanical plant. They pull air in either side of the building, it passes through two rows of servers into the center of the building. The roof slopes to the center from both side with central exhaust fans. Each unit is 120’ x 60’ and houses 3.6 MW of critical load.

 

Because of the module width they have 4 rows of servers. It’s not clear if the air from outside has to pass through both rows to get the central hot aisle but it sounds like that is the approach. Generally serial cooling where the hot air from one set of servers is routed through another is worth avoiding. It certainly can work but requires more air flow than single pass cooling using the same approach temperature.

 

Yahoo! believes they will be able to bring a new building online in 6 months at a cost of $5M per megawatt. In the Buffalo New York location, they expect to only use process-based cooling 212 hours/year and have close to zero water consumption when the air conditioning is not in use. See the Data Center Knowledge article for more detail: Yahoo Computing Coop: Shape of Things to Come?

More pictures at: A Closer Look at Yahoo’s New Data Center. Nice design Yahoo.

 

                                                                --jrh

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Friday, April 30, 2010 9:44:30 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Hardware
 Thursday, April 29, 2010

Facebook released Flashcache yesterday: Releasing Flashcache. The authors of Flashcache, Paul Saab and Mohan Srinivasan, describe it as “a simple write back persistent block cache designed to accelerate reads and writes from slower rotational media by caching data in SSD's.”

 

There are commercial variants of flash-based write caches available as well. For example, LSI has a caching controller that operates at the logical volume layer. See LSI and Seagate take on Fusion-IO with Flash. The way these systems work is, for a given logical volume, page access rates are tracked.  Hot pages are stored on SSD while cold pages reside back on spinning media. The cache is write-back and pages are written back to their disk resident locations in the background.

 

For benchmark workloads with evenly distributed, 100% random access patterns, these solutions don’t contribute all that much. Fortunately, the world is full of data access pattern skew and some portions of the data are typically very cold while others are red hot. 100% even distributions really only show up in benchmarks – most workloads have some access pattern skew. And, for those with skew, a flash cache can substantially reduce disk I/O rates at lower cost than adding more memory.

 

What’s interesting about the Facebook contribution is that its open source and supports Linux.  From: http://github.com/facebook/flashcache/blob/master/doc/flashcache-doc.txt:

 

Flashcache is a write back block cache Linux kernel module. [..]Flashcache is built using the Linux Device Mapper (DM), part of the Linux Storage Stack infrastructure that facilitates building SW-RAID and other components. LVM, for example, is built using the DM.

 

The cache is structured as a set associative hash, where the cache is divided up into a number of fixed size sets (buckets) with linear probing within a set to find blocks. The set associative hash has a number of advantages (called out in sections below) and works very well in practice.

 

The block size, set size and cache size are configurable parameters, specified at cache creation. The default set size is 512 (blocks) and there is little reason to change this.

 

More information on usage: http://github.com/facebook/flashcache/blob/master/doc/flashcache-sa-guide.txt.  Thanks to Grant McAlister for pointing me to the Facebook release of Flashcache. Nice work Paul and Mohan.

 

                                                                --jrh

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Thursday, April 29, 2010 6:36:25 AM (Pacific Standard Time, UTC-08:00)  #    Comments [4] - Trackback
Hardware
 Friday, April 09, 2010

High scale network research is hard.  Running a workload over a couple of hundred servers says little of how it will run over thousands or tens of thousands servers. But, having 10’s of thousands of nodes dedicated to a test cluster is unaffordable. For systems research the answer is easy: use Amazon EC2. It’s an ideal cloud computing application. Huge scale is needed during some parts of the research project but the servers aren’t needed 24 hours/day and certainly won’t be needed for the three year amortization life of the servers.

 

However, for high-scale network research, finding the solution is considerably more difficult. In some dimensions, it’s no different from systems research in that purchasing a few thousand servers for the research projects makes no sense. But the easy answer of simply using EC2 doesn’t work in that EC2 nodes come fully provisioned with networking.  One solution that works well for many networking research problems is to use an overlay and test at scale in EC2. But, when new hardware devices are being investigated, unless they can be emulated with high fidelity using with software implementations running on EC2, this solution breaks down. 

 

For all but a few folks at Cisco and Juniper, running a multi-thousand node physical cluster to test new network gear is impractical. And it’s even less practical in academic settings. I’m lucky enough to work near many thousands of server nodes and a huge networking infrastructure. But, even then, installing a parallel network to do network research is difficult to afford. High-scale network research at credible scale is difficult.

 

Zhangxi Tan of Cal Berkeley came up to visit a couple of weeks back. I’m interested in Zhangxi’s work for two reasons: 1) its based upon reconfigurable computing -- a technology ready for commercial application and 2) the application of FPGA to network simulation might be a solution to the problem of how to test networking gear at credible scale.

 

Reconfigurable computing maintains the flexibility of reprogrammable software systems with the performance of high performance hardware implementations.  Or, worded differently, most of the performance of Application Specific Integrated Circuits (ASIC) with the flexibility of software. Most reconfigurable computing designs are based upon Field Programmable Gate Arrays (FPGA) and some high level instruction set or programming language to allow device reconfiguration.  Recently, C and C++ subset compilers have emerged that allow a constrained version C or C++ to be compiled directly to a FPGA and, once the software is stable, directly to an ASIC. See Platform-based Electronic Systems Level (ESL) Synthesis for more on reconfigurable computing and see Heterogeneous Computing using GPGPUs and FPGAs for related discussions on the application of hardware acceleration.

 

In the work that Zhangxi presented, the Cal Berkeley team is taking the RAMP gold FPGA-based many-core simulator (Research Accelerator for Multiple Processors) and applying it to the problem of high-scale network simulation with a goal of simulating an O(10k) server network. Zhangxi’s slides are here: Using FPGAs to Simulate Novel Datacenter Network Architectures at Scale and my rough notes follow:

·         Lots of work going on in data center network research: VL2, Dcell, PortLand,…

·         But:

o   the test scale is usually WAY smaller than the problem targeted by these systems

o   Often synthetic benchmarks are used rather than actual workloads

·         RAMP Gold is:

o   Full 32-bit SPARC v8 ISA support, including FP, traps and MMU.

o   Use abstract models with enough detail, but fast enough to run real apps/OS

o   Provide cycle-level accuracy

o   Cost-efficient: hundreds of nodes plus switches on a single FPGA

·         RAMP Gold implementation:

o   Based upon Xilinx XUP V5 board ($750)

o   Able to simulate 64 core, 2GB DDR2, FP and run production Linix

·         Tested using trace data from Facebook and Yahoo Hadoop runs

·         Demonstrating the incast TCP collapse problem and showed simulated results that closely matched actual measured results

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Friday, April 09, 2010 6:57:05 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Hardware
 Thursday, January 07, 2010

There is a growing gap between memory bandwidth and CPU power and this growing gap makes low power servers both more practical and more efficient than current designs.   Per-socket processor performance continues to increase much more rapidly than memory bandwidth and this trend applies across the application spectrum from mobile devices, through client, to servers. Essentially we are getting more compute than we have memory bandwidth to feed. 

 

We can attempt to address this problem two ways: 1) more memory bandwidth and 2) less fast processors. The former solution will be used and Intel Nehalem is a good example of this but costs increase non-linearly so the effectiveness of this technique will be  bounded. The second technique has great promise to reduce both cost and power consumption.

 

For more detail on this trend:

·         The Case for Low-Cost, Low-Power Servers

·         2010 the Year of the MicroSlice Servers

·         Linux/Apache on ARM Processors

·         ARM Cortex-A9 SMP Design Announced

 

This morning GigOm reported that SeaMicro has just obtained a $9.3M Department of Energy grant to improve data center efficiency (SeaMicro’s Secret Server Changes Computing Economics).  SeaMicro is a Santa Clara based start-up that is building a 512 processor server based upon Intel Atom. Also mentioned was Smooth Stone who is designing a high-scale server based upon ARM processors. ARMs processors are incredibly power efficient, commonly used in embedded devices and by far the most common processor used in cell phones.

 

Over the past year I’ve met with both Smooth Stone and SeaMicro frequently and it’s great to see more information about both available broadly. The very low power server trend is real and advancing quickly. When purchasing servers, it needs to be all about work done per dollar and work done per joule

 

Congratulations to SeaMicro on the DoE grant.

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Thursday, January 07, 2010 10:38:35 AM (Pacific Standard Time, UTC-08:00)  #    Comments [3] - Trackback
Hardware

Disclaimer: The opinions expressed here are my own and do not necessarily represent those of current or past employers.

Archive
<September 2010>
SunMonTueWedThuFriSat
2930311234
567891011
12131415161718
19202122232425
262728293012
3456789

Categories
This Blog
Member Login
All Content © 2010, James Hamilton
Theme created by Christoph De Baene / Modified 2007.10.28 by James Hamilton