Monday, May 10, 2010

Wide area network costs and bandwidth shortage are the most common reasons why many enterprise applications run in a single data center. Single data center failure modes are common. There are many external threats to single data center deployments including utility power loss, tornado strikes, facility fire, network connectivity loss,  earthquake, break in, and many others I’ve not yet been “lucky” enough to have seen. And, inside a single facility, there are simply too many ways to shoot one’s own foot.  All it takes is one well intentioned networking engineer to black hole the entire facilities networking traffic. Even very high quality power distribution systems can have redundant paths taken out by fires in central switch gear or cascading failure modes.  And, even with very highly redundant systems, if the redundant paths aren’t tested often, they won’t work.  Even with incredibly redundancy, just having the redundant components in the same room, means that a catastrophic failure of one system, could possibly eliminate the second. It’s very hard to engineer redundancy with high independence and physical separate of all components in a single datacenter.


With incredible redundancy, comes incredible cost. Even with incredible costs, failure modes remain that can eliminate the facility entirely. The only cost effective solution is to run redundantly across multiple data centers.  Redundancy without physical separation is not sufficient and making a single facility bullet proof has expenses asymptotically heading towards infinity with only tiny increases in availability as the expense goes up. The only way to get the next nine is have redundancy between two data centers. This approach is both more available and considerably more cost effective.


Given that cross-datacenter redundancy is the only effective way to achieve cost-effective availability, why don’t all workloads run in this mode? There are 3 main blocker for the customer I’ve spoken with: 1) scale, 2) latency, and 3) WAN bandwidth and costs.


The scale problem is, stated simply, most companies don’t run enough IT infrastructure to be able to afford multiple data centers in different parts of the country. In fact, many companies only really need a small part of a collocation facility.  Running multiple data centers at low scale drives up costs. This is one of the many ways cloud computing can help drive down costs and improve availability.  Cloud service providers like Amazon Web Services, run 10s of data centers. You can leverage the AWS scale economics to allow even very low scale applications to run across multiple data centers with diverse power, diverse networking, different fault zones, etc. Each datacenter is what AWS calls an Availability Zone. Assuming the scale economics allow it, the second blocker to cross data center replication is the availability of WAN bandwidth and its cost.


There are also physical limits – mostly the speed of light in fiber – on how far apart redundant components of an application can be run. This limitation is real but won’t prevent redundancy data centers from getting far “enough” away to achieve the needed advantages. Generally, 4 to 5 msec is tolerable for most workloads and replication systems. Bandwidth availability and costs is the prime reason why most customers don’t run geo-diverse.


I’ve argued that latency need not be a blocker. So, if the application has the scale to be able to be run over multiple data centers, the major limiting factor remaining is WAN bandwidth and cost. It is for this reason that I’ve long been interested in WAN compression algorithsm and appliances. These are systems that do compression between branch offices and central enterprise IT centers. Riverbed is one of the largest and most successful of the WAN accelerator providers. Naïve application of block-based compression is better than nothing but compression ratios are bounded and some types of traffic compress very poorly. Most advanced WAN accelerators employ three basic techniques: 1) data type specific optimizations, 2) dedupe, and 3) block-based compression.


Data type specific optimizations are essentially a bag of closely guarded heuristics that optimize for Exchange, SharePoint, remote terminal protocol, or other important application data types. I’m going to ignore these type-specific optimizations and focus on dedupe followed by block-based compression since they are the easiest to apply to cross data center traffic replication traffic.


Broadly, dedupe breaks the data to be transferred between datacenters into either fixed or variable sized blocks. Variable blocks are slightly better but either works. Each block is cryptographically hashed and, rather than transferring the block to the remote datacenter, just send the hash signature. If that block is already in the remote system block index, then it or its clone has already been sent sometime in the past and nothing need to be sent now. In employing this technique we are exploiting data redundancy at a course scale. We are essentially remembering what is on both sides of the WAN and only sending blocks that have not been seen before. The effectiveness of this broad technique is very dependent upon the size and efficiency of the indexing structures, the choice of block boundaries, and inherent redundancy in the data. But, done right, the compression ratios can be phenomenal with 30 to 50:1 not being uncommon. This, by the way, is the same basic technology being applied in storage deduplications by companies like Data Domain.


If a block has not been sent before, then we actually do have to transfer it. That’s when we apply the second level compression technique. Usually a block-oriented compression algorithm and frequently some variant of LZ. The combination of dedupe and block compression is very effective. But, the system I’ve described above introduces latency. And, for highly latency sensitive workloads like EMC SRDF, this can be a problem. Many latency sensitive workloads can’t employ the tricks I’m describing here and either have to run single data center or run at higher cost without compression. 


Last week I ran across a company targeting latency sensitive cross-datacenter replication traffic. Infineta Systems announced this morning a solution targeting this problem: Infineta Unveils Breakthrough Acceleration Technology for Enterprise Data Centers. The Infineta Velocity engine is a dedupe appliance that operates at 10Gbps line rate with latencies under 100 microseconds per network packet. Their solution aims to get the bulk of the advantages of the systems I described above at much lower overhead and latency. They achieve their speed-up three ways: 1) hardware implementation based upon FPGA, 2) fixed-sized, full packet block size,  3) bounded index exploiting locality, and 4) heuristic signatures.


The first technique is fairly obvious and one I’ve talked about in the past. When you have a repetitive operation that needs to run very fast, the most cost and power effective solution may be a hardware implementation. It’s getting easier and easier to implement common software kernels in FPGAs or even ASICs. see Heterogeneous Computing using GPGPUs and FPGAs for related discussions on the application of hardware acceleration and, for an application view,  High Scale Network Research.


The second technique is another good one. Rather than spend time computing block boundaries, just use the network packet as the block boundary. Essentially they are using the networking system to find the block boundaries. This has the downside of not being as effective as variable sized block systems and they don’t exploit type specific knowledge but they can run very fast at low overhead and close to the higher compression rates yielded by these more computationally intensive techniques.  They are exploiting the fact that 20% of the work produces 80% of the gain.


The third technique helps reduce the index size. Rather than having a full index of all blocks that have even been sent, just keep the last N. This allows the index structure to be 100% memory resident without huge, expensive memories. This smaller index is much less resource intensive requiring much less memory and no disk accesses. Avoiding disk is the only way to get anything approaching 100 microsecond latency. Infineta is exploiting temporal locality.  Redundant data packets often show up near each other. Clearly this is not always the case and they won’t get the maximum possible compression but they claim to get most of the compression possible in full block index systems without the latency penalty of a disk access and without less memory overhead.


The final technique wasn’t described in enough detail for me to fully understand it.  What Infineta is doing is avoiding the cost of fully hashing each packet but taking an approximate signature of carefully chosen packet offsets. Clearly you can take a fast signature on less than the full packet and this signature can be used to know that the packet is not in the index on the other side. But, if the fast hash is present, it doesn’t prove the packet has already been sent. Two different packets can have the same fast hash. Infineta were a bit cagey on this point but what they might be doing is using the very fast approx has to find those that have not yet been sent unambiguously. Using this technique, a fast hash can be used to find those packets that absolutely need to be sent so we can start to compress and send those and waste no more resources on hashing. For those that may not need to be sent, take a full signature and check to see if it is on the remote site. If my guess is correct, the fast hash is being used to avoid spending resources quickly on packets that are not in the index on the other side.


Infineta looks like an interesting solution.  More data on them at:

·         Press release:,15851,445

·         Web site:

·         Announcing $15m Series A funding: Infineta Comes Out of Stealth and Closes $15 Million Round of Funding


James Hamilton



b: /


Monday, May 10, 2010 12:07:33 PM (Pacific Standard Time, UTC-08:00)  #    Comments [8] - Trackback
 Thursday, May 06, 2010

Earlier this week Clustrix announced a MySQL compatible, scalable database appliance that caught my interest. Key features supported by Clustrix:

·         MySQL protocol emulation (MySQL protocol supported so MySQL apps written to the MySQL client libraries just work)

·         Hardware appliance delivery package in a 1U package including both NVRAM and disk

·         Infiniband interconnect

·         Shared nothing, distributed database

·         Online operations including alter table add column


I like the idea of adopting a MySQL programming model. But, it’s incredibly hard to be really MySQL compatible unless each node is actually based upon the MySQL execution engine. And it’s usually the case that a shared nothing, clustered DB will bring some programming model constraints. For example, if global secondary indexes aren’t implemented, it’s hard to support uniqueness constraints on non-partition key columns and it’s hard to enforce referential integrity. Global secondary indexes maintenance implies a single insert, update, or delete that would normally only require a single node change would require atomic updates across many nodes in the cluster making updates more expensive and susceptible to more failure modes. Essentially, making a cluster look exactly the same as a single very large machine with all the same characteristics isn’t possible. But, many jobs that can’t be done perfectly are still well worth doing. If Clustrix delivers all they are describing, it should be successful.


I also like the idea of delivering the product as a hardware appliance. It keep the support model simple, reduces install and initial setup complexity, and enables application specific hardware optimizations.


Using Infiniband as a cluster interconnect is a nice choice as well. I believe that 10GigE with RDMA support will provide better price performance than Infiniband but commodity 10GigE volumes and quality RDMA support is still 18 to 24 months away so Inifiband is a good choice for today.


Going with a shared nothing architecture avoids dependence on expensive shared storage area networks and the scaling bottleneck of distributed lock managers.  Each node in the cluster is an independent database engine with its own physical (local) metadata, storage engine, lock manager, buffer manager, etc. Each node has full control of the table partitions that reside on that node. Any access to those partitions must go through that node. Essentially, bringing the query to the data rather than the data to the query. This is almost always the right answer and it scales beautifully.  

In operation, a client connects to one of the nodes in the cluster and submits a SQL statement. The statement is parsed and compiled. During compilation, the cluster-wide (logical) metadata is accessed as needed and an execution plan is produced. The cluster-wide (logical) metadata is either replicated to all nodes or stored centrally with local caching. The execution plan produced by the query compilation will be run on as many nodes as needed with the constraint that table or index access be on the nodes that house those table or index partitions. Operators higher in the execution plan can run on any node in the cluster.  Rows flow between operators that span node boundaries over the infiniband network.  The root of the query plan runs on the node where the query was started and the results are returned to client program using the MySQL client protocol


As described, this is a very big engineering project. I’ve worked on teams that have taken exactly this approach and they took several years to get to the first release and even subsequent releases had programming model constraints. I don’t know how far along Clustrix is a this point but I like the approach and I’m looking forward to learning more about their offering.


White paper: Clustrix: A New Approach

Press Release: Clustrix Emerges from Stealth Mode with Industry’s First Clustered DB



James Hamilton



b: /


Thursday, May 06, 2010 9:49:22 AM (Pacific Standard Time, UTC-08:00)  #    Comments [9] - Trackback
 Tuesday, May 04, 2010

 Dave Patterson did a keynote at Cloud Futures 2010.  I wasn’t able to attend but I’ve heard it was a great talk so I asked Dave to send the slides my way. He presented Cloud Computing and the Reliable Adaptive Systems Lab.


The Berkeley RAD Lab principal investigators include: Armando FoxRandy Katz & Dave Patterson (systems/networks), Michael Jordan (machine learning), Ion Stoica (networks & P2P), Anthony Joseph (systems/security), Michael Franklin (databases), and Scott Shenker (networks) in addition to 30 Phd students, 10 undergrads, and 2 postdocs.


The talk starts by arguing that cloud computing actually is a new approach drawing material from the Above the Clouds paper that I mentioned early last year: Berkeley Above the Clouds. Then walked through why pay-as-you-go computing with small granule time increments allow SLAs to be hit without stranding valuable resources.




The slides are up at: Cloud Computing and the RAD Lab and if you want to read more about the RAD lab: If you haven’t already read it, this is worth reading: Above the Clouds: A Berkeley View.




James Hamilton



b: /


Tuesday, May 04, 2010 5:48:56 AM (Pacific Standard Time, UTC-08:00)  #    Comments [4] - Trackback
 Friday, April 30, 2010

Rich Miller of Datacenter Knowledge covered this last week and it caught my interest. I’m super interested in modular data centers (Architecture for Modular Datacenters) and highly efficient infrastructure (Data Center Efficiency Best Practices) so the Yahoo! Computing Coop caught my interest.


As much as I like the cost, strength, and availability of ISO standard shipping containers, 8’ is an inconvenient width. It’s not quite wide enough for two rows of standard racks and there are cost and design advantages in having at least two rows in a container. With two rows, air can be pulled in each side with a single hot aisle in the middle with large central exhaust fans. Its an attractive design point and there is nothing magical about shipping containers. What we want is commodity, prefab, and a moderate increment of growth.


The Yahoo design is a nice one. They are using a shell borrowed from a Tyson foods design. Tyson is the grower of a large part of the North American chicken production.  These prefab facilities are essentially giant air handlers with the shell making up a good part of the mechanical plant. They pull air in either side of the building, it passes through two rows of servers into the center of the building. The roof slopes to the center from both side with central exhaust fans. Each unit is 120’ x 60’ and houses 3.6 MW of critical load.


Because of the module width they have 4 rows of servers. It’s not clear if the air from outside has to pass through both rows to get the central hot aisle but it sounds like that is the approach. Generally serial cooling where the hot air from one set of servers is routed through another is worth avoiding. It certainly can work but requires more air flow than single pass cooling using the same approach temperature.


Yahoo! believes they will be able to bring a new building online in 6 months at a cost of $5M per megawatt. In the Buffalo New York location, they expect to only use process-based cooling 212 hours/year and have close to zero water consumption when the air conditioning is not in use. See the Data Center Knowledge article for more detail: Yahoo Computing Coop: Shape of Things to Come?

More pictures at: A Closer Look at Yahoo’s New Data Center. Nice design Yahoo.




James Hamilton



b: /


Friday, April 30, 2010 9:44:30 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
 Thursday, April 29, 2010

Facebook released Flashcache yesterday: Releasing Flashcache. The authors of Flashcache, Paul Saab and Mohan Srinivasan, describe it as “a simple write back persistent block cache designed to accelerate reads and writes from slower rotational media by caching data in SSD's.”


There are commercial variants of flash-based write caches available as well. For example, LSI has a caching controller that operates at the logical volume layer. See LSI and Seagate take on Fusion-IO with Flash. The way these systems work is, for a given logical volume, page access rates are tracked.  Hot pages are stored on SSD while cold pages reside back on spinning media. The cache is write-back and pages are written back to their disk resident locations in the background.


For benchmark workloads with evenly distributed, 100% random access patterns, these solutions don’t contribute all that much. Fortunately, the world is full of data access pattern skew and some portions of the data are typically very cold while others are red hot. 100% even distributions really only show up in benchmarks – most workloads have some access pattern skew. And, for those with skew, a flash cache can substantially reduce disk I/O rates at lower cost than adding more memory.


What’s interesting about the Facebook contribution is that its open source and supports Linux.  From:


Flashcache is a write back block cache Linux kernel module. [..]Flashcache is built using the Linux Device Mapper (DM), part of the Linux Storage Stack infrastructure that facilitates building SW-RAID and other components. LVM, for example, is built using the DM.


The cache is structured as a set associative hash, where the cache is divided up into a number of fixed size sets (buckets) with linear probing within a set to find blocks. The set associative hash has a number of advantages (called out in sections below) and works very well in practice.


The block size, set size and cache size are configurable parameters, specified at cache creation. The default set size is 512 (blocks) and there is little reason to change this.


More information on usage:  Thanks to Grant McAlister for pointing me to the Facebook release of Flashcache. Nice work Paul and Mohan.




James Hamilton



b: /


Thursday, April 29, 2010 6:36:25 AM (Pacific Standard Time, UTC-08:00)  #    Comments [4] - Trackback
 Wednesday, April 21, 2010

There have been times in past years when it really looked like we our industry was on track to supporting only a single relevant web browser. Clearly that’s not the case today.  In a discussion with a co-working today on the importance of “other” browsers, I wanted to put some data on the table so I looked up the browser stats for this web site (  I hadn’t looked for a while and found the distribution  truly interesting:


Admittedly, those that visit this site clearly don’t represent the broader population well.  Nonetheless, the numbers are super interesting.  Firefox eclipsing Internet Explorer and by such a wide margin was surprising to me. You can’t see it in the data above but the IE share continues to decline.  Chrome is already up to 17%.


Looking at the share data posted on Wikipedia ( and using the Net Market Share data) we see that IE has declined from over 91.4% to  61.4% in just 5 years. Again a surprisingly rapid change.


Focusing on client operating systems, from the skewed sample that accesses this site, we see several interesting trends: 1) Mac share continues to climb sharply at 16.6%, 2) Linux at 9%, 3) iphone, ipod and ipad in aggregate at over 5 ¼%, and 4) Android already over a ¼%.


Overall we are seeing more browser diversity,  more O/S diversity, and unsurprisingly, more mobile devices.




James Hamilton



b: /


Wednesday, April 21, 2010 12:25:25 PM (Pacific Standard Time, UTC-08:00)  #    Comments [4] - Trackback
 Saturday, April 17, 2010

We live on a boat which has lots of upside but broadband connectivity isn’t one of them. As it turns out, our marina has WiFi but it is sufficiently unreliable that we needed another solution. I wish there was a Starbucks hotspot across the street – actually there is one within a block but we can’t quite pick up the signal even with an external antennae (Syrens). 


WiFi would have been a nice solution but didn’t work so we decided to go with WiMAX. We have used ClearWire for over a year on the boat and, generally, it has worked acceptably well. Not nearly as fast as WiFi but better than 3G cellular.  Recently ClearWire changed its name to Clear and “upgraded” the connectivity technology to full WiMAX. Unfortunately, the upgrade substantially reduced the coverage area, has been fairly unstable, and the Customer support although courteous and friendly is so far away from the engineering team that they basically just can’t make a difference no matter how hard they try.


We decided we had to find a different solution. I use AT&T 3G cellular with tethering and would have been fine with that as a solution. It’s a bit slower than Clear but its stable and coverage is very broad. Unfortunately, the “unlimited” plan we got some years ago is very limited to 5Gig/month and we move far more data than that. I can’t talk AT&T into offering a solution so, again, we needed something else.


Sprint now has a WiMAX service that offers good performance (although they can be a bit aggressive on throttling) and they have fairly broad coverage in our area and are expanding quickly (Sprint announces seven new WiMAX markets). Sprint has the additional nice feature on some modems where, if WiMAX is unavailable, it transparently falls back to 3G. The 3G service is still limited to 5Gig but, as long as we are on WiMAX a substantial portion of the month, we’re fine.


The remaining challenge was Virtual Private Networks (VPN) over WiMAX can be unstable. I really wish my work place supported Exchange RPC over HTTP (one of the coolest Outlook/Exchange features of all time). However, many companies believe that Exchange RPC over HTTP is insecure in that it doesn’t’ require 2 factor authentication. Ironically, many of these companies allow Blackberries’ and iPhones to access email without 2 factor auth. I won’t try to explain why one is unsafe and the other is fine but I think it might have something to do with the popularity of iPhones and Blackberries with execs and senior technical folks :-).


In the absence of RPC over HTTP, logging into the work network via VPN is the only answer. My work place uses Aventail but there are a million solutions out there. I’ve used many and love none.  There are many reasons why these systems can be unstable, cause blue screens, and otherwise negatively impact the customer experience. But one that has been driving me especially nuts is frequent dropped connections and hangs when using the VPN over WiMAX. It appears to happen more frequently when there is more data in flight but to lose a connection every few minutes is quite common. 


It turns out the problem is the default MTU on most client systems is 1500 but the WiMAX default is often smaller. It should still work and just be super inefficient but it doesn’t. For more details see


To check Vista MTUs:


netsh interface ipv4 show subinterfaces


To change the MTU to 1400:


netsh interface ipv4 set subinterface "your vpn interface here" mtu=1400 store=persistent


I’m using an MTU of 1400 with Sprint and its working well. Thanks to for the easy MTU update. If you are having flakey VPN support especially if running over WiMAX, check your MTU.




James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | |  | blog:


Saturday, April 17, 2010 5:43:03 AM (Pacific Standard Time, UTC-08:00)  #    Comments [6] - Trackback
 Monday, April 12, 2010

Standards and benchmarks have driven considerable innovation. The most effective metrics are performance-based.  Rather than state how to solve the problem, they say what needs to be achieved and leave the innovation open.


I’m an ex-auto mechanic and was working as a wrench in a Chevrolet dealership in the early 80. I hated the emission controls that were coming into force at that time because they caused the cars to run so badly. A 1980 Chevrolet 305 CID with 4 BBL carburetor would barely idle in perfect tune. It was a mess. But, the emission standards didn’t say it had to run badly only what needed to be achieved. And, competition to achieve those goals produced compliant vehicles that ran well. Ironically, as emission standards forced more precise engine management, both fuel economy and power density has improved as well. Initially both suffered as did drivability but competition brought many innovations to market and we ended up seeing emissions compliance to increasingly strict standards at the same time that both power density and fuel economy improved.


What’s key is the combination of competition and performance-based standards.  If we set high goals and allow companies to innovate in how they achieve those goals, great things happen. We need to take that same lesson and apply it to data centers.


Recently, the American Society of Heating, Refrigerating and Air-Conditioning Engineers (ASHRAE) added data centers to their building efficiency standard, ASHRAE Standard 90.1. This standard defines the energy efficiency for most types of buildings in America and is often incorporated into building codes across the country. Unfortunately, as currently worded, this document is using a prescriptive approach. To comply, you must use economizers and other techniques currently in common practice. But, are economizers the best way to achieve the stated goal?  What about a system that harvested waste heat and applied it growing cash crops like Tomatoes? What about systems using heat pumps to scavenge low grade heat (see Data Center Waste Heat Reclaimation)? Both these innovations would be precluded by the proposed spec as they don’t use economizers.


Urs Hoelzle, Google’s Infrastructure SVP, recently posted Setting Efficiency Goals for Data Centers where he argues we need goal-based environmental targets that drive innovation rather than prescriptive standards that prevent it.  Co-signatures with Urs include:

·         Chris Crosby, Senior Vice President, Digital Realty Trust

·         Hossein Fateh, President and Chief Executive Officer, Dupont Fabros Technology

·         James Hamilton, Vice President and Distinguished Engineer, Amazon

·         Urs Hoelzle, Senior Vice President, Operations and Google Fellow, Google

·         Mike Manos, Vice President, Service Operations, Nokia

·         Kevin Timmons, General Manager, Datacenter Services, Microsoft


I thinks we’re all excited by the rapid pace of innovation in high scale data centers. We know its good for the environment and for customers.  And I think we’re all uniformly in agreement with ASHRAE in the intent of 90.1. What’s needed to make it a truly influential and high-quality standard is that it be changed to be performance-based rather than prescriptive. But, otherwise, I think we’re all heading in the same direction.




James Hamilton



b: /


Monday, April 12, 2010 9:39:48 AM (Pacific Standard Time, UTC-08:00)  #    Comments [2] - Trackback
 Friday, April 09, 2010

High scale network research is hard.  Running a workload over a couple of hundred servers says little of how it will run over thousands or tens of thousands servers. But, having 10’s of thousands of nodes dedicated to a test cluster is unaffordable. For systems research the answer is easy: use Amazon EC2. It’s an ideal cloud computing application. Huge scale is needed during some parts of the research project but the servers aren’t needed 24 hours/day and certainly won’t be needed for the three year amortization life of the servers.


However, for high-scale network research, finding the solution is considerably more difficult. In some dimensions, it’s no different from systems research in that purchasing a few thousand servers for the research projects makes no sense. But the easy answer of simply using EC2 doesn’t work in that EC2 nodes come fully provisioned with networking.  One solution that works well for many networking research problems is to use an overlay and test at scale in EC2. But, when new hardware devices are being investigated, unless they can be emulated with high fidelity using with software implementations running on EC2, this solution breaks down. 


For all but a few folks at Cisco and Juniper, running a multi-thousand node physical cluster to test new network gear is impractical. And it’s even less practical in academic settings. I’m lucky enough to work near many thousands of server nodes and a huge networking infrastructure. But, even then, installing a parallel network to do network research is difficult to afford. High-scale network research at credible scale is difficult.


Zhangxi Tan of Cal Berkeley came up to visit a couple of weeks back. I’m interested in Zhangxi’s work for two reasons: 1) its based upon reconfigurable computing -- a technology ready for commercial application and 2) the application of FPGA to network simulation might be a solution to the problem of how to test networking gear at credible scale.


Reconfigurable computing maintains the flexibility of reprogrammable software systems with the performance of high performance hardware implementations.  Or, worded differently, most of the performance of Application Specific Integrated Circuits (ASIC) with the flexibility of software. Most reconfigurable computing designs are based upon Field Programmable Gate Arrays (FPGA) and some high level instruction set or programming language to allow device reconfiguration.  Recently, C and C++ subset compilers have emerged that allow a constrained version C or C++ to be compiled directly to a FPGA and, once the software is stable, directly to an ASIC. See Platform-based Electronic Systems Level (ESL) Synthesis for more on reconfigurable computing and see Heterogeneous Computing using GPGPUs and FPGAs for related discussions on the application of hardware acceleration.


In the work that Zhangxi presented, the Cal Berkeley team is taking the RAMP gold FPGA-based many-core simulator (Research Accelerator for Multiple Processors) and applying it to the problem of high-scale network simulation with a goal of simulating an O(10k) server network. Zhangxi’s slides are here: Using FPGAs to Simulate Novel Datacenter Network Architectures at Scale and my rough notes follow:

·         Lots of work going on in data center network research: VL2, Dcell, PortLand,…

·         But:

o   the test scale is usually WAY smaller than the problem targeted by these systems

o   Often synthetic benchmarks are used rather than actual workloads

·         RAMP Gold is:

o   Full 32-bit SPARC v8 ISA support, including FP, traps and MMU.

o   Use abstract models with enough detail, but fast enough to run real apps/OS

o   Provide cycle-level accuracy

o   Cost-efficient: hundreds of nodes plus switches on a single FPGA

·         RAMP Gold implementation:

o   Based upon Xilinx XUP V5 board ($750)

o   Able to simulate 64 core, 2GB DDR2, FP and run production Linix

·         Tested using trace data from Facebook and Yahoo Hadoop runs

·         Demonstrating the incast TCP collapse problem and showed simulated results that closely matched actual measured results


James Hamilton



b: /


Friday, April 09, 2010 6:57:05 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
 Wednesday, April 07, 2010

Mike Stonebraker published an excellent blog posting yesterday at the CACM site: Errors in Database Systems, Eventual Consistency, and the CAP Theorem. In this article, Mike challenges the application of Eric Brewer’s CAP Theorem by the NoSQL database community. Many of the high-scale NoSQL system implementers have argued that the CAP theorem forces them to go with an eventual consistent model.


Mike challenges this assertion pointing that some common database errors are not avoided by eventual consistency and CAP really doesn’t apply in these cases. If you have an application error, administrative error, or database implementation bug that losses data, then it is simply gone unless you have an offline copy. This, by the way, is why I’m a big fan of deferred delete.  This is a technique where deleted items are marked as deleted but not garbage collected until some days or preferably weeks later.  Deferred delete is not full protection but it has saves my butt more than once and I’m a believer. See On Designing and Deploying Internet-Scale Services for more detail.


CAP and the application of eventual consistency doesn’t directly protect us against application or database implementation errors. And, in the case of a large scale disaster where the cluster is lost entirely, again, neither eventual consistency nor CAP offer a solution. Mike also notes that network partitions are fairly rare.  I could quibble a bit on this one. Network partitions should be rare but net gear continues to cause more issues than it should. Networking configuration errors, black holes, dropped packets, and brownouts, remain a popular discussion point in post mortems industry-wide. I see this improving over the next 5 years but we have a long way to go. In Networking: the Last Bastion of Mainframe Computing, I argue that net gear is still operating on the mainframe business model: large, vertically integrated and expensive equipment, deployed in pairs. When it comes to redundancy at scale, 2 is a poor choice.


Mike’s article questions whether eventual consistency is really the right answer for these workloads. I made some similar points in “I love eventual consistency but…” In that posting, I argued that many applications are much easier to implement with full consistency and full consistency can be practically implemented at high scale. In fact, Amazon SimpleDB recently announced support for full consistency. Apps needed full consistency are now easier to write and, where only eventual consistency is needed, its available as well.


Don’t throw full consistency out too early. For many applications, it is both affordable and helps reduce application implementation errors.




Thanks to Deepak Singh for pointing me to this article.


James Hamilton



b: /


Wednesday, April 07, 2010 11:58:15 AM (Pacific Standard Time, UTC-08:00)  #    Comments [5] - Trackback
 Tuesday, March 23, 2010

Every so often, I come across a paper that just nails it and this one is pretty good.. Using a market Economy to Provision Resources across a Planet-wide Clusters doesn’t fully investigate the space but it’s great to progress on this important area and the paper is a strong step in the right direction.


I spend much of my time working on driving down infrastructure costs. There is lots of great work that can be done in datacenter infrastructure, networking, and server design. It’s both a fun and important area. But, an even bigger issue is utilization. As an industry, we can and are driving down the cost of computing and yet it remains true that most computing resources never get used. Utilization levels at large and small companies typically run in the 10 to 20% range. I occasionally hear reference to 30% but it’s hard to get data to support it. Most compute cycles go wasted. Most datacenter power doesn’t get useful work done. Most datacenter cooling is not spent supporting productive work. Utilization is a big problem. Driving down the cost of computing certainly helps but it doesn’t address the core issue: low utilization.


That’s one of the reasons I work at a cloud computing provider.  When you have very large, very diverse workloads, wonderful things happen.  Workload peaks are not highly correlated. For example, tax preparation software is busy around tax time. Retail software towards the end of the year. Social networking while folks in the region are awake. All these peaks and valleys overlay to produce a much flatter peak to trough curve. As the peak to trough ratio decreases, utilization sky rockets.  You can only get these massively diverse workloads in public clouds and its one of the reasons why private clouds are a bit depressing (see Private Clouds are not the Future). Private clouds are so close to the right destination and yet that last turn was a wrong one and the potential gains won’t be achieved. I hate wasted work as much as I hate low utilization.


The techniques above smooth the aggregated utilization curve and, the flatter that curve gets, the higher the utilization, the lower the cost, and are better it is for the environment. Large public clouds get this curve flattened the workload peaks considerably but the goal of steady unchanging load 24 hours a day, 7 days a week isn’t achievable.  Even power companies have base load and peak load. What to do with the remaining utilization valleys? The next technique is to use a market economy to incent developers and users to use resources that aren’t currently fully utilized.  


In The Cost of A Cloud: Research Problems in Datacenter Networks, we argued that turning servers off is a mistake in that the most you can hope to achieve is to save the cost of the power which is tiny when compared to the cost of the servers, the cost of power distribution gear, and the cost of the mechanical systems. See the Cost of Power in Large-Scale Datacenters (I’ve got an update of this work coming – the changes are interesting but the cost of power remains the minority cost). Rather than shutting off servers, the current darling idea of the industry, we should be productively using the servers. If we can run any workload worth more than the marginal cost of power, we should. Again, a strong argument for public clouds with large pools of resources on which a market can be made.


Continuing with making a market and offering computing resources not under supply crunch (under-utilized) at lower costs, Amazon Web Services has a super interesting offering called spot instances. Spot instances allow customers to bid on unused EC2 capacity and allow them to run those instances as long as their bids exceed the current instance spot price.


The paper I mentioned above is heading in a similar direction but this time working on the Google MapReduce cluster utilization problem.  Technically the paper actually is working on a private cloud but its still nice work and it is using the biggest private cloud in the world at well over a million servers so I can’t complain too much. I really like the paper. From the conclusion:


In this paper, we have thus proposed a framework for allocating and pricing resources in a grid-like environment. This framework employs a market economy with prices adjusted in periodic clock auctions. We have implemented a pilot allocation system within Google based on these ideas. Our preliminary experiments have resulted in significant improvements in overall utilization; users were induced to make their services more mobile, to make disk/memory/network tradeoffs as appropriate in different clusters, and to fully utilize each resource dimension, among other desirable outcomes. In addition, these auctions have resulted in clear price signals, information that the company and its engineering teams can take advantage of for more efficient future provisioning.


It’s worth reading the full paper: One of the authors, Murray Stokely of Google, also wrote an interesting blog entry Fun with Amazon Web Services where he developed many of the arguments above. Thanks to Greg Linden and Deepak Singh for pointing me to this paper. It made for a good read this morning and I hope you enjoy it as well.




James Hamilton



b: /


Tuesday, March 23, 2010 6:13:22 AM (Pacific Standard Time, UTC-08:00)  #    Comments [10] - Trackback
 Wednesday, February 24, 2010

I love eventual consistency but there are some applications that are much easier to implement with strong consistency. Many like eventual consistency because it allows us to scale-out nearly without bound but it does come with a cost in programming model complexity. For example, assume your program needs to assign work order numbers uniquely and without gaps in the sequence.  Eventual consistency makes this type of application difficult to write.


Applications built upon eventually consistent stores have to be prepared to deal with update anomalies like lost updates. For example, assume there is an update at time T1 where a given attribute is set to 2. Later, at time T2, the same attribute is set to a value of 3. What will the value of this attribute be at a subsequent time T3?  Unfortunately, the answer is we’re not sure. If T1 and T2 are well separated in time, it will almost certainly be 3. But it might be 2. And it is conceivable that it could be some value other than 2 or 3 even if there have been no subsequent updates. Coding to eventual consistency is not the easiest thing in the world. For many applications its fine and, with care, most applications can be written correctly on an eventually consistent model. But it is often more difficult.


What I’ve learned over the years is that strong consistency, if done well, can scale to very high levels. The trick is implementing it well. The naïve approach to achieve full consistency is to route all updates through a single master server but clearly this won’t scale. Instead divide the update space into a large number of partitions, each with its own master. That scales but there is still a tension between the number of partitions and the cost of maintaining many partitions and avoiding hot spots.  The obvious way to avoid hot spots is to use a large number of partitions but this increases partition management overhead.  The right answer is to be able to dynamically repartition to maintain a sufficient number of partitions and to be able to adapt to load increases on any single server by further spreading the update load.


There are many approaches to support dynamic hot sport management. One is to divide the workload into 10 to 100x more partitions than expected servers and make these fixed-sized partitions be the unit of migration. Servers with hot partitions will end up serving less partitions while servers with cold partitions will manage more. The other class of approaches, is to dynamically repartition. Start with large partitions and divide hot partitions to multiple smaller partitions to spread the load over multiple servers.


There are many variants of these techniques with different advantages and disadvantages. The constant is that full consistency is more affordable than many think. Clearly, eventual consistency remains a very good thing for workloads that don’t need full consistency and for workloads where the overhead of the above techniques is determined to be unaffordable. Both higher consistency models are quite useful.


This morning SimpleDB announced support for two new features that make it much easier to write many classes of applications: 1) consistent Reads, 2) Conditional put and delete.  Consistent reads allows applications that need full consistency to be easily written against SimpleDB. So, for example, if you wanted to implement an inventory management system that didn’t lose parts in the warehouse, doesn’t sell components twice, or place multiple orders, it would now be trivial to write this application against SimpleDB using the consistent read support. Consistent read is implemented as an optional Boolean flag on SimpleDB GetAttributes or select statements. Absence of the flag continues to deliver the familiar eventually consistent behavior with which many of you are very familiar with. If the flag is present and set, you get strong consistency. 


SimpleDB conditional PutAttributes and DeleteAttributes are a related feature that makes it much easier to write applications where the new value of an attribute are functionally related to the old value. Conditional update support allows a programmer to read the value of an attribute, operate upon it, and then write it back only if the value hasn’t changed in the interim which would render the planned update invalid. For example, say you were implementing a counter (+1). If the value of the counter at time T0 was 0, and subsequently an increment was applied at time T1 and another at increment was applied at time T2, what is the value of the counter? Using eventual consistency and, for simplicity, assuming no concurrent updates, the resulting value is probably is 2. Unfortunately, the value might be 1. With conditional updates, it will be 2.  Again, conditional puts and deletes are just another great tool to help write correct SimpleDB applications quickly and efficiently.


For more information on consistent reads and conditional put and delete, see SimpleDB Consistency Enhancements.


These two SimpleDB features have been in the works for some time and so it is exciting to see them announced and available today. It’s great to now be able to talk about these features publically. If you are interested in giving them a try, you can for free.  There is no charge for SimpleDB use for database sizes under 1GB (and silly close to free above that level). Go for it.




James Hamilton



b: /


Wednesday, February 24, 2010 3:17:57 PM (Pacific Standard Time, UTC-08:00)  #    Comments [2] - Trackback
 Monday, February 15, 2010

MySpace makes the odd technology choice that I don’t fully understand.  And, from a distance, there are times when I think I see opportunity to drop costs substantially. But, let’s ignore that, and tip our hat to the MySpace for incredibly scale they are driving. It’s a great social networking site and you just can’t argue with the scale they are driving. Their traffic is monstrous and, consequently, it’s a very interesting site to understand in more detail.


Lubor Kollar of SQL Server just sent me this super interesting overview of the MySpace service. My notes follow and the original article is at:


I particularly like social networking sites like Facebook and MySpace because they are so difficult to implement.  Unlike highly partitionable workloads like email, social networking sites work hard to find as many relationships, across as many  dimensions, amongst as many users as possible. I refer to this as the hairball problem. There are no nice clean data partitions which makes social networking sites amongst the most interesting of the high scale internet properties.  More articles on the hairball problem:

·         FriendFeed use of MySQL

·         Geo-Replication at Facebook

·         Scaling LinkedIn


The combination of the hairball problem and extreme scale makes the largest social networking sites like MySpace some of the toughest on the planet to scale.  Focusing on MySpace scale, it is prodigious:

·         130M unique monthly users

·         40% of the US population has MySpace accounts

·         300k new users each day


The MySpace Infrastructure:

·         3,000 Web Servers

·         800 cache servers

·         440 SQL Servers


Looking at the database tier in more detail:

·         440 SQL Server Systems hosting over 1,000 databases

·         Each running on an HP ProLiant DL585

o   4 dual core AMD procs

o   64 GB RAM

·         Storage tier: 1,100 disks on a distributed SAN (really!)

·         1PB of SQL Server hosted data


As ex-member of the SQL Server development team and perhaps less than completely unbiased, I’ve got to say that 440 database servers across a single cluster is a thing of beauty.


More scaling stores:


Hats off to MySpace for delivering a reliable service, in high demand, with high availability. Very impressive.



James Hamilton



b: /


Monday, February 15, 2010 12:11:19 PM (Pacific Standard Time, UTC-08:00)  #    Comments [19] - Trackback
 Saturday, February 13, 2010

Last week, I posted Scaling Second Life. Royans sent me a great set of scaling stories: Scaling Web Architectures and Vijay Rao of AMD pointed out How FarmVille Scales to Harvest 75 Million Players a Month. I find the Farmville example particularly interesting in that it’s “only” a casual game. Having spent most of my life (under a rock) working on high-scale servers and services, I naively would never have guessed that casual gaming was big business. But it is. Really big business. To put a scale point on what "big" means in this context, Zynga, the company responsible for Farmville, is estimated to have a valuation of between $1.5B and $3B (Zynga Raising $180M on Astounding Valuation) with annual revenues of roughly $250M (Zynga Revenues Closer to $250).


The Zynga games portfolio includes 24 games, the best known of which are Mafia Wars and FarmVille. The Farmville scaling story is an great example of how fast internet properties can need to scale. The game had 1M players after 4 days and 10M after 60 days.  


In this interview with FarmVille’s Luke Rajich (How FarmVille Scales to Harvest 75 Million Players a Month), Luke talks about scaling what he refers to as both the largest game in the world and the largest application on a web platform. FarmVille is a Facebook application and peak bandwidth between FarmVille and Facebook can run as high as 3Gbps. The FarmVille team has to manage both incredibly fast growth and very spikey traffic patterns. They have implemented what I call graceful degradation mode(Designing and Deploying Internet-Scale Services) and are able to shed load as load as gaming traffic increases push them towards their resource limits. In Luke’s words “the application has the ability to dynamically turn off any calls back to the platform. We have a dial that we can tweak that turns off incrementally more calls back to the platform. We have additionally worked to make all calls back to the platform avoid blocking the loading of the application itself. The idea here is that, if all else fails, players can continue to at least play the game. […]The way in which services degrade are to rate limit errors to that service and to implement service usage throttles. The key ideas are to isolate troubled and highly latent services from causing latency and performance issues elsewhere through use of error and timeout throttling, and if needed, disable functionality in the application using on/off switches and functionality based throttles.”  These are good techniques that can be applied to all services.


Lessons Learned from scaling Farmville:

1.      Interactive games are write-heavy. Typical web apps read more than they write so many common architectures may not be sufficient. Read heavy apps can often get by with a caching layer in front of a single database. Write heavy apps will need to partition so writes are spread out and/or use an in-memory architecture.

2.    Design every component as a degradable service. Isolate components so increased latencies in one area won't ruin another. Throttle usage to help alleviate problems. Turn off features when necessary.

3.    Cache Facebook data. When you are deeply dependent on an external component consider caching that component's data to improve latency.

4.    Plan ahead for new release related usage spikes.

5.      Sample. When analyzing large streams of data, looking for problems for example, not every piece of data needs to be processed. Sampling data can yield the same results for much less work.


Check out the High Scalability article How FarmVille Scales to Harvest 75 Million Players a Month for more details.




James Hamilton



b: /

Saturday, February 13, 2010 8:05:33 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
 Sunday, February 07, 2010

As many of you know I collect high-scale scaling war stories. I’ve appended many of them below. Last week Ars Technica published a detailed article on Scaling Second Life: What Second Life can Teach your Datacenter About Scaling Web Apps. This article by Ian Wilkes who worked at Second Life from 2001 to 2009 where he was director of operations. My rough notes follow:

·         Understand scale required:

o   Billing system serving US and EU where each user interacts annually and the system has 10% penetration: 2 to 3 events/second

o   Chat system serving UE and EU where each user sends 10 message/day during workday: 20k messages/second

·         Does the system have to be available 24x7 and understand the impact of downtime (beware of over-investing in less important dimensions at the expense of those more important)

·         Understand the resource impact of features. Be especially cautious around relational database systems and object relational mapping frameworks. If nobody knows the resource requirements, expect trouble in the near future.

·         Database pain: “Almost all online systems use an SQL-based RDBMS, or something very much like one, to store some or all of their data, and this is often the first and biggest bottleneck. Depending on your choice of vendor, scaling a single database to higher capacity can range from very expensive to outright impossible. Linden's experience with uber-popular MySQL is illustrative: we used it for storing all manner of structured data, ultimately totaling hundreds of tables and billions of rows, and we ran into a variety of limitations which were not expected.”

·         MySQL specific issues:

o   Lacks alter table statement

o   Write heavy workload can run heavy CPU spikes due to internal lock conflicts

o   Lack of effective per-user governors means a single application can bring the system to its knees

·         Interchangeable parts :” A common behavior of small teams on a tight budget is to tightly fit the building blocks of their system to the task at hand. It's not uncommon to use different hardware configurations for the webservers, load balancers (more bandwidth), batch jobs (more memory), databases (more of everything), development machines (cheaper hardware), and so on. If more batch machines are suddenly needed, they'll probably have to be purchased new, which takes time. Keeping lots of extra hardware on site for a large number of machine configurations becomes very expensive very quickly. This is fine for a small system with fixed needs, but the needs of a growing system will change unpredictably. When a system is changing, the more heavily interchangeable the parts are, the more quickly the team can respond to failures or new demands.”

·         Instrument, propagate, isolate errors:

o   It is important not to overlook transient, temporary errors in favor of large-scale failures; keeping good data about errors and dealing with them in an organized way is essential to managing system reliability.

o   Second Life has a large number of highly asynchronous back-end systems, which are heavily interdependent. Unfortunately, it had the property that under the right load conditions, localized hotspots could develop, where individual nodes could fall behind and eventually begin silently dropping requests, leading to lost data.

·          Batch jobs, the silent killer: Batch jobs bring two challenges: 1) sudden workload spikes and 2) inability to complete the job within the batch window.

·         Keep alerts under control: “I can't count the number of system operations people I've talked to (usually in job interviews as they sought a new position) who, at a growing firm, suffered from catastrophic over-paging.”

·         Beware of the “grand rewrite”


If you are interested in reading more from Ian at Second Life: Interview with Ian Wilkes From Linden Lab.


More from the Scaling-X series:

·         Scaling Second Life:

·         Scaling Google:

·         Scaling LinkedIn:

·         Scaling Amazon:

·         Scaling Second Life:

·         Scaling Technorati:

·         Scaling Flickr:

·         Scaling Craigslist:

·         Scaling Findory:

·         Scaling Myspace:

·         Scaling Twitter, Flickr, Live Journal, Six Apart, Bloglines,, SlideShare, and eBay:


A very comprehensive list from Royans:  Scaling Web Architectures


Some time back for USENIX LISA, I brought together a set of high-scale services best practices:

·         Designing and Deploying Internet-Scale Services


If you come across other scaling war stories, send them my way:




James Hamilton



b: /


Sunday, February 07, 2010 11:51:21 AM (Pacific Standard Time, UTC-08:00)  #    Comments [6] - Trackback
 Sunday, January 17, 2010

Cloud computing is an opportunity to substantially improve the economics of enterprise IT. We really can do more with less. 


I firmly believe that enterprise IT is a competitive weapon and, in all industries, the leaders are going to be those that invest deeply in information processing. The best companies in each market segment are going to be information processing experts and because of this investment, are going to know their customer better, will chose their suppliers better, will have deep knowledge and control of their supply chains, and will have an incredibly efficient distribution system. They will do everything better and more efficiently because of their information processing investment.  This is the future reality for retail companies, for financial companies, for petroleum exploration, for pharmaceutical, for sports teams, and for logistics companies. No market segment will be spared and, for many, it’s their reality today.  Investment in IT is the only way to serve customers and shareholders better than competitors.


It’s clear to me that investing in information technology is the future of all successful companies and it’s the present for most. The good news is that it really can be done more cost effectively, more efficiently, and with less environmental impact using cloud computing. We really can do more with less.


The argument for cloud computing is gaining acceptance industry-wide. But, private clouds are being embraced by some enterprises and analysts as the solution and the right way to improve the economics of enterprise IT infrastructure. Private clouds may feel like a step in the right direction but scale-economics make private clouds far less efficient than real cloud computing. What’s the difference? At scale, in a shared resource fabric, better services can be offered at lower cost with much higher resource utilization. We’ll look at both the cost and resource utilization advantages in more detail below.


At very high-scale it’s both affordable and efficient to have teams of experts in power distribution and mechanical systems on staff.  The major cloud computing providers have these teams and are inventing new techniques to lower costs, improve efficiency, and provide more environmentally sound solutions. This is very hard to do cost effectively at scale of less than 10s of megawatts.  Continuing that same argument to other domains, cloud computing providers have teams specialized in server and storage design.  And they are deeply invested in networking gear hardware and software. All of this is hard to justify at private cloud scales.


Cloud computing providers have 24x7 staff to monitor the services and to respond to customer issues. Doing service monitoring right is incredibly difficult and I’ve never seen it done well at anything less than multi-megawatt scales.


Cloud computing providers have some of the best distributed systems specialists in the world. They also have open source experts and depend deeply upon both open source and internally produced software.  They do this for two reasons: 1) at high-scale, things fail in new and interesting ways – operational excellence only comes from intimate knowledge of the entire hardware and software stack, and 2) when running at the high scale needed for efficiency, software licensing costs give up much of the excellent economics of a cloud service.


Resource utilization is even a stronger argument to move to a high-scale, shared infrastructure cloud. At scale, with high customer diversity, a wonderful property emerges: non-correlated peaks. Whereas each company has to provision to support their peak workload, when running in a shared cloud the peaks and valleys smooth.  The retail market peaks in November, taxation in April, some financial business peak on quarter ends and many of these workloads have many cycles overlaid some daily, some weekly, some yearly and some event specific.  For example, the death of Michael Jackson drove heavy workloads in some domains but had zero impact in others. A huge eastern seaboard storm drives massive peaks in a few businesses but has no impact on most. Large numbers of diverse workloads tend to average out and yield much higher utilization levels than are possible at low scale.  Private clouds can never achieve the utilization levels of shared clouds.


Last week Alistair Croll wrote an excellent InformationWeek article arguing that “the true cloud operators will have an unavoidable cost advantage because it's all they worry about. They'll also be closer to consumers (because they have POPs everywhere and partnerships with content delivery systems), and connecting with consumers and partners will become an increasingly essential part of any enterprise IT strategy.”  Have a look at Private Clouds are a Fix, Not the Future.


Private clouds are better than nothing but an investment in a private cloud is an investment in a temporary fix that will only slow the path to the final destination: shared clouds. A decision to go with a private cloud is a decision to run lower utilization levels, consume more power, be less efficient environmentally, and to run higher costs.


James Hamilton



b: /


Sunday, January 17, 2010 9:20:10 AM (Pacific Standard Time, UTC-08:00)  #    Comments [24] - Trackback
 Thursday, January 07, 2010

There is a growing gap between memory bandwidth and CPU power and this growing gap makes low power servers both more practical and more efficient than current designs.   Per-socket processor performance continues to increase much more rapidly than memory bandwidth and this trend applies across the application spectrum from mobile devices, through client, to servers. Essentially we are getting more compute than we have memory bandwidth to feed. 


We can attempt to address this problem two ways: 1) more memory bandwidth and 2) less fast processors. The former solution will be used and Intel Nehalem is a good example of this but costs increase non-linearly so the effectiveness of this technique will be  bounded. The second technique has great promise to reduce both cost and power consumption.


For more detail on this trend:

·         The Case for Low-Cost, Low-Power Servers

·         2010 the Year of the MicroSlice Servers

·         Linux/Apache on ARM Processors

·         ARM Cortex-A9 SMP Design Announced


This morning GigOm reported that SeaMicro has just obtained a $9.3M Department of Energy grant to improve data center efficiency (SeaMicro’s Secret Server Changes Computing Economics).  SeaMicro is a Santa Clara based start-up that is building a 512 processor server based upon Intel Atom. Also mentioned was Smooth Stone who is designing a high-scale server based upon ARM processors. ARMs processors are incredibly power efficient, commonly used in embedded devices and by far the most common processor used in cell phones.


Over the past year I’ve met with both Smooth Stone and SeaMicro frequently and it’s great to see more information about both available broadly. The very low power server trend is real and advancing quickly. When purchasing servers, it needs to be all about work done per dollar and work done per joule


Congratulations to SeaMicro on the DoE grant.


James Hamilton



b: /


Thursday, January 07, 2010 10:38:35 AM (Pacific Standard Time, UTC-08:00)  #    Comments [3] - Trackback
 Saturday, January 02, 2010

In this month’s Communications of the Association of Computing Machinery, a rematch of the MapReduce debate was staged.  In the original debate, Dave Dewitt and Michael Stonebraker, both giants of the database community, complained that:


1.    MapReduce is a step backwards in database access

2.    MapReduce is a poor implementation

3.    MapReduce is not novel

4.    MapReduce is missing features

5.    MapReduce is incompatible with the DBMS tools


Unfortunately, the original article appear to be no longer available but you will find the debate branching out from that original article by searching on the title Map Reduce: A Major Step Backwards. The debate was huge, occasionally entertaining, but not always factual.  My contribution was MapReduce a Minor Step forward.

Update: In comments, csliu offered updated URLs for the original blog post and a follow-on article:

·  MapReduce: A Major Step Backwards

· MapReduce II


I like MapReduce for a variety of reasons the most significant of which is that it allows non-systems programmers to write very high-scale, parallel programs with comparative ease.  There have been many attempts to allow mortals to write parallel programs but there really have only been two widely adopted solutions that allow modestly skilled programmers to write highly concurrent executions: SQL and MapReduce. Ironically the two communities participating in the debate,  Systems and Database, have each produced a great success by this measure. 


More than 15 years ago, back when I worked on IBM DB2, we had DB2 Parallel Edition running well over a 512 server cluster.  Even back then you could write a SQL Statement that would run over a ½ thousand servers.  Similarly, programmers without special skills can run MapReduce programs that run over thousands of serves. The last I checked Yahoo, was running MapReduce jobs over a 4,000 node cluster: Scaling Hadoop to 4,000 nodes at Yahoo!.


The update on the MapReduce debate is worth reading but, unfortunately, the ACM has marked the first article as “premium content” so you can only read it if you are a CACM subscriber:

·         MapReduce and Parallel DBMSs: Friend or Foe

·         MapReduce: A Flexible Data Processing Tool

Update: Moshe Vardi, Editor in Chief of the Communications of the Association of Computing Machinery has kindly decided to make both the of the above articles freely available for all whether or not CACM member. Thank you Moshe.


Even more important to me than the MapReduce debate is seeing this sort of content made widely available. I hate seeing it classified as premium content restricted to members only. You really all should be members but, with the plunging cost of web publishing, why can’t the above content be made freely available? But, while complaining about the ACM publishing policies, I should hasten to point out that the CACM has returned to greatness.  When I started in this industry, the CACM was an important read each month. Well, good news, the long boring hiatus is over.  It’s now important reading again and has been for the last couple of years. I just wish the CACM would follow the lead of ACM Queue and make the content more broadly available outside of the membership community.


Returning to the MapReduce discussion, in the second CACM article above, MapReduce: A Flexible Data Processing Tool, Jeff Dean and Sanjay Ghemawat, do a thoughtful job of working through some of the recent criticism of MapReduce.


If you are interested in MapReduce, I recommend reading the original Operating Systems Design and Implementation MapReduce paper: MapReduce: Simplied Data Processing on Large Clusters and the detailed MapReduce vs database comparison paper: A Comparison of Approaches to Large-Scale Data Analysis.




James Hamilton



b: /


Saturday, January 02, 2010 8:15:19 AM (Pacific Standard Time, UTC-08:00)  #    Comments [8] - Trackback
 Saturday, December 19, 2009

The networking world remains one of the last bastions of the mainframe computing design point. Back in 1987 Garth Gibson, Dave Patterson, and Randy Katz showed we could aggregate low-cost, low-quality commodity disks into storage subsystems far more reliable and much less expensive than the best purpose-built storage subsystems (Redundant Array of Inexpensive Disks). The lesson played out yet again where we learned that large aggregations of low-cost, low-quality commodity servers are far more reliable and less expensive than the best purpose-built scale up servers. However, this logic has not yet played out in the networking world.


The networking equipment world looks just like mainframe computing ecosystem did 40 years ago. A small number of players produce vertically integrated solutions where the ASICs (the central processing unit responsible for high speed data packet switching), the hardware design, the hardware manufacture, and the entire software stack are stack are single sourced and vertically integrated.  Just as you couldn’t run IBM MVS on a Burrows computer, you can’t run Cisco IOS on Juniper equipment.

When networking gear is purchased, it’s packaged as a single sourced, vertically integrated stack. In contrast, in the commodity server world, starting at the most basic component, CPUs are multi-sourced. We can get CPUs from AMD and Intel. Compatible servers built from either Intel or AMD CPUs are available from HP, Dell, IBM, SGI, ZT Systems, Silicon Mechanics, and many others.  Any of these servers can support both proprietary and open source operating systems. The commodity server world is open and multi-sourced at every layer in the stack.


Open, multi-layer hardware and software stacks encourage innovation and rapidly drive down costs. The server world is clear evidence of what is possible when such an ecosystem emerges. In the networking world, we have a long way to go but small steps are being made. Broadcom, Fulcrum, Marvell, Dune (recently purchased by Broadcom), Fujitsu and others all produce ASICs (the data plane CPU of the networking world). These ASICS are available for any hardware designer to pick up and use. Unfortunately, there is no standardization and hardware designs based upon one part can’t easily be adapted to use another.


In the X86 world, the combination of the X86 ISA, hardware platform, and the BIOS forms a De facto standard interface.  Any server supporting this low level interface can host the wide variety of different Linux systems, Windows, and many embedded O/Ss.  The existence of this layer allows software innovation above and encourages nearly unconstrained hardware innovation below.  New hardware designs work with existing software.  New software extensions and enhancements work with all the existing hardware platforms. Hardware producers get a wider variety of good quality operating systems.  Operating systems authors get a broad install base of existing hardware to target. Both get bigger effective markets. High volumes encourage greater investment and drive down costs.


This standardized layer hasn’t existed in the networking ecosystem as it has in the commodity server world. As a consequence, we don’t have high quality networking stacks able to run across a wide variety of networking devices. A potential solution is near: OpenFlow. This work originating out of the Stanford networking team driven by Nick McKeown. It is a low level hardware independent interface for updating network routing tables in a hardware independent-way. It is sufficiently rich to support current routing protocols and it also can support research protocols optimized at high-scale data center networking systems such as VL2 and PortLand. Current OpenFlow implementations exist on X86 hardware running linux, Broadcom, NEC, NetFPGA, Toroki, and many others.


The ingredients of an open stack are coming together. We have merchant silicon ASIC from Broadcom, Fulcrum, Dune and others. We have commodity, high-radix routers available from Broadcom (shipped by many competing OEMs), Arista, and others.  We have the beginnings of industry momentum behind OpenFlow which has a very good chance of being that low level networking interface we need. A broadly available, low-level interface may allow a high-quality, open source networking stack to emerge. I see the beginnings of the right thing happening.


·         OpenFlow web site:

·         OpenFlow paper:  Enabling Innovation in Campus Networks

·         My Stanford Clean Slate Talk Slides: DC Networks are in my way


James Hamilton



b: /


Saturday, December 19, 2009 10:00:08 AM (Pacific Standard Time, UTC-08:00)  #    Comments [14] - Trackback
 Friday, December 18, 2009

I'm on the technical program committe for ACM Science Cloud 2010. You should consider both submitting a paper and attending the conference. The conference will be held in Chicago on June21st, 2010 colocated with  ACM HPDC 2010 (High Performance Distributed Computing).

The call for papers abstracst are due Feb 22 with final papers due March 1st:

Workshop Overview:

The advent of computation can be compared, in terms of the breadth and depth of its impact on research and scholarship, to the invention of writing and the development of modern mathematics. Scientific Computing has already begun to change how science is done, enabling scientific breakthroughs through new kinds of experiments that would have been impossible only a decade ago. Today's science is generating datasets that are increasing exponentially in both complexity and volume, making their analysis, archival, and sharing one of the grand challenges of the 21st century. The support for data intensive computing is critical to advancing modern science as storage systems have experienced an increasing gap between their capacity and bandwidth by more than 10-fold over the last decade. There is an emerging need for advanced techniques to manipulate, visualize and interpret large datasets. Scientific computing involves a broad range of technologies, from high-performance computing (HPC) which is heavily focused on compute-intensive applications, high-throughput computing (HTC) which focuses on using many computing resources over long periods of time to accomplish its computational tasks, many-task computing (MTC) which aims to bridge the gap between HPC and HTC by focusing on using many resources over short periods of time, to data-intensive computing which is heavily focused on data distribution and harnessing data locality by scheduling of computations close to the data.

The 1st workshop on Scientific Cloud Computing (ScienceCloud) will provide the scientific community a dedicated forum for discussing new research, development, and deployment efforts in running these kinds of scientific computing workloads on Cloud Computing infrastructures. The ScienceCloud workshop will focus on the use of cloud-based technologies to meet new compute intensive and data intensive scientific challenges that are not well served by the current supercomputers, grids or commercial clouds. What architectural changes to the current cloud frameworks (hardware, operating systems, networking and/or programming models) are needed to support science? Dynamic information derived from remote instruments and coupled simulation and sensor ensembles are both important new science pathways and tremendous challenges for current HPC/HTC/MTC technologies.  How can cloud technologies enable these new scientific approaches? How are scientists using clouds? Are there scientific HPC/HTC/MTC workloads that are suitable candidates to take advantage of emerging cloud computing resources with high efficiency? What benefits exist by adopting the cloud model, over clusters, grids, or supercomputers?  What factors are limiting clouds use or would make them more usable/efficient?

This workshop encourages interaction and cross-pollination between those developing applications, algorithms, software, hardware and networking, emphasizing scientific computing for such cloud platforms. We believe the workshop will be an excellent place to help the community define the current state, determine future goals, and define architectures and services for future science clouds. 

James Hamilton



b: /


Friday, December 18, 2009 11:24:52 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback

Disclaimer: The opinions expressed here are my own and do not necessarily represent those of current or past employers.

<May 2010>

This Blog
Member Login
All Content © 2015, James Hamilton
Theme created by Christoph De Baene / Modified 2007.10.28 by James Hamilton