Monday, May 31, 2010

I did a talk at the Usenix Tech conference last year, Where does the Power Go in High Scale Data Centers. After the talk I got into a more detailed discussion with many folks from Netflix and Canada’s Research in Motion, the maker of the Blackberry. The discussion ended up in a long lunch over a big table with folks from both teams. The common theme of the discussion was predictably, given the companies and folks involved, innovation in high scale service and how to deal with incredible growth rates. Both RIM and Netflix are very successful and, until you have experienced and attempted to manage internet growth rates, you really just don’t know. I'm impressed with what they are doing. Growth brings super interesting problems and I learned from both and really enjoyed spending time with them.

 

I recently came across an interesting talk by Santosh Rau, the Netflix Cloud Infrastructure Engineering Manager. The fact that Netflix actually has a Cloud Infrastructure engineering manager is what caught my attention. Netflix continues to innovate quick and is moving fast with cloud computing.

 

My notes from Rau’s talk:

·         Details on Netflix

o   More than 10m subscribers

o   Over 100,000 DVD titles

o   50 distribution centers

o   Over 12,000 instant watch titles

·         Why is Netflix going to the cloud

o   Elastic infrastructure

o   Pay for what you use

o   Simple to deploy and maintain

o   Leverage datacenter geo-diversity

o   Leverage application services (queuing, persistence, security, etc.

·         Why did Netflix chose Amazon Web Services

o   Massive scale

o   More mature services

o   Thriving, active developer community of over 400,000 developers with excellent support

·         Netflix goals for move to the cloud:

o   Improved availability

o   Operational simplicity

o   Architect to exploit the characteristic of the cloud

·         Services in cloud:

o   Streaming control service: stream movie content to customers

§  Architecture: Three Netflix services running in EC2 (replication, queueing, and streaming) with inter-service communication via SQS and persistent state in SimpleDB.

§  Good cloud workload in that usage can vary greatly and there is value in having regional data centers and a better customer experience is possible by streaming content from locations near users

o   Encoding Service: Encodes movies in format required by diverse set of supported devices.

§  Good cloud workload in that its very computational intense and as new formats are introduced, massive encoding work needs to be done and there is value in doing it quickly (more servers for less time).

o   AWS Services used by Netflix

§  Elastic compute Cloud

§  Elastic Block Storage

§  Simple Queuing Service

§  SimpleDB

§  Simple Storage Service

§  Elastic Load Balancing

§  Elastic MapReduce

o   Developer Challenges:

§  Reliability and capacity

§  Persistence strategy

·         Oracle on EC2 over EBS vs MySQL vs SimpleDB

·         SimpleDB: Highly available replicating across zones

·         Eventually consistent (now supports full consistency (I love eventual consistency but…)

§  Data encryption and key management

§  Data replication and consistency

 

Predictably, the talk ended with “Netflix is hiring” but, in this case, it is actually worth mentioning. They are doing very interesting work and moving lightening fast. RIM is hiring as well: http://www.rim.com/careers/index.shtml.

 

The slides for the talk are at: slideshare.

 

                                                                                --jrh

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Monday, May 31, 2010 5:59:29 AM (Pacific Standard Time, UTC-08:00)  #    Comments [4] - Trackback
Services
 Tuesday, May 25, 2010

PUE is still broken and I still use it. For more on why PUE has definite flaws, see: PUE and Total Power Usage Efficiency. However, I still use it because it’s an easy to compute summary of data center efficiency. It can be gamed endlessly but it’s easy to compute and it does provide some value.

 

Improvements are underway in locking down of the most egregious abuses of PUE. Three were recently summarized in Technical Scribblings  RE Harmonizing Global Metrics for Data Center Energy Efficiency.  In this report from John Stanley, the following were presented:

·         Total energy to include all forms of energy whether electric or otherwise (e.g. gas fired chiller must include chemical energy being employed). I like it but It’ll be a challenge to implement

·         Total energy should include lighting, cooling, and all support infrastructure. We already knew this but its worth clairifying since it’s a common “fudge” employed by smaller operators

·         PUE energy should be calculated using source energy. This is energy at the source prior to high voltage distribution losses and including all losses in energy production. For example, for gas plants, it’s the fuel energy used including heat losses and other inefficiencies. This one seems hard to compute with precision and I’m not sure how I could possibly figure out source energy where some power is base load power and some is from peak plants and some is from out of state purchases. This recommendation seems a bit weird.

 

As with my recommendations in PUE and Total Power Usage Efficiency, these proposed changes add complexity while increasing precision. Mostly I think the increased complexity is warranted although the last, computing source energy, looks hard to do and I don’t fully buy that the complexity is justified.

 

It’s a good short read: Technical Scribblings  RE Harmonizing Global Metrics for Data Center Energy Efficiency. Thanks to Vijay Rao of AMD for sending this one my way.

 

                                                                                --jrh

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Tuesday, May 25, 2010 9:07:40 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Ramblings

Federal and state governments are prodigious information technology users.  Federal Chief Information Security Office Vivek Kundra reports that the United States government is spending $76B annually on 10,000 different systems. In a recently released report, State of Public Sector Cloud Computing, Vivek Kundra summarizes the benefits of cloud computing:

 

There was a time when every household, town, farm or village had its own water well.  Today, shared public utilities give us access to clean water by simply turning on the tap; cloud computing works in a similar fashion.  Just like the water from the tap in your kitchen, cloud computing services can be turned on or off quickly as needed.  Like at the water company, there is a team of dedicated professionals making sure the service provided is safe and available on a 24/7 basis.  Best of all, when the tap isnt on, not only are you saving water, but you arent paying for resources you dont currently need.

§  Economical.  Cloud computing is a pay-as-you-go approach to IT, in which a low initial investment is required to get going.  Additional investment is incurred as system use increases and costs can decrease if usage decreases.  In this way, cash flows better match total system cost.

§  Flexible.  IT departments that anticipate fluctuations in user load do not have to scramble to secure additional hardware and software.  With cloud computing, they can add and subtract capacity as its network load dictates, and pay only for what they use.

§  Rapid Implementation.  Without the need to go through the procurement and certification processes, and with a near-limitless selection of services, tools, and features, cloud computing helps projects get off the ground in record time. 

§  Consistent Service.  Network outages can send an IT department scrambling for answers.  Cloud computing can offer a higher level of service and reliability, and an immediate response to emergency situations. 

§  Increased Effectiveness.  Cloud computing frees the user from the finer details of IT system configuration and maintenance, enabling them to spend more time on mission-critical tasks and less time on IT operations and maintenance. 

§  Energy Efficient.  Because resources are pooled, each user community does not need to have its own dedicated IT infrastructure.  Several groups can share computing resources, leading to higher utilization rates, fewer servers, and less energy consumption.  

This document defines cloud computing and describes the federal government approach and then goes on to cover 30 case studies. The case studies are the most interesting part of the report in that they provide a sampling of the public sector move to cloud computing showing its real and project are underway and substantial progress is being made.

 

It’s good to see the federal government showing leadership at a time when the need for federal services are undiminished but the burgeoning federal deficit needs to be brought under control. The savings possible through cloud computing are substantial and the federal IT spending base is enormous, so its particularly good to be adopting this new technology delivery platform at scale.

 

·         Document: State of Public Sector Cloud Computing

·         Executive Summary: State of Public Sector Cloud Computing

 

Thanks to Werner Vogels for sending this article my way.

 

                                                                                                --jrh

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Tuesday, May 25, 2010 5:17:52 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Ramblings
 Monday, May 24, 2010

Economic forces are more powerful than politics.  Political change is slow.  Changing laws takes time.  Lobbyist water down the intended legislation.  Companies find loop holes.  The population as a whole, lacks the strength of conviction to make the tough decisions and stick with them.

 

Economic forces are far more powerful and certainly more responsive than political forces. Effectively, what I’m observing is great good can be done if there is a business model and profit encouraging it. Here’s my favorite two examples, partly because they are both doing great things and partly because they are so different in their approach, but still have the common thread of using the free market to improve the world.

 

Google RE<C

As a society, we can attempt to limit greenhouse gas emissions by trading carbon credits or passing laws attempting to force change but, in the end, it seems we just keep burning coal.  In my view, the Google approach to tackling this problem is wonderful: invest in renewable energy technologies that can be cheaper than coal.  More on the program: Google's Goal: Renewable Energy Cheaper than Coal and Plug into a Greener Grid: RE<C and RechargeIT Initiatives. They are working on solar thermal, high-altitude wind, and geo-thermal.  The core idea is that, if renewable sources are cheaper than coal, economic forces would quickly make the right thing happen and we actually would stop burning coal. I love the approach but its fiendishly difficult.

 

Bill & Melinda Gates Foundation

Here’s a related approach. The problem set is totally different but there are some parallels with the previous example in that they are attempting to set up an economic system where it can be profitable to do good for society.

 

I attended a small presentation by Bill Gates about 5 years ago.  By my measure, it was by far the best talk I’ve ever seen Gates gave.  I suspect Bill wouldn’t agree that it was his best but it had a huge impact on me. No press was there and I saw nothing written about it afterwards, but two things caught my interest: 1) Gates’ understanding of world health problems is astoundingly deep, and 2) I loved his technique of applying free-market principles to battle world problems ranging from disease through population growth.

 

In this talk Bill noted that North American disease has a very profitable business model and consequently is heavily invested.  Third world disease lacks a business model and, as a result, there is very little investment. It’s clear that many diseases  are easy to control or even eradicate but there is no economic incentive and so there is no sustained progress. There are charity donations but no deep and sustained R&D investment since there are no obvious profits to be made. Bill proposed that we encourage business models that allows drug companies to invest R&D into third world health problems. They should be able to invest knowing they will be able make money on the outcome.

 

Current drug costs are driven almost exclusively by R&D costs. The manufacturing costs are quite low by comparison. Does this remind you of anything? It’s the software world all over again. So, the question that brings up is: Can we create a model where drugs are sold in huge volume at very low cost?  I recall buying a copy of Unix for an IBM XT back in 1985 and it was $1,000 (Mark Williams Coherent).  Today 1/10 of that will buy an O/S and many are free with the business model being built on services.  Can we do the same thing to the drug world?  Where else could this technique play out?

 

Using the free market to drive change is the most leveraged approach I’ve ever seen to drive change. Where else can we cost effectively change the economic model and drive a better outcome for society as a whole?

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Monday, May 24, 2010 4:40:26 AM (Pacific Standard Time, UTC-08:00)  #    Comments [3] - Trackback
Ramblings
 Tuesday, May 18, 2010

I am excited by very low power, very low cost servers and the impact they will have on our industry. There are many workloads where CPU is in excess and lower power and lower cost servers are the right answer. These are workloads that don’t fully exploit the capabilities of the underlying server. For these workloads, the server is out of balance with excess CPU capability (both power and cost). There are workloads were less is more. But, with technology shifts, it’s easy to get excited and try to apply the new solution too broadly.

 

We can see parallels in the Flash memory world. At first there was skepticism that Flash had a role to play in supporting server workloads. More recently, there is huge excitement around flash and I keep coming across applications of the technology that really don’t make economic sense. Not all good ideas apply to all problems. In going after this issue I wrote When SSDs make sense in Server applications and then later When SSDs Don’t Make Sense in Server Applications. Sometimes knowing where not to apply a technology is more important than knowing where to apply it. Looking at the negative technology applications is useful.

 

Returning to very low-cost, low-power servers, I’ve written a bit about where they make sense and why:

·         Very Low-Power Server Progress

·         The Case for Low-Cost, Low-Power Servers

·         2010 the Year of the Microslice Computer

·         Linux/Apache on ARM Servers

·         ARM Cortex-A9 SMP Design Announced

 

But I haven’t looked much at where very low-power, low-cost servers do not make sense. When aren’t they a win when looking at work done per dollar and work done per joule? Last week Dave DeWitt sent me a paper that looks the application of Wimpy (from the excellent  FAWN, Fast Array of Wimpy Nodes, project at CMU) servers and their application to database workloads. In Wimpy Node Clusters: What About Non-Wimpy Workloads Willis Lang, Jignesh Patel, and Srinanth Shankar find that Intel Xeon E5410 is slightly better than Intel Atom when running parallel clustered database workloads including TPC-E and TPC-H. The database engine in this experiment is IBM DB2 DB-X (yet another new name for the product originally called DB2 Parallel Edition – see IBM DB2 for information on DB2 but the Wikipedia page is not yet caught up to the latest IBM name change).

 

These results show us that on complex, clustered database workloads, server processors can win over low-power parts. For those interested in probing the very low-cost, low-power processor space, the paper is worth a read: Wimpy Node Clusters: What About Non-Wimpy Workloads.

 

The generalization of their finding that I’ve been using is CPU intensive and workloads with poor scaling characteristic are poor choices to be hosted on very low-power, low-cost servers. CPU intensive workloads are a lose because these workloads are CPU-bound so run best where there is maximum CPU per-server in the cluster. Or worded differently, the multi-server cluster overhead is minimized by having fewer, more-powerful nodes.  Workloads with poor scaling characteristics are another category not well supported by wimpy nodes and the explanation is similar. Although these workloads may not be CPU-bound, they don’t run well over clusters with large server counts. Generally, more resources per node is the best answer if the workload can’t be scaled over large server counts.

 

Where very low-power, low-cost servers win is:

1.      Very cold storage workloads. I last posted on these workloads last year Successfully Challenging the Server Tax. The core challenge with cold storage apps is that overall system cost is dominated by disk but the disk needs to be attached to a server. We have to amortize the cost of the server over the attached disk storage. The more disk we attach to a single server, the lower the cost. But, the more disk we attach to a single server, the larger the failure zone. Nobody wants to have to move 64 to 128 TB every time a server fails. The tension is more disk to server ratio drives down costs but explodes the negative impact of server failures. So, if we have a choice of more disks to a given server or, instead, to use a smaller, cheaper server, the conclusion is clear. Smaller wins. This is a wonderful example of where low-power servers are a win.

2.      Workloads with good scaling characteristics and non-significant local resource requirements.  Web workloads that just accept connections and dispatch can run well on these processors. However, we still need to consider the “and non-significant local resource” clause. If the workload scales perfectly but each interaction needs access to very large memories for example, it  may be poor choice for Wimpy nodes.  If the workload scales with CPU and local resources are small, Wimpy nodes are a win.

The first example above is a clear win. The second is more complex. Some examples will be a win but others will not. The better the workload scales and the less fixed resources (disk or memory) required, the bigger the win.

 

Good job by Willis Lang, Jignesh Patel, and Srinanth Shankar in showing us where wimpy nodes lose with detailed analysis.

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Tuesday, May 18, 2010 4:09:47 AM (Pacific Standard Time, UTC-08:00)  #    Comments [10] - Trackback
Hardware
 Friday, May 14, 2010

I recently came across a nice data center cooling design by Alan Beresford of EcoCooling Ltd. In this approach, EcoCooling replaces the CRAC units with a combined air mover, damper assembly, and evaporative cooler. I’ve been interested by evaporative coolers and their application to data center cooling for years and they are becoming more common in modern data center deployments (e.g. Data Center Efficiency Summit).

 

An evaporative cooler is a simple device that cools air through taking water through a state change from fluid to vapor. They are incredibly cheap to run and particularly efficient in locals with lower humidity. Evaporative coolers can allow the power intensive process-based cooling to be shut off for large parts of the year. And, when combined with favorable climates or increased data center temperatures can entirely replace air conditioning systems. See Chillerlesss Datacenter at 95F, for a deeper discussion see Costs of Higher Temperature Data Centers, and for a discussion on server design impacts: Next Point of Server Differentiation: Efficiency at Very High Temperature.

 

In the EcoCooling solution, they take air from the hot aisle and release it outside the building. Air from outside the building is passed through an evaporative cooler and then delivered to the cold aisle. For days too cold outside for direct delivery to the datacenter, outside air is mixed with exhaust air to achieve the desired inlet temperature.

 

 

This is a nice clean approach to substantially reducing air conditioning hours. For more information see: Energy Efficient Data Center Cooling or the EcoCooling web site: http://www.ecocooling.co.uk/.

 

                                                                --jrh

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

 

Friday, May 14, 2010 7:06:00 PM (Pacific Standard Time, UTC-08:00)  #    Comments [7] - Trackback
Hardware
 Monday, May 10, 2010

Wide area network costs and bandwidth shortage are the most common reasons why many enterprise applications run in a single data center. Single data center failure modes are common. There are many external threats to single data center deployments including utility power loss, tornado strikes, facility fire, network connectivity loss,  earthquake, break in, and many others I’ve not yet been “lucky” enough to have seen. And, inside a single facility, there are simply too many ways to shoot one’s own foot.  All it takes is one well intentioned networking engineer to black hole the entire facilities networking traffic. Even very high quality power distribution systems can have redundant paths taken out by fires in central switch gear or cascading failure modes.  And, even with very highly redundant systems, if the redundant paths aren’t tested often, they won’t work.  Even with incredibly redundancy, just having the redundant components in the same room, means that a catastrophic failure of one system, could possibly eliminate the second. It’s very hard to engineer redundancy with high independence and physical separate of all components in a single datacenter.

 

With incredible redundancy, comes incredible cost. Even with incredible costs, failure modes remain that can eliminate the facility entirely. The only cost effective solution is to run redundantly across multiple data centers.  Redundancy without physical separation is not sufficient and making a single facility bullet proof has expenses asymptotically heading towards infinity with only tiny increases in availability as the expense goes up. The only way to get the next nine is have redundancy between two data centers. This approach is both more available and considerably more cost effective.

 

Given that cross-datacenter redundancy is the only effective way to achieve cost-effective availability, why don’t all workloads run in this mode? There are 3 main blocker for the customer I’ve spoken with: 1) scale, 2) latency, and 3) WAN bandwidth and costs.

 

The scale problem is, stated simply, most companies don’t run enough IT infrastructure to be able to afford multiple data centers in different parts of the country. In fact, many companies only really need a small part of a collocation facility.  Running multiple data centers at low scale drives up costs. This is one of the many ways cloud computing can help drive down costs and improve availability.  Cloud service providers like Amazon Web Services, run 10s of data centers. You can leverage the AWS scale economics to allow even very low scale applications to run across multiple data centers with diverse power, diverse networking, different fault zones, etc. Each datacenter is what AWS calls an Availability Zone. Assuming the scale economics allow it, the second blocker to cross data center replication is the availability of WAN bandwidth and its cost.

 

There are also physical limits – mostly the speed of light in fiber – on how far apart redundant components of an application can be run. This limitation is real but won’t prevent redundancy data centers from getting far “enough” away to achieve the needed advantages. Generally, 4 to 5 msec is tolerable for most workloads and replication systems. Bandwidth availability and costs is the prime reason why most customers don’t run geo-diverse.

 

I’ve argued that latency need not be a blocker. So, if the application has the scale to be able to be run over multiple data centers, the major limiting factor remaining is WAN bandwidth and cost. It is for this reason that I’ve long been interested in WAN compression algorithsm and appliances. These are systems that do compression between branch offices and central enterprise IT centers. Riverbed is one of the largest and most successful of the WAN accelerator providers. Naïve application of block-based compression is better than nothing but compression ratios are bounded and some types of traffic compress very poorly. Most advanced WAN accelerators employ three basic techniques: 1) data type specific optimizations, 2) dedupe, and 3) block-based compression.

 

Data type specific optimizations are essentially a bag of closely guarded heuristics that optimize for Exchange, SharePoint, remote terminal protocol, or other important application data types. I’m going to ignore these type-specific optimizations and focus on dedupe followed by block-based compression since they are the easiest to apply to cross data center traffic replication traffic.

 

Broadly, dedupe breaks the data to be transferred between datacenters into either fixed or variable sized blocks. Variable blocks are slightly better but either works. Each block is cryptographically hashed and, rather than transferring the block to the remote datacenter, just send the hash signature. If that block is already in the remote system block index, then it or its clone has already been sent sometime in the past and nothing need to be sent now. In employing this technique we are exploiting data redundancy at a course scale. We are essentially remembering what is on both sides of the WAN and only sending blocks that have not been seen before. The effectiveness of this broad technique is very dependent upon the size and efficiency of the indexing structures, the choice of block boundaries, and inherent redundancy in the data. But, done right, the compression ratios can be phenomenal with 30 to 50:1 not being uncommon. This, by the way, is the same basic technology being applied in storage deduplications by companies like Data Domain.

 

If a block has not been sent before, then we actually do have to transfer it. That’s when we apply the second level compression technique. Usually a block-oriented compression algorithm and frequently some variant of LZ. The combination of dedupe and block compression is very effective. But, the system I’ve described above introduces latency. And, for highly latency sensitive workloads like EMC SRDF, this can be a problem. Many latency sensitive workloads can’t employ the tricks I’m describing here and either have to run single data center or run at higher cost without compression. 

 

Last week I ran across a company targeting latency sensitive cross-datacenter replication traffic. Infineta Systems announced this morning a solution targeting this problem: Infineta Unveils Breakthrough Acceleration Technology for Enterprise Data Centers. The Infineta Velocity engine is a dedupe appliance that operates at 10Gbps line rate with latencies under 100 microseconds per network packet. Their solution aims to get the bulk of the advantages of the systems I described above at much lower overhead and latency. They achieve their speed-up three ways: 1) hardware implementation based upon FPGA, 2) fixed-sized, full packet block size,  3) bounded index exploiting locality, and 4) heuristic signatures.

 

The first technique is fairly obvious and one I’ve talked about in the past. When you have a repetitive operation that needs to run very fast, the most cost and power effective solution may be a hardware implementation. It’s getting easier and easier to implement common software kernels in FPGAs or even ASICs. see Heterogeneous Computing using GPGPUs and FPGAs for related discussions on the application of hardware acceleration and, for an application view,  High Scale Network Research.

 

The second technique is another good one. Rather than spend time computing block boundaries, just use the network packet as the block boundary. Essentially they are using the networking system to find the block boundaries. This has the downside of not being as effective as variable sized block systems and they don’t exploit type specific knowledge but they can run very fast at low overhead and close to the higher compression rates yielded by these more computationally intensive techniques.  They are exploiting the fact that 20% of the work produces 80% of the gain.

 

The third technique helps reduce the index size. Rather than having a full index of all blocks that have even been sent, just keep the last N. This allows the index structure to be 100% memory resident without huge, expensive memories. This smaller index is much less resource intensive requiring much less memory and no disk accesses. Avoiding disk is the only way to get anything approaching 100 microsecond latency. Infineta is exploiting temporal locality.  Redundant data packets often show up near each other. Clearly this is not always the case and they won’t get the maximum possible compression but they claim to get most of the compression possible in full block index systems without the latency penalty of a disk access and without less memory overhead.

 

The final technique wasn’t described in enough detail for me to fully understand it.  What Infineta is doing is avoiding the cost of fully hashing each packet but taking an approximate signature of carefully chosen packet offsets. Clearly you can take a fast signature on less than the full packet and this signature can be used to know that the packet is not in the index on the other side. But, if the fast hash is present, it doesn’t prove the packet has already been sent. Two different packets can have the same fast hash. Infineta were a bit cagey on this point but what they might be doing is using the very fast approx has to find those that have not yet been sent unambiguously. Using this technique, a fast hash can be used to find those packets that absolutely need to be sent so we can start to compress and send those and waste no more resources on hashing. For those that may not need to be sent, take a full signature and check to see if it is on the remote site. If my guess is correct, the fast hash is being used to avoid spending resources quickly on packets that are not in the index on the other side.

 

Infineta looks like an interesting solution.  More data on them at:

·         Press release: http://www.infineta.com/news/news_releases/press_release:5585,15851,445

·         Web site: http://www.infineta.com

·         Announcing $15m Series A funding: Infineta Comes Out of Stealth and Closes $15 Million Round of Funding http://tinyurl.com/2g4zhbc

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Monday, May 10, 2010 12:07:33 PM (Pacific Standard Time, UTC-08:00)  #    Comments [8] - Trackback
Hardware
 Thursday, May 06, 2010

Earlier this week Clustrix announced a MySQL compatible, scalable database appliance that caught my interest. Key features supported by Clustrix:

·         MySQL protocol emulation (MySQL protocol supported so MySQL apps written to the MySQL client libraries just work)

·         Hardware appliance delivery package in a 1U package including both NVRAM and disk

·         Infiniband interconnect

·         Shared nothing, distributed database

·         Online operations including alter table add column

 

I like the idea of adopting a MySQL programming model. But, it’s incredibly hard to be really MySQL compatible unless each node is actually based upon the MySQL execution engine. And it’s usually the case that a shared nothing, clustered DB will bring some programming model constraints. For example, if global secondary indexes aren’t implemented, it’s hard to support uniqueness constraints on non-partition key columns and it’s hard to enforce referential integrity. Global secondary indexes maintenance implies a single insert, update, or delete that would normally only require a single node change would require atomic updates across many nodes in the cluster making updates more expensive and susceptible to more failure modes. Essentially, making a cluster look exactly the same as a single very large machine with all the same characteristics isn’t possible. But, many jobs that can’t be done perfectly are still well worth doing. If Clustrix delivers all they are describing, it should be successful.

 

I also like the idea of delivering the product as a hardware appliance. It keep the support model simple, reduces install and initial setup complexity, and enables application specific hardware optimizations.

 

Using Infiniband as a cluster interconnect is a nice choice as well. I believe that 10GigE with RDMA support will provide better price performance than Infiniband but commodity 10GigE volumes and quality RDMA support is still 18 to 24 months away so Inifiband is a good choice for today.

 

Going with a shared nothing architecture avoids dependence on expensive shared storage area networks and the scaling bottleneck of distributed lock managers.  Each node in the cluster is an independent database engine with its own physical (local) metadata, storage engine, lock manager, buffer manager, etc. Each node has full control of the table partitions that reside on that node. Any access to those partitions must go through that node. Essentially, bringing the query to the data rather than the data to the query. This is almost always the right answer and it scales beautifully.  

In operation, a client connects to one of the nodes in the cluster and submits a SQL statement. The statement is parsed and compiled. During compilation, the cluster-wide (logical) metadata is accessed as needed and an execution plan is produced. The cluster-wide (logical) metadata is either replicated to all nodes or stored centrally with local caching. The execution plan produced by the query compilation will be run on as many nodes as needed with the constraint that table or index access be on the nodes that house those table or index partitions. Operators higher in the execution plan can run on any node in the cluster.  Rows flow between operators that span node boundaries over the infiniband network.  The root of the query plan runs on the node where the query was started and the results are returned to client program using the MySQL client protocol

 

As described, this is a very big engineering project. I’ve worked on teams that have taken exactly this approach and they took several years to get to the first release and even subsequent releases had programming model constraints. I don’t know how far along Clustrix is a this point but I like the approach and I’m looking forward to learning more about their offering.

 

White paper: Clustrix: A New Approach

Press Release: Clustrix Emerges from Stealth Mode with Industry’s First Clustered DB

 

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Thursday, May 06, 2010 9:49:22 AM (Pacific Standard Time, UTC-08:00)  #    Comments [9] - Trackback
Software
 Tuesday, May 04, 2010

 Dave Patterson did a keynote at Cloud Futures 2010.  I wasn’t able to attend but I’ve heard it was a great talk so I asked Dave to send the slides my way. He presented Cloud Computing and the Reliable Adaptive Systems Lab.

 

The Berkeley RAD Lab principal investigators include: Armando FoxRandy Katz & Dave Patterson (systems/networks), Michael Jordan (machine learning), Ion Stoica (networks & P2P), Anthony Joseph (systems/security), Michael Franklin (databases), and Scott Shenker (networks) in addition to 30 Phd students, 10 undergrads, and 2 postdocs.

 

The talk starts by arguing that cloud computing actually is a new approach drawing material from the Above the Clouds paper that I mentioned early last year: Berkeley Above the Clouds. Then walked through why pay-as-you-go computing with small granule time increments allow SLAs to be hit without stranding valuable resources.

 

 

 

The slides are up at: Cloud Computing and the RAD Lab and if you want to read more about the RAD lab:  http://radlab.cs.berkeley.edu. If you haven’t already read it, this is worth reading: Above the Clouds: A Berkeley View.

 

                                                                --jrh

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Tuesday, May 04, 2010 5:48:56 AM (Pacific Standard Time, UTC-08:00)  #    Comments [4] - Trackback
Services
 Friday, April 30, 2010

Rich Miller of Datacenter Knowledge covered this last week and it caught my interest. I’m super interested in modular data centers (Architecture for Modular Datacenters) and highly efficient infrastructure (Data Center Efficiency Best Practices) so the Yahoo! Computing Coop caught my interest.

 

As much as I like the cost, strength, and availability of ISO standard shipping containers, 8’ is an inconvenient width. It’s not quite wide enough for two rows of standard racks and there are cost and design advantages in having at least two rows in a container. With two rows, air can be pulled in each side with a single hot aisle in the middle with large central exhaust fans. Its an attractive design point and there is nothing magical about shipping containers. What we want is commodity, prefab, and a moderate increment of growth.

 

The Yahoo design is a nice one. They are using a shell borrowed from a Tyson foods design. Tyson is the grower of a large part of the North American chicken production.  These prefab facilities are essentially giant air handlers with the shell making up a good part of the mechanical plant. They pull air in either side of the building, it passes through two rows of servers into the center of the building. The roof slopes to the center from both side with central exhaust fans. Each unit is 120’ x 60’ and houses 3.6 MW of critical load.

 

Because of the module width they have 4 rows of servers. It’s not clear if the air from outside has to pass through both rows to get the central hot aisle but it sounds like that is the approach. Generally serial cooling where the hot air from one set of servers is routed through another is worth avoiding. It certainly can work but requires more air flow than single pass cooling using the same approach temperature.

 

Yahoo! believes they will be able to bring a new building online in 6 months at a cost of $5M per megawatt. In the Buffalo New York location, they expect to only use process-based cooling 212 hours/year and have close to zero water consumption when the air conditioning is not in use. See the Data Center Knowledge article for more detail: Yahoo Computing Coop: Shape of Things to Come?

More pictures at: A Closer Look at Yahoo’s New Data Center. Nice design Yahoo.

 

                                                                --jrh

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Friday, April 30, 2010 9:44:30 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Hardware
 Thursday, April 29, 2010

Facebook released Flashcache yesterday: Releasing Flashcache. The authors of Flashcache, Paul Saab and Mohan Srinivasan, describe it as “a simple write back persistent block cache designed to accelerate reads and writes from slower rotational media by caching data in SSD's.”

 

There are commercial variants of flash-based write caches available as well. For example, LSI has a caching controller that operates at the logical volume layer. See LSI and Seagate take on Fusion-IO with Flash. The way these systems work is, for a given logical volume, page access rates are tracked.  Hot pages are stored on SSD while cold pages reside back on spinning media. The cache is write-back and pages are written back to their disk resident locations in the background.

 

For benchmark workloads with evenly distributed, 100% random access patterns, these solutions don’t contribute all that much. Fortunately, the world is full of data access pattern skew and some portions of the data are typically very cold while others are red hot. 100% even distributions really only show up in benchmarks – most workloads have some access pattern skew. And, for those with skew, a flash cache can substantially reduce disk I/O rates at lower cost than adding more memory.

 

What’s interesting about the Facebook contribution is that its open source and supports Linux.  From: http://github.com/facebook/flashcache/blob/master/doc/flashcache-doc.txt:

 

Flashcache is a write back block cache Linux kernel module. [..]Flashcache is built using the Linux Device Mapper (DM), part of the Linux Storage Stack infrastructure that facilitates building SW-RAID and other components. LVM, for example, is built using the DM.

 

The cache is structured as a set associative hash, where the cache is divided up into a number of fixed size sets (buckets) with linear probing within a set to find blocks. The set associative hash has a number of advantages (called out in sections below) and works very well in practice.

 

The block size, set size and cache size are configurable parameters, specified at cache creation. The default set size is 512 (blocks) and there is little reason to change this.

 

More information on usage: http://github.com/facebook/flashcache/blob/master/doc/flashcache-sa-guide.txt.  Thanks to Grant McAlister for pointing me to the Facebook release of Flashcache. Nice work Paul and Mohan.

 

                                                                --jrh

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Thursday, April 29, 2010 6:36:25 AM (Pacific Standard Time, UTC-08:00)  #    Comments [4] - Trackback
Hardware
 Wednesday, April 21, 2010

There have been times in past years when it really looked like we our industry was on track to supporting only a single relevant web browser. Clearly that’s not the case today.  In a discussion with a co-working today on the importance of “other” browsers, I wanted to put some data on the table so I looked up the browser stats for this web site (http://mvdirona.com).  I hadn’t looked for a while and found the distribution  truly interesting:

 

Admittedly, those that visit this site clearly don’t represent the broader population well.  Nonetheless, the numbers are super interesting.  Firefox eclipsing Internet Explorer and by such a wide margin was surprising to me. You can’t see it in the data above but the IE share continues to decline.  Chrome is already up to 17%.

 

Looking at the share data posted on Wikipedia (http://en.wikipedia.org/wiki/Usage_share_of_web_browsers#Summary_table and using the Net Market Share data) we see that IE has declined from over 91.4% to  61.4% in just 5 years. Again a surprisingly rapid change.

 

Focusing on client operating systems, from the skewed sample that accesses this site, we see several interesting trends: 1) Mac share continues to climb sharply at 16.6%, 2) Linux at 9%, 3) iphone, ipod and ipad in aggregate at over 5 ¼%, and 4) Android already over a ¼%.

 

Overall we are seeing more browser diversity,  more O/S diversity, and unsurprisingly, more mobile devices.

 

                                                                --jrh

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Wednesday, April 21, 2010 12:25:25 PM (Pacific Standard Time, UTC-08:00)  #    Comments [4] - Trackback
Ramblings
 Saturday, April 17, 2010

We live on a boat which has lots of upside but broadband connectivity isn’t one of them. As it turns out, our marina has WiFi but it is sufficiently unreliable that we needed another solution. I wish there was a Starbucks hotspot across the street – actually there is one within a block but we can’t quite pick up the signal even with an external antennae (Syrens). 

 

WiFi would have been a nice solution but didn’t work so we decided to go with WiMAX. We have used ClearWire for over a year on the boat and, generally, it has worked acceptably well. Not nearly as fast as WiFi but better than 3G cellular.  Recently ClearWire changed its name to Clear and “upgraded” the connectivity technology to full WiMAX. Unfortunately, the upgrade substantially reduced the coverage area, has been fairly unstable, and the Customer support although courteous and friendly is so far away from the engineering team that they basically just can’t make a difference no matter how hard they try.

 

We decided we had to find a different solution. I use AT&T 3G cellular with tethering and would have been fine with that as a solution. It’s a bit slower than Clear but its stable and coverage is very broad. Unfortunately, the “unlimited” plan we got some years ago is very limited to 5Gig/month and we move far more data than that. I can’t talk AT&T into offering a solution so, again, we needed something else.

 

Sprint now has a WiMAX service that offers good performance (although they can be a bit aggressive on throttling) and they have fairly broad coverage in our area and are expanding quickly (Sprint announces seven new WiMAX markets). Sprint has the additional nice feature on some modems where, if WiMAX is unavailable, it transparently falls back to 3G. The 3G service is still limited to 5Gig but, as long as we are on WiMAX a substantial portion of the month, we’re fine.

 

The remaining challenge was Virtual Private Networks (VPN) over WiMAX can be unstable. I really wish my work place supported Exchange RPC over HTTP (one of the coolest Outlook/Exchange features of all time). However, many companies believe that Exchange RPC over HTTP is insecure in that it doesn’t’ require 2 factor authentication. Ironically, many of these companies allow Blackberries’ and iPhones to access email without 2 factor auth. I won’t try to explain why one is unsafe and the other is fine but I think it might have something to do with the popularity of iPhones and Blackberries with execs and senior technical folks :-).

 

In the absence of RPC over HTTP, logging into the work network via VPN is the only answer. My work place uses Aventail but there are a million solutions out there. I’ve used many and love none.  There are many reasons why these systems can be unstable, cause blue screens, and otherwise negatively impact the customer experience. But one that has been driving me especially nuts is frequent dropped connections and hangs when using the VPN over WiMAX. It appears to happen more frequently when there is more data in flight but to lose a connection every few minutes is quite common. 

 

It turns out the problem is the default MTU on most client systems is 1500 but the WiMAX default is often smaller. It should still work and just be super inefficient but it doesn’t. For more details see http://www.amazon.com/Sierra-Wireless-Overdrive-Mobile-Hotspot/dp/B0032JTPMK.

 

To check Vista MTUs:

 

netsh interface ipv4 show subinterfaces

 

To change the MTU to 1400:

 

netsh interface ipv4 set subinterface "your vpn interface here" mtu=1400 store=persistent

 

I’m using an MTU of 1400 with Sprint and its working well. Thanks to Kitz.co.uk for the easy MTU update. If you are having flakey VPN support especially if running over WiMAX, check your MTU.

 

                                                --jrh

 

James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
james@amazon.com  

H:mvdirona.com | W:mvdirona.com/jrh/work  | blog:http://perspectives.mvdirona.com

 

Saturday, April 17, 2010 5:43:03 AM (Pacific Standard Time, UTC-08:00)  #    Comments [6] - Trackback
Ramblings
 Monday, April 12, 2010

Standards and benchmarks have driven considerable innovation. The most effective metrics are performance-based.  Rather than state how to solve the problem, they say what needs to be achieved and leave the innovation open.

 

I’m an ex-auto mechanic and was working as a wrench in a Chevrolet dealership in the early 80. I hated the emission controls that were coming into force at that time because they caused the cars to run so badly. A 1980 Chevrolet 305 CID with 4 BBL carburetor would barely idle in perfect tune. It was a mess. But, the emission standards didn’t say it had to run badly only what needed to be achieved. And, competition to achieve those goals produced compliant vehicles that ran well. Ironically, as emission standards forced more precise engine management, both fuel economy and power density has improved as well. Initially both suffered as did drivability but competition brought many innovations to market and we ended up seeing emissions compliance to increasingly strict standards at the same time that both power density and fuel economy improved.

 

What’s key is the combination of competition and performance-based standards.  If we set high goals and allow companies to innovate in how they achieve those goals, great things happen. We need to take that same lesson and apply it to data centers.

 

Recently, the American Society of Heating, Refrigerating and Air-Conditioning Engineers (ASHRAE) added data centers to their building efficiency standard, ASHRAE Standard 90.1. This standard defines the energy efficiency for most types of buildings in America and is often incorporated into building codes across the country. Unfortunately, as currently worded, this document is using a prescriptive approach. To comply, you must use economizers and other techniques currently in common practice. But, are economizers the best way to achieve the stated goal?  What about a system that harvested waste heat and applied it growing cash crops like Tomatoes? What about systems using heat pumps to scavenge low grade heat (see Data Center Waste Heat Reclaimation)? Both these innovations would be precluded by the proposed spec as they don’t use economizers.

               

Urs Hoelzle, Google’s Infrastructure SVP, recently posted Setting Efficiency Goals for Data Centers where he argues we need goal-based environmental targets that drive innovation rather than prescriptive standards that prevent it.  Co-signatures with Urs include:

·         Chris Crosby, Senior Vice President, Digital Realty Trust

·         Hossein Fateh, President and Chief Executive Officer, Dupont Fabros Technology

·         James Hamilton, Vice President and Distinguished Engineer, Amazon

·         Urs Hoelzle, Senior Vice President, Operations and Google Fellow, Google

·         Mike Manos, Vice President, Service Operations, Nokia

·         Kevin Timmons, General Manager, Datacenter Services, Microsoft

 

I thinks we’re all excited by the rapid pace of innovation in high scale data centers. We know its good for the environment and for customers.  And I think we’re all uniformly in agreement with ASHRAE in the intent of 90.1. What’s needed to make it a truly influential and high-quality standard is that it be changed to be performance-based rather than prescriptive. But, otherwise, I think we’re all heading in the same direction.

 

                                                                                --jrh

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Monday, April 12, 2010 9:39:48 AM (Pacific Standard Time, UTC-08:00)  #    Comments [2] - Trackback
Services
 Friday, April 09, 2010

High scale network research is hard.  Running a workload over a couple of hundred servers says little of how it will run over thousands or tens of thousands servers. But, having 10’s of thousands of nodes dedicated to a test cluster is unaffordable. For systems research the answer is easy: use Amazon EC2. It’s an ideal cloud computing application. Huge scale is needed during some parts of the research project but the servers aren’t needed 24 hours/day and certainly won’t be needed for the three year amortization life of the servers.

 

However, for high-scale network research, finding the solution is considerably more difficult. In some dimensions, it’s no different from systems research in that purchasing a few thousand servers for the research projects makes no sense. But the easy answer of simply using EC2 doesn’t work in that EC2 nodes come fully provisioned with networking.  One solution that works well for many networking research problems is to use an overlay and test at scale in EC2. But, when new hardware devices are being investigated, unless they can be emulated with high fidelity using with software implementations running on EC2, this solution breaks down. 

 

For all but a few folks at Cisco and Juniper, running a multi-thousand node physical cluster to test new network gear is impractical. And it’s even less practical in academic settings. I’m lucky enough to work near many thousands of server nodes and a huge networking infrastructure. But, even then, installing a parallel network to do network research is difficult to afford. High-scale network research at credible scale is difficult.

 

Zhangxi Tan of Cal Berkeley came up to visit a couple of weeks back. I’m interested in Zhangxi’s work for two reasons: 1) its based upon reconfigurable computing -- a technology ready for commercial application and 2) the application of FPGA to network simulation might be a solution to the problem of how to test networking gear at credible scale.

 

Reconfigurable computing maintains the flexibility of reprogrammable software systems with the performance of high performance hardware implementations.  Or, worded differently, most of the performance of Application Specific Integrated Circuits (ASIC) with the flexibility of software. Most reconfigurable computing designs are based upon Field Programmable Gate Arrays (FPGA) and some high level instruction set or programming language to allow device reconfiguration.  Recently, C and C++ subset compilers have emerged that allow a constrained version C or C++ to be compiled directly to a FPGA and, once the software is stable, directly to an ASIC. See Platform-based Electronic Systems Level (ESL) Synthesis for more on reconfigurable computing and see Heterogeneous Computing using GPGPUs and FPGAs for related discussions on the application of hardware acceleration.

 

In the work that Zhangxi presented, the Cal Berkeley team is taking the RAMP gold FPGA-based many-core simulator (Research Accelerator for Multiple Processors) and applying it to the problem of high-scale network simulation with a goal of simulating an O(10k) server network. Zhangxi’s slides are here: Using FPGAs to Simulate Novel Datacenter Network Architectures at Scale and my rough notes follow:

·         Lots of work going on in data center network research: VL2, Dcell, PortLand,…

·         But:

o   the test scale is usually WAY smaller than the problem targeted by these systems

o   Often synthetic benchmarks are used rather than actual workloads

·         RAMP Gold is:

o   Full 32-bit SPARC v8 ISA support, including FP, traps and MMU.

o   Use abstract models with enough detail, but fast enough to run real apps/OS

o   Provide cycle-level accuracy

o   Cost-efficient: hundreds of nodes plus switches on a single FPGA

·         RAMP Gold implementation:

o   Based upon Xilinx XUP V5 board ($750)

o   Able to simulate 64 core, 2GB DDR2, FP and run production Linix

·         Tested using trace data from Facebook and Yahoo Hadoop runs

·         Demonstrating the incast TCP collapse problem and showed simulated results that closely matched actual measured results

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Friday, April 09, 2010 6:57:05 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Hardware
 Wednesday, April 07, 2010

Mike Stonebraker published an excellent blog posting yesterday at the CACM site: Errors in Database Systems, Eventual Consistency, and the CAP Theorem. In this article, Mike challenges the application of Eric Brewer’s CAP Theorem by the NoSQL database community. Many of the high-scale NoSQL system implementers have argued that the CAP theorem forces them to go with an eventual consistent model.

 

Mike challenges this assertion pointing that some common database errors are not avoided by eventual consistency and CAP really doesn’t apply in these cases. If you have an application error, administrative error, or database implementation bug that losses data, then it is simply gone unless you have an offline copy. This, by the way, is why I’m a big fan of deferred delete.  This is a technique where deleted items are marked as deleted but not garbage collected until some days or preferably weeks later.  Deferred delete is not full protection but it has saves my butt more than once and I’m a believer. See On Designing and Deploying Internet-Scale Services for more detail.

 

CAP and the application of eventual consistency doesn’t directly protect us against application or database implementation errors. And, in the case of a large scale disaster where the cluster is lost entirely, again, neither eventual consistency nor CAP offer a solution. Mike also notes that network partitions are fairly rare.  I could quibble a bit on this one. Network partitions should be rare but net gear continues to cause more issues than it should. Networking configuration errors, black holes, dropped packets, and brownouts, remain a popular discussion point in post mortems industry-wide. I see this improving over the next 5 years but we have a long way to go. In Networking: the Last Bastion of Mainframe Computing, I argue that net gear is still operating on the mainframe business model: large, vertically integrated and expensive equipment, deployed in pairs. When it comes to redundancy at scale, 2 is a poor choice.

 

Mike’s article questions whether eventual consistency is really the right answer for these workloads. I made some similar points in “I love eventual consistency but…” In that posting, I argued that many applications are much easier to implement with full consistency and full consistency can be practically implemented at high scale. In fact, Amazon SimpleDB recently announced support for full consistency. Apps needed full consistency are now easier to write and, where only eventual consistency is needed, its available as well.

 

Don’t throw full consistency out too early. For many applications, it is both affordable and helps reduce application implementation errors.

 

                                                                --jrh

 

Thanks to Deepak Singh for pointing me to this article.

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Wednesday, April 07, 2010 11:58:15 AM (Pacific Standard Time, UTC-08:00)  #    Comments [5] - Trackback
Services
 Tuesday, March 23, 2010

Every so often, I come across a paper that just nails it and this one is pretty good.. Using a market Economy to Provision Resources across a Planet-wide Clusters doesn’t fully investigate the space but it’s great to progress on this important area and the paper is a strong step in the right direction.

 

I spend much of my time working on driving down infrastructure costs. There is lots of great work that can be done in datacenter infrastructure, networking, and server design. It’s both a fun and important area. But, an even bigger issue is utilization. As an industry, we can and are driving down the cost of computing and yet it remains true that most computing resources never get used. Utilization levels at large and small companies typically run in the 10 to 20% range. I occasionally hear reference to 30% but it’s hard to get data to support it. Most compute cycles go wasted. Most datacenter power doesn’t get useful work done. Most datacenter cooling is not spent supporting productive work. Utilization is a big problem. Driving down the cost of computing certainly helps but it doesn’t address the core issue: low utilization.

 

That’s one of the reasons I work at a cloud computing provider.  When you have very large, very diverse workloads, wonderful things happen.  Workload peaks are not highly correlated. For example, tax preparation software is busy around tax time. Retail software towards the end of the year. Social networking while folks in the region are awake. All these peaks and valleys overlay to produce a much flatter peak to trough curve. As the peak to trough ratio decreases, utilization sky rockets.  You can only get these massively diverse workloads in public clouds and its one of the reasons why private clouds are a bit depressing (see Private Clouds are not the Future). Private clouds are so close to the right destination and yet that last turn was a wrong one and the potential gains won’t be achieved. I hate wasted work as much as I hate low utilization.

 

The techniques above smooth the aggregated utilization curve and, the flatter that curve gets, the higher the utilization, the lower the cost, and are better it is for the environment. Large public clouds get this curve flattened the workload peaks considerably but the goal of steady unchanging load 24 hours a day, 7 days a week isn’t achievable.  Even power companies have base load and peak load. What to do with the remaining utilization valleys? The next technique is to use a market economy to incent developers and users to use resources that aren’t currently fully utilized.  

 

In The Cost of A Cloud: Research Problems in Datacenter Networks, we argued that turning servers off is a mistake in that the most you can hope to achieve is to save the cost of the power which is tiny when compared to the cost of the servers, the cost of power distribution gear, and the cost of the mechanical systems. See the Cost of Power in Large-Scale Datacenters (I’ve got an update of this work coming – the changes are interesting but the cost of power remains the minority cost). Rather than shutting off servers, the current darling idea of the industry, we should be productively using the servers. If we can run any workload worth more than the marginal cost of power, we should. Again, a strong argument for public clouds with large pools of resources on which a market can be made.

 

Continuing with making a market and offering computing resources not under supply crunch (under-utilized) at lower costs, Amazon Web Services has a super interesting offering called spot instances. Spot instances allow customers to bid on unused EC2 capacity and allow them to run those instances as long as their bids exceed the current instance spot price.

 

The paper I mentioned above is heading in a similar direction but this time working on the Google MapReduce cluster utilization problem.  Technically the paper actually is working on a private cloud but its still nice work and it is using the biggest private cloud in the world at well over a million servers so I can’t complain too much. I really like the paper. From the conclusion:

 

In this paper, we have thus proposed a framework for allocating and pricing resources in a grid-like environment. This framework employs a market economy with prices adjusted in periodic clock auctions. We have implemented a pilot allocation system within Google based on these ideas. Our preliminary experiments have resulted in significant improvements in overall utilization; users were induced to make their services more mobile, to make disk/memory/network tradeoffs as appropriate in different clusters, and to fully utilize each resource dimension, among other desirable outcomes. In addition, these auctions have resulted in clear price signals, information that the company and its engineering teams can take advantage of for more efficient future provisioning.

 

It’s worth reading the full paper: http://www.stokely.org/papers/google-cluster-auctions.pdf. One of the authors, Murray Stokely of Google, also wrote an interesting blog entry Fun with Amazon Web Services where he developed many of the arguments above. Thanks to Greg Linden and Deepak Singh for pointing me to this paper. It made for a good read this morning and I hope you enjoy it as well.

 

                                                                --jrh

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Tuesday, March 23, 2010 6:13:22 AM (Pacific Standard Time, UTC-08:00)  #    Comments [10] - Trackback
Services
 Wednesday, February 24, 2010

I love eventual consistency but there are some applications that are much easier to implement with strong consistency. Many like eventual consistency because it allows us to scale-out nearly without bound but it does come with a cost in programming model complexity. For example, assume your program needs to assign work order numbers uniquely and without gaps in the sequence.  Eventual consistency makes this type of application difficult to write.

 

Applications built upon eventually consistent stores have to be prepared to deal with update anomalies like lost updates. For example, assume there is an update at time T1 where a given attribute is set to 2. Later, at time T2, the same attribute is set to a value of 3. What will the value of this attribute be at a subsequent time T3?  Unfortunately, the answer is we’re not sure. If T1 and T2 are well separated in time, it will almost certainly be 3. But it might be 2. And it is conceivable that it could be some value other than 2 or 3 even if there have been no subsequent updates. Coding to eventual consistency is not the easiest thing in the world. For many applications its fine and, with care, most applications can be written correctly on an eventually consistent model. But it is often more difficult.

 

What I’ve learned over the years is that strong consistency, if done well, can scale to very high levels. The trick is implementing it well. The naïve approach to achieve full consistency is to route all updates through a single master server but clearly this won’t scale. Instead divide the update space into a large number of partitions, each with its own master. That scales but there is still a tension between the number of partitions and the cost of maintaining many partitions and avoiding hot spots.  The obvious way to avoid hot spots is to use a large number of partitions but this increases partition management overhead.  The right answer is to be able to dynamically repartition to maintain a sufficient number of partitions and to be able to adapt to load increases on any single server by further spreading the update load.

 

There are many approaches to support dynamic hot sport management. One is to divide the workload into 10 to 100x more partitions than expected servers and make these fixed-sized partitions be the unit of migration. Servers with hot partitions will end up serving less partitions while servers with cold partitions will manage more. The other class of approaches, is to dynamically repartition. Start with large partitions and divide hot partitions to multiple smaller partitions to spread the load over multiple servers.

 

There are many variants of these techniques with different advantages and disadvantages. The constant is that full consistency is more affordable than many think. Clearly, eventual consistency remains a very good thing for workloads that don’t need full consistency and for workloads where the overhead of the above techniques is determined to be unaffordable. Both higher consistency models are quite useful.

 

This morning SimpleDB announced support for two new features that make it much easier to write many classes of applications: 1) consistent Reads, 2) Conditional put and delete.  Consistent reads allows applications that need full consistency to be easily written against SimpleDB. So, for example, if you wanted to implement an inventory management system that didn’t lose parts in the warehouse, doesn’t sell components twice, or place multiple orders, it would now be trivial to write this application against SimpleDB using the consistent read support. Consistent read is implemented as an optional Boolean flag on SimpleDB GetAttributes or select statements. Absence of the flag continues to deliver the familiar eventually consistent behavior with which many of you are very familiar with. If the flag is present and set, you get strong consistency. 

 

SimpleDB conditional PutAttributes and DeleteAttributes are a related feature that makes it much easier to write applications where the new value of an attribute are functionally related to the old value. Conditional update support allows a programmer to read the value of an attribute, operate upon it, and then write it back only if the value hasn’t changed in the interim which would render the planned update invalid. For example, say you were implementing a counter (+1). If the value of the counter at time T0 was 0, and subsequently an increment was applied at time T1 and another at increment was applied at time T2, what is the value of the counter? Using eventual consistency and, for simplicity, assuming no concurrent updates, the resulting value is probably is 2. Unfortunately, the value might be 1. With conditional updates, it will be 2.  Again, conditional puts and deletes are just another great tool to help write correct SimpleDB applications quickly and efficiently.

 

For more information on consistent reads and conditional put and delete, see SimpleDB Consistency Enhancements.

 

These two SimpleDB features have been in the works for some time and so it is exciting to see them announced and available today. It’s great to now be able to talk about these features publically. If you are interested in giving them a try, you can for free.  There is no charge for SimpleDB use for database sizes under 1GB (and silly close to free above that level). Go for it.

 

                                                                                --jrh

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Wednesday, February 24, 2010 3:17:57 PM (Pacific Standard Time, UTC-08:00)  #    Comments [2] - Trackback
Services
 Monday, February 15, 2010

MySpace makes the odd technology choice that I don’t fully understand.  And, from a distance, there are times when I think I see opportunity to drop costs substantially. But, let’s ignore that, and tip our hat to the MySpace for incredibly scale they are driving. It’s a great social networking site and you just can’t argue with the scale they are driving. Their traffic is monstrous and, consequently, it’s a very interesting site to understand in more detail.

 

Lubor Kollar of SQL Server just sent me this super interesting overview of the MySpace service. My notes follow and the original article is at: http://www.microsoft.com/casestudies/Case_Study_Detail.aspx?casestudyid=4000004532.

 

I particularly like social networking sites like Facebook and MySpace because they are so difficult to implement.  Unlike highly partitionable workloads like email, social networking sites work hard to find as many relationships, across as many  dimensions, amongst as many users as possible. I refer to this as the hairball problem. There are no nice clean data partitions which makes social networking sites amongst the most interesting of the high scale internet properties.  More articles on the hairball problem:

·         FriendFeed use of MySQL

·         Geo-Replication at Facebook

·         Scaling LinkedIn

 

The combination of the hairball problem and extreme scale makes the largest social networking sites like MySpace some of the toughest on the planet to scale.  Focusing on MySpace scale, it is prodigious:

·         130M unique monthly users

·         40% of the US population has MySpace accounts

·         300k new users each day

 

The MySpace Infrastructure:

·         3,000 Web Servers

·         800 cache servers

·         440 SQL Servers

 

Looking at the database tier in more detail:

·         440 SQL Server Systems hosting over 1,000 databases

·         Each running on an HP ProLiant DL585

o   4 dual core AMD procs

o   64 GB RAM

·         Storage tier: 1,100 disks on a distributed SAN (really!)

·         1PB of SQL Server hosted data

 

As ex-member of the SQL Server development team and perhaps less than completely unbiased, I’ve got to say that 440 database servers across a single cluster is a thing of beauty.

 

More scaling stores: http://perspectives.mvdirona.com/2010/02/07/ScalingSecondLife.aspx.

 

Hats off to MySpace for delivering a reliable service, in high demand, with high availability. Very impressive.

 

                                                                --jrh

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Monday, February 15, 2010 12:11:19 PM (Pacific Standard Time, UTC-08:00)  #    Comments [19] - Trackback
Services
 Saturday, February 13, 2010

Last week, I posted Scaling Second Life. Royans sent me a great set of scaling stories: Scaling Web Architectures and Vijay Rao of AMD pointed out How FarmVille Scales to Harvest 75 Million Players a Month. I find the Farmville example particularly interesting in that it’s “only” a casual game. Having spent most of my life (under a rock) working on high-scale servers and services, I naively would never have guessed that casual gaming was big business. But it is. Really big business. To put a scale point on what "big" means in this context, Zynga, the company responsible for Farmville, is estimated to have a valuation of between $1.5B and $3B (Zynga Raising $180M on Astounding Valuation) with annual revenues of roughly $250M (Zynga Revenues Closer to $250).

 

The Zynga games portfolio includes 24 games, the best known of which are Mafia Wars and FarmVille. The Farmville scaling story is an great example of how fast internet properties can need to scale. The game had 1M players after 4 days and 10M after 60 days.  

 

In this interview with FarmVille’s Luke Rajich (How FarmVille Scales to Harvest 75 Million Players a Month), Luke talks about scaling what he refers to as both the largest game in the world and the largest application on a web platform. FarmVille is a Facebook application and peak bandwidth between FarmVille and Facebook can run as high as 3Gbps. The FarmVille team has to manage both incredibly fast growth and very spikey traffic patterns. They have implemented what I call graceful degradation mode(Designing and Deploying Internet-Scale Services) and are able to shed load as load as gaming traffic increases push them towards their resource limits. In Luke’s words “the application has the ability to dynamically turn off any calls back to the platform. We have a dial that we can tweak that turns off incrementally more calls back to the platform. We have additionally worked to make all calls back to the platform avoid blocking the loading of the application itself. The idea here is that, if all else fails, players can continue to at least play the game. […]The way in which services degrade are to rate limit errors to that service and to implement service usage throttles. The key ideas are to isolate troubled and highly latent services from causing latency and performance issues elsewhere through use of error and timeout throttling, and if needed, disable functionality in the application using on/off switches and functionality based throttles.”  These are good techniques that can be applied to all services.

 

Lessons Learned from scaling Farmville:

1.      Interactive games are write-heavy. Typical web apps read more than they write so many common architectures may not be sufficient. Read heavy apps can often get by with a caching layer in front of a single database. Write heavy apps will need to partition so writes are spread out and/or use an in-memory architecture.

2.    Design every component as a degradable service. Isolate components so increased latencies in one area won't ruin another. Throttle usage to help alleviate problems. Turn off features when necessary.

3.    Cache Facebook data. When you are deeply dependent on an external component consider caching that component's data to improve latency.

4.    Plan ahead for new release related usage spikes.

5.      Sample. When analyzing large streams of data, looking for problems for example, not every piece of data needs to be processed. Sampling data can yield the same results for much less work.

 

Check out the High Scalability article How FarmVille Scales to Harvest 75 Million Players a Month for more details.

 

                                                                --jrh

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

Saturday, February 13, 2010 8:05:33 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services

Disclaimer: The opinions expressed here are my own and do not necessarily represent those of current or past employers.

Archive
<May 2010>
SunMonTueWedThuFriSat
2526272829301
2345678
9101112131415
16171819202122
23242526272829
303112345

Categories
This Blog
Member Login
All Content © 2014, James Hamilton
Theme created by Christoph De Baene / Modified 2007.10.28 by James Hamilton