Saturday, April 17, 2010

We live on a boat which has lots of upside but broadband connectivity isn’t one of them. As it turns out, our marina has WiFi but it is sufficiently unreliable that we needed another solution. I wish there was a Starbucks hotspot across the street – actually there is one within a block but we can’t quite pick up the signal even with an external antennae (Syrens). 

 

WiFi would have been a nice solution but didn’t work so we decided to go with WiMAX. We have used ClearWire for over a year on the boat and, generally, it has worked acceptably well. Not nearly as fast as WiFi but better than 3G cellular.  Recently ClearWire changed its name to Clear and “upgraded” the connectivity technology to full WiMAX. Unfortunately, the upgrade substantially reduced the coverage area, has been fairly unstable, and the Customer support although courteous and friendly is so far away from the engineering team that they basically just can’t make a difference no matter how hard they try.

 

We decided we had to find a different solution. I use AT&T 3G cellular with tethering and would have been fine with that as a solution. It’s a bit slower than Clear but its stable and coverage is very broad. Unfortunately, the “unlimited” plan we got some years ago is very limited to 5Gig/month and we move far more data than that. I can’t talk AT&T into offering a solution so, again, we needed something else.

 

Sprint now has a WiMAX service that offers good performance (although they can be a bit aggressive on throttling) and they have fairly broad coverage in our area and are expanding quickly (Sprint announces seven new WiMAX markets). Sprint has the additional nice feature on some modems where, if WiMAX is unavailable, it transparently falls back to 3G. The 3G service is still limited to 5Gig but, as long as we are on WiMAX a substantial portion of the month, we’re fine.

 

The remaining challenge was Virtual Private Networks (VPN) over WiMAX can be unstable. I really wish my work place supported Exchange RPC over HTTP (one of the coolest Outlook/Exchange features of all time). However, many companies believe that Exchange RPC over HTTP is insecure in that it doesn’t’ require 2 factor authentication. Ironically, many of these companies allow Blackberries’ and iPhones to access email without 2 factor auth. I won’t try to explain why one is unsafe and the other is fine but I think it might have something to do with the popularity of iPhones and Blackberries with execs and senior technical folks :-).

 

In the absence of RPC over HTTP, logging into the work network via VPN is the only answer. My work place uses Aventail but there are a million solutions out there. I’ve used many and love none.  There are many reasons why these systems can be unstable, cause blue screens, and otherwise negatively impact the customer experience. But one that has been driving me especially nuts is frequent dropped connections and hangs when using the VPN over WiMAX. It appears to happen more frequently when there is more data in flight but to lose a connection every few minutes is quite common. 

 

It turns out the problem is the default MTU on most client systems is 1500 but the WiMAX default is often smaller. It should still work and just be super inefficient but it doesn’t. For more details see http://www.amazon.com/Sierra-Wireless-Overdrive-Mobile-Hotspot/dp/B0032JTPMK.

 

To check Vista MTUs:

 

netsh interface ipv4 show subinterfaces

 

To change the MTU to 1400:

 

netsh interface ipv4 set subinterface "your vpn interface here" mtu=1400 store=persistent

 

I’m using an MTU of 1400 with Sprint and its working well. Thanks to Kitz.co.uk for the easy MTU update. If you are having flakey VPN support especially if running over WiMAX, check your MTU.

 

                                                --jrh

 

James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
james@amazon.com  

H:mvdirona.com | W:mvdirona.com/jrh/work  | blog:http://perspectives.mvdirona.com

 

Saturday, April 17, 2010 5:43:03 AM (Pacific Standard Time, UTC-08:00)  #    Comments [6] - Trackback
Ramblings
 Monday, April 12, 2010

Standards and benchmarks have driven considerable innovation. The most effective metrics are performance-based.  Rather than state how to solve the problem, they say what needs to be achieved and leave the innovation open.

 

I’m an ex-auto mechanic and was working as a wrench in a Chevrolet dealership in the early 80. I hated the emission controls that were coming into force at that time because they caused the cars to run so badly. A 1980 Chevrolet 305 CID with 4 BBL carburetor would barely idle in perfect tune. It was a mess. But, the emission standards didn’t say it had to run badly only what needed to be achieved. And, competition to achieve those goals produced compliant vehicles that ran well. Ironically, as emission standards forced more precise engine management, both fuel economy and power density has improved as well. Initially both suffered as did drivability but competition brought many innovations to market and we ended up seeing emissions compliance to increasingly strict standards at the same time that both power density and fuel economy improved.

 

What’s key is the combination of competition and performance-based standards.  If we set high goals and allow companies to innovate in how they achieve those goals, great things happen. We need to take that same lesson and apply it to data centers.

 

Recently, the American Society of Heating, Refrigerating and Air-Conditioning Engineers (ASHRAE) added data centers to their building efficiency standard, ASHRAE Standard 90.1. This standard defines the energy efficiency for most types of buildings in America and is often incorporated into building codes across the country. Unfortunately, as currently worded, this document is using a prescriptive approach. To comply, you must use economizers and other techniques currently in common practice. But, are economizers the best way to achieve the stated goal?  What about a system that harvested waste heat and applied it growing cash crops like Tomatoes? What about systems using heat pumps to scavenge low grade heat (see Data Center Waste Heat Reclaimation)? Both these innovations would be precluded by the proposed spec as they don’t use economizers.

               

Urs Hoelzle, Google’s Infrastructure SVP, recently posted Setting Efficiency Goals for Data Centers where he argues we need goal-based environmental targets that drive innovation rather than prescriptive standards that prevent it.  Co-signatures with Urs include:

·         Chris Crosby, Senior Vice President, Digital Realty Trust

·         Hossein Fateh, President and Chief Executive Officer, Dupont Fabros Technology

·         James Hamilton, Vice President and Distinguished Engineer, Amazon

·         Urs Hoelzle, Senior Vice President, Operations and Google Fellow, Google

·         Mike Manos, Vice President, Service Operations, Nokia

·         Kevin Timmons, General Manager, Datacenter Services, Microsoft

 

I thinks we’re all excited by the rapid pace of innovation in high scale data centers. We know its good for the environment and for customers.  And I think we’re all uniformly in agreement with ASHRAE in the intent of 90.1. What’s needed to make it a truly influential and high-quality standard is that it be changed to be performance-based rather than prescriptive. But, otherwise, I think we’re all heading in the same direction.

 

                                                                                --jrh

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Monday, April 12, 2010 9:39:48 AM (Pacific Standard Time, UTC-08:00)  #    Comments [2] - Trackback
Services
 Friday, April 09, 2010

High scale network research is hard.  Running a workload over a couple of hundred servers says little of how it will run over thousands or tens of thousands servers. But, having 10’s of thousands of nodes dedicated to a test cluster is unaffordable. For systems research the answer is easy: use Amazon EC2. It’s an ideal cloud computing application. Huge scale is needed during some parts of the research project but the servers aren’t needed 24 hours/day and certainly won’t be needed for the three year amortization life of the servers.

 

However, for high-scale network research, finding the solution is considerably more difficult. In some dimensions, it’s no different from systems research in that purchasing a few thousand servers for the research projects makes no sense. But the easy answer of simply using EC2 doesn’t work in that EC2 nodes come fully provisioned with networking.  One solution that works well for many networking research problems is to use an overlay and test at scale in EC2. But, when new hardware devices are being investigated, unless they can be emulated with high fidelity using with software implementations running on EC2, this solution breaks down. 

 

For all but a few folks at Cisco and Juniper, running a multi-thousand node physical cluster to test new network gear is impractical. And it’s even less practical in academic settings. I’m lucky enough to work near many thousands of server nodes and a huge networking infrastructure. But, even then, installing a parallel network to do network research is difficult to afford. High-scale network research at credible scale is difficult.

 

Zhangxi Tan of Cal Berkeley came up to visit a couple of weeks back. I’m interested in Zhangxi’s work for two reasons: 1) its based upon reconfigurable computing -- a technology ready for commercial application and 2) the application of FPGA to network simulation might be a solution to the problem of how to test networking gear at credible scale.

 

Reconfigurable computing maintains the flexibility of reprogrammable software systems with the performance of high performance hardware implementations.  Or, worded differently, most of the performance of Application Specific Integrated Circuits (ASIC) with the flexibility of software. Most reconfigurable computing designs are based upon Field Programmable Gate Arrays (FPGA) and some high level instruction set or programming language to allow device reconfiguration.  Recently, C and C++ subset compilers have emerged that allow a constrained version C or C++ to be compiled directly to a FPGA and, once the software is stable, directly to an ASIC. See Platform-based Electronic Systems Level (ESL) Synthesis for more on reconfigurable computing and see Heterogeneous Computing using GPGPUs and FPGAs for related discussions on the application of hardware acceleration.

 

In the work that Zhangxi presented, the Cal Berkeley team is taking the RAMP gold FPGA-based many-core simulator (Research Accelerator for Multiple Processors) and applying it to the problem of high-scale network simulation with a goal of simulating an O(10k) server network. Zhangxi’s slides are here: Using FPGAs to Simulate Novel Datacenter Network Architectures at Scale and my rough notes follow:

·         Lots of work going on in data center network research: VL2, Dcell, PortLand,…

·         But:

o   the test scale is usually WAY smaller than the problem targeted by these systems

o   Often synthetic benchmarks are used rather than actual workloads

·         RAMP Gold is:

o   Full 32-bit SPARC v8 ISA support, including FP, traps and MMU.

o   Use abstract models with enough detail, but fast enough to run real apps/OS

o   Provide cycle-level accuracy

o   Cost-efficient: hundreds of nodes plus switches on a single FPGA

·         RAMP Gold implementation:

o   Based upon Xilinx XUP V5 board ($750)

o   Able to simulate 64 core, 2GB DDR2, FP and run production Linix

·         Tested using trace data from Facebook and Yahoo Hadoop runs

·         Demonstrating the incast TCP collapse problem and showed simulated results that closely matched actual measured results

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Friday, April 09, 2010 6:57:05 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Hardware
 Wednesday, April 07, 2010

Mike Stonebraker published an excellent blog posting yesterday at the CACM site: Errors in Database Systems, Eventual Consistency, and the CAP Theorem. In this article, Mike challenges the application of Eric Brewer’s CAP Theorem by the NoSQL database community. Many of the high-scale NoSQL system implementers have argued that the CAP theorem forces them to go with an eventual consistent model.

 

Mike challenges this assertion pointing that some common database errors are not avoided by eventual consistency and CAP really doesn’t apply in these cases. If you have an application error, administrative error, or database implementation bug that losses data, then it is simply gone unless you have an offline copy. This, by the way, is why I’m a big fan of deferred delete.  This is a technique where deleted items are marked as deleted but not garbage collected until some days or preferably weeks later.  Deferred delete is not full protection but it has saves my butt more than once and I’m a believer. See On Designing and Deploying Internet-Scale Services for more detail.

 

CAP and the application of eventual consistency doesn’t directly protect us against application or database implementation errors. And, in the case of a large scale disaster where the cluster is lost entirely, again, neither eventual consistency nor CAP offer a solution. Mike also notes that network partitions are fairly rare.  I could quibble a bit on this one. Network partitions should be rare but net gear continues to cause more issues than it should. Networking configuration errors, black holes, dropped packets, and brownouts, remain a popular discussion point in post mortems industry-wide. I see this improving over the next 5 years but we have a long way to go. In Networking: the Last Bastion of Mainframe Computing, I argue that net gear is still operating on the mainframe business model: large, vertically integrated and expensive equipment, deployed in pairs. When it comes to redundancy at scale, 2 is a poor choice.

 

Mike’s article questions whether eventual consistency is really the right answer for these workloads. I made some similar points in “I love eventual consistency but…” In that posting, I argued that many applications are much easier to implement with full consistency and full consistency can be practically implemented at high scale. In fact, Amazon SimpleDB recently announced support for full consistency. Apps needed full consistency are now easier to write and, where only eventual consistency is needed, its available as well.

 

Don’t throw full consistency out too early. For many applications, it is both affordable and helps reduce application implementation errors.

 

                                                                --jrh

 

Thanks to Deepak Singh for pointing me to this article.

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Wednesday, April 07, 2010 11:58:15 AM (Pacific Standard Time, UTC-08:00)  #    Comments [5] - Trackback
Services
 Tuesday, March 23, 2010

Every so often, I come across a paper that just nails it and this one is pretty good.. Using a market Economy to Provision Resources across a Planet-wide Clusters doesn’t fully investigate the space but it’s great to progress on this important area and the paper is a strong step in the right direction.

 

I spend much of my time working on driving down infrastructure costs. There is lots of great work that can be done in datacenter infrastructure, networking, and server design. It’s both a fun and important area. But, an even bigger issue is utilization. As an industry, we can and are driving down the cost of computing and yet it remains true that most computing resources never get used. Utilization levels at large and small companies typically run in the 10 to 20% range. I occasionally hear reference to 30% but it’s hard to get data to support it. Most compute cycles go wasted. Most datacenter power doesn’t get useful work done. Most datacenter cooling is not spent supporting productive work. Utilization is a big problem. Driving down the cost of computing certainly helps but it doesn’t address the core issue: low utilization.

 

That’s one of the reasons I work at a cloud computing provider.  When you have very large, very diverse workloads, wonderful things happen.  Workload peaks are not highly correlated. For example, tax preparation software is busy around tax time. Retail software towards the end of the year. Social networking while folks in the region are awake. All these peaks and valleys overlay to produce a much flatter peak to trough curve. As the peak to trough ratio decreases, utilization sky rockets.  You can only get these massively diverse workloads in public clouds and its one of the reasons why private clouds are a bit depressing (see Private Clouds are not the Future). Private clouds are so close to the right destination and yet that last turn was a wrong one and the potential gains won’t be achieved. I hate wasted work as much as I hate low utilization.

 

The techniques above smooth the aggregated utilization curve and, the flatter that curve gets, the higher the utilization, the lower the cost, and are better it is for the environment. Large public clouds get this curve flattened the workload peaks considerably but the goal of steady unchanging load 24 hours a day, 7 days a week isn’t achievable.  Even power companies have base load and peak load. What to do with the remaining utilization valleys? The next technique is to use a market economy to incent developers and users to use resources that aren’t currently fully utilized.  

 

In The Cost of A Cloud: Research Problems in Datacenter Networks, we argued that turning servers off is a mistake in that the most you can hope to achieve is to save the cost of the power which is tiny when compared to the cost of the servers, the cost of power distribution gear, and the cost of the mechanical systems. See the Cost of Power in Large-Scale Datacenters (I’ve got an update of this work coming – the changes are interesting but the cost of power remains the minority cost). Rather than shutting off servers, the current darling idea of the industry, we should be productively using the servers. If we can run any workload worth more than the marginal cost of power, we should. Again, a strong argument for public clouds with large pools of resources on which a market can be made.

 

Continuing with making a market and offering computing resources not under supply crunch (under-utilized) at lower costs, Amazon Web Services has a super interesting offering called spot instances. Spot instances allow customers to bid on unused EC2 capacity and allow them to run those instances as long as their bids exceed the current instance spot price.

 

The paper I mentioned above is heading in a similar direction but this time working on the Google MapReduce cluster utilization problem.  Technically the paper actually is working on a private cloud but its still nice work and it is using the biggest private cloud in the world at well over a million servers so I can’t complain too much. I really like the paper. From the conclusion:

 

In this paper, we have thus proposed a framework for allocating and pricing resources in a grid-like environment. This framework employs a market economy with prices adjusted in periodic clock auctions. We have implemented a pilot allocation system within Google based on these ideas. Our preliminary experiments have resulted in significant improvements in overall utilization; users were induced to make their services more mobile, to make disk/memory/network tradeoffs as appropriate in different clusters, and to fully utilize each resource dimension, among other desirable outcomes. In addition, these auctions have resulted in clear price signals, information that the company and its engineering teams can take advantage of for more efficient future provisioning.

 

It’s worth reading the full paper: http://www.stokely.org/papers/google-cluster-auctions.pdf. One of the authors, Murray Stokely of Google, also wrote an interesting blog entry Fun with Amazon Web Services where he developed many of the arguments above. Thanks to Greg Linden and Deepak Singh for pointing me to this paper. It made for a good read this morning and I hope you enjoy it as well.

 

                                                                --jrh

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Tuesday, March 23, 2010 6:13:22 AM (Pacific Standard Time, UTC-08:00)  #    Comments [10] - Trackback
Services
 Wednesday, February 24, 2010

I love eventual consistency but there are some applications that are much easier to implement with strong consistency. Many like eventual consistency because it allows us to scale-out nearly without bound but it does come with a cost in programming model complexity. For example, assume your program needs to assign work order numbers uniquely and without gaps in the sequence.  Eventual consistency makes this type of application difficult to write.

 

Applications built upon eventually consistent stores have to be prepared to deal with update anomalies like lost updates. For example, assume there is an update at time T1 where a given attribute is set to 2. Later, at time T2, the same attribute is set to a value of 3. What will the value of this attribute be at a subsequent time T3?  Unfortunately, the answer is we’re not sure. If T1 and T2 are well separated in time, it will almost certainly be 3. But it might be 2. And it is conceivable that it could be some value other than 2 or 3 even if there have been no subsequent updates. Coding to eventual consistency is not the easiest thing in the world. For many applications its fine and, with care, most applications can be written correctly on an eventually consistent model. But it is often more difficult.

 

What I’ve learned over the years is that strong consistency, if done well, can scale to very high levels. The trick is implementing it well. The naïve approach to achieve full consistency is to route all updates through a single master server but clearly this won’t scale. Instead divide the update space into a large number of partitions, each with its own master. That scales but there is still a tension between the number of partitions and the cost of maintaining many partitions and avoiding hot spots.  The obvious way to avoid hot spots is to use a large number of partitions but this increases partition management overhead.  The right answer is to be able to dynamically repartition to maintain a sufficient number of partitions and to be able to adapt to load increases on any single server by further spreading the update load.

 

There are many approaches to support dynamic hot sport management. One is to divide the workload into 10 to 100x more partitions than expected servers and make these fixed-sized partitions be the unit of migration. Servers with hot partitions will end up serving less partitions while servers with cold partitions will manage more. The other class of approaches, is to dynamically repartition. Start with large partitions and divide hot partitions to multiple smaller partitions to spread the load over multiple servers.

 

There are many variants of these techniques with different advantages and disadvantages. The constant is that full consistency is more affordable than many think. Clearly, eventual consistency remains a very good thing for workloads that don’t need full consistency and for workloads where the overhead of the above techniques is determined to be unaffordable. Both higher consistency models are quite useful.

 

This morning SimpleDB announced support for two new features that make it much easier to write many classes of applications: 1) consistent Reads, 2) Conditional put and delete.  Consistent reads allows applications that need full consistency to be easily written against SimpleDB. So, for example, if you wanted to implement an inventory management system that didn’t lose parts in the warehouse, doesn’t sell components twice, or place multiple orders, it would now be trivial to write this application against SimpleDB using the consistent read support. Consistent read is implemented as an optional Boolean flag on SimpleDB GetAttributes or select statements. Absence of the flag continues to deliver the familiar eventually consistent behavior with which many of you are very familiar with. If the flag is present and set, you get strong consistency. 

 

SimpleDB conditional PutAttributes and DeleteAttributes are a related feature that makes it much easier to write applications where the new value of an attribute are functionally related to the old value. Conditional update support allows a programmer to read the value of an attribute, operate upon it, and then write it back only if the value hasn’t changed in the interim which would render the planned update invalid. For example, say you were implementing a counter (+1). If the value of the counter at time T0 was 0, and subsequently an increment was applied at time T1 and another at increment was applied at time T2, what is the value of the counter? Using eventual consistency and, for simplicity, assuming no concurrent updates, the resulting value is probably is 2. Unfortunately, the value might be 1. With conditional updates, it will be 2.  Again, conditional puts and deletes are just another great tool to help write correct SimpleDB applications quickly and efficiently.

 

For more information on consistent reads and conditional put and delete, see SimpleDB Consistency Enhancements.

 

These two SimpleDB features have been in the works for some time and so it is exciting to see them announced and available today. It’s great to now be able to talk about these features publically. If you are interested in giving them a try, you can for free.  There is no charge for SimpleDB use for database sizes under 1GB (and silly close to free above that level). Go for it.

 

                                                                                --jrh

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Wednesday, February 24, 2010 3:17:57 PM (Pacific Standard Time, UTC-08:00)  #    Comments [2] - Trackback
Services
 Monday, February 15, 2010

MySpace makes the odd technology choice that I don’t fully understand.  And, from a distance, there are times when I think I see opportunity to drop costs substantially. But, let’s ignore that, and tip our hat to the MySpace for incredibly scale they are driving. It’s a great social networking site and you just can’t argue with the scale they are driving. Their traffic is monstrous and, consequently, it’s a very interesting site to understand in more detail.

 

Lubor Kollar of SQL Server just sent me this super interesting overview of the MySpace service. My notes follow and the original article is at: http://www.microsoft.com/casestudies/Case_Study_Detail.aspx?casestudyid=4000004532.

 

I particularly like social networking sites like Facebook and MySpace because they are so difficult to implement.  Unlike highly partitionable workloads like email, social networking sites work hard to find as many relationships, across as many  dimensions, amongst as many users as possible. I refer to this as the hairball problem. There are no nice clean data partitions which makes social networking sites amongst the most interesting of the high scale internet properties.  More articles on the hairball problem:

·         FriendFeed use of MySQL

·         Geo-Replication at Facebook

·         Scaling LinkedIn

 

The combination of the hairball problem and extreme scale makes the largest social networking sites like MySpace some of the toughest on the planet to scale.  Focusing on MySpace scale, it is prodigious:

·         130M unique monthly users

·         40% of the US population has MySpace accounts

·         300k new users each day

 

The MySpace Infrastructure:

·         3,000 Web Servers

·         800 cache servers

·         440 SQL Servers

 

Looking at the database tier in more detail:

·         440 SQL Server Systems hosting over 1,000 databases

·         Each running on an HP ProLiant DL585

o   4 dual core AMD procs

o   64 GB RAM

·         Storage tier: 1,100 disks on a distributed SAN (really!)

·         1PB of SQL Server hosted data

 

As ex-member of the SQL Server development team and perhaps less than completely unbiased, I’ve got to say that 440 database servers across a single cluster is a thing of beauty.

 

More scaling stores: http://perspectives.mvdirona.com/2010/02/07/ScalingSecondLife.aspx.

 

Hats off to MySpace for delivering a reliable service, in high demand, with high availability. Very impressive.

 

                                                                --jrh

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Monday, February 15, 2010 12:11:19 PM (Pacific Standard Time, UTC-08:00)  #    Comments [19] - Trackback
Services
 Saturday, February 13, 2010

Last week, I posted Scaling Second Life. Royans sent me a great set of scaling stories: Scaling Web Architectures and Vijay Rao of AMD pointed out How FarmVille Scales to Harvest 75 Million Players a Month. I find the Farmville example particularly interesting in that it’s “only” a casual game. Having spent most of my life (under a rock) working on high-scale servers and services, I naively would never have guessed that casual gaming was big business. But it is. Really big business. To put a scale point on what "big" means in this context, Zynga, the company responsible for Farmville, is estimated to have a valuation of between $1.5B and $3B (Zynga Raising $180M on Astounding Valuation) with annual revenues of roughly $250M (Zynga Revenues Closer to $250).

 

The Zynga games portfolio includes 24 games, the best known of which are Mafia Wars and FarmVille. The Farmville scaling story is an great example of how fast internet properties can need to scale. The game had 1M players after 4 days and 10M after 60 days.  

 

In this interview with FarmVille’s Luke Rajich (How FarmVille Scales to Harvest 75 Million Players a Month), Luke talks about scaling what he refers to as both the largest game in the world and the largest application on a web platform. FarmVille is a Facebook application and peak bandwidth between FarmVille and Facebook can run as high as 3Gbps. The FarmVille team has to manage both incredibly fast growth and very spikey traffic patterns. They have implemented what I call graceful degradation mode(Designing and Deploying Internet-Scale Services) and are able to shed load as load as gaming traffic increases push them towards their resource limits. In Luke’s words “the application has the ability to dynamically turn off any calls back to the platform. We have a dial that we can tweak that turns off incrementally more calls back to the platform. We have additionally worked to make all calls back to the platform avoid blocking the loading of the application itself. The idea here is that, if all else fails, players can continue to at least play the game. […]The way in which services degrade are to rate limit errors to that service and to implement service usage throttles. The key ideas are to isolate troubled and highly latent services from causing latency and performance issues elsewhere through use of error and timeout throttling, and if needed, disable functionality in the application using on/off switches and functionality based throttles.”  These are good techniques that can be applied to all services.

 

Lessons Learned from scaling Farmville:

1.      Interactive games are write-heavy. Typical web apps read more than they write so many common architectures may not be sufficient. Read heavy apps can often get by with a caching layer in front of a single database. Write heavy apps will need to partition so writes are spread out and/or use an in-memory architecture.

2.    Design every component as a degradable service. Isolate components so increased latencies in one area won't ruin another. Throttle usage to help alleviate problems. Turn off features when necessary.

3.    Cache Facebook data. When you are deeply dependent on an external component consider caching that component's data to improve latency.

4.    Plan ahead for new release related usage spikes.

5.      Sample. When analyzing large streams of data, looking for problems for example, not every piece of data needs to be processed. Sampling data can yield the same results for much less work.

 

Check out the High Scalability article How FarmVille Scales to Harvest 75 Million Players a Month for more details.

 

                                                                --jrh

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

Saturday, February 13, 2010 8:05:33 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services
 Sunday, February 07, 2010

As many of you know I collect high-scale scaling war stories. I’ve appended many of them below. Last week Ars Technica published a detailed article on Scaling Second Life: What Second Life can Teach your Datacenter About Scaling Web Apps. This article by Ian Wilkes who worked at Second Life from 2001 to 2009 where he was director of operations. My rough notes follow:

·         Understand scale required:

o   Billing system serving US and EU where each user interacts annually and the system has 10% penetration: 2 to 3 events/second

o   Chat system serving UE and EU where each user sends 10 message/day during workday: 20k messages/second

·         Does the system have to be available 24x7 and understand the impact of downtime (beware of over-investing in less important dimensions at the expense of those more important)

·         Understand the resource impact of features. Be especially cautious around relational database systems and object relational mapping frameworks. If nobody knows the resource requirements, expect trouble in the near future.

·         Database pain: “Almost all online systems use an SQL-based RDBMS, or something very much like one, to store some or all of their data, and this is often the first and biggest bottleneck. Depending on your choice of vendor, scaling a single database to higher capacity can range from very expensive to outright impossible. Linden's experience with uber-popular MySQL is illustrative: we used it for storing all manner of structured data, ultimately totaling hundreds of tables and billions of rows, and we ran into a variety of limitations which were not expected.”

·         MySQL specific issues:

o   Lacks alter table statement

o   Write heavy workload can run heavy CPU spikes due to internal lock conflicts

o   Lack of effective per-user governors means a single application can bring the system to its knees

·         Interchangeable parts :” A common behavior of small teams on a tight budget is to tightly fit the building blocks of their system to the task at hand. It's not uncommon to use different hardware configurations for the webservers, load balancers (more bandwidth), batch jobs (more memory), databases (more of everything), development machines (cheaper hardware), and so on. If more batch machines are suddenly needed, they'll probably have to be purchased new, which takes time. Keeping lots of extra hardware on site for a large number of machine configurations becomes very expensive very quickly. This is fine for a small system with fixed needs, but the needs of a growing system will change unpredictably. When a system is changing, the more heavily interchangeable the parts are, the more quickly the team can respond to failures or new demands.”

·         Instrument, propagate, isolate errors:

o   It is important not to overlook transient, temporary errors in favor of large-scale failures; keeping good data about errors and dealing with them in an organized way is essential to managing system reliability.

o   Second Life has a large number of highly asynchronous back-end systems, which are heavily interdependent. Unfortunately, it had the property that under the right load conditions, localized hotspots could develop, where individual nodes could fall behind and eventually begin silently dropping requests, leading to lost data.

·          Batch jobs, the silent killer: Batch jobs bring two challenges: 1) sudden workload spikes and 2) inability to complete the job within the batch window.

·         Keep alerts under control: “I can't count the number of system operations people I've talked to (usually in job interviews as they sought a new position) who, at a growing firm, suffered from catastrophic over-paging.”

·         Beware of the “grand rewrite”

 

If you are interested in reading more from Ian at Second Life: Interview with Ian Wilkes From Linden Lab.

 

More from the Scaling-X series:

·         Scaling Second Life: http://perspectives.mvdirona.com/2010/02/07/ScalingSecondLife.aspx

·         Scaling Google: http://perspectives.mvdirona.com/2009/10/17/JeffDeanDesignLessonsAndAdviceFromBuildingLargeScaleDistributedSystems.aspx

·         Scaling LinkedIn: http://perspectives.mvdirona.com/2008/06/08/ScalingLinkedIn.aspx

·         Scaling Amazon: http://glinden.blogspot.com/2006/02/early-amazon-splitting-website.html

·         Scaling Second Life: http://radar.oreilly.com/archives/2006/04/web_20_and_databases_part_1_se.html

·         Scaling Technorati: http://www.royans.net/arch/2007/10/25/scaling-technorati-100-million-blogs-indexed-everyday/

·         Scaling Flickr: http://radar.oreilly.com/archives/2006/04/database_war_stories_3_flickr.html

·         Scaling Craigslist: http://radar.oreilly.com/archives/2006/04/database_war_stories_5_craigsl.html

·         Scaling Findory: http://radar.oreilly.com/archives/2006/05/database_war_stories_8_findory_1.html

·         Scaling Myspace: http://perspectives.mvdirona.com/2008/12/27/MySpaceArchitectureAndNet.aspx

·         Scaling Twitter, Flickr, Live Journal, Six Apart, Bloglines, Last.fm, SlideShare, and eBay: http://poorbuthappy.com/ease/archives/2007/04/29/3616/the-top-10-presentation-on-scaling-websites-twitter-flickr-bloglines-vox-and-more

 

A very comprehensive list from Royans:  Scaling Web Architectures

 

Some time back for USENIX LISA, I brought together a set of high-scale services best practices:

·         Designing and Deploying Internet-Scale Services

 

If you come across other scaling war stories, send them my way: jrh@mvdirona.com.

 

                                                                --jrh

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Sunday, February 07, 2010 11:51:21 AM (Pacific Standard Time, UTC-08:00)  #    Comments [6] - Trackback
Services
 Sunday, January 17, 2010

Cloud computing is an opportunity to substantially improve the economics of enterprise IT. We really can do more with less. 

 

I firmly believe that enterprise IT is a competitive weapon and, in all industries, the leaders are going to be those that invest deeply in information processing. The best companies in each market segment are going to be information processing experts and because of this investment, are going to know their customer better, will chose their suppliers better, will have deep knowledge and control of their supply chains, and will have an incredibly efficient distribution system. They will do everything better and more efficiently because of their information processing investment.  This is the future reality for retail companies, for financial companies, for petroleum exploration, for pharmaceutical, for sports teams, and for logistics companies. No market segment will be spared and, for many, it’s their reality today.  Investment in IT is the only way to serve customers and shareholders better than competitors.

 

It’s clear to me that investing in information technology is the future of all successful companies and it’s the present for most. The good news is that it really can be done more cost effectively, more efficiently, and with less environmental impact using cloud computing. We really can do more with less.

 

The argument for cloud computing is gaining acceptance industry-wide. But, private clouds are being embraced by some enterprises and analysts as the solution and the right way to improve the economics of enterprise IT infrastructure. Private clouds may feel like a step in the right direction but scale-economics make private clouds far less efficient than real cloud computing. What’s the difference? At scale, in a shared resource fabric, better services can be offered at lower cost with much higher resource utilization. We’ll look at both the cost and resource utilization advantages in more detail below.

 

At very high-scale it’s both affordable and efficient to have teams of experts in power distribution and mechanical systems on staff.  The major cloud computing providers have these teams and are inventing new techniques to lower costs, improve efficiency, and provide more environmentally sound solutions. This is very hard to do cost effectively at scale of less than 10s of megawatts.  Continuing that same argument to other domains, cloud computing providers have teams specialized in server and storage design.  And they are deeply invested in networking gear hardware and software. All of this is hard to justify at private cloud scales.

 

Cloud computing providers have 24x7 staff to monitor the services and to respond to customer issues. Doing service monitoring right is incredibly difficult and I’ve never seen it done well at anything less than multi-megawatt scales.

 

Cloud computing providers have some of the best distributed systems specialists in the world. They also have open source experts and depend deeply upon both open source and internally produced software.  They do this for two reasons: 1) at high-scale, things fail in new and interesting ways – operational excellence only comes from intimate knowledge of the entire hardware and software stack, and 2) when running at the high scale needed for efficiency, software licensing costs give up much of the excellent economics of a cloud service.

 

Resource utilization is even a stronger argument to move to a high-scale, shared infrastructure cloud. At scale, with high customer diversity, a wonderful property emerges: non-correlated peaks. Whereas each company has to provision to support their peak workload, when running in a shared cloud the peaks and valleys smooth.  The retail market peaks in November, taxation in April, some financial business peak on quarter ends and many of these workloads have many cycles overlaid some daily, some weekly, some yearly and some event specific.  For example, the death of Michael Jackson drove heavy workloads in some domains but had zero impact in others. A huge eastern seaboard storm drives massive peaks in a few businesses but has no impact on most. Large numbers of diverse workloads tend to average out and yield much higher utilization levels than are possible at low scale.  Private clouds can never achieve the utilization levels of shared clouds.

 

Last week Alistair Croll wrote an excellent InformationWeek article arguing that “the true cloud operators will have an unavoidable cost advantage because it's all they worry about. They'll also be closer to consumers (because they have POPs everywhere and partnerships with content delivery systems), and connecting with consumers and partners will become an increasingly essential part of any enterprise IT strategy.”  Have a look at Private Clouds are a Fix, Not the Future.

 

Private clouds are better than nothing but an investment in a private cloud is an investment in a temporary fix that will only slow the path to the final destination: shared clouds. A decision to go with a private cloud is a decision to run lower utilization levels, consume more power, be less efficient environmentally, and to run higher costs.

                                                                                                                          

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Sunday, January 17, 2010 9:20:10 AM (Pacific Standard Time, UTC-08:00)  #    Comments [24] - Trackback
Services
 Thursday, January 07, 2010

There is a growing gap between memory bandwidth and CPU power and this growing gap makes low power servers both more practical and more efficient than current designs.   Per-socket processor performance continues to increase much more rapidly than memory bandwidth and this trend applies across the application spectrum from mobile devices, through client, to servers. Essentially we are getting more compute than we have memory bandwidth to feed. 

 

We can attempt to address this problem two ways: 1) more memory bandwidth and 2) less fast processors. The former solution will be used and Intel Nehalem is a good example of this but costs increase non-linearly so the effectiveness of this technique will be  bounded. The second technique has great promise to reduce both cost and power consumption.

 

For more detail on this trend:

·         The Case for Low-Cost, Low-Power Servers

·         2010 the Year of the MicroSlice Servers

·         Linux/Apache on ARM Processors

·         ARM Cortex-A9 SMP Design Announced

 

This morning GigOm reported that SeaMicro has just obtained a $9.3M Department of Energy grant to improve data center efficiency (SeaMicro’s Secret Server Changes Computing Economics).  SeaMicro is a Santa Clara based start-up that is building a 512 processor server based upon Intel Atom. Also mentioned was Smooth Stone who is designing a high-scale server based upon ARM processors. ARMs processors are incredibly power efficient, commonly used in embedded devices and by far the most common processor used in cell phones.

 

Over the past year I’ve met with both Smooth Stone and SeaMicro frequently and it’s great to see more information about both available broadly. The very low power server trend is real and advancing quickly. When purchasing servers, it needs to be all about work done per dollar and work done per joule

 

Congratulations to SeaMicro on the DoE grant.

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Thursday, January 07, 2010 10:38:35 AM (Pacific Standard Time, UTC-08:00)  #    Comments [3] - Trackback
Hardware
 Saturday, January 02, 2010

In this month’s Communications of the Association of Computing Machinery, a rematch of the MapReduce debate was staged.  In the original debate, Dave Dewitt and Michael Stonebraker, both giants of the database community, complained that:

 

1.    MapReduce is a step backwards in database access

2.    MapReduce is a poor implementation

3.    MapReduce is not novel

4.    MapReduce is missing features

5.    MapReduce is incompatible with the DBMS tools

 

Unfortunately, the original article appear to be no longer available but you will find the debate branching out from that original article by searching on the title Map Reduce: A Major Step Backwards. The debate was huge, occasionally entertaining, but not always factual.  My contribution was MapReduce a Minor Step forward.


Update: In comments, csliu offered updated URLs for the original blog post and a follow-on article:

·  MapReduce: A Major Step Backwards

· MapReduce II

 

I like MapReduce for a variety of reasons the most significant of which is that it allows non-systems programmers to write very high-scale, parallel programs with comparative ease.  There have been many attempts to allow mortals to write parallel programs but there really have only been two widely adopted solutions that allow modestly skilled programmers to write highly concurrent executions: SQL and MapReduce. Ironically the two communities participating in the debate,  Systems and Database, have each produced a great success by this measure. 

 

More than 15 years ago, back when I worked on IBM DB2, we had DB2 Parallel Edition running well over a 512 server cluster.  Even back then you could write a SQL Statement that would run over a ½ thousand servers.  Similarly, programmers without special skills can run MapReduce programs that run over thousands of serves. The last I checked Yahoo, was running MapReduce jobs over a 4,000 node cluster: Scaling Hadoop to 4,000 nodes at Yahoo!.

 

The update on the MapReduce debate is worth reading but, unfortunately, the ACM has marked the first article as “premium content” so you can only read it if you are a CACM subscriber:

·         MapReduce and Parallel DBMSs: Friend or Foe

·         MapReduce: A Flexible Data Processing Tool


Update: Moshe Vardi, Editor in Chief of the Communications of the Association of Computing Machinery has kindly decided to make both the of the above articles freely available for all whether or not CACM member. Thank you Moshe.

 

Even more important to me than the MapReduce debate is seeing this sort of content made widely available. I hate seeing it classified as premium content restricted to members only. You really all should be members but, with the plunging cost of web publishing, why can’t the above content be made freely available? But, while complaining about the ACM publishing policies, I should hasten to point out that the CACM has returned to greatness.  When I started in this industry, the CACM was an important read each month. Well, good news, the long boring hiatus is over.  It’s now important reading again and has been for the last couple of years. I just wish the CACM would follow the lead of ACM Queue and make the content more broadly available outside of the membership community.

 

Returning to the MapReduce discussion, in the second CACM article above, MapReduce: A Flexible Data Processing Tool, Jeff Dean and Sanjay Ghemawat, do a thoughtful job of working through some of the recent criticism of MapReduce.

 

If you are interested in MapReduce, I recommend reading the original Operating Systems Design and Implementation MapReduce paper: MapReduce: Simplied Data Processing on Large Clusters and the detailed MapReduce vs database comparison paper: A Comparison of Approaches to Large-Scale Data Analysis.

 

                                                                                --jrh

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Saturday, January 02, 2010 8:15:19 AM (Pacific Standard Time, UTC-08:00)  #    Comments [8] - Trackback
Software
 Saturday, December 19, 2009

The networking world remains one of the last bastions of the mainframe computing design point. Back in 1987 Garth Gibson, Dave Patterson, and Randy Katz showed we could aggregate low-cost, low-quality commodity disks into storage subsystems far more reliable and much less expensive than the best purpose-built storage subsystems (Redundant Array of Inexpensive Disks). The lesson played out yet again where we learned that large aggregations of low-cost, low-quality commodity servers are far more reliable and less expensive than the best purpose-built scale up servers. However, this logic has not yet played out in the networking world.

 

The networking equipment world looks just like mainframe computing ecosystem did 40 years ago. A small number of players produce vertically integrated solutions where the ASICs (the central processing unit responsible for high speed data packet switching), the hardware design, the hardware manufacture, and the entire software stack are stack are single sourced and vertically integrated.  Just as you couldn’t run IBM MVS on a Burrows computer, you can’t run Cisco IOS on Juniper equipment.

When networking gear is purchased, it’s packaged as a single sourced, vertically integrated stack. In contrast, in the commodity server world, starting at the most basic component, CPUs are multi-sourced. We can get CPUs from AMD and Intel. Compatible servers built from either Intel or AMD CPUs are available from HP, Dell, IBM, SGI, ZT Systems, Silicon Mechanics, and many others.  Any of these servers can support both proprietary and open source operating systems. The commodity server world is open and multi-sourced at every layer in the stack.

 

Open, multi-layer hardware and software stacks encourage innovation and rapidly drive down costs. The server world is clear evidence of what is possible when such an ecosystem emerges. In the networking world, we have a long way to go but small steps are being made. Broadcom, Fulcrum, Marvell, Dune (recently purchased by Broadcom), Fujitsu and others all produce ASICs (the data plane CPU of the networking world). These ASICS are available for any hardware designer to pick up and use. Unfortunately, there is no standardization and hardware designs based upon one part can’t easily be adapted to use another.

 

In the X86 world, the combination of the X86 ISA, hardware platform, and the BIOS forms a De facto standard interface.  Any server supporting this low level interface can host the wide variety of different Linux systems, Windows, and many embedded O/Ss.  The existence of this layer allows software innovation above and encourages nearly unconstrained hardware innovation below.  New hardware designs work with existing software.  New software extensions and enhancements work with all the existing hardware platforms. Hardware producers get a wider variety of good quality operating systems.  Operating systems authors get a broad install base of existing hardware to target. Both get bigger effective markets. High volumes encourage greater investment and drive down costs.

 

This standardized layer hasn’t existed in the networking ecosystem as it has in the commodity server world. As a consequence, we don’t have high quality networking stacks able to run across a wide variety of networking devices. A potential solution is near: OpenFlow. This work originating out of the Stanford networking team driven by Nick McKeown. It is a low level hardware independent interface for updating network routing tables in a hardware independent-way. It is sufficiently rich to support current routing protocols and it also can support research protocols optimized at high-scale data center networking systems such as VL2 and PortLand. Current OpenFlow implementations exist on X86 hardware running linux, Broadcom, NEC, NetFPGA, Toroki, and many others.

 

The ingredients of an open stack are coming together. We have merchant silicon ASIC from Broadcom, Fulcrum, Dune and others. We have commodity, high-radix routers available from Broadcom (shipped by many competing OEMs), Arista, and others.  We have the beginnings of industry momentum behind OpenFlow which has a very good chance of being that low level networking interface we need. A broadly available, low-level interface may allow a high-quality, open source networking stack to emerge. I see the beginnings of the right thing happening.

 

·         OpenFlow web site: http://www.openflowswitch.org/

·         OpenFlow paper:  Enabling Innovation in Campus Networks

·         My Stanford Clean Slate Talk Slides: DC Networks are in my way

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Saturday, December 19, 2009 10:00:08 AM (Pacific Standard Time, UTC-08:00)  #    Comments [14] - Trackback
Hardware
 Friday, December 18, 2009

I'm on the technical program committe for ACM Science Cloud 2010. You should consider both submitting a paper and attending the conference. The conference will be held in Chicago on June21st, 2010 colocated with  ACM HPDC 2010 (High Performance Distributed Computing).

The call for papers abstracst are due Feb 22 with final papers due March 1st: http://dsl.cs.uchicago.edu/ScienceCloud2010/

Workshop Overview:

The advent of computation can be compared, in terms of the breadth and depth of its impact on research and scholarship, to the invention of writing and the development of modern mathematics. Scientific Computing has already begun to change how science is done, enabling scientific breakthroughs through new kinds of experiments that would have been impossible only a decade ago. Today's science is generating datasets that are increasing exponentially in both complexity and volume, making their analysis, archival, and sharing one of the grand challenges of the 21st century. The support for data intensive computing is critical to advancing modern science as storage systems have experienced an increasing gap between their capacity and bandwidth by more than 10-fold over the last decade. There is an emerging need for advanced techniques to manipulate, visualize and interpret large datasets. Scientific computing involves a broad range of technologies, from high-performance computing (HPC) which is heavily focused on compute-intensive applications, high-throughput computing (HTC) which focuses on using many computing resources over long periods of time to accomplish its computational tasks, many-task computing (MTC) which aims to bridge the gap between HPC and HTC by focusing on using many resources over short periods of time, to data-intensive computing which is heavily focused on data distribution and harnessing data locality by scheduling of computations close to the data.

The 1st workshop on Scientific Cloud Computing (ScienceCloud) will provide the scientific community a dedicated forum for discussing new research, development, and deployment efforts in running these kinds of scientific computing workloads on Cloud Computing infrastructures. The ScienceCloud workshop will focus on the use of cloud-based technologies to meet new compute intensive and data intensive scientific challenges that are not well served by the current supercomputers, grids or commercial clouds. What architectural changes to the current cloud frameworks (hardware, operating systems, networking and/or programming models) are needed to support science? Dynamic information derived from remote instruments and coupled simulation and sensor ensembles are both important new science pathways and tremendous challenges for current HPC/HTC/MTC technologies.  How can cloud technologies enable these new scientific approaches? How are scientists using clouds? Are there scientific HPC/HTC/MTC workloads that are suitable candidates to take advantage of emerging cloud computing resources with high efficiency? What benefits exist by adopting the cloud model, over clusters, grids, or supercomputers?  What factors are limiting clouds use or would make them more usable/efficient?

This workshop encourages interaction and cross-pollination between those developing applications, algorithms, software, hardware and networking, emphasizing scientific computing for such cloud platforms. We believe the workshop will be an excellent place to help the community define the current state, determine future goals, and define architectures and services for future science clouds. 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Friday, December 18, 2009 11:24:52 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Ramblings
 Wednesday, December 16, 2009

There were three big announcements this week at Amazon Web Services. All three announcements are important but the first is the one I’m most excited about in that it is a fundamental innovation in how computation is sold.

 

The original EC2 pricing model was  on-demand pricing. This is the now familiar pay-as-you-go and pay-as-you-grow pricing model that has driven much of the success of EC2.  Subsequently reserved instances were introduced. In the reserved instance pricing model, customers have the option of paying an up-front charge to reserve a server. There is still no obligation to use that instance but it is guaranteed to be available if needed by the customer. Much like a server you have purchased but turned off. Its not consuming additional resources but it is available when you need. Drawing analogy from the power production world, reserved instances are best for base load. This is capacity that is needed most of the time. 

 

On-demand instances are ideal for Peak Load. This is capacity that is needed to meet peak demand over the constant base load demand. Spot instances are a new, potentially very low cost instance type ideal for computing capacity that can be run with some time flexibility. This instance type will often allow workloads with soft deadline requirements to be run at very low cost.  What makes Spot particularly interesting is the Spot instance price fluctuates with the market demand. When demand is low, the spot instance price is low. When demand is higher, the price will increase exactly as the energy spot market functions.

 

Also announced this week were the Virtual Private Cloud unlimited beta and CloudFront streaming support.

 

Elastic Cloud Compute Spot Instances: Amazon EC2 Spot Instances are a new way to purchase and consume Amazon EC2 Instances. Spot Instances allow customers to bid on unused Amazon EC2 capacity and run those instances for as long as their bid exceeds the current Spot Price. The Spot Price changes periodically based on supply and demand, and customers whose bids meet or exceed it gain access to the available Spot Instances. Spot Instances are complementary to On-Demand Instances and Reserved Instances, providing another option for obtaining compute capacity. If you have flexibility in when your applications can run, Spot Instances can significantly lower your Amazon EC2 costs. Additionally, Spot Instances can provide access to large amounts of additional capacity for applications with urgent needs. To learn more, please visit the Amazon EC2 Spot Instances detail page.

 

Amazon Virtual Private Cloud Unlimited Beta: Amazon Virtual Private Cloud (Amazon VPC) is a secure and seamless bridge between a company’s existing IT infrastructure and the AWS cloud. Since August 2009, Amazon VPC has been in a limited beta, during which we’ve selectively granted access. Starting today, all current and future Amazon EC2 customer accounts are enabled to use Amazon VPC, but customers will not be charged for Amazon VPC until they begin using it. Amazon VPC enables enterprises to connect their existing infrastructure to a set of isolated AWS compute resources via a Virtual Private Network (VPN) connection, and to extend their existing management capabilities such as security services, firewalls, and intrusion detection systems to include their AWS resources. To get started with the service, please visit the Amazon VPC detail page.

 

Amazon CloudFront Streaming: Amazon CloudFront, the easy-to-use content delivery service, now supports the ability to stream audio and video files. Traditionally, world-class streaming has been out of reach of for many customers – running streaming servers was technically complex, and customers had to negotiate long- term contracts with minimum commitments in order to have access to the global streaming infrastructure needed to give high performance.

Amazon CloudFront is designed to make streaming accessible to anyone with media content. Streaming with Amazon CloudFront is exceptionally easy: with only a few clicks on the AWS Management Console or a simple API call, you’ll be able to stream your content using a world-wide network of edge locations running Adobe’s Flash® Media Server. And, like all AWS services, Amazon CloudFront streaming requires no up-front commitments or long-term contracts. There are no additional charges for streaming with Amazon CloudFront; you simply pay normal rates for the data that you transfer using the service. Visit the Amazon CloudFront page to learn more.

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Wednesday, December 16, 2009 5:41:07 AM (Pacific Standard Time, UTC-08:00)  #    Comments [6] - Trackback
Services
 Monday, December 14, 2009

Want to join a startup team within Amazon Web Services?  I’m deeply involved and excited about this project and another couple of talented engineers could really make a difference.  We are looking for:

 

User Interface Software Development Engineer

We are looking for an experienced engineer with a proven track record of building high quality, AJAX enabled websites. HTML, JavaScript, AJAX, and CSS experience is critical, along with Java and Tomcat. Experience with languages such as PHP, Perl, Ruby, Python, etc. is also useful. You must have significant experience in designing highly reliable and scalable distributed systems, including building front end website facing applications.  You must thrive in a hyper-growth environment where priorities shift fast, have strong OO design and implementation experience, knowledge of web protocols, and in-depth knowledge of Linux tools and Java EE architectures.

 

For more information: https://us-amazon.icims.com/jobs/107700/job

 

Senior Software Development Engineer

We are looking for a Senior Software Engineer with a strong track record of building production scalable, high end, reliable, data driven distributed website systems. You must be able to tackle tough challenges and feel strongly not only about building good software but about making that software achieve its goals in an operational reality. You must thrive in a hyper-growth environment where priorities shift fast, have strong OO design and implementation experience, knowledge of web protocols, and in-depth knowledge of Linux tools and Java EE architectures.

 

For more information: https://us-amazon.icims.com/jobs/109479/job

 

If you are interested, send a resume to aws-jobinfo-lee@amazon.com. I’m looking forward to working with you.

 

                                                --jrh

 

James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
james@amazon.com  

H:mvdirona.com | W:mvdirona.com/jrh/work  | blog:http://perspectives.mvdirona.com

 

Monday, December 14, 2009 8:01:45 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Ramblings
 Wednesday, December 02, 2009

For several years I’ve been interested in PUE<1.0 as a rallying cry for the industry around increased efficiency. From PUE and Total Power Usage Efficiency (tPUE) where I talked about PUE<1.0:

 

In the Green Grid document [Green Grid Data Center Power Efficiency Metrics: PUE and DCiE], it says that “the PUE can range from 1.0 to infinity” and goes on to say “… a PUE value approaching 1.0 would indicate 100% efficiency (i.e. all power used by IT equipment only).   In practice, this is approximately true. But PUEs better than 1.0 is absolutely possible and even a good idea.  Let’s use an example to better understand this.  I’ll use a 1.2 PUE facility in this case. Some facilities are already exceeding this PUE and there is no controversy on whether its achievable. 

 

Our example 1.2 PUE facility is dissipating 16% of the total facility power in power distribution and cooling. Some of this heat may be in transformers outside the building but we know for sure that all the servers are inside which is to say that at least 83% of the dissipated heat will be inside the shell. Let’s assume that we can recover 30% of this heat and use it for commercial gain.  For example, we might use the waste heat to warm crops and allow tomatoes or other high value crops to be grown in climates that would not normally favor them.  Or we can use the heat as part of the process to grow algae for bio-diesel.  If we can transport this low grade heat and net only 30% of the original value, we can achieve a 0.90 PUE.  That is to say if we are only 30% effective at monetizing the low-grade waste heat, we can achieve a better than 1.0 PUE.

 

Less than 1.0 PUE are possible and I would love to rally the industry around achieving a less than 1.0 PUE.  In the database world years ago, we rallied around the achieving 1,000 transactions per second.  The High Performance Transactions Systems conference was originally conceived with a goal of achieving these (at the time) incredible result.  1,000 TPS was eclipsed decades ago but HPTS remains a fantastic conference. We need to do the same with PUE and aim to get below 1.0 before 2015. A PUE less than 1.0 is hard but it can and will be done.

 

So, a PUE of less than 1.0 is totally possible but doing it efficiently and economically has proven elusive so far. The challenge is finding a process that can make use of the very low grade heat produced by data centers and turn it into economic gain. The challenge is producing economic gain from the low grade heat where the economic gain exceeds the combined capital and operational expense of recovering that energy.

 

In the posting Is Sandia National Lab's Red Sky Really Able to Deliver a PUE of 1.035?, I pointed to an innovative sewage waste heat reclamation system in Norway: Flush the loo, warm your house. In this system,  heat pumps are used to reclaim waste heat from sewage and convert to home heat. 

 

Other possible applications of waste heat are heating green houses to allow the growth of valuable crops in adverse climates.  See Vertical Farming for most radical extension of these ideas. Another possible approach is to grow biodiesel from microbes and use the low grade heat as a heat source for the culture. See A Better Biofuel for an example of this approach.

 

Yesterday, I came across an interesting application of waste heat reclamation from datacenters from Helsingin Energia (Helsinki public energy company).

In this proof of concept datacenter that will come on line next month, they have a conventional datacenter water cooling design but rather than releasing the waste heat to the atmosphere via a cooling tower or related technique, they run it through a heat pump to add heat to a heating loop to heat homes in the Finnish capital. The data center is located in an unused bomb shelter.

 

In a conversation I had earlier today, the project manager Sipilia Juha said:

 

We provide facilities for datacenter operators including underground property, electricity and cooling. We can capture almost 100% of the heat that comes out of the datacenter and put it in to the district heating system to heat buildings in Helsinki. Our customers make the detailed planning inside the premises and bring their own IT-equipment.

 

The cooling costs for the customer from 7€ to 20€ per MWh depending on the size of the center and of the time in the year. We can do it very ecologically and economically.

 

Computerworld also talked to Juha: Green Data Center Recycles Waste Heat.

 

I’ve been unable to get the details on the capital cost, the operational costs and the estimated cost recovery time and model used.  The facility won’t be live until January so, even with good cost models, they wouldn’t yet be calibrated by real operational experience.

 

They are aiming for a PUE of around 1.0 and its quite conceivable they will get there:

The energy efficiency of computer halls is quantified by the so-called efficiency factor which expresses the ratio of the total energy consumption and the energy used for actual computing. The efficiency factor of ordinary computer halls is between 1.5 and 2, with the figure for computer halls deemed to be extremely ecoefficient possibly under 1.5. The efficiency factor of Academica's and Helsingin Energia's hall is around one, and it is possible to get even below this figure.

 

The next test is to see if this level of efficiency can be achieved in a economically positively or at least without loss. It’s an interesting project. I’ll continue to watch this and similar proof of concept facilities closely.

 

A brochure from Helsingin Energia is at: Hel_En_Eco-efficient_computer_hall.pdf (1.93 MB).

 

               --jrh

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Wednesday, December 02, 2009 7:54:56 AM (Pacific Standard Time, UTC-08:00)  #    Comments [2] - Trackback
Hardware
 Monday, November 30, 2009

Very low-power scale-out servers -- it’s an idea whose time has come. A few weeks ago Intel announced it was doing Microslice servers: Intel Seeks new ‘microserver’ standard. Rackable Systems (I may never manage to start calling them ‘SGI’ – remember the old MIPS-based workstation company?) was on this idea even earlier: Microslice Servers. The Dell Data Center Solutions team has been on a similar path: Server under 30W.

 

Rackable has been talking about very low power servers as physicalization: When less is more: the basics of physicalization. Essentially they are arguing that rather than buying, more-expensive scale-up servers and then virtualizing the workload onto those fewer servers, buy many smaller servers. This saves the virtualization tax which can run 15% to 50% in I/O intensive applications and smaller and low-scale servers can produce more work done per joule and better work done per dollar. I’ve been a believer in this approach for years and wrote it up for the Conference on Innovative Data Research last year in The Case for Low-Cost, Low-Power Servers.

 

I’ve recently been very interested in the application of ARM processors to web-server workloads:

·         Linux/Apache on ARM Processors

·         ARM Cortex-A9 SMP Design Announced

 

ARMs are an even more radical application of the Microslice approach.

 

Scale-down servers easily win on many workloads when looking at work done per dollar and work done per joule and I claim, if you are looking at single dimensional metrics, like performance, you aren’t looking hard enough. However, there are workloads where scale-up wins. They are absolutely required when the workload won’t partition and scale near linearly. Database workloads are classic examples of partition-resistant workloads that really do often run better on more-expensive, scale-up servers.

 

The other limit is administration. Non-automated IT shops believe they are better off with fewer, more-expensive servers although they often achieve this goal by running many operating system images on a single server.  Given that the bulk of administration is spent on the software stack, it’s not clear that this approach of running the same number of O/S images and software stacks on a single server is a substantial savings. However, I do agree that administration costs are important at low-scale. If, at high-scale, admin costs are over 10% of overall operational costs, go fix it rather than buying bigger, more expensive servers.

 

When do scale-up servers win economically? 1) very low-scale workloads where administration costs dominate, and 2) workloads that partition poorly and suffer highly-sub-linear scale-out.  Simple web workloads and other partition-tolerant applications should look to scale-down severs. Make sure your admin costs are sub-10% and don’t scale with server count. Then use work done per dollar and work done per joule and you’ll be amazed to see scale-down gets more done at lower cost and lower power consumption.

 

2010 is the year of the low-cost, scale-down server.

 

                                                --jrh

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Monday, November 30, 2009 7:04:17 AM (Pacific Standard Time, UTC-08:00)  #    Comments [10] - Trackback
Hardware
 Sunday, November 22, 2009

Sometime back I whined that Power Usage Efficiency (PUE) is a seriously abused term: PUE and Total Power Usage Efficiency.  But I continue to use it because it gives us a rough way to compare the efficiency of different data centers.  It’s a simple metric that takes the total power delivered to a facility (total power) and divides it by the amount of power delivered to the servers (critical power or IT load).  A PUE of 1.35 is very good today. Some datacenter owners have claimed to be as good as 1.2.  Conventionally designed data centers operated conservatively are in the 1.6 to 1.7 range.  Unfortunately most of the industry has a PUE of over 2.0, some are as bad as 3.0, and the EPA reports the industry average is 2.0 (Report to Congress on Server Data Center Efficiency). A PUE of 2.0 means that for each watt delivered to the IT load (servers, net gear, and storage), one watt is lost in cooling and in power distribution.

 

Whenever a metric becomes important, managers ask about it and marketing people use it.  Eventually we start seeing data points that are impossibly good. The recent Red Sky installation is one of these events. Sandia National Lab’s Red Sky supercomputer is reported to be delivering a PUE of 1.035 in a system without waste heat recovery. In Red Sky at Night, Sandia’s New Computer Might it is reported “The power usage effectiveness of Red Sky is an almost unheard-of 1.035”. The video referenced below also reports Red Sky at a 1.035 PUE. in response to the claimed PUE of 1.035, Rich Miller of Data Center Knowledge astutely asked “How’s this possible?” (see Red Sky: Supercomputing and Efficiency Meet).  

 

The data center knowledge article links to a blog posting Building Red Sky by Marc Hamilton which includes a wonderful time lapse video showing the building of Red Sky: http://www.youtube.com/watch?v=mNW9cYY4tqc. You should watch the 4 min and 51 second video and I’ll include my notes and observations from the video below. But, before we get to the video, let’s look more closely at the widely reported 1.035 PUE and what it would mean.

 

A PUE of 1.035 implies that for each 1 watt delivered to the servers, 0.035 is lost in power distribution and mechanical systems. For a facility of this size, I suspect they will get delivered high voltage in the 115kV range. In a conventional power distribution design, they will take 115kV and transform it to mid-voltage (13kV range), then to 480V 3p, then to 208V to be delivered to the servers. In addition to all these conversions, there is some loss in the conductors themselves. And there is considerable loss in even the very best uninterruptable power supply (UPS) systems.  In fact, a UPS alone with 3.5% loss is excellent. Excellent power distribution designs will avoid 1 or perhaps 2 of the conversions above and will use a full bypass UPS. But, getting these excellent power distribution designs to even within a factor of 2 of the reported 3.5% loss is incredibly difficult and I’m very skeptical that they are going to get much below 6% to 7%. In fact, if anyone knows how to get down below 6% loss in the power distribution system measured fully, I’m super interested and would love to see what you have done, buy you lunch, and do a datacenter a tour.

 

A 6% loss in power distribution would limit the PUE to nothing lower than 1.06. But, we still have the cooling system to account for. Air is an expensive fluid to move long distances. Consequently, Red Sky brings the water to the server racks using Sun Cooling Door Systems (similar to the IBM iDataPlex Rear Door Cooling system).

 

The Sun Cooling Door System is a nice designs that will significantly improve PUE over more conventional CRAC-based mechanical designs. Generally, bringing water close to the heat load in systems that use water (rather than aggressive free-air only designs) is a good approach. The Sun advertising material credibly reports that “A highly efficient datacenter utilizing a holistic design for closely coupled cooling using Sun Cooling Door Systems can reach a PUE of 1.3”.

 

I know of no way to circulate air through a heat exchanger, pump water to the outside of the building, and then cool the water using any of the many technologies available that can be done at only a 3.5% loss.  Which is to say that a PUE of 1.035 can’t be done with the Red Sky mechanical system design even if power distribution losses were ignored completely. I like Red Sky but suspect we’re looking at a 1.35 PUE system rather than the reported 1.035.  But, that’s OK, 1.35 is quite good and, for a top 10 super computer, it’s GREAT.   

 

Note that a PUE of 1.035 is technically possible with waste heat recovery and, in fact, even less than 1.0 can be achieved with waste heat recovery. See the “PUE less than 1.0” section of PUE and Total Power Usage Efficiency for more data on waste heat recovery.  Remember this is “technically possible” rather than achieved in production today. It’s certainly possible to do today but doing it cost effectively is the challenge.  I have seen it applied to related domains that also have large quantities of low grade heat. For example, a city in Norway is experimenting with waste heat recovery from Sewage: Flush the loo, warm your house.

 

My notes from the Red Sky Video follow:

·         47,232 cores of Intel EM64T Xeon X55xx (Nehalem-EP) 2930 MHz (11.72 GFlops)

o   553 Teraflops

·         Infiniband QDR interconnect

o   1,440 cables totally 9.1 miles

·         Operating System: CentOS

·         Main Memory: 22,104 GB

·         266 VA [jrh: this is clearly incorrect unless they are talking about each server]

o   Each reach is 32kW

·         96 JBOD enclosures

o   2,304 1TB disks

·         12 GB RAM/note & 70TB total

·         PUE 1.035 [jrh: I strongly suspect they meant 1.35]

·         328 tons cooling

·         7.3million gallons of water per year

 

The video is worth watching although if you play with cross referencing the numbers above, there appear to be many mistakes: Red Sky time Lapse.  Thanks to Jeff Bar for sending this one my way.

 

                                                --jrh

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Sunday, November 22, 2009 8:30:19 AM (Pacific Standard Time, UTC-08:00)  #    Comments [4] - Trackback
Hardware
 Friday, November 20, 2009

I’m on the program committee for the ACM Symposium on Cloud Computing. The conference will be held June 10th and 11th 2010 in Indianapolis Indiana. SOCC brings together database and operating systems researchers and practitioners interested in cloud computing. It is jointly sponsored by the ACM Special Interest Group on Management of Data (SIGMOD) and the ACM Special Interest Group on Operating Systems (SIGOPS). The conference will be held in conjunction with ACM SIGMOD in 2010 and with SOSP in 2011 continuing to alternate between SIGMOD and SOSP in subsequent years.

 

Joe Hellerstein is the SOCC General Chair and Surajit Chaudhuri and Mendel Rosenblum are Program Chairs. The rest of the SOCC organizers are at: http://research.microsoft.com/en-us/um/redmond/events/socc2010/organizers.htm.  If you are interested in cloud computing in general and especially if you are interested in systems or database issues and their application to cloud computing, consider submitting a paper (copied below). The paper submission deadline for SOCC is January 15, 2010. Get writing!

 

                                                                --jrh

 

The ACM Symposium on Cloud Computing 2010 (ACM SOCC 2010) is the first in a new series of symposia with the aim of bringing together researchers, developers, users, and practitioners interested in cloud computing. This series is co-sponsored by the ACM Special Interest Groups on Management of Data (ACM SIGMOD) and on Operating Systems (ACM SIGOPS). ACM SOCC will be held in conjunction with ACM SIGMOD and ACM SOSP Conferences in alternate years, starting with ACM SIGMOD in 2010.

The scope of SOCC Symposia will be broad and will encompass diverse systems topics such as software as a service, virtualization, and scalable cloud data services. Many facets of systems and data management issues will need to be revisited in the context of cloud computing. Suggested topics for paper submissions include but are not limited to:

 

  Administration and Manageability

  Data Privacy

  Data Services Architectures

  Distributed and Parallel Query Processing

  Energy Management

  Geographic Distribution

  Grid Computing

  High Availability and Reliability

  Infrastructure Technologies

  Large Scale Cloud Applications

  Multi-tenancy

  Provisioning and Metering

  Resource management and Performance

  Scientific Data Management

  Security of Services

  Service Level Agreements

  Storage Architectures

  Transactional Models

  Virtualization Technologies

 

Organizers


General Chair:
Joseph M. Hellerstein, U. C. Berkeley

Program Chairs:
Surajit Chaudhuri, Microsoft Research
Mendel Rosenblum, Stanford University

Treasurer:
Brian Cooper, Yahoo! Research

Publicity Chair:
Aman Kansal, Microsoft Research

Steering Committee
Phil Bernstein, Microsoft Research
Ken Birman, Cornell University
Joseph M. Hellerstein, U. C. Berkeley
John Ousterhout, Stanford University
Raghu Ramakrishnan, Yahoo! Research
Doug Terry, Microsoft Research
John Wilkes, Google

Technical Program Committee:
Anastasia Ailamaki, EPFL
Brian Bershad, Google
Michael Carey, UC Irvine
Felipe Cabrera, Amazon
Jeff Chase, Duke
Dilma M da Silva, IBM
David Dewitt, Microsoft
Shel Finkelstein, SAP
Armando Fox, UC Berkeley
Tal Garfinkel, Stanford
Alon Halevy, Google
James Hamilton, Amazon
Jeff Hammerbacher, Cloudera
Joe Hellerstein, UC Berkeley
Alfons Kemper, Technische Universität München
Donald Kossman, ETH
Orran Krieger, Vmware
Jeffrey Naughton, University of Wisconsin, Madison
Hamid Pirahesh, IBM
Raghu Ramakrishnan, Yahoo!
Krithi Ramamritham, Indian Institute of Technology, Bombay
Donovan Schneider, Salesforce.com
Andy Warfield, University of British Columbia
Hakim Weatherspoon, Cornell

Paper Submission

Authors are invited to submit original papers that are not being considered for publication in any other forum. Manuscripts should be submitted in PDF format and formatted using the ACM camera-ready templates available at http://www.acm.org/sigs/pubs/proceed/template.html. See the Paper Submission page for details on the submission procedure.

A submission to the symposium may be one of the following three types:
(a) Research papers: We seek papers on original research work in the broad area of cloud computing. The length of research papers is limited to twelve pages.
(b) Industrial papers: The symposium will also be a forum for high quality industrial presentations on innovative cloud computing platforms, applications and experiences on deployed systems. Submissions for industrial presentations can either be an extended abstract (1-2 pages) or an industrial paper up to 6 pages long.
(c) Position papers: The purpose of a position paper is to expose a new problem or advocate a new approach to an old idea. Participants will be invited based on the submission's originality, technical merit, topical relevance, and likelihood of leading to insightful technical discussions at the symposium. A position paper can be no more than 6 pages long.

Important Dates

Paper Submission: Jan 15, 2010 (11:59pm, PST)
Notification: Feb 22, 2010
Camera-Ready: Mar 22, 2010

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Friday, November 20, 2009 6:24:30 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services

Disclaimer: The opinions expressed here are my own and do not necessarily represent those of current or past employers.

Archive
<April 2010>
SunMonTueWedThuFriSat
28293031123
45678910
11121314151617
18192021222324
2526272829301
2345678

Categories
This Blog
Member Login
All Content © 2014, James Hamilton
Theme created by Christoph De Baene / Modified 2007.10.28 by James Hamilton