Thursday, July 09, 2009

MIT’s Barbara Liskov was awarded the 2008 Association of Computing Machinery Turing Award.  The Turning award is the highest distinction in computer science and is often referred to as the Nobel price of computing. Past award winners are listed at: http://en.wikipedia.org/wiki/Turing_Award.

The full award citation:

Barbara Liskov has led important developments in computing by creating and implementing programming languages, operating systems, and innovative systems designs that have advanced the state of the art of data abstraction, modularity, fault tolerance, persistence, and distributed computing systems.

The Venus operating system was an early example of principled operating system design. The CLU programming language was one of the earliest and most complete programming languages based on modules formed from abstract data types and incorporating unique intertwining of both early and late binding mechanisms. ARGUS extended many of the CLU ideas to distributed programming, and incorporated the first versions of nested transactions to maintain predictable consistencies. Other advances include solutions elegantly combining theory and pragmatics in the areas of decentralized information flow, replicated storage and caching of persistent objects, and modular upgrading of distributed systems. Her contributions have been incorporated into the practice of programming, thereby influencing many of the most important systems used today: for programming, specification, systems design, and distributed architectures.

From: http://awards.acm.org/citation.cfm?id=1108679&srt=year&year=2008&aw=140&ao=AMTURING

 

The cover article in the July Communications of the ACM was on the award: http://cacm.acm.org/magazines/2009/7/32083-liskovs-creative-joy/fulltext.

 

James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
james@amazon.com  

H:mvdirona.com | W:mvdirona.com/jrh/work  | blog:http://perspectives.mvdirona.com

 

Thursday, July 09, 2009 8:43:43 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Ramblings
 Wednesday, July 08, 2009

Our industry has always moved quickly but the internet and high-scale services have substantially quickened the pace. Search is an amazingly powerful productivity tool and available effectively to free to all. The internet makes nearly all information available to anyone who can obtain time on an internet connection. Social networks and interest-area specific discussion groups are bringing together individuals of like interest from all over the globe.  The cost of computing is falling rapidly and new services are released daily.  The startup community has stayed viable through one of the most severe economic downturns since the great depression. Infrastructure as a service offerings allow new businesses to be build with very little seed investment. I’m amazed at the quality of companies I’m seeing that have 100% bootstrapped without VC funding.  Everything is changing.

 

Netbooks have made low end computers close to free and, in fact, some are released on the North American cell phone model where a multi-year service contract subsidies the device. I’ve seen netbooks for free with a three wireless contract. This morning I came across yet more evidence of healthy change: a new client operating system alterative. The Wall Street Journal reports that Google Plans to Launch Operating Systems for PC (http://online.wsj.com/article/SB124702911173210237.html).  Other articles: http://news.google.com/news?q=google+to+launch+operating+system&oe=utf-8&rls=org.mozilla:en-US:official&client=firefox-a&um=1&ie=UTF-8&hl=en&ei=s5hUSsTlO4PUsQPX7dCaDw&sa=X&oi=news_group&ct=title&resnum=1.

 

The new O/S is Linux based and Linux has long been an option on Netbooks. What’s different in this case is a huge commercial interest is behind advancing the O/S and intends to make it a viable platform on more capable client systems rather than just netbooks. These new lightweight, connected products are made viable by the combination of the wide-spread connectivity and the proliferation of very high-quality, high-fuction services. Having a new O/S player in the game will almost certainly increase the rate of improvement.

 

Alternatives continue to emerge, the cost of computing continues to fall, the pace of change continues to quicken, and everyone from individual consumers through the largest enterprises are gaining from the increased pace of innovation. It’s a fun time to participate in this industry.

 

James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
james@amazon.com  

H:mvdirona.com | W:mvdirona.com/jrh/work  | blog:http://perspectives.mvdirona.com

 

Wednesday, July 08, 2009 5:16:35 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services
 Tuesday, June 30, 2009

Microsoft announced yesterday that it was planning to bring both Chicago and Dublin online next month. Chicago is initially to be a 30MW critical load facility with a plan to build out to a booming 60MW.  2/3 of the facility is a high scale containerized facility. It’s great to see the world’s second modular data center going online (See  http://perspectives.mvdirona.com/2009/04/01/RoughNotesDataCenterEfficiencySummitPosting3.aspx for details on an earlier Google facility).

 

The containers in Chicago will hold 1,800 to 2,500 servers each. Assuming 200W/server, that’s 1/2 MW for each container with 80 containers on the first floor and a 40MW container critical load. The PUE estimate for the containers is 1.22 which is excellent but it’s very common to include all power conversions below 480VAC and all air moving equipment in the container as critical load so these data can end up not mean much. See: http://perspectives.mvdirona.com/2009/06/15/PUEAndTotalPowerUsageEfficiencyTPUE.aspx for more details on why a better definition of what is infrastructure and what is critical load is needed.

 

Back on April 10th, Data Center Knowledge asked Is Microsoft still committed to containers?  It looks like the answer is unequivocally YES!

 

Dublin is a non-containerized facility initially 5MW with plans to grow to 22MW as demand requires it. The facility is heavily dependent on air-side economization which should be particularly effective in Dublin.

 

More from:

·         Microsoft Blog: http://blogs.technet.com/msdatacenters/archive/2009/06/29/microsoft-brings-two-more-mega-data-centers-online-in-july.aspx

·         Data Center Knowledge: http://www.datacenterknowledge.com/archives/2009/06/29/microsoft-to-open-two-massive-data-centers/

·         MJF: http://blogs.zdnet.com/microsoft/?p=3200

 

James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
james@amazon.com  

H:mvdirona.com | W:mvdirona.com/jrh/work  | blog:http://perspectives.mvdirona.com

 

Tuesday, June 30, 2009 5:44:41 AM (Pacific Standard Time, UTC-08:00)  #    Comments [2] - Trackback
Hardware
 Wednesday, June 24, 2009

I presented the keynote at the International Symposium on Computer Architecture 2009 yesterday.  Kathy Yelick kicked off the conference with the other keynote on Monday: How to Waste a Parallel Computer.

 

Thanks to ISCA Program Chair Luiz Borroso for the invitation and for organizing an amazingly successful conference.  I’m just sorry I had to leave a day early to attend a customer event this morning. My slides: Internet-Scale Service Infrastructure Efficiency.

 

Abstract: High-scale cloud services provide economies of scale of five to ten over small-scale deployments, and are becoming a large part of both enterprise information processing and consumer services. Even very large enterprise IT deployments have quite different cost drivers and optimizations points from internet-scale services. The former are people-dominated from a cost perspective whereas internet-scale service costs are driven by server hardware and infrastructure with people costs fading into the noise at less than 10%.

 

In this talk we inventory where the infrastructure costs are in internet-scale services. We track power distribution from 115KV at the property line through all conversions into the data center tracking the losses to final delivery at semiconductor voltage levels. We track cooling and all the energy conversions from power dissipation through release to the environment outside of the building. Understanding where the costs and inefficiencies lie, we ll look more closely at cooling and overall mechanical system design, server hardware design, and software techniques including graceful degradation mode, power yield management, and resource consumption shaping.


James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
james@amazon.com  

H:mvdirona.com | W:mvdirona.com/jrh/work  | blog:http://perspectives.mvdirona.com

 

Wednesday, June 24, 2009 6:21:40 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Hardware
 Monday, June 22, 2009

Title: Ten Ways to Waste a Parallel Computer

Speaker: Katherine Yelick

 

An excellent keynote talk at ISCA 2009 in Austin this morning. My rough notes follow:

·         Moore’s law continues

o   Frequency growth replaced by core count growth

·         HPC has been working on this for more than a decade but HPC concerned as well

·         New World Order

o   Performance through parallelism

o   Power is overriding h/w concern

o   Performance is now a software concern

·         What follows are Yellnick’s top 10 ways to waste a parallel computer

·         #1: Build system with insufficient memory bandwidth

o   Multicore puts us on the wrong side of the memory wall

o   Key metrics to look at:

§  Memory size/bandwidth (time to fill memory)

§  Memory size * alg intensity / op-per-sec (time to process memory)

·         #2: Don’t Take Advantage of hardware performance features

o   Showed example of speedup from tuning nearest-neighbor 7 point stencil on a 3D array

o   Huge gains but hard to do by hand.  Need to do it automatically at code gen time.

·         #3: Ignore Little’s Law

o   Required concurrency = bandwidth * latency

o   Observation is that most apps are running WAY less than full memory bandwidth [jrh: this isn’t because these apps aren’t memory bound. They are waiting on memory with small requests. Essentially they are memory request latency bound rather than bandwidth bound. They need larger requests or more outstanding requests]

o   To make effective use of the machine, you need:

§  S/W prefetch

§  Pass memory around caches in some cases

·         #4: Turn functional problems into performance problems

o   Fault resilience introduces inhomogeneity in execution rates

o   Showed a graph that showed ECC recovery rates (very common) but that the recovery times are substantial and the increased latency of correction is substantially slowing the computation. [jrh: more evidence that non-ECC designs such as current Intel Atom are not workable in server applications.  Given ECC correction rates, I’m increasingly becoming convinced that non-ECC client systems don’t make sense.]

·         #5: Over-Synchronize Applications

o   View parallel executions as directed acyclic graphs of the computation

o   Hiding parallelism in a library tends to over serialize (too many barriers)

o   Showed work from Jack Dongarra on PLASMA as an example

·         #6: Over-synchronize Communications

o   Use a programming model in which you can’t utilize b/w or “low” latency

o   As an example, compared GASNet and MPI with GASNet delivering far higher bandwidth

·         #7: Run Bad Algorithms

o   Algorithmic gains have far outstripped Moore’s law over the last decade

o   Examples: 1) adaptive meshes rather than uniform, 2) sparse matrices rather than dense, and 3) reformulation of problem back to basics.

·         #8: Don’t rethink your algorithms

o   Showed examples of sparse iterative methods and optimizations possible

·         #9: Choose “hard” applications

o   Examples of such systems

§  Elliptic: stead state, global space dependence

§  Hyperbolic: time dependent, local space dependence

§  Parabolic: time dependent, global space dependence

o   There is often no choice – we can’t just ignore hard problems

·         #10: Use heavy-weight cores optimized for serial performance

o   Used Power5 as an example of a poor design by this measure and show a stack of “better” performance/power

§  Power5:

·         389 mm^2

·         120W @ 1900 MHz

§  Intel Core2 sc

·         130 mm^2

·         15W @ 1000 MHz

§  PowerPC450 (BlueGene/P)

·         8mm^2

·         3W @ 850

§  Tensilica (cell phone processor)

·         0.8mm^2

·         0.09W @ 650W

o   [jrh: This last point is not nearly well enough understood. Far too many systems are purchased on performance when they should be purchased on work done per $ and work done per joule.]

·         Note: Large scale machines have 1 unrecoverable memory error (UME) per day [jrh: again more evidence that no-ECC server designs such as current Intel Atom boards simply won’t be acceptable in server applications, nor embedded, and with memory sizes growing evidence continues to mount that we need to move to ECC on client machines as well]

·         HPC community shows that parallelism is key but serial performance can’t be ignored.

·         Each factor of 10 increase in performance, tends to require algorithmic rethinks

 

James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
james@amazon.com  

H:mvdirona.com | W:mvdirona.com/jrh/work  | blog:http://perspectives.mvdirona.com

 

Monday, June 22, 2009 7:04:50 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Hardware
 Sunday, June 14, 2009

I like Power Usage Effectiveness as a course measure of data center infrastructure efficiency. It gives us a way of speaking about the efficiency of the data center power distribution  and mechanical equipment without having to qualify the discussion on the basis of server and storage used or utilization levels, or other issues not directly related to data center design. But, there are clear problems with the PUE metric. Any single metric that attempts reduce a complex system to a single number is going to both fail to model important details and it is going to be easy to game. PUE suffers from some of both nonetheless, I find it useful.

 

In what follows, I give an overview of PUE, talk about some the issues I have with it as currently defined, and then propose some improvements in PUE measurement using a metric called tPUE.

 

What is PUE?

PUE is defined in Christian Belady’s Green Grid Data Center Power Efficiency Metrics: PUE and DCiE. It’s a simple metric and that’s part of why it’s useful and it’s the source of some of the sources of the flaws in the metric.  PUE is defined to be

 

                                PUE = Total Facility Power / IT Equipment Power

 

Total Facility Power is defined to be “power as measured at the utility meter”.  IT Equipment Power is defined as “the load associated with all of the IT equipment”. Stated simply, PUE is the ratio of the power delivered to the facility divided by the power actually delivered to the servers, storage, and networking gear. It gives us a measure of what percentage of the power actually gets to the servers with the rest being lost in the infrastructure.  These infrastructure losses include power distribution (switch gear, uninterruptable power supplies, Power Distribution Units, Remote Power Plugs, etc.) and mechanical systems (Computer Room Air Handlers/Computer Room Air Conditioners, cooling water pumps, air moving equipment outside of the servers, chillers, etc.).   The inverse of PUE is called Data Center Infrastructure Efficiency (DCiE):

 

                                DCiE = IT Equipment Power / Total Facility Power * 100%

 

So, if we have a PUE of 1.7 that’s a DCiE of 59%.  In this example, the data center infrastructure is dissipating 41% of the power and the IT equipment the remaining 59%.

 

This is useful to know in that allows us to compare different infrastructure designs and understand their relative value.  Unfortunately, where money is spent, we often see metrics games and this is no exception. Let’s look at some of the issues with PUE and then propose a partial solution.

 

Issues with PUE

Total Facility Power: The first issue is the definition of total facility power. The original Green Grid document defines total facility power as “power as measured at the utility meter”. This sounds fairly complete at first blush but its not nearly tight enough.  Many smaller facilities meter at 480VAC but some facilities meter at mid-voltage (around 13.2kVAC in North America). And a few facilities meter at high voltage (~115kVAC in North America). Still others purchase and provided the land for the 115kVAC to 13.2kVAC step down transformer layer but still meter at mid-voltage.

 

Some UPS are installed at medium voltage whereas others are at low (480VAC). Clearly the UPS has to be part of the infrastructure overhead. 

 

The implication of the above observations is that some PUE numbers include the losses on two voltage conversion layers getting down to 480VAC, some include 1 conversion, and some don’t include any of them. This muddies the water considerably and makes small facilities look somewhat better than they should and it’s an just another opportunity to inflate numbers beyond what the facility can actually produce.

 

Container Game: Many modular data centers are built upon containers that take 480VAC as input. I’ve seen modular data center suppliers that chose to call the connection to the container “IT equipment” which means the normal conversion from 480VAC to 208VAC (or sometimes even to 110VAC) is not included.  This seriously skews the metric but the negative impact is even worse on the mechanical side. The containers often have the CRAH or CRAC units in the container. This means that large parts of the mechanical infrastructure is being included under “IT load” and this makes these containers look artificially good.  Ironically, the container designs I’m referring to here actually are pretty good. They really don’t need to play metrics games but it is happening so read the fine print.

 

Infrastructure/Server Blur: Many rack based modular designs use large rack levels fans rather than multiple inefficient fans in the server. For example, the Rackable CloudRack C2 (SGI is still Rackable to me :)) moves the fans out of the servers and puts them at the rack level. This is a wonderful design that is much more efficient than tiny 1RU fans. Normally the server fans are included as “IT load” but in these modern designs that move fans out of the servers, its considered infrastructure load.

 

In extreme cases, fan power can be upwards of 100W (please don’t buy these servers). This makes a data center running more efficient servers potentially have to report a lower PUE number. We don’t want to push the industry in the wrong direction. Here’s one more.  The IT load normally includes the server Power Supply Unit (PSU) but in many designs such as IBM iDataPlex the individual PSUs are moved out of the server and placed at the rack level. Again, this is a good design and one we’re going to see a lot more of but it takes losses that were previously IT load and makes them infrastructure load. PUE doesn’t measure the right thing in these cases.

 

PUE less than 1.0: In the Green Grid document, it says that “the PUE can range from 1.0 to infinity” and goes on to say “… a PUE value approaching 1.0 would indicate 100% efficiency (i.e. all power used by IT equipment only).   In practice, this is approximately true. But PUEs better than 1.0 is absolutely possible and even a good idea.  Let’s use an example to better understand this.  I’ll use a 1.2 PUE facility in this case. Some facilities are already exceeding this PUE and there is no controversy on whether its achievable. 

 

Our example 1.2 PUE facility is dissipating 16% of the total facility power in power distribution and cooling. Some of this heat may be in transformers outside the building but we know for sure that all the servers are inside which is to say that at least 83% of the dissipated heat will be inside the shell. Let’s assume that we can recover 30% of this heat and use it for commercial gain.  For example, we might use the waste heat to warm crops and allow tomatoes or other high value crops to be grown in climates that would not normally favor them.  Or we can use the heat as part of the process to grow algae for bio-diesel.  If we can transport this low grade heat and net only 30% of the original value, we can achieve a 0.90 PUE.  That is to say if we are only 30% effective at monetizing the low-grade waste heat, we can achieve a better than 1.0 PUE.

 

Less than 1.0 PUE are possible and I would love to rally the industry around achieving a less than 1.0 PUE.  In the database world years ago, we rallied around the achieving 1,000 transactions per second.  The High Performance Transactions Systems conference was originally conceived with a goal of achiving these (at the time) incredible result.  1,000 TPS was eclipsed decades ago but HPTS remains a fantastic conference. We need to do the same with PUE and aim to get below 1.0 before 2015. A PUE less than 1.0 is hard but it can and will be done.

 

tPUE Defined

Christian Belady, the editor of the Green Grid document, is well aware of the issues I raise above.  He proposes that it be replaced long haul by the Data Center Productivity (DCP) index. DCP is defined as:

 

                                DCP = Useful Work / Total Facility Power

 

I love the approach but the challenge is defining “useful work” in a general way. How do we come up with a measure of useful work that spans all interesting workloads over all host operating systems.  Some workloads use floating point and some don’t. Some use special purpose ASICs and some run on general purpose hardware. Some software is efficient and some is very poorly written.  I think the goal is the right one but there never will be a way to measure it in a fully general way. We might be able to define DCP for a given workload type but I can’t see a way to use it to speak about infrastructure efficiency in a fully general way.

 

Instead I propose tPUE which is a modification of PUE that mitigates some of the issues above. Admittedly it is more complex than PUE but it has the advantage of equalizing different infrastructure designs and allows comparison across workload types. Using tPUE, HPC facility can compare how they are doing against commercial data processing facilities.

 

tPUE standardizes where the total facility power is to be measured from and precisely where the IT equipment starts and what portions of the load are infrastructure vs server. With tPUE we attempt to remove some of the negative incentive to the blurring of the lines between IT equipment and infrastructure. Generally, this blurring is very good thing.  1RU fans are incredibly inefficient so replacing them with large rack or container level impellers is a good thing.  Multiple central PSUs can be more efficient and so moving the PSU from the server out to the module or rack again is a good thing. We want a metric that measure the efficiency of these changes correctly. PUE, as currently designed, will actually show a negative “gain” in both examples.

 

We define as:

 

tPUE =Total Facility Power / Productive IT Equipment Power

 

This is almost identical to PUE. It’s the next level of definitions that are important.  The tPUE definition of “Total Facility Power” is fairly simple. It’s power delivered to  the medium voltage (~13.2kVAC) source prior to any UPS or power conditioning. Most big facilities are delivered at this voltage level or higher. Smaller facilities may get 480VAC delivered, in which case, this number is harder to get. We solve the problem by using a transformer manufacturer specified number if measurement is not possible.  Fortunately, the efficiency numbers for high voltage transformers are accurately specified by manufacturers. 

 

For tPUE the facility voltage must be actually measured at medium voltage if possible. If not possible, it is permissible to measure at low voltage (480VAC in North America and 400VAC in many other geographies) as long as the efficiency loss of the medium voltage transformer(s) is included. Of course, all measurements must be before UPS or any form of power conditioning. This definition permits using a non-measured, manufacturer-specified efficiency number for the medium voltage to low transformer but it does ensure that all measurements are using medium voltage as the baseline.

 

The tPUE definition of “Productive IT Equipment Power” is somewhat more complex.  PUE measure IT load as the power delivered to the IT equipment. But, high scale data centers IT equipment are breaking the rules. Some have fans inside and some use the infrastructure fans. Some have no PSU and are delivered 12VDC by the infrastructure whereas most still have some form of PSU. tPUE “charges” all fans and all power conversions to the infrastructure component.  I define “Productive IT Equipment Power” to be all power delivered to semiconductors (memory, CPU, northbridge, southbridge, NICs), disks, ASIC, FPGAs, etc. Essentially we’re moving the PSU losses, the voltage regulator down (VRD) and/or voltage regulator modules (VRM), and cooling fans from “IT load” to infrastructure.  In this definition, infrastructure losses unambiguously includes all power conversions, UPS, switch gear, and other losses in distribution. And it includes all cooling costs whether they be in the server or not.

 

This hard part is how to measure tPUE. It achieves our goals of being comparable since everyone would be using the same definitions. And doesn’t penalize innovative designs that blur the conventional lines between server and infrastructure.  I would argue we have a better metric but the challenge will be how to measure it? Will data center operators be able to measure it and track improvements in their facilities and understand how they compare with others?

 

We’ve discussed how to measure total facility power. The short summary is it must be measured prior to all UPS and power conditioning at medium voltage.  If high voltage is delivered directly to your facility, you should measure after the first step down transformer.  If your facility is delivered low voltage, then ask your power supplier whether it be the utility, the colo-facility owner, or your companies infrastructure group, the efficiency of the medium to low step down transformer at your average load. Add this value in mathematically. This is not perfect but it better than where we are right now when we look at a PUE.

 

At the low voltage end where we are delivering “productive IT equipment power” we’re also forced to use estimate with our measures.  What we want to measure is the power delivered to individual components. We want to measure the power delivered to memory, CPU, etc. Our goal is to get power after the last conversion and this is quite difficult since VRDs are often on the board near the component they are supplying.  Given that non-destructive power measurement at this level is not easy, we use an inductive ammeter on each conductor delivering power to the board. Then we get the VRD efficiencies from the system manufacturer (you should be asking for these anyway – they are an important factor in server efficiency). In this case, we often can only get efficiency at rated power and the actually efficiency of the VRD will be less in your usage.  Nonetheless, we use this single efficiency number since it at least is an approximation and more detailed data is either unavailable or very difficult to obtain. We don’t include fan power (server fans typically run on a 12 volt rail). Essentially what we are doing is taking the definition of IT Equipment load used by the PUE definition and subtracting off VRD, PSU, and fan losses.   These measurement needs to be taken at full server load.

 

The measurements above are not as precise as we might like but I argue the techniques will produce a much more accurate picture of infrastructure efficiency than the current PUE definitions and yet these metrics are both measurable and workload independent.

 

Summary:

We have defined tPUE to be:

 

tPUE =Total Facility Power / Productive IT Equipment Power

 

We defined total facility power to be measured before all UPS and power conditioning at medium voltage.  And we defined Productive IT Equipment Power to be server power not including PSU, VRD and other conversion losses nor including fan or cooling power consumption.

 

Please consider helping to evangelize tPUE and use tPUE. And, for you folks designing and building commercial servers, if you can help by measuring the Productive IT Equipment Power for one or more of your SKUs, I would love to publish your results.  If you can supply Productive IT Equipment Power measurement for one of your newer servers, I’ll publish it here with a picture of the server.

 

Let’s make the new infrastructure rallying cry achieving a tPUE<1.0.

 

                                                                --jrh

 

James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
james@amazon.com  

H:mvdirona.com | W:mvdirona.com/jrh/work  | blog:http://perspectives.mvdirona.com

 

Sunday, June 14, 2009 4:53:34 PM (Pacific Standard Time, UTC-08:00)  #    Comments [9] - Trackback
Hardware
 Saturday, June 13, 2009

Erasure coding provides redundancy for greater than single disk failure without 3x or higher redundancy. I still like full mirroring for hot data but the vast majority of the worlds data is cold and much of it never gets referenced after writing it: Measurement and Analysis of Large-Scale Network File System Workloads. For less-than-hot workloads, erasure coding is an excellent solution. Companies such as EMC, Data Domain, Maidsafe, Allmydata, Cleversafe, and Panasas are all building products based upon erasure coding.

 

At FAST 2009 in late February, A Performance Evaluation and Examination of Open-Source Erasure Coding Libraries For Storage will be presented. This paper looks at 5 open source erasure coding systems and compares there relative performance. The open source erasure coding packages implement Read-Solomon, Cauchy Read-Solomon, Even-Odd, Row-Diagonal Parity (RDP), and Minimal Density RAID-6 codes.

 

The authors found:

·         The special-purpose RAID-6 codes vastly outperform their general-purpose counterparts. RDP performs the best of these by a narrow margin.

·         Cauchy Reed-Solomon coding outperforms classic Reed-Solomon coding significantly, as long as attention is paid to generating good encoding matrices.

·         An optimization called Code-Specific Hybrid Reconstruction  is necessary to achieve good decoding speeds in many of the codes.

·         Parameter selection can have a huge impact on how well an implementation performs. Not only must the number of computational operations be considered, but also how the code interacts with the memory hierarchy, especially the caches.

·         There is a need to achieve the levels of improvement that the RAID-6 codes show for higher numbers of failures.

 

The paper also provides a good introduction of how erasure coding works.  Recommended. I expect erasure codes to spring up in many more application in the near future.

 

                                                --jrh

 

James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
james@amazon.com  

H:mvdirona.com | W:mvdirona.com/jrh/work  | blog:http://perspectives.mvdirona.com

 

Saturday, June 13, 2009 9:42:58 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Software
 Wednesday, June 03, 2009

Don MacAskill did one of his usual excellent talks at MySQL Conf 09 this. My rough notes follow.

 

Speaker: Don MacAskill

Video at: http://mysqlconf.blip.tv/file/2037101

·         SmugMug:

o   Bootstrapped in ’02 and still operating without external funding

o   Profitable and without debt

o   Top 400 website

o   Doubling yearly

·         SmugMug Challenge:

o   Users get unlimited storage & bandwidth            

o   Photos up to 48Mpix (more than 500m)

o   Video up to 1920x180p

·         300+ four core hosts (mostly diskless)

o   Mostly AMD but really excited by Intel Nehalem [JRH: so am I]

·         5 datacenters (3 in Silicon Valley, 1 in Seattle, and 1 in Virginia) [JRH: corrected from 4 to 5 -- thanks Modesto Alexandre]

·         Only 2 ops guys

·         Lots of AWS use (Simple Storage Service, Elastic Compute Cloud, etc.)

·         Service deployment model: servers automatically load their config from a central role database. On reboot, the configured role is loaded.  Role change is a DB update followed by a reboot. [JRH: very nice]

·         Binary data all stored in Amazon S3 (PB of data at this point)

·         Akamai for content distribution network

·         Structured data

o   MySQL (InnoDB mostly)

o   Scaled up and out using cheap multi-core CPUs with lots of memory

o   4+ cores, 64GB memory, >2TB storage

·         Heavy use of MemcacheD (over 1TB of memory)

o   Over 96% hit rate and fall back to MySQL for cold data access

o   Been using it since first released 4 to 5 years back

·         Compute:

o   Amazon EC2 for photo and video processing and encoding

o   Depend upon EC2 for scaling up to high traffic times and, more importantly, being able to scale down to low traffic times such as the middle of the night (SmugMug is predominantly a North American service at this point). During scale down periods 10’s of cores and during scale up periods 100s if not 1000s of cores)

§  Totally autonomous scaling up and down using SkyNet (written by SmugMug)

·         Web Servers:

o   Diskless with PXE boot

·         MySQL:

o   Most important technology in use at SmugMug

o   Super dependent on replication for performance, reliability, and high availability

o   No data loss in over 7 years

o   No joins or other 4.x+ features

§  Like the Drizzle project (http://en.wikipedia.org/wiki/Drizzle_(database_server)) since its re-focuses MySQL on the core they actually use – lean and mean.

o   Vertically partitioned. They have looked at sharding several times but have always managed to find a way to avoid it so far

·         InnoDB

o   Running 1.0.3+ patches (Percona XtraDB) in production (great for concurrency bound issues)

§  Great relationship with Percona (“Crazy concentration of talent under 1 roof”) who does MySQL support

·         MySQL Details:

o   Data integrity is number 1 issue

o   Next most important is write latency since scaling reads is relatively easy.

o   Replication kept at less than 1sec behind

o   Big RAM (64GB+) to keep indexes in memory

o   Previously had many concurrency issues (better now).

·         MySQL Usage:

o   Not very relational. Mostly a key-value store

o   Very denormalized

o   No  joins or complex selects

o   96% MemcacheD hit rate to cool MySQL

·         MySQL Issues:

o   Need a better filesystem:

§  They use the CentOS linux distro

§  MySQL is storage intensive (IOPS & capacity)

§  Ext3 is broken and sucks. Fsck sucks as well

§  Ext4 is also old and busted

§  Want good volume management

§  Ext3 serialized writes to a given file

§  Love ZFS

·         Transactional, copy-on-write, end-to-end data integrity, on the fly corruption detection and repair, integrated volume management, snapshots and clones supported, and open source software

·         Unfortunately ZFS doesn’t run on Linux and SmugMug is a Linux shop

o   Replication:

§  Unknown state on crash

§  Did *.info get written at commit or 2 months out of date (in one instance)?

·         Transactional replication to the rescue

§  Bringing up TB+ slaves is slow

§  Backups using LVM/ZFS a pain

§  Single thread for replication can fall behind

§  Transactional replication patches from Google are GREAT and solves these issues

·         InnoDB only

·         Taking these patches to production next week.

·         Sun Sushi Toro aka S7410

o   NAS box with a few twists:

§  2x quad-core Opterond with 64GB RAM

§  100GB Readzilla SSD

§  2x 18GB Writezilla SSd (20k write IOPS)

§  22x 1TB 7200 RPM HDD

§  Clustered for HA

§  SSD performance with HDD economy

§  Toro supports ZFS on Linux

§  Can access using : NFS, iSCSI, CIFS, HTTP, FTP, etc.

§  Supports compression (1.5 compression ratio on their workload)

§  Cost: $80k ($142k clustered) – nobody pays list price though

§  SmugMug has 5 of these devices

§  5 different MySQL workloads hosted on a single shared cluster

§  Backups are a breeze (great snapshot support with roll back)

·         Rollback can selectively skip operations

·         Investigating 10GigE and actively testing

o   Intel NICS with Arista switches at less than $500/port

o   Using copper twinax SFP+

·         Expect 100% SSD in the future (not for bulk data)

·         Excited about Drizzle (scaled down MySQL)

·         Request from Oracle:

o   MySQL is a crown jewel – take care of it

o   GPL ZFS (lots of applause)

 

James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
james@amazon.com  

H:mvdirona.com | W:mvdirona.com/jrh/work  | blog:http://perspectives.mvdirona.com

 

Wednesday, June 03, 2009 6:57:34 AM (Pacific Standard Time, UTC-08:00)  #    Comments [2] - Trackback
Services
 Thursday, May 28, 2009

I’ve brought together links to select past postings and posted them to: http://mvdirona.com/jrh/AboutPerspectives/. It’s linked to the blog front page off the “about” link. I’ll add to this list over time. If there is a Perspectives article not included that you think should be, add a comment or send me email.

 

Talks and Presentations

Data Center Architecture and Efficiency

Service Architectures

Storage

Server Hardware

High-Scale Service Optimizations, Techniques, & Random Observations

James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
james@amazon.com  

H:mvdirona.com | W:mvdirona.com/jrh/work  | blog:http://perspectives.mvdirona.com

 

Thursday, May 28, 2009 4:45:07 AM (Pacific Standard Time, UTC-08:00)  #    Comments [5] - Trackback
Ramblings
 Friday, May 22, 2009

Two years ago I met with the leaders of the newly formed Dell Data Center Solutions team and they explained they were going to invest deeply in R&D to meet the needs of very high scale data center solutions.  Essentially Dell was going to invest in R&D for a fairly narrow market segment. “Yeah, right” was my first thought but I’ve been increasingly impressed since then. Dell is doing very good work and the announcement of Fortuna this week is worthy of mention.

  

Fortuna, the Dell XS11-VX8, is an innovative server design. I actually like the name as proof that the DCS team is an engineering group rather than a marketing team. What marketing team would chose XS11-VX8 as a name unless they just didn’t like the product? 

 

The name aside, this server is excellent work. It is based on the Via Nano and the entire server is just over 15W idle and just under 30W at full load. It’s a real server with 1GigE ports, full remote management via IPMI 2.0 (stick with the DCMI subset). In a fully configured rack, they can house 252 servers only requiring 7.3KW. Nice work DCS!

 

 

6 min video with more data: http://www.youtube.com/watch?v=QT8wEgjwr7k.

 

                                --jrh

 

James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
james@amazon.com  

H:mvdirona.com | W:mvdirona.com/jrh/work  | blog:http://perspectives.mvdirona.com

Friday, May 22, 2009 10:15:01 AM (Pacific Standard Time, UTC-08:00)  #    Comments [7] - Trackback
Hardware
 Thursday, May 21, 2009

Cloud services provide excellent value but it’s easy to underestimate the challenge of getting large quantities of data to the cloud. When moving very large quantities of data, even the fastest networks are surprisingly slow.  And, many companies have incredibly slow internet connections. Back in 1996 MInix author and networking expert, Andrew Tanenbaum said “Never underestimate the bandwidth of a station wagon  full of tapes hurtling down the highway”.  For large data transfers, it’s faster (and often cheaper) to write to local media and ship the media via courier.

 

This morning the Beta release Amazon Web Services Import/Export was announced. This service essentially implements sneakernet allowing the efficient transfer of very large quantities of data into or out of the AWS Simple Storage Service. This initial beta release only supports import but the announcement reports that “the service will be expanded to include export in the coming months”.

 

To use the service, the data is copied to a portable storage device formatted using NTFS, FAT, ext2, or ext3 file systems. The manifest that describes the data load job is digitally signed using the sending users AWS access secret key and shipped to Amazon for loading.  Load charges are:

Device Handling

·         $80.00 per storage device handled.

Data Loading Time

·         $2.49 per data-loading-hour. Partial data-loading-hours are billed as full hours.

Amazon S3 Charges

·         Standard Amazon S3 Request and Storage pricing applies.

·         Data transferred between AWS Import/Export and Amazon S3 is free of charge (i.e. $0.00 per GB).

In addition to allowing much faster data ingestion, AWS Import/Export reduces networking costs since there is no charge for the transfer of data from the Import/Export service and S3.  A calculator is provided to compare estimated electronic transfer costs vs import/export costs.  It’s a clear win for larger data sets.

 

                                                --jrh

 

James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
james@amazon.com  

H:mvdirona.com | W:mvdirona.com/jrh/work  | blog:http://perspectives.mvdirona.com

 

Thursday, May 21, 2009 5:49:27 AM (Pacific Standard Time, UTC-08:00)  #    Comments [4] - Trackback
Services
 Wednesday, May 20, 2009

From an interesting article in Data Center Knowledge Who has the Most Web Servers:

The article continues to speculate on server counts at the companies that publically disclose server counts but are likely over 50k.  Google is likely around a million, microsoft is over 200k, and "Amazon says very little about its data center operations".

 

                                --jrh

 

James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
james@amazon.com  

H:mvdirona.com | W:mvdirona.com/jrh/work  | blog:http://perspectives.mvdirona.com

 

Wednesday, May 20, 2009 4:45:41 AM (Pacific Standard Time, UTC-08:00)  #    Comments [2] - Trackback
Services
 Tuesday, May 19, 2009

 

Our 1999 Mitsubishi 3000 VR4 For Sale. Black-on-black with 80,000 miles. $12,500 OBO. Fewer than 300 1999 VR-4s were produced for North America, and only 101 in black-on-black.

 

We love this car and hate to sell it, but are living downtown Seattle and no longer need a car. It's a beautiful machine, 320 HP, and handles incredibly well. We're often stopped on the street asking if we would sell it, and now we are.

 

Details and pictures at: http://www.mvdirona.com/somerset/vr4.html.

 

Our house in Bellevue is for sale as well: 4509 Somerset Pl SE, Bellevue, Wa. Virtual tour: http://vifp.com/presentation/video_flash.php?PresID=U6I3JN153KNBW7625R826E0KP2411X18&CustID=0&Branded=logo14909.jpg.


 

                                    --jrh

 

James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
james@amazon.com 

H:mvdirona.com | W:mvdirona.com/jrh/work  | blog:http://perspectives.mvdirona.com

 

Tuesday, May 19, 2009 5:25:29 AM (Pacific Standard Time, UTC-08:00)  #    Comments [6] - Trackback
Ramblings
 Monday, May 18, 2009

Earlier this morning Amazon Web Services announced the public beta of Amazon Cloudwatch, Auto Scaling, and Elastic Load Balancing.  Amazon Cloudwatch is a web service for monitoring AWS resources. Auto Scaling automatically grows and shrinks Elastic Compute Cloud resources based upon demand.  Elastic Load Balancing distributed workload over a fleet of EC2 servers.

  • Amazon CloudWatch – Amazon CloudWatch is a web service that provides monitoring for AWS cloud resources, starting with Amazon EC2. It provides you with visibility into resource utilization, operational performance, and overall demand patterns—including metrics such as CPU utilization, disk reads and writes, and network traffic. To use Amazon CloudWatch, simply select the Amazon EC2 instances that you’d like to monitor; within minutes, Amazon CloudWatch will begin aggregating and storing monitoring data that can be accessed using web service APIs or Command Line Tools. See Amazon CloudWatch for more details.
  • Auto Scaling – Auto Scaling allows you to automatically scale your Amazon EC2 capacity up or down according to conditions you define. With Auto Scaling, you can ensure that the number of Amazon EC2 instances you’re using scales up seamlessly during demand spikes to maintain performance, and scales down automatically during demand lulls to minimize costs. Auto Scaling is particularly well suited for applications that experience hourly, daily, or weekly variability in usage. Auto Scaling is enabled by Amazon CloudWatch and available at no additional charge beyond Amazon CloudWatch fees. See Auto Scaling for more details.
  • Elastic Load Balancing – Elastic Load Balancing automatically distributes incoming application traffic across multiple Amazon EC2 instances. It enables you to achieve even greater fault tolerance in your applications, seamlessly providing the amount of load balancing capacity needed in response to incoming application traffic. Elastic Load Balancing detects unhealthy instances within a pool and automatically reroutes traffic to healthy instances until the unhealthy instances have been restored. You can enable Elastic Load Balancing within a single Availability Zone or across multiple zones for even more consistent application performance. Amazon CloudWatch can be used to capture a specific Elastic Load Balancer’s operational metrics, such as request count and request latency, at no additional cost beyond Elastic Load Balancing fees. See Elastic Load Balancing for more details.

James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
james@amazon.com  

H:mvdirona.com | W:mvdirona.com/jrh/work  | blog:http://perspectives.mvdirona.com

 

Monday, May 18, 2009 5:16:10 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services
 Saturday, May 16, 2009

A couple of weeks back, a mini-book by Luiz André Barroso and Urs Hölzle of the Google infrastructure team was released. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines is just over 100 pages long but an excellent introduction into very high scale computing and the issues important at scale.

 

From the Abstract:

As computation continues to move into the cloud, the computing platform of interest no longer resembles a pizza box or a refrigerator, but a warehouse full of computers. These new large datacenters are quite different from traditional hosting facilities of earlier times and cannot be viewed simply as a collection of co-located servers. Large portions of the hardware and software resources in these facilities must work in concert to efficiently deliver good levels of Internet service performance, something that can only be achieved by a holistic approach to their design and deployment. In other words, we must treat the datacenter itself as one massive warehouse-scale computer (WSC). We describe the architecture of WSCs, the main factors influencing their design, operation, and cost structure, and the characteristics of their software base. We hope it will be useful to architects and programmers of today’s WSCs, as well as those of future many-core platforms which may one day implement the equivalent of today’s WSCs on a single board.

 

Some of the points I found particularly interesting:

·         Networking:

o   Commodity switches in each rack provide a fraction of their bi-section bandwidth for interrack communication through a handful of uplinks to the more costly cluster-level switches. For example, a rack with 40 servers, each with a 1-Gbps port, might have between four and eight 1-Gbps uplinks to the cluster-level switch, corresponding to an oversubscription factor between 5 and 10 for communication across racks. In such a network, programmers must be aware of the relatively scarce cluster-level bandwidth resources and try to exploit rack-level networking locality, complicating software development and possibly impacting resource utilization. Alternatively, one can remove some of the cluster-level networking bottlenecks by spending more money on the interconnect fabric.

·         Server Power Usage:

·         Buy vs Build:

Traditional IT infrastructure makes heavy use of third-party software components such as databases and system management software, and concentrates on creating software that is specific to the particular business where it adds direct value to the product offering, for example, as business logic on top of application servers and database engines. Large-scale Internet services providers such as Google usually take a different approach in which both application-specific logic and much of the cluster-level infrastructure software is written in-house. Platform-level software does make use of third-party components, but these tend to be open-source code that can be modified inhouse as needed. As a result, more of the entire software stack is under the control of the service developer.

 

This approach adds significant software development and maintenance work but can provide important benefits in flexibility and cost efficiency. Flexibility is important when critical functionality or performance bugs must be addressed, allowing a quick turn-around time for bug fixes at all levels. It is also extremely advantageous when facing complex system problems because it provides several options for addressing them. For example, an unwanted networking behavior might be very difficult to address at the application level but relatively simple to solve at the RPC library level, or the other way around.

 

The full paper: http://www.morganclaypool.com/doi/pdf/10.2200/S00193ED1V01Y200905CAC006

 

James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
james@amazon.com  

H:mvdirona.com | W:mvdirona.com/jrh/work  | blog:http://perspectives.mvdirona.com

 

Saturday, May 16, 2009 9:30:04 AM (Pacific Standard Time, UTC-08:00)  #    Comments [4] - Trackback
Services
 Tuesday, May 05, 2009

High data center temperatures is the next frontier for server competition (see pages 16 through 22 of my Data Center Efficiency Best Practices talk: http://mvdirona.com/jrh/TalksAndPapers/JamesHamilton_Google2009.pdf and 32C (90F) in the Data Center). At higher temperatures the difference between good and sloppy mechanical designs are much more pronounced and need to be a purchasing criteria.

 

The infrastructure efficiency gains of running at higher temperatures are obvious. In a typical data center 1/3 of the power arriving at the property line is consumed by cooling systems.  Large operational expenses can be avoided by raising the temperature set point.  In most climates raising data center set points to the 95F range will allow a facility to move to a pure air-side economizer configuration eliminating 10% to 15% of the overall capital expense with the later number being the more typical.

 

These savings are substantial and exciting.  But, there are potential downsides: 1) increased server mortality, 2) higher semi-conductor leakage current at higher temperatures, 3) increased air movement costs driven by higher fan speeds at higher temperatures.  The former, increased server mortality, has very little data behind it. I’ve seen some studies that confirm higher failure rates at higher temperature and I’ve seen some that actually show the opposite.  For all servers there clearly is some maximum temperature beyond which failure rates will increase rapidly. What’s unclear is what that temperature point actually is.

 

We also know that the knee of the curve where failures start to get more common is heavily influenced by the server components chosen and the mechanical design.  Designs that cool more effectively, will operate without negative impact at higher temperatures. We could try to understand all details of each server and try to build a failure prediction model for different temperatures but this task is complicated by the diversity of servers and components and the near complete lack of data at higher temperatures. 

 

So, not being able to build a model, I chose to lean on a different technique that I’ve come to prefer: incent the server OEMs to produce the models themselves. If we ask the server OEMs to warrant the equipment at the planned operating temperature, we’re giving the modeling problem to the folks that have both the knowledge and the skills to model the problem faithfully and, much more importantly, they have ability to change designs if they aren’t fairing well in the field. The technique of transferring the problem to the party most capable of solving it and financially incenting them to solve it will bring success. 

 

My belief is that this approach of transferring the risk, failure modeling, and field result tracking to the server vendor will control point 1 above (increased server mortality rate). We also know that the Telecom world has been operating at 40C (104F) for years (see NEBS)so clearly equipment can be designed to operate correctly at these temperatures and last longer than current servers are used. This issue looks manageable.

 

The second issue raised above was increased semi-conductor current leakage at higher temperatures.  This principle is well understood and certainly measureable. However, in the crude measurements I’ve seen, the increased leakage is lost in the noise of higher fan power losses. And, the semiconductor leakages costs are dependent upon semi-conductor temperature rather than air inlet temperature. Better cooling designs or higher air volumes can help prevent substantial increases in actually semi-conductor temperatures. Early measurements with current servers suggests that this issue is minor so I’ll set it aside as well.

 

The final issue issues is hugely important and certainly not lost in the noise. As server temperatures go up, the required cooling air flow will increase.  Moving more air consumes more power and, as it turns out, air is an incredibly inefficient fluid to move.  More fan speed is a substantial and very noticeable cost.  What this tells us is the savings of higher temperature will get eaten up, slowly at first and more quickly as the temperature increases, until some cross over point where fan power increases dominate conventional cooling system operational costs.

 

Where is the knee of the curve where increased fan power crosses over and dominates the operational savings of running at higher temperatures? Well, like many things in engineering, the answer is “it depends.” But, it depends in very interesting ways. Poor mechanical designs built by server manufactures who think a mechanical engineers are a waste of money, will be able to run perfectly well at 95F.  Even I’m a good enough mechanical engineer to pass this bar. The trick is to put a LARGE fan in the chassis and move lots of air. This approach is very inefficient and wastes much power but it’ll work perfectly well at cooling the server. The obvious conclusion is that points 1 and 2 above really don’t matter. We clearly CAN use 95F approach air to cool servers and maintain them at the same temperature they run today which eliminates server mortality issues and potential semi-conductor leakage issues. But, eliminating these two issues with a sloppy mechanical design will be expensive and waste much power.

 

A well designed server with careful part placement, good mechanical design, and careful impeller selection and control will perform incredibly differently from a poor design. The combination of good mechanical engineering and intelligent component selection can allow a server to run at 95F at a nominal increase in power due to higher air movement requirements. A poorly designed system will be expensive to run at elevated temperatures. This is a good thing for the server industry because it’s a chance for them to differentiate and compete on engineering talent rather than all building the same thing and chasing the gray box server cost floor.

 

In past postings, I’ve said that server purchases should be made on the basis of work done per dollar and work done per joule (see slides at http://mvdirona.com/jrh/TalksAndPapers/JamesHamilton_Google2009.pdf). Measure work done using your workload or a kernel of your workload or a benchmark you feel is representative of your work load.  When measuring work done per dollar and work done joule (one watt for one second), do it at your planned data center air supply temperature. Higher temperatures will save you big operational costs and, at the same time, measuring and comparing servers at high temperatures will show much larger differentiation between server designs.  Good servers will be very visibly better than poor designs. And, if we all measure work done joule (or just power consumption under load) at high inlet temperatures, we’ll quickly get efficient servers that run reliably at high temperature.

 

Make the server suppliers compete for work done per joule at 95F approach temperatures and the server world will evolve quickly. It’s good for the environment and is perhaps the largest and easiest to obtain cost reduction on the horizon.

 

                                                --jrh

 

James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
james@amazon.com  

H:mvdirona.com | W:mvdirona.com/jrh/work  | blog:http://perspectives.mvdirona.com

 

Tuesday, May 05, 2009 6:53:55 AM (Pacific Standard Time, UTC-08:00)  #    Comments [13] - Trackback
Hardware
 Sunday, May 03, 2009

Chris Dagdigian of BioTeam presented the keynote at this year’s Bio-IT World Conference. I found this presentation interesting for at least two reasons: 1) it’s a very broad and well reasoned look at many of the issues in computational science and, 2) an innovative example of cloud computing is presented  where BioTeam and Pfizer implement protein docking using Amazon AWS.

 

The presentation is posted at: http://blog.bioteam.net/wp-content/uploads/2009/04/bioitworld-2009-keynote-cdagdigian.pdf and I summarize some of what caught my interest below:

·         Argues that virtualization is “still the lowest hanging fruit in most shops” yielding big gains for operators, users, the environment, and budgets

·         Storage:

o   Storage still cheap and getting cheaper but operational costs largely unchanged

o   Data Triage needed: volume of data production is outpacing declining fully burdened cost of storage (including operational costs)

o   Lessons learned from a data loss event (10+TB lost)

§  Double disk failure on RAID5 volume holding SAN FS metadata with significant operational errors

§  Need more redundancy than RAID5

§  Need SNMP and email error reporting

§  Need storage subsystems to actively scrub, verify, and correct  errors

o   Concludes the storage discussion by pointing out that cloud services offer excellent fully burdened storage costs

·         Utility Computing

o   It is expensive to design for peak demand in-house

o   Pay-as-you-go can be compelling for some workloads

o   Explained why he “drank the Amazon EC2 Kool-Aid: saw it, used it, solved actual customer problems with it. As an example, Chris looked at a protein docking project done by Pfizer & BioTeam.

·         Protein Docking project architecture:

o   Borrows heavily from Rightscale Grid Edition

o   Inbound and outbound in Amazon SQS

o   Job specification in JSON

o   Data stored in Amazon S3

o   Job provenance and metadata stored in SimpleDB

o   Worker instances dynamic spawned in EC2 where structures are scored

o   All results stored in S3 (EC2 <-> S3 bandwidth free)

o   Download the top ranked docked complexes

o   Launch post-processing EC2 instances to score, rank, filter,  and cluster results into S3 (bring the computation do the data)

·         Don’t want to belittle the security concerns but whiff hypocrisy in the air

o   Is your staff really concerned or just protecting their turf

o   It is funny to see people demanding security measures they don’t practice internally across their own infrastructure

·         Next-Gen & utility storage

o   Primary analysis onsite; data moved to remote utility storage service after passing QC tests

o   Data would rarely (if ever) move back

o   Need to reprocess or rerun?

§  Spin up cloud servers to re-analyze in situ

§  Terabyte data transit not required

 

James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
james@amazon.com  

H:mvdirona.com | W:mvdirona.com/jrh/work  | blog:http://perspectives.mvdirona.com

 

Sunday, May 03, 2009 8:58:20 AM (Pacific Standard Time, UTC-08:00)  #    Comments [1] - Trackback
Services
 Wednesday, April 29, 2009

In the Randy Katz on High Scale Data Centers posting I the article brought up Google Dalles.  The article reported that Dalles used air side economization but I’ve not seen the large intakes or louvers I would expect from a facility of that scale. 

 

Cary Roberts, ex-TellMe Networks and all around smart guy, produced a picture of Google Dalles that clearly shows air side economization (Thanks Cary).

 

James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
james@amazon.com  

H:mvdirona.com | W:mvdirona.com/jrh/work  | blog:http://perspectives.mvdirona.com

 

Wednesday, April 29, 2009 1:46:27 PM (Pacific Standard Time, UTC-08:00)  #    Comments [4] - Trackback
Services
 Tuesday, April 28, 2009

Earlier this week I got a thought provoking comment from Rick Cockrell in response to the posting: 32C (90F) in the Data Center. I found the points raised interesting and worthy of more general discussion so I pulled the thread out from the comments into a separate blog entry. Rick posted:

 

Guys, to be honest I am in the HVAC industry. Now, what the Intel study told us is that yes this way of cooling could cut energy use, but what is also said is that there was more than a 100% increase in server component failure in 8 months (2.45% to 4.46%) over the control study with cooling... Now with that said if anybody has been watching the news lateley or Wall-e, we know that e-waste is overwhlming most third world nations that we ship to and even Arizona. Think?

I see all kinds of competitions for energy efficiency, there should be a challenge to create sustainable data center. You see data centers use over 61 billion kWh annually (EPA and DOE), more than 120 billion gallons of water at the power plant (NREL), more than 60 billion gallons of water onsite (BAC) while producing more than 200,000 tons of e-waste annually (EPA). So for this to be a fair game we can't just look at the efficiency. It's SUSTAINABILITY!

It would be easy to just remove the mechanical cooling (I.E. Intel) and run the facility hotter, but the e-waste goes up by more than 100% (Intel Report and Fujitsu hard drive testing), It would be easy to not use water cooled equipment, to reduce water onsite use but the water at the power plant level goes up, as well as the energy use. The total solution has to be a solution of providing the perfect environment, the proper temperatures, while reducing e-waste.

People really need to do more thinking and less talking. There is a solution out there that can do almost everything that needs to be done for the industry. You just have to look! Or maybe call me I'll show you.

 

Rick, you commented that “it’s time to do more thinking and less talking” and argued that the additional server failures seen in the Intel report created 100% more ewaste so simply wouldn’t make sense. I’m willing to do some thinking with you on this one.

 

I see two potential issues with your assumption.  The first that the Intel report showed “100% more ewaste”. What they saw in a 8 rack test is server mortality rate of 4.46% whereas their standard data centers were 3.83%. This is far from double and with only 8 racks may not be statistically significant. Further evidence that the difference may not be significant we see that the control experiment where they had 8 racks in the other half of the container running on DX cooling showed failure rates of 2.45%.  It may be noise given that the control differed from the standard data center by about as much as test data set. And, it’s a small sample.

 

Let’s assume for a second that the increase in failure rates actually was significant. Neither the investigators or I are convinced this is the case but let’s make the assumption and see where it takes us.  They have 0.63% more than their normal data centers and 2.01% more than the control.  Let’s take the 2% number and think it through assuming these are annualized numbers. The most important observation I’ll make is that 85% to 90% of servers are replaced BEFORE they fail which is to say that obsolescence is the leading cause of server replacement. They no longer are power efficient and get replaced after 3 to 5 years.  If I could save 10% of the overall data center capital expense and 25%+ of the operating expense at the cost of having an additional 2% in server failures each year. Absolutely yes.  Further driving this answer home, Dell, Rackable, and ZT Systems will replace early failures if run under 35C (95F) on warranty.

 

So, the increased server mortality rate is actually free during the warranty period but let’s ignore that and focus on what’s better for the environment.  If 2% of the servers need repair early and I spend the carbon footprint to buy replacement parts but saving 25%+ of my overall data center power consumption, is that a gain for the environment?  I’ve not got a great way to estimate true carbon footprint of repair parts but it sure looks like a clear win to me.

 

On the basis of the small increase in server mortality weighed against the capital and operating expense savings, running hotter looks like a clear win to me. I suspect we’ll see at least a 10F average rise over the next 5 years and I’ll be looking for ways to make that number bigger. I’m arguing it’s a substantial expense reduction and great for the environment.

 

                                                                --jrh

 

James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
james@amazon.com  

H:mvdirona.com | W:mvdirona.com/jrh/work  | blog:http://perspectives.mvdirona.com

 

Tuesday, April 28, 2009 8:01:18 AM (Pacific Standard Time, UTC-08:00)  #    Comments [20] - Trackback
Hardware
 Saturday, April 25, 2009

This IEEE Spectrum article was published in February but I’ve been busy and haven’t had a chance to blog it. The author, Randy Katz, is a UC Berkeley researcher and member of the Reliable Available Distributed Systems Lab. Katz was a coauthor on the recently published RAD Lab article on Cloud Computing: Berkeley Above the Clouds.

 

The IEEE Spectrum article focuses on data center infrastructure: Tech Titans Building Boom.  In this article Katz, looks at the Google, Microsoft, Amazon, and Yahoo data center building boom. Some highlights from my read:

·         Microsoft Quincy is 48MW total load with 48,600 sq m of space.  4.8 km of chiller pipe, 965 km of electrical wire, 92,900 m2 of drywall, and 1.5 metric tons of backup batteries.

·         Yahoo Quincy, is somewhat smaller at 13,000 m2. This not yet complete facility will include free air cooling.

·         Google Dalles is a two building facility on the Columbia river, each at 6,500 m2.  I’ve been told that this facility does make use air-side economization but in carefully studying all pictures I’ve come across I can’t find air intakes or louvers so I’m skeptical. From the outside the facilities look fairly conventional.

·         Google is also building in Pryor, Okla.; Council Bluffs, Iowa; Lenoir, N.C.; and Goose Creek, S.C.

·         Arial picture of Google Dalles: http://www.spectrum.ieee.org/feb09/7327/2

·         McKinsey estimates that the world has 44M servers and that they consume 0.5% of all electricity and produce 0.2% of all carbon dioxide. However, in a separate article McKinsey also speculates that Cloud Computing may be more expensive for enterprise customers, a claim that most of the community had trouble understanding or finding data to support.

·         Google uses conventional multicore processors. To reduce the machines’ energy appetite, Google fitted them with high-efficiency power supplies and voltage regulators, variable-speed fans, and system boards stripped of all unnecessary components like graphics chips. Google has also experimented with a CPU power-management feature called dynamic voltage/frequency scaling. It reduces a processor’s voltage or frequency during certain periods (for example, when you don’t need the results of a computing task right away). The server executes its work more slowly, thus reducing power consumption. Google engineers have reported energy savings of around 20 percent on some of their tests. For more recently released data on Google’s servers, see Data Center Efficiency Summit (Posting #4).

·         Katz reports that average data center is 14C and that newer centers are pushing to 27C. I’m interested in going to 35C and eliminating process based cooling: Data Center Efficiency Best Practices.

·         Containers: T he most radical change taking place in some of today’s mega data centers is the adoption of containers to house servers. Instead of building raised-floor rooms, installing air-conditioning systems, and mounting rack after rack, wouldn’t it be great if you could expand your facility by simply adding identical building blocks that integrate computing, power, and cooling systems all in one module? That’s exactly what vendors like IBM, HP, Sun Microsystems, Rackable Systems, and Verari Systems have come up with. These modules consist of standard shipping containers, which can house some 3000 servers, or more than 10 times as many as a conventional data center could pack in the same space. Their main advantage is that they’re fast to deploy. You just roll these modules into the building, lower them to the floor, and power them up. And they also let you refresh your technology more easily—just truck them back to the vendor and wait for the upgraded version to arrive.

·         Microsoft Chicago will have 200 containers in its lower floor (it’s a two floor facility) and it’s expected to be well over 45MW and will be 75MW if built out to the full 200 containers planned (First Containerized Data Center Announcement). The Chicago, Dublin, and Des Moines facilities have all been delayed by Microsoft presumably due to economic conditions: Microsoft Delays Chicago, Dublin, and Des Moines Data Centers.

 

Check out Tech Titans Building Boom: http://www.spectrum.ieee.org/feb09/7327.

 

                                                --jrh

 

James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
james@amazon.com  

H:mvdirona.com | W:mvdirona.com/jrh/work  | blog:http://perspectives.mvdirona.com

 

Saturday, April 25, 2009 6:40:10 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Hardware

Disclaimer: The opinions expressed here are my own and do not necessarily represent those of current or past employers.

Archive
<July 2009>
SunMonTueWedThuFriSat
2829301234
567891011
12131415161718
19202122232425
2627282930311
2345678

Categories
This Blog
Member Login
All Content © 2014, James Hamilton
Theme created by Christoph De Baene / Modified 2007.10.28 by James Hamilton