Wednesday, November 05, 2008

Butler Lampson, one of the founding members of Xerox PARC, Turing award winner, and one of the most practical engineering thinkers I know spoke a couple of days ago at the Computing in the 21st Century Conference in Beijing. My rough notes from Butler’s talk follow.  Overall Butler argues that “embodiment” is the next big phase of computing after simulation and communications.  Butler defines embodiment as computers interacting directly with the physical world.  For example, autonomously driven vehicles.  Butler argues that this class of applications are only possible now due to the rapidly falling price of computing coupled with systems capabilities driven by Moore’s law.

 

He argues that we need to further advance how we deal with uncertainty and dependability to be successful with these applications.  Uncertainty is important since all input has noise, all sensors have faults, and all data is incomplete.  Dependability in that these systems are directly interacting with the physical world and actions in the physical world can have live critical failure modes. 

 

Butler’s recommendation on how to build incredibly complex systems that directly interact with the physical world and yet have these systems be dependable is to build them two tier.  At the core, is a small, simple kernel that doesn’t do a great job of its task but doesn’t hard fail and won’t kill anyone.  He calls this “catastrophe mode”.  For example, an autonomous vehicle may slow down to 10 MPH or just safely stop in catastrophe mode. 

 

The software stack is designed in two layers where the top layer is responsible for the complex, real time interaction the system is designed to deliver. The inner or lower layer is catastrophe mode designed to be simple and, as only simple systems can be, correct.  I like the approach.

 

Butlers Slides are: ButlerLampson_China_Microsoft2008 (1.49 MB).

 

                                                                --jrh

 

Title: The Uses of Computers: What's Past is Merely Prologue

Speaker: Butler Lampson

 

Implication of Moore

·         Spend hardware to simplify software

·         Hardware enables new applications

·         Pull complexity up into software (if unavoidable)

The uses of computers:

·         1950: Simulation

·         1980: Communications

·         2010: Embodiment (computers interacting directly with the physical world)

Argument: embodiment is now possible and there are some grand challenges that fall into this category:

·         Gave some examples from Jim Gray’s Systems Challenges (Turing award lecture)

·         Butler  example: Reduce highway traffic deaths to zero

What do we need to learn how to deal with to achieve embodiment in general and zero traffic deaths in particular:

·         Dealing with uncertainty

o   Need good models of what can happen (what is possible)

o   Need boundaries for models (where they don’t apply)

·         Dependability

o   The system meets its spec

o   Measure: probability(failure) x Cost(failure)

o   Had to model dependability. Recommends using “no catastrophes”

o   Must have a threat model of what can go wrong

o   Recommends producing a simple, small base that will avoid catastrophe. It must be simple. There may be incredibly complex, very highly optimized layers but a reliable systems needs to be able to fail back to the reliable base kernel (less than 50k loc?)

Conclusions for Engineers:

·         Understand Moore’s Law

·         Aim for mass markets

·         Learn how to deal with uncertainty

·         Learn how to avoid catastrophe (avoiding fault not possible in systems at scale)

 

James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

 

Wednesday, November 05, 2008 1:07:21 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Software
 Tuesday, November 04, 2008

Tony Hoare spoke yesterday at the Computing in the 21st Century Conference in Beijing. Tony is a Turing award winner, Quicksort inventor, author of the influential Communication Sequential Processes (CSP) formal language, and long time advocate of program verification and tools to help produce reliable software systems. In his talk he argues that programming should be and can be a science and the goals should be correct programs that stay correct through change. Zero defect software. 

 

He explains that engineers will accept that there will be defects but the scientist should pursue perfection far beyond that for which there is a commercial need. Tony has spent a big part of his successful career in pursuit of techniques and tools to produce reliable complex systems.

 

Tony ended his talk on an a practical engineering note hoping that we can advance our field to the point that “Software will contain no more errors than other engineering disciplines”.  We’re not there yet.

 

My rough notes from the talk follow.

 

Title: The Science of Programming

Speaker: Tony Hoare

 

The Vision:

·         Computer software contains no more errors

o   Software is the most reliable component of any device that contains it

·         Programmers make no mistakes

o   Programs work the first time they run

o   They run forever after, even after changing

·         Programming is an engineering discipline

o   Respected for its delivered benefits and it’s foundation on basic science

·         Semantics is the science of programming

o   Explores the meaning of computer programs

o   Operational: correctness of implementation

o   Algebraic: Correctness of optimization

o   Axiomatic

The Insight:

·         Computer programs are mathematical formulae

o   They don’t suffer from rust, wear, decay, fatigue

o   If a correct program is started in a correct state, they it will stay correct

·         Their correctness is a mathematical conjecture

o   To be proved by logic and calculation

o   Checked by the computer itself

History of the idea:

·         Aristotle (350bc): Syllogistic logic

·         Euclid (300bc): geometry

·         Leibnitz (1700): calculus

·         Boole (1850): laws of thought

·         Frege (1880): predicate logic

·         Russel (1920): Principia

·         Hao Wang (1956): Computer checks

Basic Science:

·         Answers fundamental questions

·         What does it do?

·         How does it work?

·         Why does it work?

·         How do we know?

What does it do?

·         Answered by its behavioral specification

How does it work?

·         Answer by it’s internal interface contracts

Why does the program work?

·         Answered by programming theory

How do we know?

·         By logical/mathematical proof

Ideals in Basic Science

·         Pursued for the sake of scientific glory far in advance of commercial need

·         Physics: accuracy of measurement

·         Chemistry: purity of materials

·         Computing Science: zero defect programs

Unifying Theory

·         Basic science seeks unifying theories

·         Explains diverse phenomena

·         Supported by evidence

Overall, industry is not heavily using software verification along the lines that Tony wants to see but there are some in use. For example, some tools in use at Microsoft:

·         PREfix and PREfast

·         Static Driver Verifier

·         ESP (locates potential buffer overflows)

The Hope:

·         Software will contain no more errors than other engineering disciplines.

 

James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

 

Tuesday, November 04, 2008 2:23:57 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Software
 Saturday, October 25, 2008

Service monitoring at scale is incredibly hard. I’ve long argued that you should never learn anything about a problem your service is experiencing from a customer.  How could they possibly know first when there is a service outage or issue? And, yet it happens frequently. The reason it happens is most sites don’t have close to an adequate level of instrumentation.  Without this instrumentation, you are flying blind.

 

Systems monitoring data can be used to drive alerts, to compute SLAs, to drive capacity planning, to find latencies, to understand customer access patterns, and some sites use it to drive billing although the later is probably a mistake.

 

In the rare cases where I’ve come across high quality monitoring systems that actually do fine-grained data collection, its often not looked at or underutilized.  It turns out that fully using and exploiting very large amounts of  monitoring data isn’t much easier than collecting it.

 

Returning the challenge of efficiently collecting fine grained monitoring data and events from thousands of servers, Facebook made a contribution yesterday in making Scribe available as an open source project: Facebook's Scribe technology now open source.  Scribe is used at Facebook to monitor their more than 10k servers across multiple data centers.  Scribe is a Sourceforge project at: http://sourceforge.net/projects/scribeserver/.

 

Facebook continues to both develop interesting and broadly useful software and often contributes it to the community by making it open source. For example, Facebook Releases Cassandra as Open Source.

 

Some excerpts from On Designing and Deploying Internet-Scale Services on why I think auditing, monitoring, and alerting are important

 

Alerting is an art. There is a tendency to alert on any event that the developer expects they might find interesting and so version-one services often produce reams of useless alerts which never get looked at. To be effective, each alert has to represent a problem. Otherwise, the operations team will learn to ignore them. We don’t know of any magic to get alerting correct other than to interactively tune what conditions drive alerts to ensure that all critical events are alerted and there are not alerts when nothing needs to be done. To get alerting levels correct, two metrics can help and are worth tracking: 1) alerts-to-trouble ticket ratio (with a goal of near one), and 2) number of systems health issues without corresponding alerts (with a goal of near zero).

 

·         Instrument everything. Measure every customer interaction or transaction that flows through the system and report anomalies. There is a place for “runners” (synthetic workloads that simulate user interactions with a service in production) but they aren’t close to sufficient. Using runners alone, we’ve seen it take days to even notice a serious problem, since the standard runner workload was continuing to be processed well, and then days more to know why.

 

·         Data is the most valuable asset. If the normal operating behavior isn’t well-understood, it’s hard to respond to what isn’t. Lots of data on what is happening in the system needs to be gathered to know it really is working well. Many services have gone through catastrophic failures and only learned of the failure when the phones started ringing.

 

·         Have a customer view of service. Perform end-to-end testing. Runners are not enough, but they are needed to ensure the service is fully working. Make sure complex and important paths such as logging in a new user are tested by the runners. Avoid false positives. If a runner failure isn’t considered important, change the test to one that is. Again, once people become accustomed to ignoring data, breakages won’t get immediate attention.

 

·         Instrumentation required for production testing. In order to safely test in production, complete monitoring and alerting is needed. If a component is failing, it needs to be detected quickly.

 

·         Latencies are the toughest problem. Examples are slow I/O and not quite failing but processing slowly. These are hard to find, so instrument carefully to ensure they are detected.

 

·         Have sufficient production data. In order to find problems, data has to be available. Build fine grained monitoring in early or it becomes expensive to retrofit later. The  most important data that we’ve relied upon includes:

 

Thanks to Sriram Krishnan for pointing me to the release of Scribe.

 

                                                                --jrh

James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

 

Saturday, October 25, 2008 8:33:23 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services
 Sunday, October 19, 2008

In When SSDs Make Sense in Server Applications, we looked at where Solid State Drives (SSDs) were practical in servers and services. On the client side, there are even more reasons to use SSDs and I expect that within three years, more than half of enterprise laptops will have NAND Flash as at least part of their storage subsystems. This estimate has SSDs in 38% of all laptops by 2011: Flash SSD in 38% of Laptops by 2011.  

 

What follows is a quick summary of SSD advantages on the client side, followed by the disadvantages, and then a closer look at the write endurance (wear-out) problem that has been the topic of much discussion recently.

 

Client SSD Advantages:

·         Random IOPS:  Laptop I/O patterns are dominated by random workloads and, as argued in When SSDs Make Sense in Server Applications, these workloads run cost effectively on SSDs

·         Low Power: SSD power  consumption is typically in the under 2W range and often under 1W. Enterprise disk can run 15 to 18W, desktop parts are typically in the 10W range but laptop drives usually run a more modest 2.5W when active.  So, on one hand this is represents an exciting reduction in storage power of a factor of 2 but, on the other, it’s actually only a 1W saving when the HDD is active and even less when idle. A savings but a small one overall. If you are interested in more data on laptop power consumption see Client-Side Power Consumption. Some very efficient HDDs actually have less idle power consumption than some SSDs so it’s not even the case that SSDs are all better under all conditions from a power consumption perspective.

·         Quiet. HDDs can be noisy. They are mechanical parts with precision bearings spinning at high speeds and they make noise.  Semi-conductor-based SSDs avoid this.

·         Small Form Factors: SSDs can be small and light weight.

·         Scale Down Floor: Disks have a price floor where further lowering the capacity of the device doesn’t save money. This price floor changes over time but, at this point, it’s hard to get much below $30 for a disk regardless of how small. The fixed costs of the mechanical parts dominate the media and the cost of the disk doesn’t scale down. SSD costs scale down well and for applications with modest storage requirements, they can be less expensive.  This makes them interesting for very low-end laptops, netPCs, ultra-mobile PCs, and, of course, NAND Flash is the storage of choice in cell phones, music players, cameras, and other related applications.

·         Shock and Vibration: HDDs usually spec max shock in the 50g to as high as 100G range and vibration in the ¼G to ½G.  SSD specs run well over 1,000G shock and around 20G vibration. The are much more durable to this common threat in the laptop world.

·         Latency: I/O latency is far lower on an SSD than a HDD and this is particularly noticeable when I/O queues get deep as they often do on single disk laptops.

·         Reliability: HDDs are the number one failing component on clients (and servers). This is particularly a problem on laptops as they are (usually) single drive devices and often not well backed up. HDD failures represent a substantial service cost in most enterprises so eliminating them is appealing.  Our operational history with SSDs is fairly short so far but we expect they will exhibit less frequent failures that hard disks.  However, like all new components, they bring additional failure modes  as well as eliminating a few.  The biggest concern around SSDs is write endurance with SLC part lifetimes typically in the range of 10^5 writes and MLC parts down around 10^4 write cycles (some even lower). We’ll look at that in more depth below.

·         Temperature: SSDs have a much wider temperatures and humidity operating range than HDDs.

 

Client SSD Disadvantages:

·         Capacity/$: Flash devices can deliver excellent random I/O performance and laptops, with only a single disks are frequently random I/O bound rather than capacity limited.  In fact, many enterprises customers actually want LESS storage on their laptop fleet. For them, having less capacity is often either not a problem or even a potential advantage. For my uses and for many consumer usage patterns, capacity remains important with pictures, audio, and other media files driving space requirements up to the point where SSDs can be tough to afford.  As a direct consequence, I expect that we’ll see more enterprise than consumer use of SSDs in clients.

·         Performance Degradation: There have been many reports of SSDs initially performing well and then degrading over time. See Laptop SSD Performance Degradation Problems for more detail.

·         Endurance: This is the most common concern I’ve heard of late with MLC write endurance only around 10,000 writes.

 

Write Endurance

I keep hearing anecdotal reports that SSDs in laptops are going to fail in the first year due to the poor write endurance of MLC SSDs. The typical MLC write endurance is usually quoted at around 10,000 cycles which I agree does sound quite low.

 

Let’s do a quick back of the envelope on MLC SSD write endurance (SLC parts are typically more expensive but have longer write endurance specifications). Assume a client system is used four hours a day and that it spends ¼ of that time at the max I/O rate of 100 IOPS.  My gut feel says this number very likely errs high.  Let’s include write amplification. Write amplification is a side effect of Flash memory designs having larger blocks as the unit of erase and smaller pages as the unit of read and programming (write).  This combined with wear leveling leads to the device having to do some overhead housekeeping writes when servicing writes from the host system. Assume an average write amplification of 3x over three years of life which again seems high.  To make it really aggressive we’ll assume a write to read ration of 1:1 (50% writes) which is very high. Finally let assume it’s a 64GB MLC device and that my writes are all to 4k pages and the overheads are all accounted for by my 3x write amplification number.

 

4*60*60*365*3*.25*.5*3*100 => 591m

 

Reading left to write, that’s 4 hours a day * 60 to get minutes * 60 to get seconds * 365 to get seconds use per year * 3 to get seconds use in three years, *25% of time at max I/O, *50% of I/Os are writes, *3 write amplification, *100 I/Os per second.

 

In aggregate, that’s about ½ billion write I/Os is needed by each laptop living three years.  But, a 64GB device has 16m pages. If you spread ½ billion writes over 16m pages with perfect wear leveling, you would have 36 write I/Os per page. Very low. With terrible wear leveling, it could go up an order of magnitude but it’s still a very low number. Move write amplification up to 5x and the wear/page still looks tiny.   Move the usage pattern up from 4 hours a day to an aggressive 16 hours a day and it’s still only 147 writes per page. Perhaps we’ll use more lifetime I/Os with an SSD than my magnetic disk model above assuming we spend less time waiting but, still, it’s not looking that big a number of lifetime writes required.

 

If we use very low endurance MLC where write endurance is specified down around 1,000 cycles rather than the more common 10,000, it’s still not a problem. But is within on order of magnitude so arguably a concern over a three year life. And it would be definitively a concern over a 5 year life.

 

Because client systems spend such a small percentage of their working lifetimes at 100% I/O rates, it’s hard to see a credible usage model that has MLC write endurance as a serious problem if using parts specified at 10,000 write cycles.

 

In a subsequent post, we’ll look back at server applications and, in contrast with When SSDs Make Sense in Server Applications, we’ll look at where SSDs don’t make sense on the server side.

 

James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

 

Sunday, October 19, 2008 9:04:40 AM (Pacific Standard Time, UTC-08:00)  #    Comments [4] - Trackback
Hardware
 Wednesday, October 15, 2008

In past posts, I’ve talked a lot about Solid State Drives.  I’ve mostly discussed about why they are going to be relevant on the server side and the shortest form of the argument is based on extremely hot online transaction processing systems (OLTP).  There are potential applications as reliable boot disks in blade servers and other small data applications but I’m focused on high-scale OLTP in this discussion. OLTP applications random I/O bound workloads such as ecommerce systems, airline reservation systems, and any data intensive application that does lots of small reads and writes, usually on a database where future access patterns are unknown. When sizing a server for one of these workloads, the key dimension is the number of small random I/Os per second.  You need to add memory  to increase the memory hit rate and reduce the number I/Os or you need to add disks to support the application-required I/O rates.  The problem with adding memory is that it has linear cost – the last DIMM costs as much as the first DIMM – but only logarithmic value.  Because the workloads are random, adding memory only delivers a reduction in I/Os roughly proportional to the square root of the memory size. Cheap memory helps but, even then, the costs add up as does the power consumption as memory is added.  Alternatively, you can add disk but each disk added gives only another roughly 200 I/Os per second (IOPS) when using very expensive, 15k RPM disks.

 

The problem is best summarized by my favorite chart these days from Dave Patterson of Berkeley:

This chart is from an amazingly useful paper, Latency Lags Bandwidth (if you know of no-charge location for this paper, let me know).  In this chart, Dave tracks the trend of bandwidth and latency over the last 20+ years. For the purposes of this discussion ignore the latency row and focus on bandwidth. Disk bandwidth is growing slower than DRAM and CPU bandwidth.  I love looking for divergent trends in that they direct us to the more fundamental problems needing innovation.

 

Understanding disk bandwidth growth is a growing problem, let’s compare disk sequential bandwidth with random I/O rates over-time. In the chart below, I graph sequential bandwidth growth against random bandwidth growth over the same period:

 

We know that disk sequential bandwidth growth lags the rest of the system. This graph shows that random IOPS bandwidth is growing even more slowly. Across the industry, we have a huge problem and the trend lines above make it crystal clear that the problem won’t be cost-effectively solved by disk alone. More detail on one dimension of the disk limits problem in: Why Disk Speeds aren’t Increasing.

 

Disks clearly aren’t the full solution. Ever larger memory sub-systems actually are part of the solution but the logarithmic (or worse) payback with linear cost and power consumption makes memory an expensive approach if we use it as the only tool.  Many have argued for the last couple of years that solid state disks are the solution to filling the chasm between memory and disk random IOPS rates.  Jim Gray was one of the first to make this observation in: Tape is Dead, Disk is Tape, Flash is Disk, Ram Locality is King.

 

The first generation, server-side SSDs were slow random write performers but we’re now seeing great components released to the market. See 100,000 IOPS and 1,000,000 IOPS. These are great performers but they are far from commodity pricing at this point.  Intel has been doing some great work on SSDs and I really like this one: Intel X25-E Extreme SATA Solid-State Drive.  It’s a step towards commodity pricing. Overall the industry now has great performing parts available and the price/performance equation is very rapidly improving since this is a semi-conductor component rather than a mechanical one.

 

When should we expect the crossover? At what price point are SSDs a win over HDDs?  Unfortunately, it’s an application specific answer.  It depends upon I/O density of the workload, the number of I/Os per GB of data. Bob Fitzgerald has done a great job of analyzing different workloads to understand what level of application I/O heat (IOPS per GB) are needed to justify a SSD.  Building on Fitz’s work, I have a quick test you can use to figure out how cheap an SSD will have to get before it is a win in your application.

 

My observation goes like this. Disks have an abundance of capacity and are short of IOPS so, on random IOPS intensive workloads, the limiting factor using HDDs will be IOPS.  SSDs have an abundance of IOPS and are short of capacity, so the limiting factor using SSDs will be capacity.  SSDs are cost effective for your application when the cost of the disk farm adequate to support the IOPS you need is more than the SSD farm required to support the capacity you need. As a formula:

 

current#hdd * hdd$ > CapacityNeeded / Capacity_ssd * ssd$

 

Let’s try an example. This example application is hosted on several hundred database servers and it’s a red hot transaction processing system.  Each system has 53 disks of which 40 are used to store data and 8 for log and a few for admin purposes.  Leave the log on magnetic media since disks sequential bandwidth is cheaper than SSD sequential bandwidth. The database size on each server is 572GB.  The disks used by this application are 15k RPM, 3 ½ disks that price out at $333 each. Understanding this, the disk budget per server for this application is 40 * 3333 which is $13,320.  We know we need 572GB and let’s assume we are trying out 64 GB SSDs.  Using that equation, 572/64 is 8.9 so we’ll need 9 SSDs to support this workload. 

 

Taking the disk budget of $13,320 and dividing by the 9 SSDs we have computed we need, we can afford to pay up to $1,480 for each SSD. If the SSDs cost is less than this, it’s worth doing.  This model ignores the power savings (SSDS usually run under 1/5 the power of HDDs and fewer are needed) and other factors like service costs but it’s a quick check to see if SSDs are worth considering.

 

We also need more data on SSD longevity in high write-rate workloads.  In the absence of historical data, ask your vendor to stand behind their product with full warrantee in your usage model before jumping in.

 

Speaking of wear-out rates, for the next posting I’ll investigate client-side MLC NAND-flash wear out rates.

 

                                                                --jrh

 

James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

 

Wednesday, October 15, 2008 6:27:15 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Hardware
 Saturday, October 11, 2008

Albert Greenberg and I missed Hotnets 2008 last week due to a conflicting meeting down in California but Ken Church was there to present our On Delivering Embarrassingly Distributed Cloud Services paper.  I summarized the paper in a recent blog entry:  Embarrassingly Distributed Cloud Services and the abstract from the paper follows:

 

 Very large data centers are very expensive (servers, power/cooling, networking, physical plant.) Newer, geo-diverse, distributed or containerized designs offer a more economical alternative. We argue that a significant portion of cloud services are embarrassingly distributed – meaning there are high performance realizations that do not require massive internal communication among large server pools. We argue further that these embarrassingly distributed applications are a good match for realization in small distributed data center designs. We consider email delivery as an illustrative example. Geo-diversity in the design not only im-proves costs, scale and reliability, but also realizes advantages stemming from edge processing; in applications such as spam filtering, unwanted traffic can be blocked near the source to reduce transport costs.

 

The Hotnets agenda and all the papers present5ed are up at: Seventh ACM Workshop on Hot Topics in Networks.

 

The slides ken presented are posted at:

·         pptx form:  http://conferences.sigcomm.org/hotnets/2008/slides/EmbarrassinglyDistributed6.pptx

·         pdf form:  EmbarrassinglyDistributedFinalSlides.pdf (724.17 KB)

 

James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

Saturday, October 11, 2008 1:40:21 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services
 Saturday, October 04, 2008

Google has long enjoyed a reputation for running efficient data centers. I suspect this reputation is largely deserved but, since it has been completely shrouded in secrecy, that’s largely been a guess built upon respect for the folks working on the infrastructure team rather than anything that’s been published.  However, some of the shroud of secrecy was lifted last week and a few interesting tidbits were released in Google Commitment to Sustainable Computing.

 

On server design (Efficient Servers), the paper documents the use of high-efficiency power supplies and voltage regulators, and the removal of components not relevant in a service-targeted server design.  A key point is the use of efficient, variable-speed fans.  I’ve seen servers that spend as much as 60W driving the fans alone. Using high efficiency fans running at the minimum speed necessary based upon current heat load can bring big savings.  An even better approach is employed by Rackable Systems in their ICE Cube Modular Data Center design (First Containerized Data Center Announcement) where they eliminate server fans entirely.

 

The paper also argues for energy proportionality a concept introduced by Luiz Barroso and Urs Holzle of Google. Energy proportionality is a call to the industry to produce servers where the amount of energy consumed is proportional to the server load.  Sadly, many current server designs consume more than 60% of their full load power when idle.  None of us will talk publically about the average utilizations of our servers farms but the quick summary is that achieving very high utilizations is incredibly difficult. Or, worded differently, most servers are on average closer to idle than to full load. Even small steps towards energy proportionality make a huge difference and, of course, getting utilization up remains the holy grail of the industry.

 

It’s good to see water conservation brought up beside energy efficiency.  It’s the next big problem for our industry and the consumption rates are prodigious.  To achieve efficiency, most centers have cooling towers which allow them to avoid the use of energy-intensive direct-expansion chillers except under unusually hot and humid conditions. This is great news from an energy efficiency perspective, but cooling towers consume water in two significant ways. The first are evaporative losses which are hard to avoid in wet tower designs (other less water-intensive designs exist). The second is caused by the first. As water evaporates from the closed system, the concentrations of dissolved solids and other contaminants present in the supply water left behind by evaporation continue to rise. These high concentrations are dumped from the system to protect it and this dumping is referred to as blow-down water.  Between make-up and blow-down water, a medium-sized, 10MW facility, built to current industry conventions, can go through ¼ to ½ million gallons of water a day.

 

The paper describes a plan to address this problem in the future by moving to recycled water sources.  This is good to see but I argue the industry needs to reduce overall water consumption, whether the source is fresh or recycled.  The combination of higher data center temperatures and aggressive use of air-side economization are both good steps in that direction and industry-wide we’re all working hard on new techniques and approaches to reduce water consumption.

 

The section on PUE is the most interesting in that the are documenting an at-scale facility running at a PUE of 1.13 during a quarter. Generally, you want full-year numbers since these numbers are very load and weather dependent. The best annual number quoted in the paper is 1.15 which is excellent. That means that for every watt delivered to servers 0.15W is lost in power distribution and cooling. 

 

This number, with pure air-side cooling and good overall center design, is quite attainable.  But, elsewhere in the document, they described the use of cooling towers.  Attaining a PUE of 1.15 with a conventional water-based cooling system is considerably more difficult.  On the power distribution side, conventional designs waste about 8% to 9% of the power delivered.  A rough breakdown of where it goes is 3 transformers taking  115KV down to 13.2KV down to 480KV and then down to 208KV for delivery to the load. Good transformer designs run around 99.7% efficiency.  The uninterruptable power supply can be as poor as 94%, and roughly 1% is lost in switching and conductors. That approach gets us to 8% lost in distribution. We can easily eliminate one layer of transformers and either use a high efficiency bypass UPS.   Let’s use 97% efficiency for the UPS. Those two changes will get us 4% to 5% lost in distribution.  Let’s assume we can reliably hit 5% power distribution losses.  That leaves us with 10% for all the losses to the mechanical systems.  Powering the Computer Room Air Handlers, the water pumps etc. at only 10% overhead would be both difficult and more impressive. 

 

The 1.15 PUE with pure air-side economization in the right climate looks quite reasonable, but powering a conventional, high-scale, air and water, multi-conversion cooling system at this efficiency looks considerably harder to me.  Unfortunately, there is no data published in the paper on the approach and whether it was simply attained by relying on favorable weather conditions and air-side economization with the water loops idle.

 

The paper closes with An Efficient and Clean Energy Future, a discussion of the Renewable Energy Less Than Coal (RE<C) project.  The RE<C project isn’t part of the Google infrastructure team, and they aren’t building data centers, but it is perhaps the coolest project I’ve ever come across.  It’s just amazing. The core premise of this project is to do research into renewable energy sources that can be harnessed less expensively than coal and then let capitalism take care of the rest.  Environmental policy lags reality and is influenced by special interest groups.  If renewable energy can be made less expensive than coal, the free market system will help eliminate the burning of coal. Why fight a powerful market force if an alternative may exist. This is great research and, the more I hear about it, the more I like it.

 

The paper concludes that “if all data centers operated at the same efficiency as ours, the U.S. alone would save enough electricity to power every household within the city limits of Atlanta, Los Angeles, Chicago, and Washington, D.C.”. This is hard to independently verify without much more information than offered by the paper. Most of the techniques employed are not discussed in the paper published last week.  If the large service providers like Google, Microsoft, Yahoo, Beidu, Amazon and a handful of others don’t publish the details, the rest of the world’s data centers will never run as efficiently as described in the paper.  Only high-scale datacenter users can afford the R&D program to spend on increased efficiency and water consumption elimination.  I’m arguing it’s up to all of us working in the industry to publish the details to allow smaller-scale deployments to operate at similar efficiency levels.  If we don’t, it’ll continue to be the case that US data centers alone will be needlessly spending enough power to support every household in Atlanta, Los Angeles, Chicago, and Washington DC.  Each day, every day.

 

                                                --jrh

 

Thanks to Alex Mallet, Mike Neil, Janine Harrison, and many others who sent this article my way last week.

 

James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

 

 

 

Saturday, October 04, 2008 1:55:28 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services
 Sunday, September 28, 2008

An interesting file system study is at this year’s USENIX Annual Technical Conference.  The paper Measurement and Analysis of Large-Scale Network File System Workloads looks at CIFS  remote file system access patterns from two populations. The first a large file store of 19TB serving 500 software developers and the second a medium sized file store of 3TB used by 1,000 marketing, sales, and finance users.

 

The authors found that file access patterns have changed since previous studies and offer 10 observations:

·         Both workloads are more write-heavy than workloads studied previously

·         Read-write [rather than pure read or pure write] access patterns are much frequent compared to past studies

·         Bytes are transferred in much longer sequential runs than in previous studies [the lengths of sequential runs is increasing but note that the percentage of random access is increasing]

·         Bytes are transferred from much larger files than previous studies [files are getting bigger]

·         Files live an order of magnitude longer than in previous studies

·         Most files are not re-opened once they are closed

·         If a file is re-opened, it is temporally related to the previous close

·         A small fraction of the clients account for a large fraction of the activity

·         Files are infrequently accessed by more than one client

·         Files sharing is rarely concurrent and mostly read-only

·         Most file types do not have a single pattern of access

 

The comments in brackets above are mine. Some of the important points that spring out for me: the percentage of random access is increasing; for those accesses that are sequential, the runs are longer; file sizes are increasing, data is getting colder; file lifetimes are increasing; and client usage has very high skew.

 

Overall, file data has been getting colder and the write to read ratio has been increasing.  The authors conclude that substantial increases in the client file caches are unlikely to help significantly based upon this data. But, since file metadata requests make up roughly 50% of all operations, larger metadata caches could be very beneficial.  Log Structured File systems look increasingly like the write answer. Increasingly random access patterns make NAND flash an interesting approach. The authors didn’t directly mention it but log structured block stores (below the filesystem) is also interesting in that, like LFS, it’s a write optimized organization.  And, in addition, a log structured block store tends to sequentialize writes while randomizing reads which is ideal for NAND Flash.

 

Thanks to Vlad Sadovsky for sending this paper my way.

 

                                --jrh

 

James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

Sunday, September 28, 2008 7:02:02 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Software
 Sunday, September 21, 2008

Ken Church, Albert Greenberg, and I just finished On Delivering Embarrassingly Distributed Cloud Services which has been accepted for presentation at ACM Hotnets 2008 in Calgary, Alberta October 6th and 7th.  This paper followed from the discussion and debate around a blog entry that Ken and I did some time back: Diseconomies of scale where we argue that the industry trend towards mega-datacenters needs to be questioned and, in many cases, is simply not cost effective.

 

There are times when Mega-datacenters do makes sense. Very large data analysis jobs and large, multi-server workloads with considerable inter-node communications traffice run best against large central data stores. MapReduce jobs are the classic example of this sort of workload.  However, we argue that other types of workloads actually run better in distributed micro-datacenters.  Highly partitionable applications with light inter-partition traffic can be better hosted in distributed micro-datacenters. Highly interactive applications such as Google Docs need to be close to their users.  Network round trip latencies can make highly interactive applications frustrating to use. We collectively refer to applications can be partitioned effectively and run close to the edge (the users) as Embarrassingly Distributed. Essentially, these are the easy applications when it comes to running them close to the edge.

 

In the paper, we argue that the class of applications that are embarrassingly distributed and therefore run well on distributed micro-datacenters is large and we are go on to show that distributed micro-datacenters can offer considerable advantage over mega-centers.  Essentially the point is that you can run many applications over distributed micr-datacenters and, if you can, you should.

 

Micro-datacenters are made possible by containerization that I wrote about in a 2007 Conference on Innovative Data Research Paper: Architecture for Modular Data Centers.  When that paper was published Rackable Systems had just shipped their first containerized design and Sun Microsystems had announced Black Box but it wasn’t yet shipping. Two years later, containerized designs are offered by most of the major datacenter server vendors:

·         IBM Scalable modular data center

·         Rackable ICE Cube™ Modular Data Center

·         Sun Modular Datacenter S20 (project Blackbox)

·         Dell Insight

·         Verari Forest Container Solution

 

Microsoft recently announced the first containerized data center in Chicago: First Containerized Data Center Announcement. The Chicago announcement is a mega-center but it does show that containerized designs are now ready for primetime.

 

Mega-datacenters remain useful and aren’t going away any time soon but, in Delivering Embarrassingly Distributed Cloud Services, we argue that distributed micro-datacenters are appropriate for many workloads and can reduce costs, improve the quality of service, and increase the speed of deployment.

 

                                    --jrh

 

James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

 

Sunday, September 21, 2008 6:22:13 AM (Pacific Standard Time, UTC-08:00)  #    Comments [2] - Trackback
Services
 Tuesday, September 16, 2008

Earlier today, I gave a talk at LADIS 2008 (Large Scale Distributed Systems & Middleware) in Yorktown Heights, New York. The program for LADIS is at: http://www.cs.cornell.edu/projects/ladis2008/program.html.  The slides presented are posted to:  http://mvdirona.com/jrh/TalksAndPapers/JamesRH_Ladis2008.pdf.

 

The quick summary of the talk: Hosted services will be a large part of enterprise information processing and consumer services with economies of scale of 5 to 10x over small scale deployments.  Enterprise deployments are very different from high scale services. The former is people-dominated from a cost perspective whereas people-costs are not in the top 4 major factors in the services world. 

 

The talk looks at limiting factors in the economic application of resources to services, one of which is power.  Looking at power in more detail, we go through where power goes in a modern data center inventorying power disapation in power distribution, cooling and server load in a high-scale data center.

 

Then it steps through a sampling of high scale services implementation techniques and possible optimizations including modular data centers, multi-data center failover replacing single data center redundancy, NAND flash bridging the memory to disk chasm, graceful degradation mode, admission control, power yield management, and resource consumption shaping.

 

                                                --jrh

 

James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

 

Tuesday, September 16, 2008 10:06:12 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services
 Thursday, September 11, 2008

This note describes a conversation I’ve had multiple times with data center owners and concludes that blade servers frequently don’t help and they sometimes hurt, easy data center power utilization improvements are available independent of the blade server premium, and enterprise data center owners have a tendency to buy gadgets from the big suppliers rather than think through overall data center design. We’ll dig into each.

 

In talking to data center owners, I’ve learned a lot but every once in a while I come across a point that just doesn’t make sense.  My favorite example is server density.  I’ve talked to many DC owners (and I’ll bet I’ll hear from many after this note) that have just purchased blades servers.  The direction of conversation is always the same. “We just went with blades and now have 25+kW racks”. I ask if their data center has open floor and it almost always does. We’ll come back to that.  Hmmm, I’m thinking. They now have much higher power density racks at higher purchase cost in order to get more computing per square foot but the data center already has open floor space (since almost all well designed centers are power and cooling bound rather than floor space bound).  Why?

 

Earlier, we observed that most well designed data centers are power and cooling bound rather than space bound.  Why is that anyway?  There is actually very little choice.  Here’s the math: Power and Cooling make up roughly 70% of the cost of the data center while the shell (the building) is just over 10%. As a designer, you need to design a data center to lasts for 15 years. Who has a clue of the needed power density (usually expressed in W/sq ft) 15 years from today? It depends upon the server technology, the storage ratio, and many other factors.  The only thing we know for sure is we don’t know and almost any choice will inevitably be wrong.  So a designer is going to have too much power and cooling or too much floor space.  One or the other will be wasted no matter what.  Wasting floor space is a 10% mistake whereas stranding power and cooling is a 70% mistake.  This 10% number applies to large scale data centers of over 10MW not in the center of New York – we’ll come back to that. Any designer that strands power and cooling by running out of floor space should have been fired years ago.  Most avoid this by providing more floor space than needed in any reasonable usage and that’s why most data centers have vast open spaces. Its insurance against the expensive mistake of stranding power.

 

There are rare exceptions to this rule of well designed data centers being power and cooling rather than floor space limited. But the common case is that a DC owner just paid the blade server premium to get yet again more unused data center floor space. They were power and cooling limited before and now, with the addition of higher density servers, even more so.  No gain visible yet so the conversation then swings over to efficiency. When talking about the amazing efficiency of the new racks, we usually talk about PUE.  PUE is Power Usage Effectiveness and it’s actually simpler than it sounds.  It’s the total power that comes into the data center divided by the power delivered to the critical load (the servers themselves). As an example, a PUE of 1.7 means that for every watt delivered to the load 0.7 W  is lost in power distribution and cooling.  Some data centers, especially those that have accreted over time rather than having been designed as a whole, can be as bad as 3.0 but achieving numbers this bad takes work and focus so we’ll stick with the 1.7 example as a baseline.

 

So, in this conversation about the efficiency of blade servers, we hear the PUE improved PUE from 1.7 to 1.4. Sounds like a fantastic deal and, if true, that kind of efficiency gain will more than pay the blade premium and is also good for society.  That would be good news all around but let’s dig deeper. I first congratulate them on the excellent PUE and ask if they had data center cooling problems when the new blade racks were first installed.  Usually they experienced exactly that and eventually bought water cooled racks from APC, Rittal, or others.  Some purchased blade racks with back-of-rack water cooling like the nicely designed IBM iDataPlex. But the story is always the same: they purchased blade servers and, at the same time, moved to water cooling at the rack. New generation servers can be more efficient than the previous generation and better cooling designs are more efficient whether or not blade servers are part of the equation. Turning the servers over onto their sides didn’t make them more efficient.

 

They key part of that PUE improvement above is they replaced the inefficiency of conventional data center cooling with water at the racks. Here’s an example of a medium to large scale deployment that went with blades and water cooled racks: One Datacenter to Rule Them All. There is nothing magical about water at the rack cooling designs.  Many other approach yield similar or even better efficiency. The important factor is that they used something other than the most common data center cooling system design which is amazingly inefficient as deployed in most centers. Conventional data centers typically move air from a water cooled CRAC unit through a narrow raised floor choked with cabling.  The air comes up into the cold aisle through perforated tiles.  In some aisles there are too many perforated tiles and in others too few.  Sometimes someone on the ops staff has put a perforated tile into the hot aisle to “cool things down” or to make it more habitable.  This innocent decision unfortunately reduces cooling efficiency greatly. The cool air that comes up into the cold aisle is pulled through the servers to cool them but some spills over the top of the rack and some around the ends.  Some goes through open rack positions without blanking panels. All these flows not going through the servers reduces cooling system efficiency.  After flowing through the servers, the air rises to the ceiling and returns to the CRAC. Moving air that distance with so many paths that don’t go through the servers, is inefficient.  If you move the water directly to the rack in what I call a CRAC-at-the-Rack design, the overall cooling design can be made much more efficient mostly through the avoidance of all these not-through-the-server air paths and avoiding the expense of pumping air long distances. It’s mostly not the blades that are more efficient, it’s the cooling systems redesign required as a side effect of deploying the high power density servers.

 

Rather than moving to blades and paying the blade premium, just changing the cooling system design to avoid the problems in the previous paragraph will yield big efficiency improvements.

 

Why are some data centers in expensive locations?  Sometimes for good reason in that the communications latency to low cost real estate is too high for a very small number of applications. But, for most data centers, having them in expensive locations is simply a design mistake.  Many time it’s to allow easy access to the data center but you shouldn’t need to be in data center frequently. In fact, if people are in the DC frequently, you are almost assured to have mistakes and outages.  Placing DCs in hard to get to locations substantially reduces costs and improves reliability. For those few that need to have them located in New York, Tokyo, London, etc., there aren’t very many of you and you all know who you are.  The remainder are spending too much.  Remember my first law of data centers: if you have a windows to see in, you are almost certainly paying too much for servers, network gear, etc. Keep it cheap and ugly.

 

What about data centers that are out of cooling capacity but can’t use all their power or floor space.  It’s bad design to strand power and simply shouldn’t happen.  We know that for every watt we bring into the building we need to get it back out again. It has got to go somewhere.  If the cooling system isn’t designed to dissipate the power being brought into the building, it’s bad design.

 

Now a more common cooling system problem is someone brought a 30kW rack into the data center and an otherwise fine cooling system that is appropriately sized overall, can’t manage that hot spot. This isn’t bad data center design but it does raise a question: why is a 30kW rack a good idea?  We’re now back to asking “why” on the blade server question.  Generally, unless you are getting value for extreme high power density, don’t buy it. High power density drives more expensive cooling.  Unless you are getting measurable value from the increased density, don’t pay for it. 

 

Summary so far: Blade servers allow for very high power density but they cost more than commodity, low power density servers. Why buy blades?  They save space and there are legitimate reasons to locate data centers where the floor space is expensive. For those, more density is good.  However, very few data center owners with expensive locations are able to credibly explain why all their servers NEED to be there.  Many data centers are in poorly chosen locations driven by excessively manual procedures and the human need to see and touch that for which you paid over 100 million dollars.  Put your servers where humans don’t want to be. Don’t worry, attrition won’t go up. Servers really don’t care about life style, how good the schools are, and related quality of life issues.

 

We’ve talked about increased efficiency possible with blades by bringing water cooling directly to the rack but this really has nothing to do with blades. Any DC designer can employ this technique or a myriad of other mechanical designs and substantially improve their data centers cooling efficiency.  For those choosing modular data centers like the Rackable Ice Cube, you get the efficiency of water at the rack it as a side effect of the design. See Architecture for Modula Data Centers for more on container-based approaches and First Containerized Data Center Announced for information on the Microsoft modular DC deployment in Chicago.

 

We’ve talked about the high heat density of blade servers and argued that increased heat density increases operational or capital cooling expense and usually both.  Generally, don’t buy increased density unless there is a tangible gain from it that actually offsets the cooling cost penalty.  Basically, do the math. And then check it. And then make sure that there isn’t some cheaper way to get the same gain.

 

There are many good reasons to want higher density racks.  One good one is that you are using very high speed, low latency communications between servers in the cluster – I know of examples of this from the HPC world but I’ve not found them in many commercial data centers.  Another reason to go dense is the value of floor space is high.  We’ve argued above that a very small number of centers need to be located in expensive locations due to wide-area communications delays but, again, these are rare. The vast majority of folks buying high density, blade servers aren’t able to articulate why they are buying them in a way that stands up to scrutiny.  In these usage patterns, blades are not the best price/performing solutions.  In fact, that’s why the world’s largest data center operator, Google, doesn’t use blade servers. When you are deploying 10’s of thousands of servers a month, all that matters is work done per dollar. And, at today price points, blade servers do not yet make sense for these high scale, high efficiency deployments.

 

I’m not saying that there aren’t good reason to buy high density server designs.  I’ve seen many. What I’m arguing is that many folks that purchase blades, don’t need them. The arguments explaining the higher value often don’t stand scrutiny. Many experience cooling problems after purchasing blade racks.  Some experience increased cooling efficiency but, upon digging more deeply, you’ll see they made cooling system design changes to increase cooling system efficiency after installation but these excellent design changes could have been deployed without paying the blade premium.  In short, many data center purchases don’t really get the “work done per dollar” scrutiny that they should get. 

 

Density is fine but don’t pay a premium for it unless there is a measurable gain and make sure that the gain can’t be achieved by cheaper means.

 

                                                --jrh

 

James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

 

Thursday, September 11, 2008 5:01:01 AM (Pacific Standard Time, UTC-08:00)  #    Comments [2] - Trackback
Hardware
 Saturday, September 06, 2008

IBM just announced achieving over one million Input-output operations per second: IBM Breaks Performance Records Through Systems Innovation. That’s an impressive number.  To put the achievement in context, a very good (and way too expensive) enterprise disk will deliver somewhere between 180 to just over 200 IOPS. A respectable, but commodity, SATA disk will usually drive somewhere in the 70 to 100 IOPS range.

 

To achieve this goal, IBM actually used a Fusion-IO NAND flash based storage component.  It’s unfortunate that the original press release from IBM didn’t include FusionIO. However, an excellent blog write-up on the performance run by Barry Whyte of IBM offers the details behind the effort: 1M IOPs from Flash - actions speak louder than words.  The Fusion-IO press release is at: Fusion-io and IBM Team to Improve Enterprise Storage Performance.

 

 FusionIO is a PCIe storage subsystem based upon NAND flash.  I mentioned them in 100,000 IOPS.  It’s a bit expensive at this point but a very nice part nonetheless.  NAND prices continue to free-fall based upon mammoth volumes driven by usage in consumer devices and some over-capacity in the NAND market. As the base technology prices fall and sales of enterprise Flash-based storage devices increases, I expect we’ll see pricing improvements as well over the near term.  And, for the very hottest online transaction workloads where IOPS are the primary limiting factor, even current prices work and we’re starting to see some high I/O rate workloads migrate from spinning media to NAND flash.  Some have already moved and I know of many more that have devices in test.

 

Digging deeper into the IBM result, we see that the Fusion-IO part in this run was mounted behind a SAN.  I’ve already taken a bit of heat on this point as it’s well known that I’m not a lover of SANs. Actually, its not really true that I hate SANs. What I hate are expensive, scale-up solutions and it is true that many SAN fall into this catagory.  I want servers, storage, and networking to all be built from clusters of commodity components.   Quarter million dollar network routers just don’t make sense to me and most SANs are not affordable at internet service scale. Essentially, high end network routers and SAN storage arrays are the last bastion of the mainframe -- very high quality, very expensive, scale-up solutions.  As an example, consider the Symmetrix DMX3000.  At full scale, it has 576 disk drives, ¼ TB of memory and over 100 1GHz embedded PowerPC processors. When it was announced back in 2003, the starting price was $1.7M (in lightly configured form– the sky is the limit).

 

It’s really mainframe priced storage subsystems that I’m objecting to.  SANs could be great if built from commodity parts and priced to sell in volume. The ability to migrate storage between machines is clearly useful.  I’m not in love with an entire networking and switching infrastructure dedicated to storage (Fibre Channel) but that’s not inheriently required by SANs either.  FCOE should solve that problem and iSCSI does.

 

The IBM Million IOPS number built upon Fusion-IO NAND Flash components and a virtual SAN over a cluster of Intel-based servers is very interesting.

 

                                                --jrh

 

James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

 

Saturday, September 06, 2008 10:06:02 AM (Pacific Standard Time, UTC-08:00)  #    Comments [2] - Trackback
Hardware
 Sunday, August 31, 2008

In Designing and Deploying Internet Scale Services I’ve argued that all services should expect to be overloaded and all services should expect mass failures.  Very few do and I see related down-time in the news every month or so.

 

The Windows Genuine Advantage failure (WGA Meltdown...) from a year ago is a good example in that the degraded operations modes possible for that service are unusually simple and the problem and causes were well documented.  The obvious degraded operations model for WGA is allow users to continue as  “WGA Authorized” when the service isn’t healthy enough to fully check their O/S authenticity.  In the case of WGA, this actually is the intended operation and it is actually designed to do this.  This should have worked but services rarely have the good sense to fail.  They normally just run very, very slowly or otherwise misbehave.

 

The actual cause of the WGA issues are presented in detail here: So What Happened?. This excellent post even includes some of the degraded operation modes that the WGA team have implemented.  This is close to the right answer.  However, the problem with the implemented approach is: 1) it doesn’t detect unacceptable rises in latency or failure rate via deep monitoring and automatically fall back to degraded mode, and 2) it doesn’t allow the service to be repaired and  retested in production selectively with different numbers of users (slow restart).  It’s either on or off in this design.  A better model is one where 100% of the load can be directed to a backup service that just says “yes”.  And then real service that actually does the full check can be brought back live incrementally by switching more and more load from the “yes” service to the real, deep check service.  Here again, deep real time monitoring is needed to measure whether the service is performing properly.  Implementing and production testing a degraded operation mode is hard but I’ve never talked to a service who had invested in this work and later regretted it.

 

15 years ago I worked on a language compiler which, amongst others,  targeted a Navy fire control system.  This embedded system had a large red switch tagged as “Battle Ready”.  This switch would disable all emergency shutdowns and put the server into a mode where it would continue to run when the room was on fire or water is beginning to rise up the base of the computer.  In this state, the computer runs until it dies.  In the services world, this isn’t exactly what we’re after but it’s closely related.  We want all system to be able to drop back to a degraded operation mode that will allow it to continue to provide at least a subset of service even when under extreme load or suffering from cascading sub-system failures.  We need to design and, most important, we need to test these degraded modes of operation in at least limited production or they won’t work when we really need them.  Unfortunately, almost all services but the least successful will need these degraded operations modes at least once.

 

Degraded operation modes are service specific and, for many services, the initial gut reaction is that everything is mission critical and there exist no meaningful degraded modes.  But, they are always there if you take it seriously and look hard.  The first level is to stop all batch processing and periodic jobs.  That’s an easy one and almost all services have some batch jobs that are not time critical.   Run them later.  That one is fairly easy but most are hard to come up with.  It’s hard to produce a lower quality customer experience that is still useful but I’ve yet to find an example where none were available. As an example, consider Exchange Hosted Services.  In that service, the mail must get through.  What is the degraded operation mode?  They actually can be found in mission critical applications such as EHS as well.  Here’s some examples: turn up the aggressiveness of edge blocks, defer processing of mail classified as Spam until later, process mail from users of the service ahead of non-known users, prioritize premium customers ahead of others.  There actually are quite a few options.  The important point is to think what they should be ahead of time and ensure they are developed and tested prior to Operations needing them in the middle of the night.

 

Some time back Skype recently had a closely related problem where the entire service went down or mostly down for more than a day.  What they report happened was that Windows Update forced many reboots and it lead to a flood of Skype login requests as the clients were coming back up and “that when combined with lack of peer to peer resources had a critical impact” (What Happened on August 16th?).  There are at least two interesting factors here, one generic to all services and one Skype specific.  Generically, it’s very common for login operations to be MUCH more expensive than steady state operation so all services need to engineer for login storms after service interruption.  The WinLive Messenger team has given this considerable thought and has considerable experience with this issue.  They know there needs to be an easy way to throttle login requests such that you can control the rate with which they are accepted (a fine grained admission control for login).  All services need this or something like this but it’s surprising how few have actually implemented this protection and tested it to ensure it works in production.  The Skype-specific situation is not widely documented put is hinted at by the “lack of peer-to-peer” resources note in the above referenced quote.  In Skype’s implementation, the lack of an available supernode will cause client to report login failure (this is documented in An Analysis of the Skype Internet Peer-to-Peer Internet Telephony Protocol which was sent to me by Sharma Kunapalli of IW Services Marketing team).  This means that nodes can’t login unless they can find a supernode.  This has a nasty side effect in that the fewer clients that can successfully login, the more likely it is that other clients won’t successfully find a supernode since a super-node is a just a well connected client.  If they can’t find a supernode, they won’t be able to login either.  Basically, the entire network is unstable due to the dependence on finding a supernode to successfully log a client into the network.  For Skype, a great “degraded operation” mode would be to allow login even when a supernode can’t be found. Let the client get on and perhaps establish peer connectivity later.

 

Why wait for failure and the next post-mortem to design in AND production test degraded operations for your services? 

 

                                --jrh

 

James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

 

Sunday, August 31, 2008 7:47:58 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services
 Tuesday, August 26, 2008

Facebooks F8 conference was held last month in San Francisco. During his mid-day keynote Mark Zuckerberg reported that the Facebook platform now has 400,000 developers and 90 million users of which 32% are from the United States.  The platforms US user population grew 2.4x last year while the international population grew at an astounding 5.1x.

Vladimir Fedorov (Windows Live Mesh) attended F8 and brought together this excellent sent of notes on the conference.

                                                --jrh

Summary:

I spent the day on Wednesday at Facebook (F8) conference and talked to some of the companies building facebook applications today. Overall I was pleasantly surprised by overall sense of direction/messaging and organization of the conference itself.

There were only 12 talks divided into 3 tracks - Technical/User Experience/Business, so I was able to attend a third of all talks.

The was focus throughout the day was on making it easier for applications that increase the value of the Facebook ecosystem and stopping abusing applications that detract value. The event itself was organized through a Facebook application. Here are the main changes in the Facebook application platform:

  1. Improve visibility of applications and allow users to observe functionality offered by an application without user taking an explicit install action
  2. Lower the barrier to using the application i.e. remove the necessity of a dialog granting rights to the application prior to any functionality being available
  3. Make the rights granting to application more granular i.e. remove the necessity of granting the application an extensive set of rights prior to using it. Grant specific permission at the time the application performs an operation.
  4. Allow external websites to act as applications on the Facebook platform by using Facebook as an identity provider, using social graph from Facebook and submitting data to Facebook news feed
  5. Allow the internalization method used for Facebook itself (translation by users) to be used by applications

The statistics given at keynote were 400k developers, 90 million users (32 % US / 68 % International) as compared to 24 million last year (50% US / 50% international), 200 million in venture capital given to facebook applications. Note that while the number of international users increased by 5.1x (by 49.2 million), the number of US users only increased by 2.4x (16.8 million).  

I went through the booths and talked to a number of Facebook application companies. I was primarily focusing on what they do and how they plan to make money. The business models are:

  1. Transaction fees - charging a small percentage per transaction for organizing events or coordinating travel
  2. Software as service - sell packages to organizations such as donation drive or car pooling applications
  3. Indirect advertising - large companies want to drive brand awareness through the social graph, but don't know how. There were different methods here - branded gifts i.e. Gunness beer, full featured brand campaigns, games which incorporate brand info in them, etc
  4. Direct advertising - trip planning, activity planning, wedding planning, reviews, etc

There were a number of companies that didn't have a real business model, but are still adding value to the ecosystem especially when combined with an offering from a different company.

The major features released are new application authorization model, new news feed (with new backend),  Facebook Connect, new look to the site and opening of internalization support used for Facebook itself to applications.

News Feed/News backend

They decided to do fan out on read in order to minimize storage costs and maximize the ability to tinker with the algorithm that decides which news events are shown to the user. The backend is made of two classes of machines – transient storage machines and aggregator machines. The users are assigned to buckets using a hashing algorithm and the buckets are assigned to transient storage machines using a DB table. For each user they store 30 days of events generated by the user in the transient storage machines. There are two replicas of the data in transient storage machines (replicas are on different racks). Each transient storage machine has 40GB of RAM and they use 40 machines for 90 million users. They also use 40 aggregator machines which actually construct the news feed that is shown on the website (in <50ms) by reading the events for each friend of the user from the transient storage machines and aggregating them. There are two racks each with twenty aggregators and twenty transient storage machines, where each rack has a complete copy of the data. The aggregators have affinity to transient storage machines in the same rack, but will go to the other  rack when local machine fails. There is no affinity between users and aggregators. They report they have 8x to 10x extra capacity in this solution. Facebook doesn’t have any geo-partitioning, which interesting given that majority of users are international. [JRH: they now have some geo-redundancy to serve read only queries nearer to users and to backup the primary site: Geo-Replication at Facebook]

The transient data is updated by another process called the “tailor” which reads the tail of a file on a network file system which actually contains the persisted copy of the data. The “tailor” periodically updates each user in the transient storage system via a system of dirty flags. Any of the transient machines can be restarted and reloaded from the persisted store in 10-20 minutes. This is different from the normal MySQL solution they use for the rest of their metadata.

They now allow comments on the news feed items. They also formalized 3 formats for the entry – one liner, summary, and picture plus text. The developers can register templates for each format (i.e. “author” has listened to “track” on MyMusicFoo) and then post just the data together with the template id instead of the whole message. The coalescing is black box – the system requires the developers to register multiple templates and will choose between them depending on the event volume. The event volume is throttles but the throttles change dynamically on the basis of user behavior i.e. if your applications event is marked as spam by some percentage of users the throttle is lowered.

Facebook Connect

In order to merge other websites into the ecosystem, Facebook is providing identity services to third party registered websites. A good example is integration with CitySearch. If you are logged into Facebook, you are automatically logged into CitySearch if you “CitySearch” enabled your Facebook account. Whenever you do a CitySearch review you have an option of spamming your friends news feed with it. You can also view reviews by your friends, who have “CitySearch” enabled their  Facebook account. Through a system of exchanging hashes for email addresses, there is a UI to invite your friends who are already on CitySearch to “CitySearch” enabled their  Facebook account. The end result is that in addition to providing identity services they also provide social network services and drive extra traffic to your site, making it more desirable for third party web sites to offer integration. In exchange the Facebook pages become more content rich and third party websites start acting almost like Facebook application.

 

International Support

 

Facebook is translated by the users themselves via voting system, where a user suggests a translation and the rest of the users vote on it. They opened this system up to applications, where application strings can be translated in the same way. While Facebook itself has had success with this model (complete translation to a  new language in <24 hours) it is less clear that application with smaller user bases will be translated quickly.

 

 

Tuesday, August 26, 2008 5:09:05 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services
 Saturday, August 23, 2008

Kevin Clark, Director of IT Operations at Lucasfilm was interviewed by On-Demand Enterprise in We’ve Come a Long Way Since Star Wars.  His organization owns IT for LucasArts, Lucasfilm, and Industrial Light and Magic.

 

Lucasfilm runs a 4,500 server dedicated rendering farm and they expand this farm with workstations when  they are not in use to 5,500 servers in total.  The servers are dual socket, dual core Opterons with 32GB of memory.  Nothing unusual except the memory configuration is a bit larger than the current average.  They have 400TB of storage and produce 10 to 20TB of new and changed data each day.

 

Clark expects the big investment next year is making their datacenter more efficient. Partly for environmental reasons and partly because, like all businesses, they are power and cooling rather than floor space constrained. This is becoming the number one issue industry-wide and I’m glad to see. Current data center designs leave a lot of room for improvement.  At this year’s Foo Camp, I lead a short session on large scale data center power consumption: Where Does the Power Go and What Can We Do About It?

 

This cluster is medium sized but the data change rate is unusually high at 10 to 20TB a day.  It’s mostly batch work with each job being quite large.  It would be interesting to see more detail on the workload scheduler they have written to manage this workload.  It’s a bit ironic that IBM MVS (now called Z/OS) had a great scheduler 40 years ago. In the 10 years I worked for IBM, they constantly were requesting that a high quality batch scheduler be added to AIX.  And in the 11 years I’ve worked at Microsoft, there has been great interest in improving batch scheduling to the MVS-like levels.  More recently, Apache Hadoop has been used to run mega-jobs and, guess what?  It too needs a high-quality, prioritized, multi-job scheduler.  At the Hadoop Summit, Yahoo said they are working on one.  They typically contribute their Hadoop work to open source so Hadoop may have a better scheduler coming.

 

Thanks to Jeff Hammerbacher for pointing me to the note on Lucasfilm.

 

                                                --jrh

 

James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

 

Saturday, August 23, 2008 6:50:16 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services
 Thursday, August 21, 2008

Last Friday I arrived back from vacation (Back from the Outside Passage in BC) to 3,600 email messages.  I’ve been slogging through them through the weekend to now and I’m actually starting to catch up. 

 

Yesterday Tom Kleinpeter pointed me to this excellent posting from Jason Sobel of Facebook: Scaling Out. This excellent post describes the geo-replication support recently added to Facebook.  Rather than having a single west coast data center they added an east coast center both to be near to East Coast users and to provide redundancy for the west coast site.

 

It’s a cool post for two reasons: 1) it’s a fairly detailed description of how one large scale service implemented geo-redundancy, and 2) they are writing about it externally.  Way too many of these techniques are never discussed outside the implementing company and so everyone needs to keep re-inventing.  Facebook continues to share both how they engineer aspects of their service in addition to contributing some of the code used to do it to open source (e.g. Facebook Releases Cassandra as Open Source).  Good to see.

 

I’ve long been interested in geo-replication, geo-partitioning, and all other forms of cross data center operation because it’s a problem that every high scale service needs to solve and yet there really are no general, just-do-this recipes to easily operate over multiple data centers. A few common design patters have emerged but all solutions of reasonable complexity end up being application specific.  I would love to see general infrastructure emerge to generally support geo-redundancy but it’s a hard problem and solutions invariably exploit knowledge of application semantics. Since most solutions tend to be ad hoc and application specific today, it’s worth studying what they do while looking for common patterns. That was the other reason I enjoyed seeing this posting by Jason. He described how Facebook is currently addressing the issue and its always worth understanding as many existing solutions as possible before attempting a generalization.

 

The solution they have adopted is one where the California data center is primary and all writes are routed to it. The Virginia data center is secondary and serves read-only traffic.  A load balancer does layer 7 packet inspection and routes URIs for pages that support writes to CA.  If the page is read-only, as much of the Facebook traffic is, it gets routed to the east coast site (assuming you are “near” to it for some definition of near).  

 

All writes are done through the California data center. Its serves as primary for the entire service. When a write is done in the Facebook architecture, the corresponding memcached layer is invalidated after the write is completed.  MySQL replication is used to replicate the changes to the remote data center and solved the problem of only invalidating the remote memcached entries after the remote MySQL update by modifying MySQL.  They changed MySQL to clear the local memcached of the appropriate keys once the replicated write is complete. 

 

It’s a simple and fairly elegant approach and there is no doubt that simple is good. I would prefer an approach that scales out both reads and writes and there is a slight robustness risk that some engineer in the future may sometime add a page to the site that does a write and forget to update the this-page-writes URI list.  If MySQL supports running a secondary replica in read-only mode, then that potential issue can be quickly and easily detected. 

 

An approach that would allow multi-data center updates is to replicate the entire DB contents everywhere as Jason described in this post but rather than routing all writes through the California DB, partition the user-base on userID and route their traffic to a fixed center. Allow updates for that user at the data center they were routed to and replicate these changes to the other centers.  Essentially it’s the same solution that was described except rather than routing all requests to the single primary database in CA, the primary database is distributed over multiple datacenters partitioned on userID.  This approach would have many advantages but it is much more difficult if not all the data is cleanly partitioned by userID.  And, in social networks, it isn’t.  I refer to this as the “Hairball Problem” (Scaling LinkedIn).

 

The alternative approach I describe above of partitioning the primary database by userID would be reasonably easy to do in a service like an email system with most state partitioned cleanly by user but in a social network with lots of cross-user, shared state (the hairball problem), it’s harder to do.  Nonetheless, it’s probably the right place for FB to end up but the current solution is clean and works and it’s hard to argue with that.

 

If you come across other articles on geo-support or want to contribute one here, drop me a note jrh@mvdirona.com.

 

                                                --jrh

 

James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

 

Thursday, August 21, 2008 5:36:23 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services
 Thursday, July 17, 2008

Going  boating: http://mvdirona.com/ so I’ll be taking a break from blogging until mid-august when I’m back and caught back up.  Enjoy,

 

                                                --jrh

 

James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

Thursday, July 17, 2008 4:53:25 AM (Pacific Standard Time, UTC-08:00)  #    Comments [1] - Trackback
Ramblings
 Wednesday, July 16, 2008

I’ve been collecting scaling stories for some time now and last week I came across the following run down on Fliker scaling: Federation at Flickr: Doing Billions of Queries Per Day by Dathan Vance Pattishall, the Flickr database guy.

 

The Flickr DB Architecture is sharded with a PHP access layer to maintain consistency.  Flickr users are randomly assigned to a shard. Each shard is duplicated in another database that is also serving active shards. Each DB needs to be less than 50% loaded to be able to handle failover.

 

Shards are found via a lookup ring that maps userID or groupID to shardID and photoID to userID.  The DBs are protected by a memcached layer with a 30 minute caching lifetime. Slide 16 says they are maintaining consistency using distributed transactions but I strongly suspect they are actually just running two parallel transactions with application management rather than 2pc.

 

Maintenance is done by bringing down ½ the DBs and the remaining DBs will handle the load but it appears they have no redundancy (failure protection) during the maintenance periods.

 

They have 12TB of user data in aggregate and they appear to be using MySQL (slide 25 complains about an INNODB bug).

 

Other web site scaling stories:

·         Scaling Linkedin: http://perspectives.mvdirona.com/2008/06/08/ScalingLinkedIn.aspx

·         Scaling Amazon: http://glinden.blogspot.com/2006/02/early-amazon-splitting-website.html

·         Scaling Second Life: http://radar.oreilly.com/archives/2006/04/web_20_and_databases_part_1_se.html

·         Scaling Technorati: http://www.royans.net/arch/2007/10/25/scaling-technorati-100-million-blogs-indexed-everyday/

·         Scaling Flickr: http://radar.oreilly.com/archives/2006/04/database_war_stories_3_flickr.html

·         Scaling Craigslist: http://radar.oreilly.com/archives/2006/04/database_war_stories_5_craigsl.html

·         Scaling Findory: http://radar.oreilly.com/archives/2006/05/database_war_stories_8_findory_1.html

·         MySpace 2006: http://sessions.visitmix.com/upperlayer.asp?event=&session=&id=1423&year=All&search=megasite&sortChoice=&stype=

·         MySpace 2007: http://sessions.visitmix.com/upperlayer.asp?event=&session=&id=1521&year=All&search=scale&sortChoice=&stype=

·         Twitter, Flickr, Live Journal, Six Apart, Bloglines, Last.fm, SlideShare, and eBay: http://poorbuthappy.com/ease/archives/2007/04/29/3616/the-top-10-presentation-on-scaling-websites-twitter-flickr-bloglines-vox-and-more

 

                        -jrh

 

Thanks to Kevin Merritt (Blist) and Dave Quick (Microsoft) for sending this my way.

 

James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

 

Wednesday, July 16, 2008 5:13:40 AM (Pacific Standard Time, UTC-08:00)  #    Comments [2] - Trackback
Services
 Sunday, July 13, 2008

I just got back from O’Reilly’s Foo Camp.   Foo is an interesting conference format in that there is no set agenda. It’s basically self organized as a open space-type event but that’s not what makes it special.  What makes Foo a very cool conference is the people.  Lots of conferences invite good people but few invite such a  diverse a set of attendees.  It was a lot of fun.

 

Here’s a picture from Saturday night of (right to left) Jesse Robbins (Seattle entrepreneur and co-chair of O’Reilly’s Velocity conference), Pat Helland (Microsoft Developer Division Architect) and myself.

 

  

From: http://flickr.com/photos/jesserobbins/2663653582/

Foo is an invitational event loosely organized around folks O’Reilly find interesting or want to get to know better.  Tim O’Reilly describes the invite process in this blog comment: http://gigaom.com/2005/08/16/foocampfighting/#comment-19709 .

 

The conference starts around 5pm on Friday with drinks followed by dinner. After dinner, the sign-up boards for Saturday and Sunday talk sessions are put up. When Tim O’Reilly announced the sign-up boards were up, attendees rose from their tables and casually started ambling towards the sign-up boards as though there really was no rush. We’ll just be doing some signing up over the course of the evening.  We might do it now or perhaps we’ll do it later. No big rush.  And then some folks break into a slight trot towards but still not really an overt run  Suddenly, it’s a full raging stampede.  We became a single flow than as a discrete set of individuals. The flow accelerated and then crashed up against the sign-up boards spilling to both sides with folks madly negotiating for pens, better locations, sharing sessions, a shift of 6” to one side or the other, or  requesting to join two sessions together. Welcome to foo camp.  It didn’t slow down much from that point for the next 44 hours.

 

Sessions are 1 hour long but it’s good form to share a session if you don’t need the full hour. Speakers use their judgment to sign up for a small, medium or large room.  Some rooms have AV equipment and some don’t.

 

I shared a 1 hour ssession with Jeff Hammerbacher who leads the Facebook data team.  Earlier last week Jeff announced that he will be leaving Facebok in September: http://valleywag.com/5024169/yet-another-hoodie+wearing-harvard-kid-drops-out-of-facebook.  Jeff and I got a medium sized room in a tent beside the main building without A/V. 

 

The title for my session was Where Does the Power go in Data Centers and How to get it Back?  I didn’t show slides but much of what we covered is posted at: http://mvdirona.com/jrh/TalksAndPapers/JamesRH_DCPowerSavingsFooCamp08.ppt.  In the session, we talked through how contemporary large data centers work first looking at power distribution. We tracked the power from the feed to the substation at 115,000 volts through numerous conversions before arriving at the CPU at 1.2 volts. We then talked about power saving server design techniques.  And then the mechanical systems used to get the heat back out.  In each section we discussed what could be done to improve the design and how much could be saved.

 

Our conclusion from the session was that power savings of nearly 4x where both possible and affordable using only current technology.  For those participated in the session, thanks for your contribution and  for your help. It was fun.

 

                                                                --jrh

 

James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

Sunday, July 13, 2008 5:58:43 PM (Pacific Standard Time, UTC-08:00)  #    Comments [1] - Trackback
Ramblings
 Saturday, July 12, 2008

Last week the Facebook Data team released Cassandra as open source. Cassandra is an structured store with write ahead logging and indexing. Jeff Hammerbacher, who leads the Facebook Data team described Cassandra as a BigTable data model running on a Dynamo-like infrastructure.

 

Google Code for Cassandra (Apache 2.0 License): http://code.google.com/p/the-cassandra-project/.

 

Avinash Lakshman, Prashant Malik, and Karthik Ranganathan presented at SIGMOD 2008 this year: Cassandra: Structured Storage System over a P2P Network.  From the presentation:

 

Cassandra design goals:

·         High availability

·         Eventual consistency

·         Incremental scalability

·         Optimistic replication

·         Knobs to “tune” tradeoffs between consistency, durability, and latecy

·         Low cost of ownership

·         Minimal administration

 

Write operation: write to arbitrary node in Cassandra cluster, request sent to node owning the data, node writes to log first and then applied to in-memory copy. Properties of write: no locks in critical path, sequential disk accesses, behaves like a write through cache, atomicity guarantee for a key, and always writable.

 

Cluster membership is maintained via gossip protocol.

 

Lessons learned:

·         Add fancy features only when required

·         Many types of failures are possible

·         Big systems need proper systems-level monitoring

·         Value simple designs

 

Future work:

·         Atomicity guarantees across multiple keys

·         Distributed transactions (I’ll try to talk them out of this one)

·         Compression support

·         Fine grained security via ACLs

 

It looks like a well engineered system.

 

                                                --jrh

 

James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

 

Saturday, July 12, 2008 3:17:11 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Software

Disclaimer: The opinions expressed here are my own and do not necessarily represent those of current or past employers.

Archive
<November 2008>
SunMonTueWedThuFriSat
2627282930311
2345678
9101112131415
16171819202122
23242526272829
30123456

Categories
This Blog
Member Login
All Content © 2012, James Hamilton
Theme created by Christoph De Baene / Modified 2007.10.28 by James Hamilton