Wednesday, November 05, 2008

Butler Lampson, one of the founding members of Xerox PARC, Turing award winner, and one of the most practical engineering thinkers I know spoke a couple of days ago at the Computing in the 21st Century Conference in Beijing. My rough notes from Butler’s talk follow.  Overall Butler argues that “embodiment” is the next big phase of computing after simulation and communications.  Butler defines embodiment as computers interacting directly with the physical world.  For example, autonomously driven vehicles.  Butler argues that this class of applications are only possible now due to the rapidly falling price of computing coupled with systems capabilities driven by Moore’s law.

 

He argues that we need to further advance how we deal with uncertainty and dependability to be successful with these applications.  Uncertainty is important since all input has noise, all sensors have faults, and all data is incomplete.  Dependability in that these systems are directly interacting with the physical world and actions in the physical world can have live critical failure modes. 

 

Butler’s recommendation on how to build incredibly complex systems that directly interact with the physical world and yet have these systems be dependable is to build them two tier.  At the core, is a small, simple kernel that doesn’t do a great job of its task but doesn’t hard fail and won’t kill anyone.  He calls this “catastrophe mode”.  For example, an autonomous vehicle may slow down to 10 MPH or just safely stop in catastrophe mode. 

 

The software stack is designed in two layers where the top layer is responsible for the complex, real time interaction the system is designed to deliver. The inner or lower layer is catastrophe mode designed to be simple and, as only simple systems can be, correct.  I like the approach.

 

Butlers Slides are: ButlerLampson_China_Microsoft2008 (1.49 MB).

 

                                                                --jrh

 

Title: The Uses of Computers: What's Past is Merely Prologue

Speaker: Butler Lampson

 

Implication of Moore

·         Spend hardware to simplify software

·         Hardware enables new applications

·         Pull complexity up into software (if unavoidable)

The uses of computers:

·         1950: Simulation

·         1980: Communications

·         2010: Embodiment (computers interacting directly with the physical world)

Argument: embodiment is now possible and there are some grand challenges that fall into this category:

·         Gave some examples from Jim Gray’s Systems Challenges (Turing award lecture)

·         Butler  example: Reduce highway traffic deaths to zero

What do we need to learn how to deal with to achieve embodiment in general and zero traffic deaths in particular:

·         Dealing with uncertainty

o   Need good models of what can happen (what is possible)

o   Need boundaries for models (where they don’t apply)

·         Dependability

o   The system meets its spec

o   Measure: probability(failure) x Cost(failure)

o   Had to model dependability. Recommends using “no catastrophes”

o   Must have a threat model of what can go wrong

o   Recommends producing a simple, small base that will avoid catastrophe. It must be simple. There may be incredibly complex, very highly optimized layers but a reliable systems needs to be able to fail back to the reliable base kernel (less than 50k loc?)

Conclusions for Engineers:

·         Understand Moore’s Law

·         Aim for mass markets

·         Learn how to deal with uncertainty

·         Learn how to avoid catastrophe (avoiding fault not possible in systems at scale)

 

James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

 

Wednesday, November 05, 2008 1:07:21 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Software
 Tuesday, November 04, 2008

Tony Hoare spoke yesterday at the Computing in the 21st Century Conference in Beijing. Tony is a Turing award winner, Quicksort inventor, author of the influential Communication Sequential Processes (CSP) formal language, and long time advocate of program verification and tools to help produce reliable software systems. In his talk he argues that programming should be and can be a science and the goals should be correct programs that stay correct through change. Zero defect software. 

 

He explains that engineers will accept that there will be defects but the scientist should pursue perfection far beyond that for which there is a commercial need. Tony has spent a big part of his successful career in pursuit of techniques and tools to produce reliable complex systems.

 

Tony ended his talk on an a practical engineering note hoping that we can advance our field to the point that “Software will contain no more errors than other engineering disciplines”.  We’re not there yet.

 

My rough notes from the talk follow.

 

Title: The Science of Programming

Speaker: Tony Hoare

 

The Vision:

·         Computer software contains no more errors

o   Software is the most reliable component of any device that contains it

·         Programmers make no mistakes

o   Programs work the first time they run

o   They run forever after, even after changing

·         Programming is an engineering discipline

o   Respected for its delivered benefits and it’s foundation on basic science

·         Semantics is the science of programming

o   Explores the meaning of computer programs

o   Operational: correctness of implementation

o   Algebraic: Correctness of optimization

o   Axiomatic

The Insight:

·         Computer programs are mathematical formulae

o   They don’t suffer from rust, wear, decay, fatigue

o   If a correct program is started in a correct state, they it will stay correct

·         Their correctness is a mathematical conjecture

o   To be proved by logic and calculation

o   Checked by the computer itself

History of the idea:

·         Aristotle (350bc): Syllogistic logic

·         Euclid (300bc): geometry

·         Leibnitz (1700): calculus

·         Boole (1850): laws of thought

·         Frege (1880): predicate logic

·         Russel (1920): Principia

·         Hao Wang (1956): Computer checks

Basic Science:

·         Answers fundamental questions

·         What does it do?

·         How does it work?

·         Why does it work?

·         How do we know?

What does it do?

·         Answered by its behavioral specification

How does it work?

·         Answer by it’s internal interface contracts

Why does the program work?

·         Answered by programming theory

How do we know?

·         By logical/mathematical proof

Ideals in Basic Science

·         Pursued for the sake of scientific glory far in advance of commercial need

·         Physics: accuracy of measurement

·         Chemistry: purity of materials

·         Computing Science: zero defect programs

Unifying Theory

·         Basic science seeks unifying theories

·         Explains diverse phenomena

·         Supported by evidence

Overall, industry is not heavily using software verification along the lines that Tony wants to see but there are some in use. For example, some tools in use at Microsoft:

·         PREfix and PREfast

·         Static Driver Verifier

·         ESP (locates potential buffer overflows)

The Hope:

·         Software will contain no more errors than other engineering disciplines.

 

James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

 

Tuesday, November 04, 2008 2:23:57 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Software
 Saturday, October 25, 2008

Service monitoring at scale is incredibly hard. I’ve long argued that you should never learn anything about a problem your service is experiencing from a customer.  How could they possibly know first when there is a service outage or issue? And, yet it happens frequently. The reason it happens is most sites don’t have close to an adequate level of instrumentation.  Without this instrumentation, you are flying blind.

 

Systems monitoring data can be used to drive alerts, to compute SLAs, to drive capacity planning, to find latencies, to understand customer access patterns, and some sites use it to drive billing although the later is probably a mistake.

 

In the rare cases where I’ve come across high quality monitoring systems that actually do fine-grained data collection, its often not looked at or underutilized.  It turns out that fully using and exploiting very large amounts of  monitoring data isn’t much easier than collecting it.

 

Returning the challenge of efficiently collecting fine grained monitoring data and events from thousands of servers, Facebook made a contribution yesterday in making Scribe available as an open source project: Facebook's Scribe technology now open source.  Scribe is used at Facebook to monitor their more than 10k servers across multiple data centers.  Scribe is a Sourceforge project at: http://sourceforge.net/projects/scribeserver/.

 

Facebook continues to both develop interesting and broadly useful software and often contributes it to the community by making it open source. For example, Facebook Releases Cassandra as Open Source.

 

Some excerpts from On Designing and Deploying Internet-Scale Services on why I think auditing, monitoring, and alerting are important

 

Alerting is an art. There is a tendency to alert on any event that the developer expects they might find interesting and so version-one services often produce reams of useless alerts which never get looked at. To be effective, each alert has to represent a problem. Otherwise, the operations team will learn to ignore them. We don’t know of any magic to get alerting correct other than to interactively tune what conditions drive alerts to ensure that all critical events are alerted and there are not alerts when nothing needs to be done. To get alerting levels correct, two metrics can help and are worth tracking: 1) alerts-to-trouble ticket ratio (with a goal of near one), and 2) number of systems health issues without corresponding alerts (with a goal of near zero).

 

·         Instrument everything. Measure every customer interaction or transaction that flows through the system and report anomalies. There is a place for “runners” (synthetic workloads that simulate user interactions with a service in production) but they aren’t close to sufficient. Using runners alone, we’ve seen it take days to even notice a serious problem, since the standard runner workload was continuing to be processed well, and then days more to know why.

 

·         Data is the most valuable asset. If the normal operating behavior isn’t well-understood, it’s hard to respond to what isn’t. Lots of data on what is happening in the system needs to be gathered to know it really is working well. Many services have gone through catastrophic failures and only learned of the failure when the phones started ringing.

 

·         Have a customer view of service. Perform end-to-end testing. Runners are not enough, but they are needed to ensure the service is fully working. Make sure complex and important paths such as logging in a new user are tested by the runners. Avoid false positives. If a runner failure isn’t considered important, change the test to one that is. Again, once people become accustomed to ignoring data, breakages won’t get immediate attention.

 

·         Instrumentation required for production testing. In order to safely test in production, complete monitoring and alerting is needed. If a component is failing, it needs to be detected quickly.

 

·         Latencies are the toughest problem. Examples are slow I/O and not quite failing but processing slowly. These are hard to find, so instrument carefully to ensure they are detected.

 

·         Have sufficient production data. In order to find problems, data has to be available. Build fine grained monitoring in early or it becomes expensive to retrofit later. The  most important data that we’ve relied upon includes:

 

Thanks to Sriram Krishnan for pointing me to the release of Scribe.

 

                                                                --jrh

James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

 

Saturday, October 25, 2008 8:33:23 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services
 Sunday, October 19, 2008

In When SSDs Make Sense in Server Applications, we looked at where Solid State Drives (SSDs) were practical in servers and services. On the client side, there are even more reasons to use SSDs and I expect that within three years, more than half of enterprise laptops will have NAND Flash as at least part of their storage subsystems. This estimate has SSDs in 38% of all laptops by 2011: Flash SSD in 38% of Laptops by 2011.  

 

What follows is a quick summary of SSD advantages on the client side, followed by the disadvantages, and then a closer look at the write endurance (wear-out) problem that has been the topic of much discussion recently.

 

Client SSD Advantages:

·         Random IOPS:  Laptop I/O patterns are dominated by random workloads and, as argued in When SSDs Make Sense in Server Applications, these workloads run cost effectively on SSDs

·         Low Power: SSD power  consumption is typically in the under 2W range and often under 1W. Enterprise disk can run 15 to 18W, desktop parts are typically in the 10W range but laptop drives usually run a more modest 2.5W when active.  So, on one hand this is represents an exciting reduction in storage power of a factor of 2 but, on the other, it’s actually only a 1W saving when the HDD is active and even less when idle. A savings but a small one overall. If you are interested in more data on laptop power consumption see Client-Side Power Consumption. Some very efficient HDDs actually have less idle power consumption than some SSDs so it’s not even the case that SSDs are all better under all conditions from a power consumption perspective.

·         Quiet. HDDs can be noisy. They are mechanical parts with precision bearings spinning at high speeds and they make noise.  Semi-conductor-based SSDs avoid this.

·         Small Form Factors: SSDs can be small and light weight.

·         Scale Down Floor: Disks have a price floor where further lowering the capacity of the device doesn’t save money. This price floor changes over time but, at this point, it’s hard to get much below $30 for a disk regardless of how small. The fixed costs of the mechanical parts dominate the media and the cost of the disk doesn’t scale down. SSD costs scale down well and for applications with modest storage requirements, they can be less expensive.  This makes them interesting for very low-end laptops, netPCs, ultra-mobile PCs, and, of course, NAND Flash is the storage of choice in cell phones, music players, cameras, and other related applications.

·         Shock and Vibration: HDDs usually spec max shock in the 50g to as high as 100G range and vibration in the ¼G to ½G.  SSD specs run well over 1,000G shock and around 20G vibration. The are much more durable to this common threat in the laptop world.

·         Latency: I/O latency is far lower on an SSD than a HDD and this is particularly noticeable when I/O queues get deep as they often do on single disk laptops.

·         Reliability: HDDs are the number one failing component on clients (and servers). This is particularly a problem on laptops as they are (usually) single drive devices and often not well backed up. HDD failures represent a substantial service cost in most enterprises so eliminating them is appealing.  Our operational history with SSDs is fairly short so far but we expect they will exhibit less frequent failures that hard disks.  However, like all new components, they bring additional failure modes  as well as eliminating a few.  The biggest concern around SSDs is write endurance with SLC part lifetimes typically in the range of 10^5 writes and MLC parts down around 10^4 write cycles (some even lower). We’ll look at that in more depth below.

·         Temperature: SSDs have a much wider temperatures and humidity operating range than HDDs.

 

Client SSD Disadvantages:

·         Capacity/$: Flash devices can deliver excellent random I/O performance and laptops, with only a single disks are frequently random I/O bound rather than capacity limited.  In fact, many enterprises customers actually want LESS storage on their laptop fleet. For them, having less capacity is often either not a problem or even a potential advantage. For my uses and for many consumer usage patterns, capacity remains important with pictures, audio, and other media files driving space requirements up to the point where SSDs can be tough to afford.  As a direct consequence, I expect that we’ll see more enterprise than consumer use of SSDs in clients.

·         Performance Degradation: There have been many reports of SSDs initially performing well and then degrading over time. See Laptop SSD Performance Degradation Problems for more detail.

·         Endurance: This is the most common concern I’ve heard of late with MLC write endurance only around 10,000 writes.

 

Write Endurance

I keep hearing anecdotal reports that SSDs in laptops are going to fail in the first year due to the poor write endurance of MLC SSDs. The typical MLC write endurance is usually quoted at around 10,000 cycles which I agree does sound quite low.

 

Let’s do a quick back of the envelope on MLC SSD write endurance (SLC parts are typically more expensive but have longer write endurance specifications). Assume a client system is used four hours a day and that it spends ¼ of that time at the max I/O rate of 100 IOPS.  My gut feel says this number very likely errs high.  Let’s include write amplification. Write amplification is a side effect of Flash memory designs having larger blocks as the unit of erase and smaller pages as the unit of read and programming (write).  This combined with wear leveling leads to the device having to do some overhead housekeeping writes when servicing writes from the host system. Assume an average write amplification of 3x over three years of life which again seems high.  To make it really aggressive we’ll assume a write to read ration of 1:1 (50% writes) which is very high. Finally let assume it’s a 64GB MLC device and that my writes are all to 4k pages and the overheads are all accounted for by my 3x write amplification number.

 

4*60*60*365*3*.25*.5*3*100 => 591m

 

Reading left to write, that’s 4 hours a day * 60 to get minutes * 60 to get seconds * 365 to get seconds use per year * 3 to get seconds use in three years, *25% of time at max I/O, *50% of I/Os are writes, *3 write amplification, *100 I/Os per second.

 

In aggregate, that’s about ½ billion write I/Os is needed by each laptop living three years.  But, a 64GB device has 16m pages. If you spread ½ billion writes over 16m pages with perfect wear leveling, you would have 36 write I/Os per page. Very low. With terrible wear leveling, it could go up an order of magnitude but it’s still a very low number. Move write amplification up to 5x and the wear/page still looks tiny.   Move the usage pattern up from 4 hours a day to an aggressive 16 hours a day and it’s still only 147 writes per page. Perhaps we’ll use more lifetime I/Os with an SSD than my magnetic disk model above assuming we spend less time waiting but, still, it’s not looking that big a number of lifetime writes required.

 

If we use very low endurance MLC where write endurance is specified down around 1,000 cycles rather than the more common 10,000, it’s still not a problem. But is within on order of magnitude so arguably a concern over a three year life. And it would be definitively a concern over a 5 year life.

 

Because client systems spend such a small percentage of their working lifetimes at 100% I/O rates, it’s hard to see a credible usage model that has MLC write endurance as a serious problem if using parts specified at 10,000 write cycles.

 

In a subsequent post, we’ll look back at server applications and, in contrast with When SSDs Make Sense in Server Applications, we’ll look at where SSDs don’t make sense on the server side.

 

James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

 

Sunday, October 19, 2008 9:04:40 AM (Pacific Standard Time, UTC-08:00)  #    Comments [4] - Trackback
Hardware
 Wednesday, October 15, 2008

In past posts, I’ve talked a lot about Solid State Drives.  I’ve mostly discussed about why they are going to be relevant on the server side and the shortest form of the argument is based on extremely hot online transaction processing systems (OLTP).  There are potential applications as reliable boot disks in blade servers and other small data applications but I’m focused on high-scale OLTP in this discussion. OLTP applications random I/O bound workloads such as ecommerce systems, airline reservation systems, and any data intensive application that does lots of small reads and writes, usually on a database where future access patterns are unknown. When sizing a server for one of these workloads, the key dimension is the number of small random I/Os per second.  You need to add memory  to increase the memory hit rate and reduce the number I/Os or you need to add disks to support the application-required I/O rates.  The problem with adding memory is that it has linear cost – the last DIMM costs as much as the first DIMM – but only logarithmic value.  Because the workloads are random, adding memory only delivers a reduction in I/Os roughly proportional to the square root of the memory size. Cheap memory helps but, even then, the costs add up as does the power consumption as memory is added.  Alternatively, you can add disk but each disk added gives only another roughly 200 I/Os per second (IOPS) when using very expensive, 15k RPM disks.

 

The problem is best summarized by my favorite chart these days from Dave Patterson of Berkeley:

This chart is from an amazingly useful paper, Latency Lags Bandwidth (if you know of no-charge location for this paper, let me know).  In this chart, Dave tracks the trend of bandwidth and latency over the last 20+ years. For the purposes of this discussion ignore the latency row and focus on bandwidth. Disk bandwidth is growing slower than DRAM and CPU bandwidth.  I love looking for divergent trends in that they direct us to the more fundamental problems needing innovation.

 

Understanding disk bandwidth growth is a growing problem, let’s compare disk sequential bandwidth with random I/O rates over-time. In the chart below, I graph sequential bandwidth growth against random bandwidth growth over the same period:

 

We know that disk sequential bandwidth growth lags the rest of the system. This graph shows that random IOPS bandwidth is growing even more slowly. Across the industry, we have a huge problem and the trend lines above make it crystal clear that the problem won’t be cost-effectively solved by disk alone. More detail on one dimension of the disk limits problem in: Why Disk Speeds aren’t Increasing.

 

Disks clearly aren’t the full solution. Ever larger memory sub-systems actually are part of the solution but the logarithmic (or worse) payback with linear cost and power consumption makes memory an expensive approach if we use it as the only tool.  Many have argued for the last couple of years that solid state disks are the solution to filling the chasm between memory and disk random IOPS rates.  Jim Gray was one of the first to make this observation in: Tape is Dead, Disk is Tape, Flash is Disk, Ram Locality is King.

 

The first generation, server-side SSDs were slow random write performers but we’re now seeing great components released to the market. See 100,000 IOPS and 1,000,000 IOPS. These are great performers but they are far from commodity pricing at this point.  Intel has been doing some great work on SSDs and I really like this one: Intel X25-E Extreme SATA Solid-State Drive.  It’s a step towards commodity pricing. Overall the industry now has great performing parts available and the price/performance equation is very rapidly improving since this is a semi-conductor component rather than a mechanical one.

 

When should we expect the crossover? At what price point are SSDs a win over HDDs?  Unfortunately, it’s an application specific answer.  It depends upon I/O density of the workload, the number of I/Os per GB of data. Bob Fitzgerald has done a great job of analyzing different workloads to understand what level of application I/O heat (IOPS per GB) are needed to justify a SSD.  Building on Fitz’s work, I have a quick test you can use to figure out how cheap an SSD will have to get before it is a win in your application.

 

My observation goes like this. Disks have an abundance of capacity and are short of IOPS so, on random IOPS intensive workloads, the limiting factor using HDDs will be IOPS.  SSDs have an abundance of IOPS and are short of capacity, so the limiting factor using SSDs will be capacity.  SSDs are cost effective for your application when the cost of the disk farm adequate to support the IOPS you need is more than the SSD farm required to support the capacity you need. As a formula:

 

current#hdd * hdd$ > CapacityNeeded / Capacity_ssd * ssd$

 

Let’s try an example. This example application is hosted on several hundred database servers and it’s a red hot transaction processing system.  Each system has 53 disks of which 40 are used to store data and 8 for log and a few for admin purposes.  Leave the log on magnetic media since disks sequential bandwidth is cheaper than SSD sequential bandwidth. The database size on each server is 572GB.  The disks used by this application are 15k RPM, 3 ½ disks that price out at $333 each. Understanding this, the disk budget per server for this application is 40 * 3333 which is $13,320.  We know we need 572GB and let’s assume we are trying out 64 GB SSDs.  Using that equation, 572/64 is 8.9 so we’ll need 9 SSDs to support this workload. 

 

Taking the disk budget of $13,320 and dividing by the 9 SSDs we have computed we need, we can afford to pay up to $1,480 for each SSD. If the SSDs cost is less than this, it’s worth doing.  This model ignores the power savings (SSDS usually run under 1/5 the power of HDDs and fewer are needed) and other factors like service costs but it’s a quick check to see if SSDs are worth considering.

 

We also need more data on SSD longevity in high write-rate workloads.  In the absence of historical data, ask your vendor to stand behind their product with full warrantee in your usage model before jumping in.

 

Speaking of wear-out rates, for the next posting I’ll investigate client-side MLC NAND-flash wear out rates.

 

                                                                --jrh

 

James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

 

Wednesday, October 15, 2008 6:27:15 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Hardware
 Saturday, October 11, 2008

Albert Greenberg and I missed Hotnets 2008 last week due to a conflicting meeting down in California but K