Sunday, November 09, 2008

Abolade Gbadegesin, Windows Live Mesh Architect gave a great talk at the Microsoft Professional Developers Conference on Windows Live Mesh (talk video, talk slides). Live mesh is a service that supports p2p file sharing amongst your devices, file storage in the cloud, remote access to all your devices (through firewalls and NATS), and web access to those files you chose to store in the cloud. Live Mesh is a good service and worth investigating in its own right but what makes this talk particularly interesting is Abolade gets into the architecture of how the system is written and, in many cases, why it is designed that way. 

 

I’ve been advocating redundant, partitioned, fail fast service designs based upon Recovery Oriented Computing for years.  For example, Designing and Deploying Internet Scale Services (paper, slides). Live Mesh is a great example of such a service.   It’s designed with enough redundancy and monitoring such that service anomalies are detected and, when detected, it’ll auto-recover by first restarting, then rebooting, and finally re-image the failing system.

 

It’s partitioned across multiple data centers and, in each datacenter, across many symmetric commodity servers each of which is a 2 core, 4 disk, 8 GB system. The general design principles are:

·         Commodity hardware

·         Partitioning for scaling out, redundancy for availability

·         Loose coupling across roles

·         Xcopy deployment and configuration

·         Fail-fast, recovery-oriented error handling

·         Self-monitoring and self-healing

 

The scale out strategy is to:

·         Partition by user, device, and Mesh Object

·         Use soft state to minimize I/O load

·         Leverage HTTP 1.1 semantics for caching, change notification, and incremental state transfer

·         Leverage client-side resources for holding state

·         Leverage peer connectivity for content replication

 

Experiences and lessons learned on availability:

·         Design for loosely coupled dependence on building blocks

·         Diligently validate client/cloud upgrade scenarios

·         Invest in pre-production stress and functional coverage in environments that look like production

·         Design for throttling based on both dynamic thresholds and static bounds

 

Experiences and lessons learned on monitoring:

·         Continuously refine performance counters, logs, and log processing tools

·         Monitor end-user-visible operations (Keynote)

·         Build end-to-end tracing across tiers

·         Self-healing is hard:  Invest in tuning watchdogs and thresholds

 

Experiences and lessons learned on deployment:

·         Deployments every other week, client upgrades every month

·         Major functionality roughly each quarter

·         Took advantage of gradual ramp to learn lessons early

 

--jrh

 

Thanks to Andrew Enfield  for sending this one my way.

 

James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

 

Sunday, November 09, 2008 10:05:38 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services
 Wednesday, November 05, 2008

Butler Lampson, one of the founding members of Xerox PARC, Turing award winner, and one of the most practical engineering thinkers I know spoke a couple of days ago at the Computing in the 21st Century Conference in Beijing. My rough notes from Butler’s talk follow.  Overall Butler argues that “embodiment” is the next big phase of computing after simulation and communications.  Butler defines embodiment as computers interacting directly with the physical world.  For example, autonomously driven vehicles.  Butler argues that this class of applications are only possible now due to the rapidly falling price of computing coupled with systems capabilities driven by Moore’s law.

 

He argues that we need to further advance how we deal with uncertainty and dependability to be successful with these applications.  Uncertainty is important since all input has noise, all sensors have faults, and all data is incomplete.  Dependability in that these systems are directly interacting with the physical world and actions in the physical world can have live critical failure modes. 

 

Butler’s recommendation on how to build incredibly complex systems that directly interact with the physical world and yet have these systems be dependable is to build them two tier.  At the core, is a small, simple kernel that doesn’t do a great job of its task but doesn’t hard fail and won’t kill anyone.  He calls this “catastrophe mode”.  For example, an autonomous vehicle may slow down to 10 MPH or just safely stop in catastrophe mode. 

 

The software stack is designed in two layers where the top layer is responsible for the complex, real time interaction the system is designed to deliver. The inner or lower layer is catastrophe mode designed to be simple and, as only simple systems can be, correct.  I like the approach.

 

Butlers Slides are: ButlerLampson_China_Microsoft2008 (1.49 MB).

 

                                                                --jrh

 

Title: The Uses of Computers: What's Past is Merely Prologue

Speaker: Butler Lampson

 

Implication of Moore

·         Spend hardware to simplify software

·         Hardware enables new applications

·         Pull complexity up into software (if unavoidable)

The uses of computers:

·         1950: Simulation

·         1980: Communications

·         2010: Embodiment (computers interacting directly with the physical world)

Argument: embodiment is now possible and there are some grand challenges that fall into this category:

·         Gave some examples from Jim Gray’s Systems Challenges (Turing award lecture)

·         Butler  example: Reduce highway traffic deaths to zero

What do we need to learn how to deal with to achieve embodiment in general and zero traffic deaths in particular:

·         Dealing with uncertainty

o   Need good models of what can happen (what is possible)

o   Need boundaries for models (where they don’t apply)

·         Dependability

o   The system meets its spec

o   Measure: probability(failure) x Cost(failure)

o   Had to model dependability. Recommends using “no catastrophes”

o   Must have a threat model of what can go wrong

o   Recommends producing a simple, small base that will avoid catastrophe. It must be simple. There may be incredibly complex, very highly optimized layers but a reliable systems needs to be able to fail back to the reliable base kernel (less than 50k loc?)

Conclusions for Engineers:

·         Understand Moore’s Law

·         Aim for mass markets

·         Learn how to deal with uncertainty

·         Learn how to avoid catastrophe (avoiding fault not possible in systems at scale)

 

James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

 

Wednesday, November 05, 2008 1:07:21 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Software
 Tuesday, November 04, 2008

Tony Hoare spoke yesterday at the Computing in the 21st Century Conference in Beijing. Tony is a Turing award winner, Quicksort inventor, author of the influential Communication Sequential Processes (CSP) formal language, and long time advocate of program verification and tools to help produce reliable software systems. In his talk he argues that programming should be and can be a science and the goals should be correct programs that stay correct through change. Zero defect software. 

 

He explains that engineers will accept that there will be defects but the scientist should pursue perfection far beyond that for which there is a commercial need. Tony has spent a big part of his successful career in pursuit of techniques and tools to produce reliable complex systems.

 

Tony ended his talk on an a practical engineering note hoping that we can advance our field to the point that “Software will contain no more errors than other engineering disciplines”.  We’re not there yet.

 

My rough notes from the talk follow.

 

Title: The Science of Programming

Speaker: Tony Hoare

 

The Vision:

·         Computer software contains no more errors

o   Software is the most reliable component of any device that contains it

·         Programmers make no mistakes

o   Programs work the first time they run

o   They run forever after, even after changing

·         Programming is an engineering discipline

o   Respected for its delivered benefits and it’s foundation on basic science

·         Semantics is the science of programming

o   Explores the meaning of computer programs

o   Operational: correctness of implementation

o   Algebraic: Correctness of optimization

o   Axiomatic

The Insight:

·         Computer programs are mathematical formulae

o   They don’t suffer from rust, wear, decay, fatigue

o   If a correct program is started in a correct state, they it will stay correct

·         Their correctness is a mathematical conjecture

o   To be proved by logic and calculation

o   Checked by the computer itself

History of the idea:

·         Aristotle (350bc): Syllogistic logic

·         Euclid (300bc): geometry

·         Leibnitz (1700): calculus

·         Boole (1850): laws of thought

·         Frege (1880): predicate logic

·         Russel (1920): Principia

·         Hao Wang (1956): Computer checks

Basic Science:

·         Answers fundamental questions

·         What does it do?

·         How does it work?

·         Why does it work?

·         How do we know?

What does it do?

·         Answered by its behavioral specification

How does it work?

·         Answer by it’s internal interface contracts

Why does the program work?

·         Answered by programming theory

How do we know?

·         By logical/mathematical proof

Ideals in Basic Science

·         Pursued for the sake of scientific glory far in advance of commercial need

·         Physics: accuracy of measurement

·         Chemistry: purity of materials

·         Computing Science: zero defect programs

Unifying Theory

·         Basic science seeks unifying theories

·         Explains diverse phenomena

·         Supported by evidence

Overall, industry is not heavily using software verification along the lines that Tony wants to see but there are some in use. For example, some tools in use at Microsoft:

·         PREfix and PREfast

·         Static Driver Verifier

·         ESP (locates potential buffer overflows)

The Hope:

·         Software will contain no more errors than other engineering disciplines.

 

James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

 

Tuesday, November 04, 2008 2:23:57 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Software
 Saturday, October 25, 2008

Service monitoring at scale is incredibly hard. I’ve long argued that you should never learn anything about a problem your service is experiencing from a customer.  How could they possibly know first when there is a service outage or issue? And, yet it happens frequently. The reason it happens is most sites don’t have close to an adequate level of instrumentation.  Without this instrumentation, you are flying blind.

 

Systems monitoring data can be used to drive alerts, to compute SLAs, to drive capacity planning, to find latencies, to understand customer access patterns, and some sites use it to drive billing although the later is probably a mistake.

 

In the rare cases where I’ve come across high quality monitoring systems that actually do fine-grained data collection, its often not looked at or underutilized.  It turns out that fully using and exploiting very large amounts of  monitoring data isn’t much easier than collecting it.

 

Returning the challenge of efficiently collecting fine grained monitoring data and events from thousands of servers, Facebook made a contribution yesterday in making Scribe available as an open source project: Facebook's Scribe technology now open source.  Scribe is used at Facebook to monitor their more than 10k servers across multiple data centers.  Scribe is a Sourceforge project at: http://sourceforge.net/projects/scribeserver/.

 

Facebook continues to both develop interesting and broadly useful software and often contributes it to the community by making it open source. For example, Facebook Releases Cassandra as Open Source.

 

Some excerpts from On Designing and Deploying Internet-Scale Services on why I think auditing, monitoring, and alerting are important

 

Alerting is an art. There is a tendency to alert on any event that the developer expects they might find interesting and so version-one services often produce reams of useless alerts which never get looked at. To be effective, each alert has to represent a problem. Otherwise, the operations team will learn to ignore them. We don’t know of any magic to get alerting correct other than to interactively tune what conditions drive alerts to ensure that all critical events are alerted and there are not alerts when nothing needs to be done. To get alerting levels correct, two metrics can help and are worth tracking: 1) alerts-to-trouble ticket ratio (with a goal of near one), and 2) number of systems health issues without corresponding alerts (with a goal of near zero).

 

·         Instrument everything. Measure every customer interaction or transaction that flows through the system and report anomalies. There is a place for “runners” (synthetic workloads that simulate user interactions with a service in production) but they aren’t close to sufficient. Using runners alone, we’ve seen it take days to even notice a serious problem, since the standard runner workload was continuing to be processed well, and then days more to know why.

 

·         Data is the most valuable asset. If the normal operating behavior isn’t well-understood, it’s hard to respond to what isn’t. Lots of data on what is happening in the system needs to be gathered to know it really is working well. Many services have gone through catastrophic failures and only learned of the failure when the phones started ringing.

 

·         Have a customer view of service. Perform end-to-end testing. Runners are not enough, but they are needed to ensure the service is fully working. Make sure complex and important paths such as logging in a new user are tested by the runners. Avoid false positives. If a runner failure isn’t considered important, change the test to one that is. Again, once people become accustomed to ignoring data, breakages won’t get immediate attention.

 

·         Instrumentation required for production testing. In order to safely test in production, complete monitoring and alerting is needed. If a component is failing, it needs to be detected quickly.

 

·         Latencies are the toughest problem. Examples are slow I/O and not quite failing but processing slowly. These are hard to find, so instrument carefully to ensure they are detected.

 

·         Have sufficient production data. In order to find problems, data has to be available. Build fine grained monitoring in early or it becomes expensive to retrofit later. The  most important data that we’ve relied upon includes:

 

Thanks to Sriram Krishnan for pointing me to the release of Scribe.

 

                                                                --jrh

James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

 

Saturday, October 25, 2008 8:33:23 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services
 Sunday, October 19, 2008

In When SSDs Make Sense in Server Applications, we looked at where Solid State Drives (SSDs) were practical in servers and services. On the client side, there are even more reasons to use SSDs and I expect that within three years, more than half of enterprise laptops will have NAND Flash as at least part of their storage subsystems. This estimate has SSDs in 38% of all laptops by 2011: Flash SSD in 38% of Laptops by 2011.  

 

What follows is a quick summary of SSD advantages on the client side, followed by the disadvantages, and then a closer look at the write endurance (wear-out) problem that has been the topic of much discussion recently.

 

Client SSD Advantages:

·         Random IOPS:  Laptop I/O patterns are dominated by random workloads and, as argued in When SSDs Make Sense in Server Applications, these workloads run cost effectively on SSDs

·         Low Power: SSD power  consumption is typically in the under 2W range and often under 1W. Enterprise disk can run 15 to 18W, desktop parts are typically in the 10W range but laptop drives usually run a more modest 2.5W when active.  So, on one hand this is represents an exciting reduction in storage power of a factor of 2 but, on the other, it’s actually only a 1W saving when the HDD is active and even less when idle. A savings but a small one overall. If you are interested in more data on laptop power consumption see Client-Side Power Consumption. Some very efficient HDDs actually have less idle power consumption than some SSDs so it’s not even the case that SSDs are all better under all conditions from a power consumption perspective.

·         Quiet. HDDs can be noisy. They are mechanical parts with precision bearings spinning at high speeds and they make noise.  Semi-conductor-based SSDs avoid this.

·         Small Form Factors: SSDs can be small and light weight.

·         Scale Down Floor: Disks have a price floor where further lowering the capacity of the device doesn’t save money. This price floor changes over time but, at this point, it’s hard to get much below $30 for a disk regardless of how small. The fixed costs of the mechanical parts dominate the media and the cost of the disk doesn’t scale down. SSD costs scale down well and for applications with modest storage requirements, they can be less expensive.  This makes them interesting for very low-end laptops, netPCs, ultra-mobile PCs, and, of course, NAND Flash is the storage of choice in cell phones, music players, cameras, and other related applications.

·         Shock and Vibration: HDDs usually spec max shock in the 50g to as high as 100G range and vibration in the ¼G to ½G.  SSD specs run well over 1,000G shock and around 20G vibration. The are much more durable to this common threat in the laptop world.

·         Latency: I/O latency is far lower on an SSD than a HDD and this is particularly noticeable when I/O queues get deep as they often do on single disk laptops.

·         Reliability: HDDs are the number one failing component on clients (and servers). This is particularly a problem on laptops as they are (usually) single drive devices and often not well backed up. HDD failures represent a substantial service cost in most enterprises so eliminating them is appealing.  Our operational history with SSDs is fairly short so far but we expect they will exhibit less frequent failures that hard disks.  However, like all new components, they bring additional failure modes  as well as eliminating a few.  The biggest concern around SSDs is write endurance with SLC part lifetimes typically in the range of 10^5 writes and MLC parts down around 10^4 write cycles (some even lower). We’ll look at that in more depth below.

·         Temperature: SSDs have a much wider temperatures and humidity operating range than HDDs.

 

Client SSD Disadvantages:

·         Capacity/$: Flash devices can deliver excellent random I/O performance and laptops, with only a single disks are frequently random I/O bound rather than capacity limited.  In fact, many enterprises customers actually want LESS storage on their laptop fleet. For them, having less capacity is often either not a problem or even a potential advantage. For my uses and for many consumer usage patterns, capacity remains important with pictures, audio, and other media files driving space requirements up to the point where SSDs can be tough to afford.  As a direct consequence, I expect that we’ll see more enterprise than consumer use of SSDs in clients.

·         Performance Degradation: There have been many reports of SSDs initially performing well and then degrading over time. See Laptop SSD Performance Degradation Problems for more detail.

·         Endurance: This is the most common concern I’ve heard of late with MLC write endurance only around 10,000 writes.

 

Write Endurance

I keep hearing anecdotal reports that SSDs in laptops are going to fail in the first year due to the poor write endurance of MLC SSDs. The typical MLC write endurance is usually quoted at around 10,000 cycles which I agree does sound quite low.

 

Let’s do a quick back of the envelope on MLC SSD write endurance (SLC parts are typically more expensive but have longer write endurance specifications). Assume a client system is used four hours a day and that it spends ¼ of that time at the max I/O rate of 100 IOPS.  My gut feel says this number very likely errs high.  Let’s include write amplification. Write amplification is a side effect of Flash memory designs having larger blocks as the unit of erase and smaller pages as the unit of read and programming (write).  This combined with wear leveling leads to the device having to do some overhead housekeeping writes when servicing writes from the host system. Assume an average write amplification of 3x over three years of life which again seems high.  To make it really aggressive we’ll assume a write to read ration of 1:1 (50% writes) which is very high. Finally let assume it’s a 64GB MLC device and that my writes are all to 4k pages and the overheads are all accounted for by my 3x write amplification number.

 

4*60*60*365*3*.25*.5*3*100 => 591m

 

Reading left to write, that’s 4 hours a day * 60 to get minutes * 60 to get seconds * 365 to get seconds use per year * 3 to get seconds use in three years, *25% of time at max I/O, *50% of I/Os are writes, *3 write amplification, *100 I/Os per second.

 

In aggregate, that’s about ½ billion write I/Os is needed by each laptop living three years.  But, a 64GB device has 16m pages. If you spread ½ billion writes over 16m pages with perfect wear leveling, you would have 36 write I/Os per page. Very low. With terrible wear leveling, it could go up an order of magnitude but it’s still a very low number. Move write amplification up to 5x and the wear/page still looks tiny.   Move the usage pattern up from 4 hours a day to an aggressive 16 hours a day and it’s still only 147 writes per page. Perhaps we’ll use more lifetime I/Os with an SSD than my magnetic disk model above assuming we spend less time waiting but, still, it’s not looking that big a number of lifetime writes required.

 

If we use very low endurance MLC where write endurance is specified down around 1,000 cycles rather than the more common 10,000, it’s still not a problem. But is within on order of magnitude so arguably a concern over a three year life. And it would be definitively a concern over a 5 year life.

 

Because client systems spend such a small percentage of their working lifetimes at 100% I/O rates, it’s hard to see a credible usage model that has MLC write endurance as a serious problem if using parts specified at 10,000 write cycles.

 

In a subsequent post, we’ll look back at server applications and, in contrast with When SSDs Make Sense in Server Applications, we’ll look at where SSDs don’t make sense on the server side.

 

James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W: