Sunday, February 17, 2008

Yet another argument in favor of Degraded Operations Mode (http://mvdirona.com/jrh/perspectives/2008/01/22/DegradedOperationsMode.aspx) emerged last week.  All of Amazon AWS (S3, SimpleDB, Simple Queuing Service, EC2, etc.) down for several hours last week: http://mvdirona.com/jrh/perspectives/2008/02/15/DowntimeAmazonS3SimpleDBSQS.aspx. The outage was reportedly due to a authentication storm: http://www.highscalability.com/s3-failed-because-authentication-overload (Mike Neil sent this my way).

 

Remember, you’ll never have the capacity for the biggest load inrush and, no matter how hard you try, your capacity planning will continue to only slightly better than the weather report for next week. When you don’t know what’s coming, design systems to operate through adversity: Degraded Operations Mode.

 

                                    --jrh

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh 

Sunday, February 17, 2008 12:14:11 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services
 Friday, February 15, 2008

I recently was in a meeting with several physicians. One of them reported a result I have always suspected, but at a magnitude I never would have guessed.  The core observation was that that 80% of medical diagnoses were incorrect. The other doctors in the room confirmed this number to be roughly consistent with their experience.  A less anecdotal support for this estimated high error rate is found in those cases where a “gold standard” diagnostic test is discovered where there previously wasn’t one.  What has been found in many of these cases is that upwards of 80% of the previous diagnoses were incorrect. Several examples were given from different disease populations where a gold standard test has emerged.

 

How could one of the best funded medical systems in the world possibly be misdiagnosing so many patients?  The speculation was that it’s a combination of two factors: 1) doctors have VERY little information on the patient, often having never seen them before and, if they have met in the past, it is usually only for an hour or so a year, and 2) insufficient diagnostic information is available. Tests take time, cost money, sometimes are misapplied (e.g. poor X-rays), and some medical issues lack affordable, and highly reliable tests.

 

At first glance this incredible inaccuracy is shocking and hard to accept but, upon reflection, I have seen similar problems in my distant past as a professional auto mechanic.  Misdiagnosis and incorrect parts replacement is common. Repeat, returning, and unsolved problems are not uncommon.  Automobiles are complex systems, but much less complex than human beings, so its believable that medicine sees the same problems in a more exaggerated form.  In the automotive world, expensive misdiagnoses are battled on two fronts.  This first is through high quality data acquisition and diagnostic equipment to pinpoint the problem.  The second approach is to move from a repair model to a parts replacement model. The size of the replaceable component is increasing, which both minimizes labor costs and reduces the likelihood of error (replacing large complex components as a whole normally succeeds at the cost of some wastage). This second technique doesn’t apply well to the medical world but the former does: collect massive amounts of information to improve diagnostic success rates.

 

I get two things out of this discussion: 1) taking an active role in the collection and management of your medical records is worth the investment, 2) be better informed (just as knowing a bit about a car can help you communicate symptoms to auto-mechanic and the same is true with medical issues), and 3) well-executed, diagnostic tests are the most important part of any diagnosis.  In fact, well executed diagnostic tests can be more important than the skill and experience level of the diagnosing physician.

 

We’ve recently announced HealthVault (http://www.healthvault.com/), a site supporting 1) health related  content and search, 2) a central data storage system for health related information, and 3) connectivity to monitoring devices such as blood glucose readers, blood pressure monitors, heart rate monitors.  More information on HealthVault is at: HealthBlog.  This is early stage work but the combination of a central data repository and automated health information gathering has huge potential.  Technical, social, and legal issues must be overcome to realize the full potential of this service, but if we found a way to directly acquire diagnostic data from hospitals and clinics, this service could become truly amazing.

 

                                                --jrh

 

James Hamilton, Windows Live Platform Services
Bldg RedW-C/1279, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh

Friday, February 15, 2008 10:42:46 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services

If you run a big service and claim to have never had down time you either 1) have close to zero customers or 2) are lying. It’s almost that simple. 

 

There is considerable concern that Amazons AWS service was down for several hours:

 

·         http://www.roughtype.com/archives/2008/02/amazons_s3_util.php

·         http://gigaom.com/2008/02/15/amazon-s3-service-goes-down/

·         http://www.centernetworks.com/amazon-s3-down-error

 

Thanks to Jeff Currier and Soumitra Sengupta who told me about the downtime as it was happening last week. The service was reported to be down at 4:30AM. At 10:17, they reported it was resolved.  There are a couple of lessons in here but the first is that internal IT goes down, high scale services go down, client systems fail, networks stop operating, power failures happen, etc.  That’s just the way it is.  You can spend to reduce these factors and you can try to take complete control of the IT infrastructure to avoid them impacting you.  Ironically, in my experience, those that take over and run the entire infrastructure typically do it at lower scale with less experience and have downtime as well.  These small scale services end up costing much more and yet deliver very little additional uptime.  You read about commodity priced, high scale services when they go down. For example, RIM was down last week.  But, the good ones really don’t go down that frequently.  High scale, commodity infrastructure is actually pretty solid and compares very well to vertical, control-all-aspects-of-the-IT-infrastructure approaches.  Amazon AWS generally has earned a pretty good reliability record.

 

The second lesson is perhaps the hardest to learn and the most important: customers need information. If a service goes down – actually, I should say, when a service goes down – you need to tell customers what is happening and set expectations on service restoration right away.  There is a temptation to hide the facts because, well, downtime is embarrassing.  Hiding it simply doesn’t work. When people don’t know what is happening, they assume the worst and think you are trying to hide something or aren’t responding properly. Tell them what is happening, invest resources in keeping them up to date with progress, and tell them when you expect to be back up.

 

It’s hard, it’s embarrassing, but this one matters more than any other.  Long after the downtime is forgotten, people will remember how you handled it. Transparency wins when it comes to service operation – customers who have decided to bet there jobs on your service need timely information for their customers.  If you embarrass your customers, they remember forever.  A little downtime is unfortunate and you need to be getting better all the time but that’s forgivable.  Just get them the information they need for their dependent businesses.

 

                                --jrh

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh 

Friday, February 15, 2008 12:15:07 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services
 Wednesday, February 13, 2008

Google has published an interesting study of Mobile Search trends (sent my way by Tren Griffin).  In this study the authors looked at over 1M queries submitted to Google Mobile web search over the course of a one month period.   They found that the average search query was 2.56 words.  (This is surprisingly similar to the average desktop query at 2.6 and the average PDA query at 2.35).   They predictably found a uniform relationship between query length in characters and the length of time it took to enter it. The average query took 44.8 seconds including network interactions.  They estimate the overhead to be roughly 5 seconds, meaning the user is willing to spend nearly 39 seconds entering a query. This is amazingly high.  It gives an idea of how valuable the query results are if users are willing to take that long to enter it.  The researchers found less query diversity in the mobile world than the desktop world. The mobile click-through rate on queries was over 50%.

 

I also found it interesting that users are entering queries faster this year than the comparative data from 2005. The average query time fell from 66.3 seconds to 44.8 seconds (including communications overhead).  The paper speculates this is a combination of improved keyboards and a population more comfortable with using devices.

 

The full paper is available from: http://www.maryamkamvar.com/publications/KamvarBalujaComputerMagazine.pdf.

 

                                                --jrh

 

James Hamilton, Windows Live Platform Services
Bldg RedW-C/1279, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh 

Wednesday, February 13, 2008 12:15:49 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services
 Sunday, February 10, 2008

I upgraded my Samsung SGH-i607 to Windows Mobile 6 earlier today.  I had held off upgrading until today having been told that Internet Connection Sharing doesn’t work on the AT&T Windows Mobile 6 build.  Actually, it’s even a bit of a hassle to make it work on Win Mobile 5 but it can be done on both WM5 and 6.  ICS isn’t actually removed from the WM6 build, it’s just not exposed in the user interface as was done with WM5 and, in WM6, there are security settings preventing it from operating.  So it is more work to enable it but not really all that much (details on the page referenced below).

 

Overall I’m happy with the upgrade.  I’ve added to my Blackjack Hack, Tip, Techniques & Utilities page to include WM6 installation instructions, application unlocking instructions, pointer to the Internet Connection Sharing enabling procedure, instruction on how to move email and IE temp files to the storage card, a higher information density home page, and a few utilities.  If you have accumulated other interesting tricks, send them my way.  For example, I’ve not yet managed to SIM unlock this one.

 

                                    --jrh

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh

Sunday, February 10, 2008 12:16:47 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Hardware
 Thursday, February 07, 2008

I was down at Amazon last week speaking at their Internal Developers conference.  It was a fun trip in that I got to catch up with a bunch of old friends – a great many of which seemed to be working on S3 these days.

 

I presented Designing and Deploying Internet Scale Services.  Essentially best practices on writing service-based applications. Additional detail can be found in the paper on which the talk was based: http://research.microsoft.com/~jamesrh/TalksAndPapers/JamesRH_Lisa.pdf.

 

                                --jrh

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh

Thursday, February 07, 2008 12:17:46 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services
 Tuesday, February 05, 2008

A few months back I was in a debate about the value of shared code segments between virtual machines. In my view there is no question that shared code across VMs has some value but code is small compared to data so the impact will be visible but not fundamental. What follows is an inventory of a typical client-side systems.

 

This experiment was done on an IBM T43 laptop with 1GB of memory running Vista RTM, desktop search, Foldershare (it rocks), and Outlook.  Outlook was in use prior to and during the measurement.  The system has been running for three days since the last boot.  The summary stats are:

 

Classification

pages

Meg

%

Kernel:

65824

257.125

25%

User:

195913

765.2852

75%

Total:

261737

1022.41

Kernel Pages

Kernel Image:

7395

28.88672

11%

Kernel Pure Data:

58429

228.2383

89%

Kernel Total:

65824

257.125

User Pages

User Code:

32348

126.3594

17%

User Data:

163565

638.9258

83%

User Total:

195913

765.2852

 

Immediately after boot, 22% of the memory was code which makes sense.  As the O/S and apps come up, all constructors and initializers run.  After being memory resident for a few days, only those pages currently in use stay loaded and the user code percentage fell to 17%.  Ironically, code load time is an issue at start-up time but the actually percentage of code resident in memory over longer runs is fairly small.   Vista Superfetch helps with the code load times but, from looking at this data It’s clear that flash memory could make a huge difference to O/S boot and application load times.

 

The percentage of memory holding code pages is not that high so when going after memory bloat, look first to the data.

 

                                                --jrh

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh

Tuesday, February 05, 2008 12:18:52 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Software
 Saturday, February 02, 2008

Yesterday, Intel and Micron announced a generational step forward in NAND Flash Write I/O performance.  From the Intel Press release:

 

The new high speed NAND can reach speeds up to 200 megabytes per second (MB/s) for reading data and 100 MB/s for writing data, achieved by leveraging the new ONFI 2.0 specification and a four-plane architecture with higher clock speeds. In comparison, conventional single level cell NAND is limited to 40 MB/s for reading data and less than 20 MB/s for writing data.

 

They don’t actually say it’s an SLC device but they compare it to SLC and it has the typical wear characteristics of SLC (100,000 cycles).  More data from the Micron web site:

 

 

Features

Benefits

Density

8Gb–16Gb

Industry-standard densities

Performance

200 MB/s Sustained READ
100 MB/s Sustained WRITE
1.5ms (TYP) Erase Performance

Delivers the fastest read and write throughputs ever for a NAND Flash device

Endurance (cycles)

100,000

High-endurance enables applications that require intensive program and erase operation while prolonging memory life

Interface

Async/Sync
ONFI 1.0/2.0

Standard interface enables a high degree of interoperability

Temperature Range

−25˚C to +85˚C

Wide temperature range is ideal for rugged environments

Configuration

1.8V, x8

Industry-standard configuration enables easy system design

Package

100-ball BGA

Industry-standard packaging enables easier density migration

 

Expect shipments in the latter half of 2008. We should start seeing interesting applications of this technology in SSDs and other devices this year.

 

Intel Press Release: http://www.intel.com/pressroom/archive/releases/20080201corp.htm

More data from Micron: http://www.micron.com/products/nand/high_speed/index

 

                                                --jrh

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh

Saturday, February 02, 2008 12:19:37 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Hardware
 Friday, February 01, 2008

I saw a video earlier today titled “Great Ideas are a Dime a Dozen” and I just loved it. Unfortunately it’s a Microsoft internal-only video so I can’t post it here but I can point to some related talks and videos. The speaker was Bill Buxton of Microsoft research.

 

I fell in love with this talk for a variety of reasons: 1) I love and agree with the principle that ideas are cheap but it’s the communicating of the ideas and making them real that is truly hard and where the greatest talent is required. 2) He argues that you need to get a user experienced running quickly and you need to keep it evolving quickly. You need a lightweight experimentation platform to take the user experience from good to great.  I’ve long believed that the difference between the iPhone and some other designs is not being satisfied when it’s “done” and, rather than triaging to ship, just keep polishing.  Get it running, then get it better. Then throw it out and try again.  Change it some more.  Get it 100% functionally correct and as good as you can possibly get it. Then keep polishing. Polish and refine further, and 3) he points out that we never have time to properly invest in design at the beginning when the team is small. Yet, we DO have time to be months or even years late partly as a consequence of not doing the design up front.  Late projects are when the team is fully staffed and at its biggest and most expensive.  Neither he nor I are arguing for waterfall design.  What’s Bill is arguing for is human centric design up front.  Ray Ozzie calls this experience-first design.  Invest in really getting the experience fully understood with super lightweight development methods.  If you REALLY understand the user experience and it’s really right, developing the product may be the easiest and perhaps most predictable part of the process.  I’ve seen large software teams working on an ill-defined and only barely designed products more than once.  As an industry, we need to take some of Bill’s advice.

 

Bill’s talks and videos are posted at: http://www.billbuxton.com.  The closest external example of the video I’m describing above is perhaps: What if Leopold Didn't Have a Piano.  Recommended whether you are a designer or a developer.

 

                                                --jrh

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh

Friday, February 01, 2008 12:24:56 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Software
 Thursday, January 31, 2008

Earlier today Microsoft, held an internal tribute to Jim Gray, celebrating his contributions to the industry and to each of us personally.  It’s been just over a year since Jim last sailed out of San Francisco Harbor in Tenacious.  He’s been missing since.

 

Speakers at the tribute to Jim included Rich Rashid, Butler Lampson, Peter Spiro, Tony Hey, David Vaskevitch, Bill Gates, Gordon Bell and myself. Some touched upon Jim’s broad technical contributions across many sub-fields, while others recounted how Jim has influenced or guided them personally through the years.  I focused on the latter, and described Jim’s contribution as a mentor, a connector of people, and a bridger of fields (included below).

 

                                                --jrh

 

 

What Would Jim Do?

 

Jim has many skills, but where he has most influenced and impressed me is as a mentor, a connector of people, and a bridger of fields. He became a mentor to me, whether he knew it or not, more than 13 years ago, before either of us had joined Microsoft. Jim is a phenomenal mentor. He invests deeply in understanding the problem you are trying to solve and always has time for deep discussion and debate. Later, I discovered that Jim was an uncanny connector. He knows everyone, and they all want to show him the details of what they are doing. He sees a vast amount of work and forwards the best broadly. He is a nexus for interesting papers, for valuable results, and for new discoveries across many fields. Over time I learned that one of his unique abilities is a bridger of fields. He can take great work in one field and show how it can be applied in others. He knows that many of the world’s most useful discoveries have been made in the gap between fields, and that some of the most important work has been the application of the technology from one field to the problems of another.

 

Back in 1994, Pat Selinger decided that I needed to meet Jim Gray, and we went to visit him in San Francisco. Pat and I spent an afternoon chatting with Jim about database performance benchmarks, what we were doing with DB2, compiler design, RISC System 6000 and hardware architecture in general. The discussion was typical for Jim. He’s deeply interested in every technical field from aircraft engine design through heart transplants. His breadth is amazing, and the conversation ranged far and wide. It seemed he just about always knew someone working in any field that came up.

 

A few months later, Bruce Lindsay and I went to visit Jim while he was teaching at Berkeley. Jim and I didn’t get much of a chance to chat during the course of the day—things were pretty hectic around his office at Berkeley—but he and I drove back into San Francisco together. As we drove into the sunset over the city, approaching the Bay Bridge, Jim talked about his experience at Digital Equipment Corporation. He believed a hardware company could sell software, but would never be able to really make software the complete focus it needed to be. He talked of DEC’s demise and said, “They were bound and determined to fail as a hardware company rather than excel as a software company.”

 

The sunset, the city, and the Bay Bridge were stretched across the windscreen. It was startlingly beautiful. Instead of making conversation with Jim, I was mostly just listening, reflecting and contemplating. At the time, I was the lead architect on IBM DB2. And yes, I too worked for a hardware company. Everything Jim was relating of his DEC experience sounded eerily familiar to me. It was as though Jim was summarizing my own experiences rather than his. I hadn’t really thought this deeply about it before, but the more I did, the more I knew he was right. This was the beginnings of me thinking that probably I should be working at a software company.

 

He didn’t say it at the time, and, knowing Jim much better now, I’m not sure he would have even thought it, but the discussion left me thinking that I needed to aim higher. I needed to know more about all aspects of the database world, more about technology in general, and to think more about how it all fit together. Having some time to chat deeply with Jim changed how I looked at my job and where my priorities were. I left the conversation pondering responsibility and the industry, and believing I needed to do more, or at least to broaden the scope of my thinking.

 

I met Jim again later that year at the High Performance Transaction Systems workshop. During the conference, Jim came over, sat down beside me, and said “How are you doing James Hamilton?” This is signature Jim. I’ll bet nearly everyone he knows has had one of those visits during the course of a conference. He drops by, sits down, matches eyes, and you have 110% of his attention for the next 15 to 20 minutes. Jim’s style is not to correct or redirect. Yet, after each conversation, I’ve typically decided to do something differently. It just somehow becomes clear and obviously the right thing to do by the end of the discussion.

 

In 2006 I got a note from Jim with the subject “Mentor—I need to say I’m helping someone so…”  While it was an honor to officially be Jim’s mentee, I didn’t really expect this to change our relationship much.  And, of course, I was wrong. Jim approaches formal mentorship with his typically thoroughness and, in this role, he believes he has signed up to review and assist with absolutely everything you are involved with, even if not work-related. For example, last year I had two articles published in boating magazines and Jim insisted on reviewing them both. His comments included the usual detailed insights we are all used to getting from him, and the articles were much better for it. How does he find the time?

 

For years, I’ve read every paper Jim sent my way. Jim has become my quality filter in that, as much as I try, I can’t cast my net nearly as wide nor get through close to as much as I should. Like him, I’m interested in just about all aspects of technology but, unlike him, I actually do need to sleep. I can’t possibly keep up. There are hundreds of engineers who receive papers and results from him on a regular basis. Great research is more broadly seen as a result of his culling and forwarding. Many of us read more than we would have otherwise, and are exposed to ideas we wouldn’t normally have seen so early or, in some cases, wouldn’t have seen at all.

 

Jim’s magic as a mentor, connector and bridger is his scaling. The stories above can be repeated by hundreds of people, each of whom feels as though they had Jim’s complete and undivided attention. To contribute deeply to others at this level is time-consuming, to do it while still getting work done personally is even harder, and to do it for all callers is simply unexplainable. Anyone can talk to Jim, and an astonishing number frequently do. And because his review comments are so good, and he’s so widely respected, a mammoth amount is sent his way. He receives early papers and important new results across a breadth of fields from computer architecture, operating system design, networking, database, transaction processing, astronomy, and particle physics. The most interesting work he comes across is forwarded widely. He ignores company bounds, international bounds, bounds of seniority, and simply routes people and useful data together. Jim effectively is a routing nexus where new ideas and really interesting results are distributed more broadly.

 

Over the past year I’ve received no papers from Jim. There has been no advice. I’ve not had anything reviewed by him. And, I’ve not been able to talk to him about the projects I’m working on. When I attend conferences, there have been no surprise visits from Jim. Instead, I’ve been operating personally on the basis of “What would Jim do?”


Each of us has opportunities to be mentors, connectors and bridgers. These are our chances to help Jim scale even further. Each of these opportunities is a chance to pass on some of the gift Jim has given us over the years. When you are asked for a review or to help with a project, just answer “GREAT!!!” as he has so many times, and the magic will keep spreading.

 

This year, when I face tough questions, interesting issues, or great opportunities, I just ask, “What would Jim do?”  And then I dive in with gusto.

 

James Hamilton, 2006-01-16.

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh

Thursday, January 31, 2008 12:26:09 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Ramblings
 Wednesday, January 30, 2008

Founders at work (http://www.amazon.com/Founders-Work-Stories-Startups-Early/dp/1590597141) is a series of 32 interview with founders of well-known startups. Some have become very s