Wednesday, February 20, 2008

This isn’t directly related to high scale services or saving power in the data center but it’s a great video.  Bill Gates Last Days (6:54) from CES.

 

                                    --jrh

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh

Wednesday, February 20, 2008 12:10:52 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Ramblings

Yesterday, Data Center knowledge reported that Sun was working on a Cloud Platform to compete with Amazon AWS: Project Caroline.  The data behind the report comes from a upcoming Java One 2008 presentation by Sun Distinguished Engineer, Bob Scheifler.  The talk announcement and synopsis is posted at: http://research.sun.com/projects/caroline/ and, even better, the slides are already up at: http://developers.sun.com/learning/javaoneonline/2007/pdf/TS-1991.pdf.

 

The full functionality supported by Caroline is actually beyond that offered by Amazon AWS. Included is:

·         Virtualizes key resources such as network and compute and provides horizontally scaled pool for each

o   Programmatic control of resource allocation, increasing or decreasing without human interaction,

·         Java VMs (rather than offer fully general Virtual Machines as Amazon does with EC2, the Java APIs are offered as the only programming abstraction)

·         Identity provider

·         Eclipse based dev tools

·         ZFS file system with storage reservation, access controls, snapshots (with rollback) and quotas

·         Database (PostgreSQL)

·         Networking: VLAN control, VPN support, dynamic NAT, L4 and L7 load balancing, DNS config

 

Overall it looks pretty interesting – the concepts all look good. The true test and the measure of whether this will actually be an AWS competitors won’t come until the service is made available at scale.  Nonetheless, its great seeing more pay-as-you-go service offerings coming available.

 

                        --jrh

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh

Wednesday, February 20, 2008 12:09:57 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services
 Tuesday, February 19, 2008

I’ve got nothing against for-fee software – that’s what has paid the bills around our home for more than 20 years. Nonetheless, when it comes to education, it’s hard not to love free.  Yesterday Microsoft announced a great program.  Universities and high schools can now make use of Microsoft professional development tools for games, cell phones, and enterprise applications for free. I think this is a wonderful program.

 

The press release  including a Bill Gates Interview: http://www.microsoft.com/presspass/features/2008/feb08/02-18DreamSpark.mspx?rss_fdn=Top%20Stories

 

Two other related articles:

·         TechCrunch: http://www.techcrunch.com/2008/02/18/microsoft-to-give-students-dev-software-for-free/

·         Merc: http://www.mercurynews.com/ci_8302312?source=rss_emailed

 

--jrh

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh

Tuesday, February 19, 2008 12:12:59 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Ramblings
 Sunday, February 17, 2008

Yet another argument in favor of Degraded Operations Mode (http://mvdirona.com/jrh/perspectives/2008/01/22/DegradedOperationsMode.aspx) emerged last week.  All of Amazon AWS (S3, SimpleDB, Simple Queuing Service, EC2, etc.) down for several hours last week: http://mvdirona.com/jrh/perspectives/2008/02/15/DowntimeAmazonS3SimpleDBSQS.aspx. The outage was reportedly due to a authentication storm: http://www.highscalability.com/s3-failed-because-authentication-overload (Mike Neil sent this my way).

 

Remember, you’ll never have the capacity for the biggest load inrush and, no matter how hard you try, your capacity planning will continue to only slightly better than the weather report for next week. When you don’t know what’s coming, design systems to operate through adversity: Degraded Operations Mode.

 

                                    --jrh

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh 

Sunday, February 17, 2008 12:14:11 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services
 Friday, February 15, 2008

I recently was in a meeting with several physicians. One of them reported a result I have always suspected, but at a magnitude I never would have guessed.  The core observation was that that 80% of medical diagnoses were incorrect. The other doctors in the room confirmed this number to be roughly consistent with their experience.  A less anecdotal support for this estimated high error rate is found in those cases where a “gold standard” diagnostic test is discovered where there previously wasn’t one.  What has been found in many of these cases is that upwards of 80% of the previous diagnoses were incorrect. Several examples were given from different disease populations where a gold standard test has emerged.

 

How could one of the best funded medical systems in the world possibly be misdiagnosing so many patients?  The speculation was that it’s a combination of two factors: 1) doctors have VERY little information on the patient, often having never seen them before and, if they have met in the past, it is usually only for an hour or so a year, and 2) insufficient diagnostic information is available. Tests take time, cost money, sometimes are misapplied (e.g. poor X-rays), and some medical issues lack affordable, and highly reliable tests.

 

At first glance this incredible inaccuracy is shocking and hard to accept but, upon reflection, I have seen similar problems in my distant past as a professional auto mechanic.  Misdiagnosis and incorrect parts replacement is common. Repeat, returning, and unsolved problems are not uncommon.  Automobiles are complex systems, but much less complex than human beings, so its believable that medicine sees the same problems in a more exaggerated form.  In the automotive world, expensive misdiagnoses are battled on two fronts.  This first is through high quality data acquisition and diagnostic equipment to pinpoint the problem.  The second approach is to move from a repair model to a parts replacement model. The size of the replaceable component is increasing, which both minimizes labor costs and reduces the likelihood of error (replacing large complex components as a whole normally succeeds at the cost of some wastage). This second technique doesn’t apply well to the medical world but the former does: collect massive amounts of information to improve diagnostic success rates.

 

I get two things out of this discussion: 1) taking an active role in the collection and management of your medical records is worth the investment, 2) be better informed (just as knowing a bit about a car can help you communicate symptoms to auto-mechanic and the same is true with medical issues), and 3) well-executed, diagnostic tests are the most important part of any diagnosis.  In fact, well executed diagnostic tests can be more important than the skill and experience level of the diagnosing physician.

 

We’ve recently announced HealthVault (http://www.healthvault.com/), a site supporting 1) health related  content and search, 2) a central data storage system for health related information, and 3) connectivity to monitoring devices such as blood glucose readers, blood pressure monitors, heart rate monitors.  More information on HealthVault is at: HealthBlog.  This is early stage work but the combination of a central data repository and automated health information gathering has huge potential.  Technical, social, and legal issues must be overcome to realize the full potential of this service, but if we found a way to directly acquire diagnostic data from hospitals and clinics, this service could become truly amazing.

 

                                                --jrh

 

James Hamilton, Windows Live Platform Services
Bldg RedW-C/1279, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh

Friday, February 15, 2008 10:42:46 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services

If you run a big service and claim to have never had down time you either 1) have close to zero customers or 2) are lying. It’s almost that simple. 

 

There is considerable concern that Amazons AWS service was down for several hours:

 

·         http://www.roughtype.com/archives/2008/02/amazons_s3_util.php

·         http://gigaom.com/2008/02/15/amazon-s3-service-goes-down/

·         http://www.centernetworks.com/amazon-s3-down-error

 

Thanks to Jeff Currier and Soumitra Sengupta who told me about the downtime as it was happening last week. The service was reported to be down at 4:30AM. At 10:17, they reported it was resolved.  There are a couple of lessons in here but the first is that internal IT goes down, high scale services go down, client systems fail, networks stop operating, power failures happen, etc.  That’s just the way it is.  You can spend to reduce these factors and you can try to take complete control of the IT infrastructure to avoid them impacting you.  Ironically, in my experience, those that take over and run the entire infrastructure typically do it at lower scale with less experience and have downtime as well.  These small scale services end up costing much more and yet deliver very little additional uptime.  You read about commodity priced, high scale services when they go down. For example, RIM was down last week.  But, the good ones really don’t go down that frequently.  High scale, commodity infrastructure is actually pretty solid and compares very well to vertical, control-all-aspects-of-the-IT-infrastructure approaches.  Amazon AWS generally has earned a pretty good reliability record.

 

The second lesson is perhaps the hardest to learn and the most important: customers need information. If a service goes down – actually, I should say, when a service goes down – you need to tell customers what is happening and set expectations on service restoration right away.  There is a temptation to hide the facts because, well, downtime is embarrassing.  Hiding it simply doesn’t work. When people don’t know what is happening, they assume the worst and think you are trying to hide something or aren’t responding properly. Tell them what is happening, invest resources in keeping them up to date with progress, and tell them when you expect to be back up.

 

It’s hard, it’s embarrassing, but this one matters more than any other.  Long after the downtime is forgotten, people will remember how you handled it. Transparency wins when it comes to service operation – customers who have decided to bet there jobs on your service need timely information for their customers.  If you embarrass your customers, they remember forever.  A little downtime is unfortunate and you need to be getting better all the time but that’s forgivable.  Just get them the information they need for their dependent businesses.

 

                                --jrh

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh 

Friday, February 15, 2008 12:15:07 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services
 Wednesday, February 13, 2008

Google has published an interesting study of Mobile Search trends (sent my way by Tren Griffin).  In this study the authors looked at over 1M queries submitted to Google Mobile web search over the course of a one month period.   They found that the average search query was 2.56 words.  (This is surprisingly similar to the average desktop query at 2.6 and the average PDA query at 2.35).   They predictably found a uniform relationship between query length in characters and the length of time it took to enter it. The average query took 44.8 seconds including network interactions.  They estimate the overhead to be roughly 5 seconds, meaning the user is willing to spend nearly 39 seconds entering a query. This is amazingly high.  It gives an idea of how valuable the query results are if users are willing to take that long to enter it.  The researchers found less query diversity in the mobile world than the desktop world. The mobile click-through rate on queries was over 50%.

 

I also found it interesting that users are entering queries faster this year than the comparative data from 2005. The average query time fell from 66.3 seconds to 44.8 seconds (including communications overhead).  The paper speculates this is a combination of improved keyboards and a population more comfortable with using devices.

 

The full paper is available from: http://www.maryamkamvar.com/publications/KamvarBalujaComputerMagazine.pdf.

 

                                                --jrh

 

James Hamilton, Windows Live Platform Services
Bldg RedW-C/1279, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh 

Wednesday, February 13, 2008 12:15:49 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services
 Sunday, February 10, 2008

I upgraded my Samsung SGH-i607 to Windows Mobile 6 earlier today.  I had held off upgrading until today having been told that Internet Connection Sharing doesn’t work on the AT&T Windows Mobile 6 build.  Actually, it’s even a bit of a hassle to make it work on Win Mobile 5 but it can be done on both WM5 and 6.  ICS isn’t actually removed from the WM6 build, it’s just not exposed in the user interface as was done with WM5 and, in WM6, there are security settings preventing it from operating.  So it is more work to enable it but not really all that much (details on the page referenced below).

 

Overall I’m happy with the upgrade.  I’ve added to my Blackjack Hack, Tip, Techniques & Utilities page to include WM6 installation instructions, application unlocking instructions, pointer to the Internet Connection Sharing enabling procedure, instruction on how to move email and IE temp files to the storage card, a higher information density home page, and a few utilities.  If you have accumulated other interesting tricks, send them my way.  For example, I’ve not yet managed to SIM unlock this one.

 

                                    --jrh

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh

Sunday, February 10, 2008 12:16:47 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Hardware
 Thursday, February 07, 2008

I was down at Amazon last week speaking at their Internal Developers conference.  It was a fun trip in that I got to catch up with a bunch of old friends – a great many of which seemed to be working on S3 these days.

 

I presented Designing and Deploying Internet Scale Services.  Essentially best practices on writing service-based applications. Additional detail can be found in the paper on which the talk was based: http://research.microsoft.com/~jamesrh/TalksAndPapers/JamesRH_Lisa.pdf.

 

                                --jrh

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh

Thursday, February 07, 2008 12:17:46 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services
 Tuesday, February 05, 2008

A few months back I was in a debate about the value of shared code segments between virtual machines. In my view there is no question that shared code across VMs has some value but code is small compared to data so the impact will be visible but not fundamental. What follows is an inventory of a typical client-side systems.

 

This experiment was done on an IBM T43 laptop with 1GB of memory running Vista RTM, desktop search, Foldershare (it rocks), and Outlook.  Outlook was in use prior to and during the measurement.  The system has been running for three days since the last boot.  The summary stats are:

 

Classification

pages

Meg

%

Kernel:

65824

257.125

25%

User:

195913

765.2852

75%

Total:

261737

1022.41

Kernel Pages

Kernel Image:

7395

28.88672

11%

Kernel Pure Data:

58429

228.2383

89%

Kernel Total:

65824

257.125

User Pages

User Code:

32348

126.3594

17%

User Data:

163565

638.9258

83%

User Total:

195913

765.2852

 

Immediately after boot, 22% of the memory was code which makes sense.  As the O/S and apps come up, all constructors and initializers run.  After being memory resident for a few days, only those pages currently in use stay loaded and the user code percentage fell to 17%.  Ironically, code load time is an issue at start-up time but the actually percentage of code resident in memory over longer runs is fairly small.   Vista Superfetch helps with the code load times but, from looking at this data It’s clear that flash memory could make a huge difference to O/S boot and application load times.

 

The percentage of memory holding code pages is not that high so when going after memory bloat, look first to the data.

 

                                                --jrh

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh

Tuesday, February 05, 2008 12:18:52 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Software
 Saturday, February 02, 2008

Yesterday, Intel and Micron announced a generational step forward in NAND Flash Write I/O performance.  From the Intel Press release:

 

The new high speed NAND can reach speeds up to 200 megabytes per second (MB/s) for reading data and 100 MB/s for writing data, achieved by leveraging the new ONFI 2.0 specification and a four-plane architecture with higher clock speeds. In comparison, conventional single level cell NAND is limited to 40 MB/s for reading data and less than 20 MB/s for writing data.

 

They don’t actually say it’s an SLC device but they compare it to SLC and it has the typical wear characteristics of SLC (100,000 cycles).  More data from the Micron web site:

 

 

Features

Benefits

Density

8Gb–16Gb

Industry-standard densities

Performance

200 MB/s Sustained READ
100 MB/s Sustained WRITE
1.5ms (TYP) Erase Performance

Delivers the fastest read and write throughputs ever for a NAND Flash device

Endurance (cycles)

100,000

High-endurance enables applications that require intensive program and erase operation while prolonging memory life

Interface

Async/Sync
ONFI 1.0/2.0

Standard interface enables a high degree of interoperability

Temperature Range

−25˚C to +85˚C

Wide temperature range is ideal for rugged environments

Configuration

1.8V, x8

Industry-standard configuration enables easy system design

Package

100-ball BGA

Industry-standard packaging enables easier density migration

 

Expect shipments in the latter half of 2008. We should start seeing interesting applications of this technology in SSDs and other devices this year.

 

Intel Press Release: http://www.intel.com/pressroom/archive/releases/20080201corp.htm

More data from Micron: http://www.micron.com/products/nand/high_speed/index

 

                                                --jrh

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh

Saturday, February 02, 2008 12:19:37 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Hardware