Wednesday, February 20, 2008

This isn’t directly related to high scale services or saving power in the data center but it’s a great video.  Bill Gates Last Days (6:54) from CES.




James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | |

Wednesday, February 20, 2008 12:10:52 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback

Yesterday, Data Center knowledge reported that Sun was working on a Cloud Platform to compete with Amazon AWS: Project Caroline.  The data behind the report comes from a upcoming Java One 2008 presentation by Sun Distinguished Engineer, Bob Scheifler.  The talk announcement and synopsis is posted at: and, even better, the slides are already up at:


The full functionality supported by Caroline is actually beyond that offered by Amazon AWS. Included is:

·         Virtualizes key resources such as network and compute and provides horizontally scaled pool for each

o   Programmatic control of resource allocation, increasing or decreasing without human interaction,

·         Java VMs (rather than offer fully general Virtual Machines as Amazon does with EC2, the Java APIs are offered as the only programming abstraction)

·         Identity provider

·         Eclipse based dev tools

·         ZFS file system with storage reservation, access controls, snapshots (with rollback) and quotas

·         Database (PostgreSQL)

·         Networking: VLAN control, VPN support, dynamic NAT, L4 and L7 load balancing, DNS config


Overall it looks pretty interesting – the concepts all look good. The true test and the measure of whether this will actually be an AWS competitors won’t come until the service is made available at scale.  Nonetheless, its great seeing more pay-as-you-go service offerings coming available.




James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | |

Wednesday, February 20, 2008 12:09:57 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
 Tuesday, February 19, 2008

I’ve got nothing against for-fee software – that’s what has paid the bills around our home for more than 20 years. Nonetheless, when it comes to education, it’s hard not to love free.  Yesterday Microsoft announced a great program.  Universities and high schools can now make use of Microsoft professional development tools for games, cell phones, and enterprise applications for free. I think this is a wonderful program.


The press release  including a Bill Gates Interview:


Two other related articles:

·         TechCrunch:

·         Merc:




James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | |

Tuesday, February 19, 2008 12:12:59 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
 Sunday, February 17, 2008

Yet another argument in favor of Degraded Operations Mode ( emerged last week.  All of Amazon AWS (S3, SimpleDB, Simple Queuing Service, EC2, etc.) down for several hours last week: The outage was reportedly due to a authentication storm: (Mike Neil sent this my way).


Remember, you’ll never have the capacity for the biggest load inrush and, no matter how hard you try, your capacity planning will continue to only slightly better than the weather report for next week. When you don’t know what’s coming, design systems to operate through adversity: Degraded Operations Mode.




James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | | 

Sunday, February 17, 2008 12:14:11 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
 Friday, February 15, 2008

I recently was in a meeting with several physicians. One of them reported a result I have always suspected, but at a magnitude I never would have guessed.  The core observation was that that 80% of medical diagnoses were incorrect. The other doctors in the room confirmed this number to be roughly consistent with their experience.  A less anecdotal support for this estimated high error rate is found in those cases where a “gold standard” diagnostic test is discovered where there previously wasn’t one.  What has been found in many of these cases is that upwards of 80% of the previous diagnoses were incorrect. Several examples were given from different disease populations where a gold standard test has emerged.


How could one of the best funded medical systems in the world possibly be misdiagnosing so many patients?  The speculation was that it’s a combination of two factors: 1) doctors have VERY little information on the patient, often having never seen them before and, if they have met in the past, it is usually only for an hour or so a year, and 2) insufficient diagnostic information is available. Tests take time, cost money, sometimes are misapplied (e.g. poor X-rays), and some medical issues lack affordable, and highly reliable tests.


At first glance this incredible inaccuracy is shocking and hard to accept but, upon reflection, I have seen similar problems in my distant past as a professional auto mechanic.  Misdiagnosis and incorrect parts replacement is common. Repeat, returning, and unsolved problems are not uncommon.  Automobiles are complex systems, but much less complex than human beings, so its believable that medicine sees the same problems in a more exaggerated form.  In the automotive world, expensive misdiagnoses are battled on two fronts.  This first is through high quality data acquisition and diagnostic equipment to pinpoint the problem.  The second approach is to move from a repair model to a parts replacement model. The size of the replaceable component is increasing, which both minimizes labor costs and reduces the likelihood of error (replacing large complex components as a whole normally succeeds at the cost of some wastage). This second technique doesn’t apply well to the medical world but the former does: collect massive amounts of information to improve diagnostic success rates.


I get two things out of this discussion: 1) taking an active role in the collection and management of your medical records is worth the investment, 2) be better informed (just as knowing a bit about a car can help you communicate symptoms to auto-mechanic and the same is true with medical issues), and 3) well-executed, diagnostic tests are the most important part of any diagnosis.  In fact, well executed diagnostic tests can be more important than the skill and experience level of the diagnosing physician.


We’ve recently announced HealthVault (, a site supporting 1) health related  content and search, 2) a central data storage system for health related information, and 3) connectivity to monitoring devices such as blood glucose readers, blood pressure monitors, heart rate monitors.  More information on HealthVault is at: HealthBlog.  This is early stage work but the combination of a central data repository and automated health information gathering has huge potential.  Technical, social, and legal issues must be overcome to realize the full potential of this service, but if we found a way to directly acquire diagnostic data from hospitals and clinics, this service could become truly amazing.




James Hamilton, Windows Live Platform Services
Bldg RedW-C/1279, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | |

Friday, February 15, 2008 10:42:46 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback

If you run a big service and claim to have never had down time you either 1) have close to zero customers or 2) are lying. It’s almost that simple. 


There is considerable concern that Amazons AWS service was down for several hours:






Thanks to Jeff Currier and Soumitra Sengupta who told me about the downtime as it was happening last week. The service was reported to be down at 4:30AM. At 10:17, they reported it was resolved.  There are a couple of lessons in here but the first is that internal IT goes down, high scale services go down, client systems fail, networks stop operating, power failures happen, etc.  That’s just the way it is.  You can spend to reduce these factors and you can try to take complete control of the IT infrastructure to avoid them impacting you.  Ironically, in my experience, those that take over and run the entire infrastructure typically do it at lower scale with less experience and have downtime as well.  These small scale services end up costing much more and yet deliver very little additional uptime.  You read about commodity priced, high scale services when they go down. For example, RIM was down last week.  But, the good ones really don’t go down that frequently.  High scale, commodity infrastructure is actually pretty solid and compares very well to vertical, control-all-aspects-of-the-IT-infrastructure approaches.  Amazon AWS generally has earned a pretty good reliability record.


The second lesson is perhaps the hardest to learn and the most important: customers need information. If a service goes down – actually, I should say, when a service goes down – you need to tell customers what is happening and set expectations on service restoration right away.  There is a temptation to hide the facts because, well, downtime is embarrassing.  Hiding it simply doesn’t work. When people don’t know what is happening, they assume the worst and think you are trying to hide something or aren’t responding properly. Tell them what is happening, invest resources in keeping them up to date with progress, and tell them when you expect to be back up.


It’s hard, it’s embarrassing, but this one matters more than any other.  Long after the downtime is forgotten, people will remember how you handled it. Transparency wins when it comes to service operation – customers who have decided to bet there jobs on your service need timely information for their customers.  If you embarrass your customers, they remember forever.  A little downtime is unfortunate and you need to be getting better all the time but that’s forgivable.  Just get them the information they need for their dependent businesses.




James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | | 

Friday, February 15, 2008 12:15:07 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
 Wednesday, February 13, 2008

Google has published an interesting study of Mobile Search trends (sent my way by Tren Griffin).  In this study the authors looked at over 1M queries submitted to Google Mobile web search over the course of a one month period.   They found that the average search query was 2.56 words.  (This is surprisingly similar to the average desktop query at 2.6 and the average PDA query at 2.35).   They predictably found a uniform relationship between query length in characters and the length of time it took to enter it. The average query took 44.8 seconds including network interactions.  They estimate the overhead to be roughly 5 seconds, meaning the user is willing to spend nearly 39 seconds entering a query. This is amazingly high.  It gives an idea of how valuable the query results are if users are willing to take that long to enter it.  The researchers found less query diversity in the mobile world than the desktop world. The mobile click-through rate on queries was over 50%.


I also found it interesting that users are entering queries faster this year than the comparative data from 2005. The average query time fell from 66.3 seconds to 44.8 seconds (including communications overhead).  The paper speculates this is a combination of improved keyboards and a population more comfortable with using devices.


The full paper is available from:




James Hamilton, Windows Live Platform Services
Bldg RedW-C/1279, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | | 

Wednesday, February 13, 2008 12:15:49 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
 Sunday, February 10, 2008

I upgraded my Samsung SGH-i607 to Windows Mobile 6 earlier today.  I had held off upgrading until today having been told that Internet Connection Sharing doesn’t work on the AT&T Windows Mobile 6 build.  Actually, it’s even a bit of a hassle to make it work on Win Mobile 5 but it can be done on both WM5 and 6.  ICS isn’t actually removed from the WM6 build, it’s just not exposed in the user interface as was done with WM5 and, in WM6, there are security settings preventing it from operating.  So it is more work to enable it but not really all that much (details on the page referenced below).


Overall I’m happy with the upgrade.  I’ve added to my Blackjack Hack, Tip, Techniques & Utilities page to include WM6 installation instructions, application unlocking instructions, pointer to the Internet Connection Sharing enabling procedure, instruction on how to move email and IE temp files to the storage card, a higher information density home page, and a few utilities.  If you have accumulated other interesting tricks, send them my way.  For example, I’ve not yet managed to SIM unlock this one.




James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | |

Sunday, February 10, 2008 12:16:47 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
 Thursday, February 07, 2008

I was down at Amazon last week speaking at their Internal Developers conference.  It was a fun trip in that I got to catch up with a bunch of old friends – a great many of which seemed to be working on S3 these days.


I presented Designing and Deploying Internet Scale Services.  Essentially best practices on writing service-based applications. Additional detail can be found in the paper on which the talk was based:




James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | |

Thursday, February 07, 2008 12:17:46 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
 Tuesday, February 05, 2008

A few months back I was in a debate about the value of shared code segments between virtual machines. In my view there is no question that shared code across VMs has some value but code is small compared to data so the impact will be visible but not fundamental. What follows is an inventory of a typical client-side systems.


This experiment was done on an IBM T43 laptop with 1GB of memory running Vista RTM, desktop search, Foldershare (it rocks), and Outlook.  Outlook was in use prior to and during the measurement.  The system has been running for three days since the last boot.  The summary stats are:

















Kernel Pages

Kernel Image:




Kernel Pure Data:




Kernel Total:



User Pages

User Code:




User Data:




User Total:




Immediately after boot, 22% of the memory was code which makes sense.  As the O/S and apps come up, all constructors and initializers run.  After being memory resident for a few days, only those pages currently in use stay loaded and the user code percentage fell to 17%.  Ironically, code load time is an issue at start-up time but the actually percentage of code resident in memory over longer runs is fairly small.   Vista Superfetch helps with the code load times but, from looking at this data It’s clear that flash memory could make a huge difference to O/S boot and application load times.


The percentage of memory holding code pages is not that high so when going after memory bloat, look first to the data.




James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | |

Tuesday, February 05, 2008 12:18:52 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
 Saturday, February 02, 2008

Yesterday, Intel and Micron announced a generational step forward in NAND Flash Write I/O performance.  From the Intel Press release:


The new high speed NAND can reach speeds up to 200 megabytes per second (MB/s) for reading data and 100 MB/s for writing data, achieved by leveraging the new ONFI 2.0 specification and a four-plane architecture with higher clock speeds. In comparison, conventional single level cell NAND is limited to 40 MB/s for reading data and less than 20 MB/s for writing data.


They don’t actually say it’s an SLC device but they compare it to SLC and it has the typical wear characteristics of SLC (100,000 cycles).  More data from the Micron web site:







Industry-standard densities


200 MB/s Sustained READ
100 MB/s Sustained WRITE
1.5ms (TYP) Erase Performance

Delivers the fastest read and write throughputs ever for a NAND Flash device

Endurance (cycles)


High-endurance enables applications that require intensive program and erase operation while prolonging memory life


ONFI 1.0/2.0

Standard interface enables a high degree of interoperability

Temperature Range

−25˚C to +85˚C

Wide temperature range is ideal for rugged environments


1.8V, x8

Industry-standard configuration enables easy system design


100-ball BGA

Industry-standard packaging enables easier density migration


Expect shipments in the latter half of 2008. We should start seeing interesting applications of this technology in SSDs and other devices this year.


Intel Press Release:

More data from Micron:




James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | |

Saturday, February 02, 2008 12:19:37 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
 Friday, February 01, 2008

I saw a video earlier today titled “Great Ideas are a Dime a Dozen” and I just loved it. Unfortunately it’s a Microsoft internal-only video so I can’t post it here but I can point to some related talks and videos. The speaker was Bill Buxton of Microsoft research.


I fell in love with this talk for a variety of reasons: 1) I love and agree with the principle that ideas are cheap but it’s the communicating of the ideas and making them real that is truly hard and where the greatest talent is required. 2) He argues that you need to get a user experienced running quickly and you need to keep it evolving quickly. You need a lightweight experimentation platform to take the user experience from good to great.  I’ve long believed that the difference between the iPhone and some other designs is not being satisfied when it’s “done” and, rather than triaging to ship, just keep polishing.  Get it running, then get it better. Then throw it out and try again.  Change it some more.  Get it 100% functionally correct and as good as you can possibly get it. Then keep polishing. Polish and refine further, and 3) he points out that we never have time to properly invest in design at the beginning when the team is small. Yet, we DO have time to be months or even years late partly as a consequence of not doing the design up front.  Late projects are when the team is fully staffed and at its biggest and most expensive.  Neither he nor I are arguing for waterfall design.  What’s Bill is arguing for is human centric design up front.  Ray Ozzie calls this experience-first design.  Invest in really getting the experience fully understood with super lightweight development methods.  If you REALLY understand the user experience and it’s really right, developing the product may be the easiest and perhaps most predictable part of the process.  I’ve seen large software teams working on an ill-defined and only barely designed products more than once.  As an industry, we need to take some of Bill’s advice.


Bill’s talks and videos are posted at:  The closest external example of the video I’m describing above is perhaps: What if Leopold Didn't Have a Piano.  Recommended whether you are a designer or a developer.




James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | |

Friday, February 01, 2008 12:24:56 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
 Thursday, January 31, 2008

Earlier today Microsoft, held an internal tribute to Jim Gray, celebrating his contributions to the industry and to each of us personally.  It’s been just over a year since Jim last sailed out of San Francisco Harbor in Tenacious.  He’s been missing since.


Speakers at the tribute to Jim included Rich Rashid, Butler Lampson, Peter Spiro, Tony Hey, David Vaskevitch, Bill Gates, Gordon Bell and myself. Some touched upon Jim’s broad technical contributions across many sub-fields, while others recounted how Jim has influenced or guided them personally through the years.  I focused on the latter, and described Jim’s contribution as a mentor, a connector of people, and a bridger of fields (included below).





What Would Jim Do?


Jim has many skills, but where he has most influenced and impressed me is as a mentor, a connector of people, and a bridger of fields. He became a mentor to me, whether he knew it or not, more than 13 years ago, before either of us had joined Microsoft. Jim is a phenomenal mentor. He invests deeply in understanding the problem you are trying to solve and always has time for deep discussion and debate. Later, I discovered that Jim was an uncanny connector. He knows everyone, and they all want to show him the details of what they are doing. He sees a vast amount of work and forwards the best broadly. He is a nexus for interesting papers, for valuable results, and for new discoveries across many fields. Over time I learned that one of his unique abilities is a bridger of fields. He can take great work in one field and show how it can be applied in others. He knows that many of the world’s most useful discoveries have been made in the gap between fields, and that some of the most important work has been the application of the technology from one field to the problems of another.


Back in 1994, Pat Selinger decided that I needed to meet Jim Gray, and we went to visit him in San Francisco. Pat and I spent an afternoon chatting with Jim about database performance benchmarks, what we were doing with DB2, compiler design, RISC System 6000 and hardware architecture in general. The discussion was typical for Jim. He’s deeply interested in every technical field from aircraft engine design through heart transplants. His breadth is amazing, and the conversation ranged far and wide. It seemed he just about always knew someone working in any field that came up.


A few months later, Bruce Lindsay and I went to visit Jim while he was teaching at Berkeley. Jim and I didn’t get much of a chance to chat during the course of the day—things were pretty hectic around his office at Berkeley—but he and I drove back into San Francisco together. As we drove into the sunset over the city, approaching the Bay Bridge, Jim talked about his experience at Digital Equipment Corporation. He believed a hardware company could sell software, but would never be able to really make software the complete focus it needed to be. He talked of DEC’s demise and said, “They were bound and determined to fail as a hardware company rather than excel as a software company.”


The sunset, the city, and the Bay Bridge were stretched across the windscreen. It was startlingly beautiful. Instead of making conversation with Jim, I was mostly just listening, reflecting and contemplating. At the time, I was the lead architect on IBM DB2. And yes, I too worked for a hardware company. Everything Jim was relating of his DEC experience sounded eerily familiar to me. It was as though Jim was summarizing my own experiences rather than his. I hadn’t really thought this deeply about it before, but the more I did, the more I knew he was right. This was the beginnings of me thinking that probably I should be working at a software company.


He didn’t say it at the time, and, knowing Jim much better now, I’m not sure he would have even thought it, but the discussion left me thinking that I needed to aim higher. I needed to know more about all aspects of the database world, more about technology in general, and to think more about how it all fit together. Having some time to chat deeply with Jim changed how I looked at my job and where my priorities were. I left the conversation pondering responsibility and the industry, and believing I needed to do more, or at least to broaden the scope of my thinking.


I met Jim again later that year at the High Performance Transaction Systems workshop. During the conference, Jim came over, sat down beside me, and said “How are you doing James Hamilton?” This is signature Jim. I’ll bet nearly everyone he knows has had one of those visits during the course of a conference. He drops by, sits down, matches eyes, and you have 110% of his attention for the next 15 to 20 minutes. Jim’s style is not to correct or redirect. Yet, after each conversation, I’ve typically decided to do something differently. It just somehow becomes clear and obviously the right thing to do by the end of the discussion.


In 2006 I got a note from Jim with the subject “Mentor—I need to say I’m helping someone so…”  While it was an honor to officially be Jim’s mentee, I didn’t really expect this to change our relationship much.  And, of course, I was wrong. Jim approaches formal mentorship with his typically thoroughness and, in this role, he believes he has signed up to review and assist with absolutely everything you are involved with, even if not work-related. For example, last year I had two articles published in boating magazines and Jim insisted on reviewing them both. His comments included the usual detailed insights we are all used to getting from him, and the articles were much better for it. How does he find the time?


For years, I’ve read every paper Jim sent my way. Jim has become my quality filter in that, as much as I try, I can’t cast my net nearly as wide nor get through close to as much as I should. Like him, I’m interested in just about all aspects of technology but, unlike him, I actually do need to sleep. I can’t possibly keep up. There are hundreds of engineers who receive papers and results from him on a regular basis. Great research is more broadly seen as a result of his culling and forwarding. Many of us read more than we would have otherwise, and are exposed to ideas we wouldn’t normally have seen so early or, in some cases, wouldn’t have seen at all.


Jim’s magic as a mentor, connector and bridger is his scaling. The stories above can be repeated by hundreds of people, each of whom feels as though they had Jim’s complete and undivided attention. To contribute deeply to others at this level is time-consuming, to do it while still getting work done personally is even harder, and to do it for all callers is simply unexplainable. Anyone can talk to Jim, and an astonishing number frequently do. And because his review comments are so good, and he’s so widely respected, a mammoth amount is sent his way. He receives early papers and important new results across a breadth of fields from computer architecture, operating system design, networking, database, transaction processing, astronomy, and particle physics. The most interesting work he comes across is forwarded widely. He ignores company bounds, international bounds, bounds of seniority, and simply routes people and useful data together. Jim effectively is a routing nexus where new ideas and really interesting results are distributed more broadly.


Over the past year I’ve received no papers from Jim. There has been no advice. I’ve not had anything reviewed by him. And, I’ve not been able to talk to him about the projects I’m working on. When I attend conferences, there have been no surprise visits from Jim. Instead, I’ve been operating personally on the basis of “What would Jim do?”

Each of us has opportunities to be mentors, connectors and bridgers. These are our chances to help Jim scale even further. Each of these opportunities is a chance to pass on some of the gift Jim has given us over the years. When you are asked for a review or to help with a project, just answer “GREAT!!!” as he has so many times, and the magic will keep spreading.


This year, when I face tough questions, interesting issues, or great opportunities, I just ask, “What would Jim do?”  And then I dive in with gusto.


James Hamilton, 2006-01-16.


James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | |

Thursday, January 31, 2008 12:26:09 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
 Wednesday, January 30, 2008

Founders at work ( is a series of 32 interview with founders of well-known startups. Some have become very successful as independent companies such as Apple where Steve Wozniak was interviewed, Adobe Systems where Charles Geschke was interviewed,  and Research in Motion where Mike Lazaridis was interviewed.  Others were major successes through acquisition, including Mitch Kapor (Lotus Development), Max Levchin (PayPal), Steve Perlman (WebTv), and Ray Ozzie (Iris Associates & Groove Networks).  Some are still startups, and some failed long ago.  The book itself is not amazingly well-written, but I found the interviewees captivating and the book was great by that measure.


The book gives a detailed window into how startups are made, how some have succeeded, and how some have failed.  In portions of the book small windows into the VC community are opened.  The story of how Draper Fisher Jurvetson (DFJ) worked with Sabeer Bhatia (Hotmail) was revealing. 


Some common themes emerged for me as I read through the book.  One theme was that success often came from great people coming together without much funding but considerable motivation and they just kept trying things and evolving and failing and trying again, and trying some more and then changing again.  Often success comes not from a brilliant, well-funded ideas but from intense drive and trying things quickly and failing fast.  Often the VC funded idea X and the money was used to develop a completely unrelated idea.  Often success was found as the last dollar was spent.  I’m quite certain that the ones we didn’t read about were the ones where the last dollar was spent just before success was found.  The lesson for us is to spend small when investigating a new idea.  Move fast, spend little, keep the team small and keep evolving.  Admit when it’s not working and keep trying related ideas.  It was an enjoyable read.




James Hamilton, Windows Live Platform Services
Bldg RedW-C/1279, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | |

Wednesday, January 30, 2008 12:27:11 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
 Monday, January 28, 2008

Exactly one year ago, Jim Gray guided his sailboat Tenacious out of San Francisco’s Gashouse Cove Marina into the Bay. He sailed under the Golden Gate Bridge and continued towards the Faralon Islands, some 27 miles off the California coast line. Until that morning, I chatted with Jim via email or phone several days a week.  He has reviewed everything I’ve written of substance for many years. When I consider job changes, I’ll always bounce them first off him first.  When I come across something particularly interesting, I’ll always send it Jim’s way.  Every month or so, he’ll send me an interesting pre-published paper.  If a conference deadline like CIDR or HPTS is approaching, he’ll start pushing me to write something up and keep pushing until it happens. Every four to six months, he’ll decide “I just have to meet” someone with overlapping interests, someone who’s work is particularly interesting, or perhaps they are just super-clear thinkers and worth getting to know.


What’s truly remarkable is that tens and perhaps hundreds of people can say exactly the same thing. He has time for everyone and everyone has similar stories of mentorship, advice, detailed explanation, patience, and insightful reviews. Jim’s magic is that he does this for a huge cross-section of our industry. He knows no bounds and always manages to find the time to help without regard for who’s asking.


Jim is still missing.  Over the past year I’ve received no papers from Jim.  There’s been no advice.  I’ve not had anything reviewed by him. And, I’ve not been able to talk to him about projects I’ve been working on.  When I attend a conference, I don’t get the usual surprise visits from Jim. It’s been exactly one year and we know no more today than we did a year ago. Jim remains missing. We all miss him deeply.




James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | |

Monday, January 28, 2008 10:43:30 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
 Friday, January 25, 2008

If you are interested in boating, the Seattle Boat Show opened yesterday.  Jennifer and I will be presenting on the red stage at 4:15 on Saturday February 2nd.  Our presentation will be some of our favorite anchorages and cruising areas selected from  Cruising the Secret Coast: Unexplored Anchorages on British Columbia’s Inside Passage  which just went to the printer a couple of weeks back.   Drop by if you want to talk about boating (or high-scale distributed systems).





Friday, January 25, 2008 12:29:55 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
 Tuesday, January 22, 2008

In Designing and Deploying Internet Scale Services ( I’ve argued that all services should expect to be overloaded and all services should expect to have manage mass failures. 


Degraded operations mode is a means of dealing with excess load that will happen at some point in the life of your service. Sooner or later, you’ll get an unpredicted number of new customers, more concurrent users, or you’ll have part of the server fleet down and get hit with unexpected load.  Sooner or later you’ll have more customer requests than you have resources to satisfy. When this occurs, many services just run slower and slower and eventually start failing in timeouts.  Basically every user in the system gets a very bad experience.  A more serious example is a login storm. For most services, steady state service is much less resource intensive than user login.  So, in the event of global or broad service failure, millions of users will arrive back at once attempting to login.  The service will fail again under the load and the cycle repeats. It’s not a good place to be. A more drastic approach to avoid this problem is admission control. Only allow users into the service where you have resources left to be able to serve them. Essentially give a few customers a bad experience by not letting them onto the service in order to avoid giving all customers a bad experience.


There is much that can be done between the first options, service failure under high load, and the other end of the spectrum, admission control.  I call this middle ground, degraded operations mode.  In the limit all services need to have admission control to avoid complete and repeating service failure under extreme loads but you hope that admission control is never used.  Degraded operations mode allows a service to continue to take on new load after it reaches capacity by shedding unnecessary tasks.  Most services have batch jobs that run tasks that need to be done but there isn’t actually a customer waiting on them.  For example reporting, backup, index creation, system maintenance, copying data to warehouse servers, etc.  In most services a substantial amount of this work can be deferred without negatively impacting the service.  Clearly these operations need to be run eventually and how long each can be delayed is task and service specific.  Temporarily shedding these batch jobs allows more customers to be served.  The next level of degraded operations mode is to restrict the quality of service in some way. If some operations are far more expensive, you may only allow users to access a subset of the full service functionality.  For example, if you may allow transactions but not reporting if that makes sense for your service.  Finding these degraded modes of operation is difficult and very application specific but they are always there and its always worth finding them.  There WILL be a time when you have more users than resources.


15 years ago I worked on an Ada language compiler and one of the target hardware platforms for this compiler was a Navy fire control system.  This embedded system had a large red switch tagged as “Battle Ready Mode”.  This switch would disable all automatic shutdowns and put the server into a mode where it would continue to run when the room was on fire or water is beginning to rise up the base of the computer.  In this mode, it runs until it dies.  In the services world, this isn’t exactly what we’re after but it’s closely related.  We want all system to be able to drop back to a degraded operation mode that will allow them to continue to provide at least a subset of service even when under extreme load or suffering from cascading sub-system failures.  We need to design and, most important, we need to test these degraded modes of operation in at least limited production or they won’t work when we really need them.  Unfortunately, all services but the very least successful will need these degraded operations modes at least once.


Degraded operation modes are service specific and, for many services, the initial developer gut reaction is that everything is mission critical and there exist no meaningful degraded modes for their specific service.  But, they are always there if you take it seriously and look hard.  The first level is to stop all batch processing and periodic jobs.  That’s an easy one and almost all services have some batch jobs that are not time critical.   Run them later.  That one is fairly easy but most are hard to come up with.  It’s hard to produce a lower quality customer experience that is still useful but I’ve yet to find an example where none were available. As an example, consider Exchange Hosted Services (an email anti-malware and archiving service).  In that service, the mail must get delivered.  What is the degraded operation mode?  They actually can be found there as well.  Here’s some examples: turn up the aggressiveness of email edge blocks, defer processing of mail classified as Spam until later, process mail from users of the service ahead of non-known users, prioritize platinum customers ahead of others.  There actually are quite a few options.  The important point is to think what they are and ensure they are developed and tested prior to the operations team needing them in the middle of the night.


A few months back Skype had a problem recently where the entire service went down or mostly down for more than a day.  What they report happened was that Windows Update forced many reboots and it lead to a flood of Skype login requests “that when combined with lack of peer to peer resources had a critical impact” (  There are at least two interesting factors here, one generic to all services and one Skype specific.  Generically, it’s very common for login operations to be MUCH more expensive than steady state operation so all services need to engineer for login storms after service interruption.  The WinLive Messenger team has given this considerable thought and has considerable experience with this issue.  They know there needs to be an easy way to throttle login requests such that you can control the rate with which they are accepted (a fine grained admission control for login).  All services need this or something like this but it’s surprising how few have actually implemented this protection and tested it to ensure it works in production.  The Skype specific situation is not widely documented put hinted at by the “lack of peer-to-peer” resources note in the above referenced quote.  In Skype’s implementation, the lack of an available supernode will cause client to report login failure (  sent to me by Sharma Kunapalli).  This means that nodes can’t login unless they can find a supernode.  This has a nasty side effect in that the fewer clients that can successfully login, the more likely it is that other clients won’t successfully find a supernode.  If they can’t find a supernode, they won’t be able to login either.  Basically, the entire network can become unstable due to the dependence on finding a supernode to successfully log a client into the network.  For Skype, a great “degraded operation” mode would be to allow login even when a supernode can’t be found. Let the client get on and perhaps establish peer connectivity later.


Why wait for failure and the next post-mortem to design in and production test degraded operations for your services?  Make it part of your next release.




James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | |

Tuesday, January 22, 2008 12:30:49 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
 Monday, January 21, 2008

A couple of weeks back I attended the Berkeley RAD Lab Retreat.  At this retreat, the RAD Lab grad students present their projects and, as is typical of Berkeley retreats, the talks we’re quite good.  It was held up at Lake Tahoe which was great for the skiers but also made for an interesting drive up there.  Chains were required for the drive from Reno to Lake Tahoe and I was in a rental car with less than great summer tires and, of course, no chains.   


It snowed hard for much of the retreat.  When leaving I took a picture of a pickup truck completely buried in the parking lot:


The talks included: Scalable Consistent Document Store, Prototype of the Instrumentation Backplane, Response time modeling for power-aware resource allocation, Using Machine Learning to Predict Performance of Parallel DB Systems, Diagnosing Performance Problems from Trace data using probabilistic models, Xtrace to find Flaws in UC Berkeley Wireless LAN, Exposing Network Service Failures with Datapath Traces, Owning Your Own Inbox: Attacks on Spam Filters, Declarative Distributed Debugging (D3), Policy Aware Switching Layer, Tracing Hadoop, Machine-Learning-Enabled Router to Deal with Local-Area Congestion, A Declarative API for Secure Network Applications, Deterministic Replay on multi-processor systems, and


Basically the list of talks presented came pretty close to what I would list as the most interesting challenges in services and service design.  Great stuff.   In addition to the talks, there are always an interesting group of folks from Industry and this year was no exception.  I had a good conversation over dinner with Luiz Barroso ( and brief chat with John Ousterhout (


The flight back was more than a bit interesting as well.  We left Reno heading towards Seattle in a small prop plane.  Thirty minutes into the trip, I was starting to wonder what was wrong in that I could see the aircraft landing gear doors opening and closing repeatedly from my wing side seat.  Shortly thereafter the pilot announced that we had a gear problem and we needed to return to Reno.  We returned and did a low pass over the Reno airport so that the tower could check the landing gear position via binoculars.  Then we circled back and landed with a fire trucking chasing us down the runway.  We stayed out on the active taxi ways with the airport closed to incoming or outgoing traffic while crew came out to the aircraft and pinned the gear in the down position before moving the plane to the terminal.




James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | |

Monday, January 21, 2008 12:34:15 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
 Friday, January 18, 2008

Dave Dewitt and Michael Stonebraker posted an article worth reading yesterday titled: MapReduce: A Major Step Backwards (Thanks to Kevin Merrit and Sriram Krishnan for sending this one my way). Their general argument is that MapReduce isn’t better than current generation RDBMS which is certainly true in many dimensions and it isn’t a new invention which is also true.  I’m not in agreement with the conclusion that MapReduce is a major step backwards but I’m fully in agreement with many of the points building towards that conclusion.  Let’s look at some of the major points made by the article:


1. MapReduce is a step backwards in database access

In this section, the authors argue that schema is good, separation of schema and application are good, and high level language access is good. On the first two points, I agree schema is good and there is no question that application/schema separation has long ago proven to be a good thing.  The thing to keep in mind is that MapReduce is only an execution framework.  The data store is GFS or sometimes Bigtable in the case of Google or HDFS or HBase in the case of Hadoop. MapReduce is only the execution framework so it’s not 100% correct to argue that MapReduce doesn’t support schema – that’s a store issue and it is true that most stores that MapReduce is run over don’t implement these features today. 


I argue that a separation of execution framework from store and indexing technology is a good thing in that MapReduce can be run over many stores.  You can use MapReduce over either BigTable (which happens to be implemented on GFS) or over GFS depending upon the type of data you have at hand.  I think that Dewitt and Stonebraker would both agree that breaking up monolithic database management systems into extensible components is a very good thing to do. In fact much of the early work in extensible database management systems was done by David Dewitt.  The point here is that Dewitt and Stonebraker would like to see schema enforcement as part of the store and, generally, I agree that this would be useful.  However, MapReduce is not a store. 


They also argue that high level languages are good.  I agree and any language can be used with MapReduce systems so this isn’t a problem and is supported today.


2. MapReduce is a poor implementation

The argument here is that any reasonable structured store will support indexes.  I agree for many workloads you absolutely must have indexes. However, for many data mining and analysis algorithms, all the data in a data set is accessed.  Indexes, in these cases, don’t help.  This is one of the reason why many data mining algorithms run poorly over RDBMS – if all they are going to do is repeatedly scan the same data, a flat file is faster.  It depends upon application access pattern and the amount of data that is accessed.  A common execution approach for data mining algorithms is to export the data to a flat file and then operate on it there.  An index helps when you are looking at a small subset of the data and there is point N where if you are looking at less than N% of the data, the index helps and should be used. But, if looking at more than N%, you are better off table scanning.  The point N is implementation dependent but storage technology trends have been pushing this number down over the years.  Basically some algorithms look at all the data and aren’t helped by indexes and some look at only a portion of the data and for those that look at more than N% of the data, the index again won’t help.


There is no question that indexes are a good thing and there is no arguing that much of the worlds persistent storage access is done through indexes.  Indexes are good.  But, they are not good for all workloads and for all access patterns.  Remember MapReduce is not a store – only an execution framework. To implement indexing in a store used by MapReduce would be easy and presumably someone will when it’s need is broadly noticed.  In the interim, indexes can be built using MapReduce jobs and then used by subsequent MapReduce jobs.  Certainly more of a hassle than stores that automatically maintain indexes but acceptable for some workloads.


3. MapReduce is not novel

This is clearly true. These ideas have been fully and deeply investigated by the database community in the distant past.  What is innovative is scale.  I’ve seen MapReduce clusters of 3,000 nodes and I strongly suspect that clusters of 5,000+ servers can be found if you look in the right places.  I’ve been around parallel database management systems for many years but have never seen multi-thousand node clusters of Oracle RAC or IBM DB2 Parallel Edition.  The innovative part of MapReduce is that it REALLY scales and, for where MapReduce is used today, scale matters more than everything else.  I’ll claim that 3,000 server query engines ARE novel but I agree that the constituent technologies have been around for some time.


4.  MapReduce is missing features

All of the missing features (bulk loader, indexing, updates, transactions, RI, views) are features that could be implemented in a store used by MapReduce.  As these features become important in domains over which MapReduce is used, they can be implemented in the underlying stores.  I suspect, as long as MapReduce is used for analysis and data mining workloads the pressing need for RI may never get strong enough to motivate someone to implement it.  However, it clearly could be done and the absence of RI in many stores is not a shortcoming of MapReduce.


5.  MapReduce is incompatible with the DBMS tools


I 100% agree. Tools are useful and today many of these tools target RDBMS.  It’s not mentioned by the authors but another useful characteristic of RDBMS is developers understand them and many people know how to write SQL.  It’s an data access and manipulation language that is broadly understood.  The thing to keep in mind is that MapReduce is part of a componentized system.  It’s just the execution framework. I could easily write a SQL compiler that emitted MapReduce jobs (SQL doesn’t dictate or fundamentally restrict the execution engine design).  MapReduce can be run over simple stores as it mostly is today or over stores with near database level functionality if needed.


I’m arguing that the languages with which MapReduce jobs are expressed could be higher level and there have been research projects to do this (for example: Even a SQL Compiler is possible over MapReduce.  And I’m arguing that MapReduce could be run over very rich stores with indexes and integrity constraints should that become broadly interesting.  MapReduce is just an execution engine that happens to scale extremely well. For example, in the MapReduce-like system used around Microsoft, there exist layers of languages above the execution engine that offer different levels of abstraction and control on the same engine.


An execution engine that runs on multi-thousand node clusters really is an important step forward.  The separation of execution engine and storage engine into extensible parts isn’t innovative but it is a very flexible approach that current generation commercial RDBMS could profit from.


I love MapReduce because I love high scale data manipulation.  What can be frustrating for database folks is 1) most of the ideas of MapReduce have been around for years and 2) there has been decades of good research in the DB community focusing on execution engine techniques and algorithms that haven’t yet been applied to the MapReduce engines. Many of these optimizations from the DB world will help make better MapReduce engines.  But, for all these faults, MapReduce sure does scale and it’s hard not to love being able to submit a job and see several thousand nodes churning over several petabytes of data.  Priceless.




James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | |

Friday, January 18, 2008 12:34:49 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
 Tuesday, January 15, 2008

The article below is a restricted version of what I view to be the next big thing.  If I was doing a start-up today, it would be data analysis and optimization as a service.  The ability to run real time optimization over an understandable programming platform for a multi-thousand node cluster is very valuable to almost every industry.  The smaller their margins, the harder it is for them to afford to not to use yield management and optimization systems.  The airlines showed us years ago that yield management systems are a big part of what it takes to optimize the profitability of an airline.  Walmart did the same thing to brick and mortar retail.  Amazon applied the same for online commerce.  The financial community has been doing this for years.  The online advertising business is driven by data analysis.  What’s different as a service is that the technology can be made available to the remaining 99% of the business world.  Those that don’t have the scale to afford doing their own.


Storing blobs in the sky is fine but pretty reproducible by any competitor.  Storing structured data as well as blobs is considerably more interesting but what has even more lasting business value is the storing data in the cloud AND providing a programming platform for multi-thousand node data analysis.  Almost every reasonable business on the planet has a complex set of dimensions that need to be optimized. The example below is a good one but it’s only one.  Almost every business can be made more profitable by employing some form of yield management.  And, it’s ideally set up for a service based solution: the output data is small, the algorithms to be run are small, the data that forms the input is HUGE but it accumulates slowly over times.  Historically large in aggregate but not that large per unit time.  The input data per unit time is of manageable size and could be sent to the service.  The service provides petabytes of storage and thousands of processors and a platform to run data analysis and optimization programs.  This service would also be great for ISVs with companies springing up all over to write interesting algorithms that can be rented and run in the infrastructure.




From: Jeff Carnahan
Subject: Fluidity better than perfection?

Good reading for a Saturday morning…


Flight Plan

The math wizards at Dayjet are building a smarter air taxi--and it could change the way you do business.

From: Issue 115 | May 2007 | Page 100 | By: Greg Lindsay | Photographs By: Jill Greenberg and Courtesy DayJet

It's only fitting that a service pitched to traveling salesmen should find itself confronting an especially nasty version of what's known as the "traveling-salesman problem." Stated simply: Given a salesman and a certain number of cities, what's the shortest possible path he should take before returning home? It's a classic conundrum of resource allocation that rears its ugly head in industries ranging from logistics (especially trucking) to circuit design to, yes, flesh-and-blood traveling salesmen: How do you minimize the cost and maximize your efficiency of movement?

Solving for X: Once the FAA clears the way for the Eclipse 500, Iacobucci will get to see how good his models really are.

Back in 2002, that was the question facing DayJet, a new air-taxi service hoping to take off this spring. Based in Delray Beach, Florida, DayJet will fly planes, but its business model isn't built around its growing fleet of spanking-new Eclipse 500 light jets. It's built on math and silicon, and the near-prophetic powers that have in turn emerged from them. "We're a software and logistics company that only happens to make money flying planes," insists Ed Iacobucci, an IBM (NYSE:IBM) veteran and cofounder of Citrix Systems (NASDAQ:CTXS), who started DayJet as his third act.

The advent of affordable air taxis has been heralded by a steady drumbeat of press over the past few years, with an understandable fixation on the sexy new technology that's generally credited with making the market possible: the planes. The Eclipse 500 is a clean-sheet design for a tiny jet that seats up to six and costs about $1.5 million (the Federal Aviation Administration may clear it for mass production as early as next month). It is also the most fuel-efficient certified jet in the sky. Cessna, meanwhile, has rolled out its own, if pricier, "very light jet" (VLJ), with Honda's (NYSE:HMC) set to appear in 2010. No less an authority than The Innovator's Dilemma author and Harvard Business School professor Clayton Christensen has mused in print that the E500 and its ilk "could radically change the airline industry" by disrupting the hub-and-spoke system we all know and despise.

But Iacobucci, who wrote a check long ago for more than 300 orders and options on Eclipse's first planes, isn't relying on the aircraft to make or break him. Instead, it's his company's software platform--and the novel way it attacks the traveling-salesman problem--that will set DayJet apart. On day one of operations, flying from just five cities in Florida with only 12 planes, DayJet's dispatchers will already have millions of interlocking flight plans to choose from. As the company's geographic footprint spreads (with luck) across the Southeast--and as its fleet expands as well--the computational challenge only gets worse. Factor in such variables as pilot availability, plane maintenance schedules, and the downpours that drench the peninsula like clockwork in the summer, and well, you get the idea: Finding the shortest, fastest, and least-expensive combination of routes could take every computer in the universe until the end of time.

"I knew what the complexities were and how the problem degenerates once you reach a threshold," Iacobucci says. So he didn't try to find the optimal solution. Instead, DayJet began looking for a family of options that create positive (if imperfect) results--following a discipline known as "complexity science."

For the past five years, with no planes, pilots, or customers, DayJet has been running every aspect of its business thousands of times a day, every day, in silicon. Feeding in whatever data they could find, Iacobucci and his colleagues were determined to see how the business would actually someday behave. When DayJet finally starts flying, they'll switch to real-time flight data, using their operating system to shuttle planes back and forth the way computers shuttle around bits and bytes.

Iacobucci is an expert at building operating systems--he did it for decades at IBM and Citrix. Because of that, he has zero interest in the loosey-goosey world of Web 2.0. He sees the next great opportunities in business as a series of operating systems designed to model activities in the real world. DayJet looks to be the first, but he has no doubt there will be others, and that new companies, and even new industries, will appear overnight as computers tease answers out of previously intractable problems.

Which brings us back to the traveling salesmen. Iacobucci says his computer models predict that DayJet's true competitors are not the airlines, but Bimmers and Benzes--he says 80% of his revenue will come from business travelers who would otherwise drive. In other words, DayJet, which closed an additional $50 million round of financing in March, is creating a market where none exists, an astonishing mathematical feat. To get there, all Iacobucci needed was five years, a professor with a bank of 16 parallel processors, two so-called Ant Farmers, and a pair of "Russian rocket scientists" who, it turns out, are neither Russian nor rocket scientists.

"This is way nastier than any of the other airline-scheduling work we've ever done," says Georgia Tech professor George Nemhauser, whose PhD students have been helping to map the scope of DayJet's mountain-sized scheduling dilemma. "You can think of this as a traveling-salesman problem with a million cities, and that's a problem DayJet has to solve every day."

Tapping into the school's computing power, Nemhauser and his students have figured out how to calculate a near-perfect solution for 20 planes in a few seconds' worth of computing time and a solution for 300 planes in 30 hours. But as impressive as that is, in the real world, it's not nearly enough. That's because in order for DayJet's reservations system to succeed, Iacobucci and company need an answer and a price in less than five seconds, the limit for anyone conditioned to Orbitz or Expedia (NASDAQ:EXPE). Because DayJet has no preset schedule--and because overbooking is out of the question (DayJet will fly two pilots and three passengers maximum)--any request to add another customer to a given day's equation requires its software to crunch the entire thing again.

One of Iacobucci's oldest pals and investors, former Microsoft (NASDAQ:MSFT) CFO and Nasdaq chairman Mike Brown, pointed him toward a shortcut--a way to cheat on the math. Brown had retired with his stock options to pursue his pet projects in then bleeding-edge topics such as pattern recognition, artificial intelligence, nonlinear optimization, and computational modeling. His dabblings led him first to Wall Street, where he invested in a trading algorithm named FATKAT and eventually to Santa Fe, New Mexico, ground zero for complexity science.

Iacobucci says 80% of his revenues will come from travelers who would otherwise drive. DayJet, in other words, is creating a market where none existed, an astonishing mathematical feat.

Invented by scientists at the nearby Los Alamos National Laboratory in the 1980s, complexity science is a gumbo of insights drawn from fields as diverse as biology, physics, and economics. At its core is the belief that any seemingly complex and utterly random system or phenomenon--from natural selection to the stock market--emerges from the simple behavior of thousands or millions of individuals. Using computer algorithms to stand in for those individual "agents," scientists discovered they could build fantastically powerful and detailed models of these systems if only they could nail down the right set of rules.

When Brown arrived in town in the late 1990s, many of the scientists-in-residence at the Santa Fe Institute--the serene think tank dedicated to the contemplation of complexity--were rushing to commercialize their favorite research topics. The Prediction Co. was profitably gaming Wall Street by spotting and exploiting small pockets of predictability in capital flows. An outfit called Complexica was working on a simulator that could basically model the entire insurance industry, acting as a giant virtual brain to foresee the implications of any disaster. And the BiosGroup was perfecting agent-based models that today would fall under the heading of "artificial life."

By the time Iacobucci mentioned his logistical dilemma to Brown in 2002, however, most of Santa Fe's Info Mesa startups were bobbing in the dotcom wreckage. But Brown knew that Bios had produced astonishingly elegant solutions a few years earlier by creating virtual "ants" that, when turned loose, revealed how a few false assumptions or bottlenecks could throw an entire system out of whack. A model Bios built of Southwest's cargo operations, for example, cost $60,000 and found a way to save the airline $2 million a year.

Brown proposed that Iacobucci supplement his tool kit with a healthy dose of complexity science. Iacobucci was already hard at work building an "optimizer" program that employed nonlinear algorithms and other mathematical shortcuts to generate scheduling solutions in seconds. But what he really needed, Brown suggested, was an agent-based model (ABM) that would supply phantom traveling salesmen to train the optimizer. Without it, he'd essentially be guessing at the potential number and behavior of his future customers. "Eddy took no convincing," Brown says. "He was telling me, 'Get some guys down here and let's do this.'"

Brown dug up the Ant Farmers, a pair of Bios refugees and expert modelers named Bruce Sawhill and Jim Herriot. Sawhill had been a theoretical physicist at the Santa Fe Institute, while Herriot had been a member of the original team that invented Java at Sun Microsystems (NASDAQ:SUNW). Together, they're DayJet's own Mutt and Jeff, with Herriot playing congenial science professor and Sawhill his mischievous sidekick.

Meanwhile, to build the optimizer, Iacobucci recruited his pair of Russian rocket scientists: Alex Khmelnitsky and Eugene Taits, mathematical wizards he'd hired once before at Citrix. Rather than tackle every scheduling contingency via brute-force computing, the not-Russians cheated by slicing and dicing them into more manageable chunks. They used opaque mathematical techniques such as heuristics and algebraic multigrids, which elegantly subdivide a sprawling problem like this one into discrete patches that can be solved (within limits) simultaneously.

Ironically, the more they slaved over the problem, the less it seemed that throwing a perfect bull's-eye every time was the key to their salvation. The speed of their solutions was proving to be more crucial. If they could provide DayJet with a minute-to-minute snapshot of near- perfect solutions, the system could essentially run the company for them. DayJet would become faster--both in the air and operationally--than any of its competitors could ever hope to be.

With one team working on modeling demand and the other calculating baroque flight plans, Iacobucci and his engineers then concocted a third software system called the Virtual Operation Center. The VOC runs the company in silicon, feeding the phantom customers inside the ABM into the optimizer, which does its best to meet each of their demands with optimal efficiency and maximum gain. Seen on-screen, the VOC is a time-lapse photograph of DayJet's daily operations, also drawing upon maintenance and real-time weather information to produce a final data feed that factors in nearly every facet of the business. Iacobucci compares each run of the VOC with a game of baseball in which the ABM is continually pitching to the optimizer; DayJet has already played several thousand lifetimes' worth of seasons.

Armed with its real-time operating system, DayJet is pursuing a very different idea of optimality than, say, the airlines. With their decades of expertise in the dark arts of yield management, the airlines know exactly how to squeeze every last dollar out of their seats, which is indeed pretty optimal. But they also lack an effective plan B--let alone a plan C or D--in the event that the weather intervenes and schedules collapse. In fact, while, say, JetBlue (NASDAQ:JBLU) may now finally have a contingency plan or two, DayJet's business model is nothing but contingency plans.

Herriot offers another sports metaphor: "Total soccer," popularized by the Dutch in the 1970s, replaced brute-force attacks to the goal with continuous ball movement. "Moving straight to the goal is an excellent way to score, except for one slight problem--the other team," Herriot says. "They're a human version of Murphy's Law. In total soccer, you continually place the ball in a position with not the straightest but the greatest number of ways to reach the goal, the richest set of pathways."

"Each individual pathway may have a lower possibility of reaching the goal than a straight shot," Sawhill chimes in, "but the combinatorial multiplicity overwhelms the other team." The Dutch discovered that a better strategy was a series of good, seamlessly connected solutions rather than a single brittle one.

"The Dutch won a lot of games that way," Herriot adds. "It also created a different kind of player, a more agile, intelligent one. In some sense, we're teaching DayJet how to play total soccer."

In complexity lingo, a chart of all the pathways those Dutch teams exploited would be called a "fitness landscape," a sort of topographical map of every theoretical solution in which the best are visualized as peaks and the worst as deep valleys. "We're dealing with a problem where the problem specification itself is changing as you go along," Sawhill says. "You no longer want to find the best solution--you want to be living in a space of good solutions, so when the problem changes, you're still there." Fluidity is the greater goal than perfection.

To that end, the company has been changing the problem inside its simulators every day for the past four and a half years, looking for those broad mesas of good solutions. And after a million or so spins of the VOC, DayJet has produced a clear vision of the total market and its likely place in it. Iacobucci expects to siphon off somewhere between 1% and 1.5% of all regional business trips within DayJet's markets by 2008, with "regional trips" defined as being between 100 and 500 miles. In the southeast states the company initially has its eye on, that's 500,000 to 750,000 trips a year, out of a total of 52 million, more than 80% of which are currently traversed by car. Yes, DayJet's life-or-death competition is Florida's SUV dealerships, not the airlines. DayJet may even help the airlines slightly: The model predicts some customers who fly DayJet one way will take a commercial flight back home.

The reams of data produced by the VOC have already coalesced into a thick sheaf of battle plans framing best- to worst-case scenarios. And having run the scenarios so relentlessly for so long, Iacobucci is now utterly sanguine about his prospects. When I ask over dinner for the dozenth time about DayJet's presumptive break-even number, he flat out admits there isn't one. "Within the realm of all realistic possibilities--at least 25% of our projected demand to 125% demand--we maintain profitability." Even at 25%? "Sure," Iacobucci replies, "it just takes longer, and takes more [airports], and the margin is much lower. But this isn't going to be what the venture capitalists call the 'walking dead.' If it's a hit, it's going to be a hit pretty quickly."

"We'll see more companies integrate modeling," says former Microsoft CFO Mike Brown. "This is just like the Internet: One day no one had heard of it, the next day we were all using it."

I'm not the only one who has trouble wrapping his head around the numbers, or lack thereof. Iacobucci tells the story of one analyst asked to crunch the numbers ahead of an investment. "He asked a direct question: 'All I want to know is, what formula do I put into this cell to tell me how you come up with a revenue number?'" Iacobucci says. "I told him, 'There ain't no formula to put in that cell! It can't be done! We'll sit you down with our modelers, who will explain the range of numbers we came up with, but they can't be encapsulated in a spreadsheet.'" The would-be investors passed.

Not everyone is so put out by the math involved. Esther Dyson, the veteran technologist and venture capitalist, now runs an annual conference called "Flight School," in which DayJet has played a starring role. "I have no doubt it will work," she says, referring to the software, "and I have no doubt they will spend time refining it and that there will be glitches here and there. But I do think Ed knows how to design very highly available systems"--a reference to his days building operating systems--"and that's exactly what they're doing."

Mike Brown, who did ante up and today sits on DayJet's board, is convinced that businesses big and small will increasingly turn to modeling as a way of developing--or troubleshooting--their business plans, mapping out strategies and market expectations that go far, far beyond spreadsheets and PowerPoint (NASDAQ:MSFT) decks. "We'll see more and more companies integrate modeling into the heart of their business. This is just like the Internet: One day no one had heard of it, the next day we were all using it."

Since Iacobucci sees himself as being in the operating-systems business, he has no intention of giving that system away. (He learned that lesson the hard way at IBM.) He doesn't want to build what he calls "horizontal" software that gets shared, e.g., Web 2.0 and Windows, the two great platforms for which every programmer in Silicon Valley seems to be writing widgets these days. Where everyone else in the business sees limitless opportunities in snap-together applications, Iacobucci sees a playing field so flat as to have no barriers to entry at all, and he doesn't like it.

According to Dyson, DayJet's competitors have so far pooh-poohed its software, assuming they'll be able to buy their own off the shelf at some point. Eclipse Aviation's Vern Raburn hopes Iacobucci might be persuaded to license his tools, because Raburn's own business model depends upon air taxis' taking off. Iacobucci says that isn't going to happen. "There's a shift away from building another platform toward building highly integrated, vertical, special-purpose, high-performance systems," he argues. Iacobucci envisions more companies like his own, in which the competitive advantage resides in custom-built, deeply proprietary, real-world operating systems that don't just streamline accounting, but become the central nervous systems of entirely new, scalable businesses. He's looking to build barriers to entry out of brainpower--so much of it that rivals can never catch up. ("It's like in Dr. Strangelove," Sawhill quips. "'Our German scientists are better than their German scientists.'")

Iacobucci points to Google (NASDAQ:GOOG) as an example of what a vertical system can accomplish. While everyone raves about free services on Google, the largely invisible supercomputers in Google's data centers are themselves invisibly tackling a variation on the traveling-salesman problem: How do you solve millions of searches in parallel at any given second? "When you get into mesh computing," the name for Google's technique, "that's what it's all about: managing the complexity," Iacobucci insists.

But no company has ever built a business model around complexity from the ground up--until DayJet. Thumbing his nose at the prevailing ethos in software circles of "the wisdom of crowds," let alone that "IT doesn't matter," Iacobucci has set out to first invent and then dominate a market he might have otherwise just sold software to. "When we built generic software at IBM and Citrix, the other side would always reverse-engineer it," he says. "The only thing the customer sees here is an incredible service. This is 'software as a service.'"



James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | |

Tuesday, January 15, 2008 12:36:57 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback

Disclaimer: The opinions expressed here are my own and do not necessarily represent those of current or past employers.

<February 2008>

This Blog
Member Login
All Content © 2015, James Hamilton
Theme created by Christoph De Baene / Modified 2007.10.28 by James Hamilton