Downtime: Amazon S3, SimpleDB, SQS, …

If you run a big service and claim to have never had down time you either 1) have close to zero customers or 2) are lying. It’s almost that simple.

There is considerable concern that Amazons AWS service was down for several hours:

· http://www.roughtype.com/archives/2008/02/amazons_s3_util.php

· http://gigaom.com/2008/02/15/amazon-s3-service-goes-down/

· http://www.centernetworks.com/amazon-s3-down-error

Thanks to Jeff Currier and Soumitra Sengupta who told me about the downtime as it was happening last week. The service was reported to be down at 4:30AM. At 10:17, they reported it was resolved. There are a couple of lessons in here but the first is that internal IT goes down, high scale services go down, client systems fail, networks stop operating, power failures happen, etc. That’s just the way it is. You can spend to reduce these factors and you can try to take complete control of the IT infrastructure to avoid them impacting you. Ironically, in my experience, those that take over and run the entire infrastructure typically do it at lower scale with less experience and have downtime as well. These small scale services end up costing much more and yet deliver very little additional uptime. You read about commodity priced, high scale services when they go down. For example, RIM was down last week. But, the good ones really don’t go down that frequently. High scale, commodity infrastructure is actually pretty solid and compares very well to vertical, control-all-aspects-of-the-IT-infrastructure approaches. Amazon AWS generally has earned a pretty good reliability record.

The second lesson is perhaps the hardest to learn and the most important: customers need information. If a service goes down – actually, I should say, when a service goes down – you need to tell customers what is happening and set expectations on service restoration right away. There is a temptation to hide the facts because, well, downtime is embarrassing. Hiding it simply doesn’t work. When people don’t know what is happening, they assume the worst and think you are trying to hide something or aren’t responding properly. Tell them what is happening, invest resources in keeping them up to date with progress, and tell them when you expect to be back up.

It’s hard, it’s embarrassing, but this one matters more than any other. Long after the downtime is forgotten, people will remember how you handled it. Transparency wins when it comes to service operation – customers who have decided to bet there jobs on your service need timely information for their customers. If you embarrass your customers, they remember forever. A little downtime is unfortunate and you need to be getting better all the time but that’s forgivable. Just get them the information they need for their dependent businesses.

–jrh

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.