Yet another argument in favor of Degraded Operations Mode (http://mvdirona.com/jrh/perspectives/2008/01/22/DegradedOperationsMode.aspx) emerged last week. All of Amazon AWS (S3, SimpleDB, Simple Queuing Service, EC2, etc.) down for several hours last week: http://mvdirona.com/jrh/perspectives/2008/02/15/DowntimeAmazonS3SimpleDBSQS.aspx. The outage was reportedly due to a authentication storm: http://www.highscalability.com/s3-failed-because-authentication-overload (Mike Neil sent this my way).
Remember, you’ll never have the capacity for the biggest load inrush and, no matter how hard you try, your capacity planning will continue to only slightly better than the weather report for next week. When you don’t know what’s coming, design systems to operate through adversity: Degraded Operations Mode.
–jrh
James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | JamesRH@microsoft.com