I’m a boater and I view reading about boating accidents as important. The best source that I’ve come across is the UKs Marine Accident Investigation Branch (MAIB). I’m an engineer and again, I view it as important to read about engineering failures and disasters. One of the best sources I know of is Peter G. Neumann’s RISKS Digest.
There is no question that firsthand experience is a powerful teacher but few of us have time (or enough lives) to make every possible mistake. There are just too many ways to screw-up. Clearly, it’s worth learning from others when trying to make our own systems more safe or more reliable. On that belief I’m a avid reader of service post mortems. I love understanding what went wrong, thinking of those same issues could impact a service in which I’m involved, and what should be done to avoid the class of problems under discussion. Some of what I’ve learned around services over the years is written up in this best practices document: http://mvdirona.com/jrh/talksAndPapers/JamesRH_Lisa.pdf originally published at USENIX LISA.
One post mortem I came across recently and enjoyed was: Message from discussion Information Regarding 2 July 2009 outage. I liked it because there was enough detail to educate and it presented many lessons. If you own or operate a service or mission critical application, it’s worth a read.
–jrh
James Hamilton, Amazon Web Services
1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | james@amazon.com
H:mvdirona.com | W:mvdirona.com/jrh/work | blog:http://perspectives.mvdirona.com
As educational as it is to get operational experience in managing disasters, you’re right, most of us would rather read about it than be in the middle of one.
I figured your reasons might include "Because they didn’t happen to me."