I’m a boater and I view reading about boating accidents as important. The best source that I’ve come across is the UKs Marine Accident Investigation Branch (MAIB). I’m an engineer and again, I view it as important to read about engineering failures and disasters. One of the best sources I know of is Peter G. Neumann’s RISKS Digest.
There is no question that firsthand experience is a powerful teacher but few of us have time (or enough lives) to make every possible mistake. There are just too many ways to screw-up. Clearly, it’s worth learning from others when trying to make our own systems more safe or more reliable. On that belief I’m a avid reader of service post mortems. I love understanding what went wrong, thinking of those same issues could impact a service in which I’m involved, and what should be done to avoid the class of problems under discussion. Some of what I’ve learned around services over the years is written up in this best practices document: http://mvdirona.com/jrh/talksAndPapers/JamesRH_Lisa.pdf originally published at USENIX LISA.
One post mortem I came across recently and enjoyed was: Message from discussion Information Regarding 2 July 2009 outage. I liked it because there was enough detail to educate and it presented many lessons. If you own or operate a service or mission critical application, it’s worth a read.
James Hamilton, Amazon Web Services
1200, 12th Ave. S., Seattle, WA, 98144W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | email@example.com
H:mvdirona.com | W:mvdirona.com/jrh/work | blog:http://perspectives.mvdirona.com
Disclaimer: The opinions expressed here are my own and do not
necessarily represent those of current or past employers.