This is a simple little article that’s worth reading. I don’t agree with every point made but all 18 are interesting and every one of them leads to some introspection on how it compares with the situations I have come across over the years. It’s nice and concise with unusually good reading time to value ratio.
Thanks for Rodolfo Velasco for pointing this article out. Somewhat related, some time back I wrote up some of what I’ve come across over the years on how to avoid systems failing:
Some nice corresponding presentations;
Both good talks on complex systems and how they fail. Thanks for posting them.
FYI, the link to the paper (web.mit.edu) now requires MIT authentication.
Links tend to be ephemeral. Ours are stable but not everyone feels committed to keeping all content live. Thanks for letting me know.
seems to be available at https://how.complexsystems.fail/
Interesting read. I can’t help but think that failure driven design combined with a simple view of complex systems as energy distribution/discharge/management managed by capital growth driven strategies would upend a lot of these points. I also think that root cause is by itself not a useful term as it is too reductionist to be useful – so I strongly agree with that – instead it should be managed in terms of ‘story’ and ‘counter story’ to balance change vs. future risk.
I’m still partial to post mortems and feel that we often don’t learn enough from our mistakes. In fact, I like reading failure analysis from other fields as well feeling like the more I understand failure, the better my change to avoid some. But, I understand that any technique can be mis-applied and no approach is perfect.
I advocate a failure acceptance posture to designing complex systems – that is, failure will be inevitable, so try to choose HOW your system’s failure will play out. I’d expect that broad understanding of post-mortem failure types would permit great skill in choosing failure modes?
You don’t always have full control over how your system fails but the principle of failing safe is a common technique to try to achieve at least part of what you recommend above. When failing, fail to an operating mode that is safe. None of these approaches are perfect or can cover all failure modes but I agree that it’s worth a lot of time and focus to take influence over how the system will fail and choosing less dangerous failure modes.
Hey James, thanks for sharing the article!
I have read about Hindsight bias in “Thinking fast and Slow” as well. How can organizations and teams protect against that, as it may lead to reduction in team moral and may lead to pathological work environment. On the flip side the book also talks about over indexing on Bias for Action and Deliver Results where risk takers are rewarded over safe practitioners. The work done by safe practitioners may go unnoticed due to lack of operational events caused by their efforts.
For new feature development, I can relate to the new failure paths that we add to a well understood system. I think this could be attributed to increased entropy in the system and lack of training of operators/development of operational tools, to reduce it. So far I think reviews of various sorts (design, operational, etc.) and thorough testing and validation provide some safeguard against failures. However, low fidelity test envs and time to market pressure may make it hard to predict underlying system behaviors that could lead to catastrophic failures. Again borrowing from the book, as the system is new the area may lack experts who know the edge of the envelop as they haven’t done enough repetition of operating the new features of the system and understanding its behavior.
Which points do you disagree with and what are your thoughts on those?
You asked about risk takers being rewarded over safe practitioners. I’ve heard that mentioned by many in the past but I’ve not seen that in the roles I have had. We really like stable systems and hate fire fighting. At Amazon we really actively discourage fire fighting with the mantra “just roll back” and working hard to ensure we always can roll back. Firefighting is a great skill but, generally, we prefer not to have fires. Firefighting probably is rewarded some places — I could see it happening — but’s not been an issue where I work.
The second really ring true for me and everyone discounts the risk brought with new features, new hardware, and new operational approaches. Nobody should let that slow them down but it does underline the importance of not making changes that don’t materially improve the product since every change of any sort has a cost. The best way to argue this point with data is to get the data on service reliability and note that it’s much better during Christmas (or other idle periods) when there are fewer deployments.
Thanks for the response James, makes sense!
By “Which points do you disagree with and what are your thoughts on those?” I was referring to ” I don’t agree with every point made but all 18 are interesting….”. Apologies for the ambiguity :). I still have to go through the second linked paper, maybe it has more details.
Overall, it’s a good article. It makes a couple of points I don’t agree with but I found it useful overall. Some that points on which I don’t fully agree are
7) Post-accident attribution accident to a ‘root cause’ is fundamentally wrong. On this one, I’ve seen lots of mistakes made but it’s super important to know what actually happened, what the second fault was, and all faults that lead to the eventual failure of the system. Understanding failures in depth is a great way to avoid them. I’m make an effort to read about failures from many different domains as a way to get learn to spot similar weaknesses in the system I work upon. Knowing exactly what went wrong and every step from operating correctly to failing to operate correctly of critical importance. There often isn’t a single root cause but, however many there are, they need to be found. It’s amazing how many faults repeat because insufficient was done about the problem the previous time.
Closely, related, I’m not a big fan of 15 for much the same reason. It’s worth knowing in detail every detail that led to an outage and it’s OK if some of those errors are human errors. By far the most common cause of outage is human error which is why we work hard on all systems where I’m involved, to automate away the boring stuff that humans get wrong reserving them for hard problems where software can’t reliably work around it. No matter what is done, there are some issues that really do require humans. And, it will be the case that some humans are trained better or managed the pressure better than others. There will be times when the fault really is a human fault and the solution really is taking a step to make that part of the chain stronger. I agree that many are too quick to blame people when working with complex systems. But, understanding that, I strongly believe that the way to make systems more reliable is to work on all aspects of those systems including with how human operators interact with them and the training skills possessed by those human operators.
My belief is you have to chase ever even apparently small nits that hardly had anything to do with the failure all the way down to first principles whether they be in hardware, software, or operator actions or risk the complex system will decay at an accelerating pace and lose availability over time.
I enjoyed watching Dr. Cook presentation/keynote on YT (Velocity conference). From there I learned about Jens Rasmussen “boundary of acceptable performance” model (available here: https://backend.orbit.dtu.dk/ws/portalfiles/portal/158016663/SAFESCI.PDF )
Thanks Yury. The authors core point appears to be that fault understanding and modeling is best done by a cross functional team which seems reasonable and not particularly controversial.
interesting read. Thanks for sharing James. I felt like one of the major points being communicated is that success is mostly as a result of experience and positive recalibration.
It’s true that few lessons have been learned without a cost. That’s one of the reasons I’m so interested in learning about systems failures of all types from all domains. Sometime we can learn from someone else’s mistakes so systems can get better without having to experience all failures directly.
thanks, James – will share this with my Failure class. hope you are staying well and healthy.
Hey Andrew. Glad you can use it. I’ll bet your class is a fun one.