Last night, Tom Klienpeter sent me The Official Report of the Fukushima Nuclear Accident Independent Investigation Commission Executive Summary. They must have hardy executives in Japan in that the executive summary runs 86 pages in length. Overall, It’s an interesting document but I only managed to read in to the first page before starting to feel disappointed. What I was hoping for is a deep dive into why the reactors failed, the root causes of the failures, and what can be done to rectify it.
Because of the nature of my job, I’ve spent considerable time investigating hardware and software system failures and what I find most difficult and really time consuming is getting to the real details. It’s easy to say there was a tsunami and it damaged the reactor complex and loss of power caused radiation release. But why did loss of power cause radiation release? Why didn’t the backup power systems work? Why does the design depend upon the successful operation of backup power systems? Digging to the root cause takes the time, requires that all assumptions be challenged, and invariably leads to many issues that need to be addresses. Good post mortems are detailed, get to the root cause, and it’s rare that a detailed investigation of any complex system doesn’t yield a long, detailed list of design and operational changes. The Rogers Commission on the Space Shuttle Challenger failure is perhaps the best example of digging deeply, finding root cause both technical and operational, and making detailed recommendations.
On the second page of this report, the committee members were enumerated. The committed includes 1) seismologist, 2) 2 medical doctors, 3) chemist, 4) journalist, 5) 2 lawyers, 6) social system designer, 7) one politician, and 8) no nuclear scientist, no reactor designers, and no reactor operators. The earthquake and subsequent tsunami was clearly the seed for the event but since we can’t prevent these, I would argue that they should only play a contextual role in the post mortem. What we need to understand is exactly why the both the reactor and nuclear material storage design were not stable in the presence of cooling system failure. It’s weird that there were no experts in the subject area where the most dangerous technical problems were encountered. Basically we can’t stop earthquakes and tsunamis so we need to ensure that systems remain safe in the presence of them.
Obviously the investigative team is very qualified to deal with the follow-on events both in assessing radiation exposure risk, how the evacuation was carried out, and regulatory effectiveness. And it is clear these factors are all important. But still, it feels like the core problem is that cooling system flow was lost and the both the reactors and nuclear material storage ponds overheated. Using materials that, when overheated, release explosive hydrogen gas is a particularly important area of investigation.
Personally, the largest part of my interest were it my investigation, would be focused on achieving designs stable in the presence of failure. Failing that, getting really good at evacuation seems like a good idea but still less important than ensuring these reactors and others in the country fail into a safe state.
The report reads like a political document. Its heavy on blame, light on root cause and the technical details of the root cause failure, and the recommended solution depends upon more regulatory oversight. The document focuses on more oversight by the Japanese Diet (a political body) and regulatory agencies but doesn’t go after the core issues that lead to the nuclear release. From my perspective, the key issues are 1) scramming the reactor has to 100% stop the reaction and the passive cooling has to be sufficient to ensure the system can cool from full operating load without external power, operational oversight, or other input beyond dropping the rods. Good SCRAM systems automatically deploy and stop the nuclear reaction. This is common. What is uncommon is ensuring the system can successfully cool from a full load operational state without external input of power, cooling water, or administrative input.
The second key point that this nuclear release drove home for me is 2) all nuclear material storage areas must be seismically stable, above flood water height, maintain integrity through natural disasters, and must be able to stay stable and safe without active input or supervision for long periods of time. They can’t depends upon pumped water cooling and have to 100% passive and stable for long periods without tending.
My third recommendation is arguably less important than my first two but applies to all systems: operators can’t figure out what is happening or take appropriate action without detailed visibility into the state of the system. The monitoring system needs to be independent (power, communications, sensors, …) , detailed, and able to operate correctly with large parts of the system destroyed or inoperative.
My fourth recommendation is absolutely vital and I would never trust any critical system without this: test failure modes frequently. Shut down all power to the entire facility at full operational load and establish that temperatures fall rather than rise and no containment systems are negatively impacted. Shut off the monitoring system and ensure that the system continues to operate safely. Never trust any system in any mode that hasn’t been tested.
The recommendations from the Official Report of the Fukushima Nuclear Accident Independent Investigation Commission Executive Summary follow:
Recommendation 1:
Monitoring of the nuclear regulatory body by the National Diet
A permanent committee to deal with issues regarding nuclear power must be established in the National Diet in order to supervise the regulators to secure the safety of the public. Its responsibilities should be:
1. To conduct regular investigations and explanatory hearings of regulatory agencies, academics and stakeholders.
2. To establish an advisory body, including independent experts with a global perspective, to keep the committee’s knowledge updated in its dealings with regulators.
3. To continue investigations on other relevant issues.
4. To make regular reports on their activities and the implementation of their recommendations.
Recommendation 2:
Reform the crisis management system
A fundamental reexamination of the crisis management system must be made. The boundaries dividing the responsibilities of the national and local governments and the operators must be made clear. This includes:
1. A reexamination of the crisis management structure of the government. A structure must be established with a consolidated chain of command and the power to deal with emergency situations.
2. National and local governments must bear responsibility for the response to off-site radiation release. They must act with public health and safety as the priority.
3. The operator must assume responsibility for on-site accident response, including the halting of operations, and reactor cooling and containment.
Recommendation 3:
Government responsibility for public health and welfare
Regarding the responsibility to protect public health, the following must be implemented as soon as possible:
1. A system must be established to deal with long-term public health effects, including stress-related illness. Medical diagnosis and treatment should be covered by state funding. Information should be disclosed with public health and safety as the priority, instead of government convenience. This information must be comprehensive, for use by individual residents to make informed decisions.
2. Continued monitoring of hotspots and the spread of radioactive contamination must be undertaken to protect communities and the public. Measures to prevent any potential spread should also be implemented.
3. The government must establish a detailed and transparent program of decontamination and relocation, as well as provide information so that all residents will be knowledgeable about their compensation options.
Recommendation 4:
Monitoring the operators
TEPCO must undergo fundamental corporate changes, including strengthening its governance, working towards building an organizational culture which prioritizes safety, changing its stance on information disclosure, and establishing a system which prioritizes the site. In order to prevent the Federation of Electric Power Companies (FEPC) from being used as a route for negotiating with regulatory agencies, new relationships among the electric power companies must also be established—built on safety issues, mutual supervision and transparency.
1. The government must set rules and disclose information regarding its relationship with the operators.NAIIC 23
2. Operators must construct a cross-monitoring system to maintain safety standards at the highest global levels.
3. TEPCO must undergo dramatic corporate reform, including governance and risk management and information disclosure—with safety as the sole priority.
4. All operators must accept an agency appointed by the National Diet as a monitoring authority of all aspects of their operations, including risk management, governance and safety standards, with rights to on-site investigations.
Recommendation 5:
Criteria for the new regulatory body
The new regulatory organization must adhere to the following conditions. It must be:
1. Independent: The chain of command, responsible authority and work processes must be: (i) Independent from organizations promoted by the government (ii) Independent from the operators (iii) Independent from politics.
2. Transparent: (i) The decision-making process should exclude the involvement of electric power operator stakeholders. (ii) Disclosure of the decision-making process to the National Diet is a must. (iii) The committee must keep minutes of all other negotiations and meetings with promotional organizations, operators and other political organizations and disclose them to the public. (iv) The National Diet shall make the final selection of the commissioners after receiving third-party advice.
3. Professional: (i) The personnel must meet global standards. Exchange programs with overseas regulatory bodies must be promoted, and interaction and exchange of human resources must be increased. (ii) An advisory organization including knowledgeable personnel must be established. (iii) The no-return rule should be applied without exception.
4. Consolidated: The functions of the organizations, especially emergency communications, decision-making and control, should be consolidated.
5. Proactive: The organizations should keep up with the latest knowledge and technology, and undergo continuous reform activities under the supervision of the Diet.
Recommendation 6:
Reforming laws related to nuclear energy
Laws concerning nuclear issues must be thoroughly reformed.
1. Existing laws should be consolidated and rewritten in order to meet global standards of safety, public health and welfare.
2. The roles for operators and all government agencies involved in emergency response activities must be clearly defined.
3. Regular monitoring and updates must be implemented, in order to maintain the highest standards and the highest technological levels of the international nuclear community.
4. New rules must be created that oversee the backfit operations of old reactors, and set criteria to determine whether reactors should be decommissioned.
Recommendation 7:
Develop a system of independent investigation commissions
A system for appointing independent investigation committees, including experts largely from the private sector, must be developed to deal with unresolved issues, including, but not limited to, the decommissioning process of reactors, dealing with spent fuel issues, limiting accident effects and decontamination.
Many of the report recommendations are useful but they fall short of addressing the root cause. Here’s what I would like to see:
1. Scramming the reactor has to 100% stop the reaction and the passive cooling has to be sufficient to ensure the system can cool from full operating load without external power, operational oversight, or other input beyond dropping the rods.
2. All nuclear material storage areas must be seismically stable, above flood water height, maintain integrity through natural disasters, and must be able to stay stable and safe without active input or supervision for long periods of time.
3. The monitoring system needs to be independent, detailed, and able to operate correctly with large parts of the system destroyed or inoperative.
4. Test all failure modes frequently. Assume that all systems that haven’t been tested will not work. Surprisingly frequently, they don’t.
The Official Report of the Fukushima Nuclear Accident Independent Investigation Commission Executive Summary can be found at: http://naiic.go.jp/wp-content/uploads/2012/07/NAIIC_report_lo_res2.pdf.
Since our focus here is primarily on building reliable hardware and software systems, this best practices document may be of interest: Designing & Deploying Internet-Scale Services: http://mvdirona.com/jrh/talksAndPapers/JamesRH_Lisa.pdf
–jrh
James Hamilton
e: jrh@mvdirona.com
w: http://www.mvdirona.com
b: http://blog.mvdirona.com / http://perspectives.mvdirona.com
Vovan commented: "regarding your recommendation #4 – this is what guys at Cherbobyl were doing on 26 April 1986 – testing failure modes"
This is a great point. There is a tension between a test lacking fidelity and not being realistic on one end of the spectrum and, on the other, to stage a very realistic test but put the service at risk or negative impact the customer experience. Making the system live testable without service risk or customer impact requires skill and engineering investment. Unfortunately, as hard as it is to do well, I’m 100% convinced if its not tested, it won’t work.
Common technique are to build the system with a small set of frequently exercised error handling paths. Engineer the system to be able to test those paths live. Typically some paths can’t be live tested without customer impact. On these, you need to either convince yourself that the customer impact during periodic test is lower than the customer impact of a non-tested system. Or, go through a multi-stage process where first real customer workload is moved, then the system is tested under synthetic load.
–jrh
regarding your recommendation #4 – this is what guys at Cherbobyl were doing on 26 April 1986 – testing failure modes
You are right that many systems, especially software systems, failing completely when faced with unlikely events. I’ve been in many debates/discussions where I’ve been arguing we need to protect against a certain failure mode and the counter argument has been "what are the odds that anyone will ever use it that way or that situation will ever arise?" On heavily used systems, I’ve learned over the years that these unlikely events are just about guaranteed to happen.
<blockquote cite="jrh">
Without constant focus on operational excellence, testing of all mission critical systems and fail-safes, detailed monitoring, and frequent review of operational results, tests, and monitor outputs, long term success is just good luck.
</blockquote>
JRH you have absolutely hit the nail on the head there. Too many large scale systems have simply counted on good luck.
You could have also added "a design in place that meets the design objective" to your aphorism. One would think that went without saying if there weren’t so many real world counterexamples.
Many of these reports is to push the auditing function to more senior levels. This makes some sense but pushing all the way up to a political body like the Diet that can’t stay focused on an issue when it’s no longer front of mind for the people can’t work. I’m not sure what happened with the Foot and Mouth disease report but audit by political bodies isn’t a good long term solution. Their attention will wander over time.
I think you are right Mike. There needs to be systems in place to find problems (what your called "gain knowledge") and systems in place to ensure it continues to be applied (ensure knowledge continues to have impact)
I work in a much simpler domain without most of the life safety implications but the approach I take to getting reliable services usually revolve around 1) get a design in place or a plan with dates to have a design in place that meets the design objective. The plants at Fukushima would have to be at least heavily upgraded if not fully replaced to meet the fail-safe design objectives I’ve advocated (incredibly expensive and time consuming); 2) get good monitoring in place to ensure we have visibility into the health of the systems and whether it is meeting the design goals day-to-day and in test; 3) run frequent tests of all mission critical systems on the belief that an untested system won’t work; and 4) get good operational practices in place, training, and audit regularly in front of the senior leadership.
Without constant focus on operational excellence, testing of all mission critical systems and fail-safes, detailed monitoring, and frequent review of operational results, tests, and monitor outputs, long term success is just good luck.
–jrh
Brilliant discussion. I particularly like how you point out the different dimensions of such a report: technical content on root-cause, vs political statements. Ultimately, what counts is the impact factor.
In 2001, foot-and-mouth disease led to a real crisis in the UK, with tens of thousands of affected animals being burnt. One problem led to another. Apparently, in the sixties, there was a similar outbreak of the same disease. Lots of learnings from how the outbreak was handled, summarized in an excellent report from some ministry or regulatory body.
My point is that this excellent report from the 60’s had little impact in 2001. Knowledge is irrelevant, if it is not retrieved, when it can make a difference.
I wonder whether it would be worthwhile to put together an anthology of desasters and how we learn from them (or not). The ambition would be to establish behaviours that ultimately contribute to
a) creating knowledge
b) creating an environment where knowledge can have the impact it deserves.
Thanks for the additional data Dave. I hadn’t seen the storage pools were over-capacity at the time of the Tsunami.
I understand the reactor design at Fukushima can’t meet the "stable under SCRAM without external active input" requirement. But, it is one of the criteria I would stubbornly stick too. It can be met but wasn’t in the older General Electric reactors installed at Fukushima. Nonetheless, I’ve been around enough highly redundant systems to know that if you roll the dice frequently enough, you will occasionally get a bad outcome. The negative impact in this case just feels too high to allow operation without proving the system stable on SCRAM without any active input.
A tremendous amount of technical analysis of the Fukushima disaster was posted on the Union of Concerned Scientists web site. http://allthingsnuclear.org/tagged/Japan+nuclear/page/6 might be a good place to start if you are interested in reading it, and then move to the newer posts. There is other stuff on ucsusa.org as well.
Your proposed recommendations would be good but the unfortunate fact is that the type of reactors used at Fukushima are inherently incapable of meeting those requirements, even more so when the amount of used fuel stored on site is much more than the storage pools were originally designed for. Perhaps the Japanese commission report was designed to obscure this embarrassing fact, since solving it would be impossibly expensive.
The basic technical issue with nuclear fission reactors is that they cannot be turned off. More precisely, once a fuel rod has been used for some time, stopping the chain reaction will diminish but not stop heat generation. The "exhaust" (fission products) is itself highly radioactive, so it generates heat. For the same reason it cannot be discharged to the environment like the exhaust of a fossil fuel generation plant, but must be retained in the fuel rod. A used fuel rod is a heat generator that cannot be turned off, cannot be approached by living things, remains in that state for several years, remains poisonous for thousands of years, and has a tendency to evolve hydrogen gas. Sounds like the opposite of safe!