Last night, Tom Klienpeter sent me The Official Report of the Fukushima Nuclear Accident Independent Investigation Commission Executive Summary. They must have hardy executives in Japan in that the executive summary runs 86 pages in length. Overall, It’s an interesting document but I only managed to read in to the first page before starting to feel disappointed. What I was hoping for is a deep dive into why the reactors failed, the root causes of the failures, and what can be done to rectify it.
Because of the nature of my job, I’ve spent considerable time investigating hardware and software system failures and what I find most difficult and really time consuming is getting to the real details. It’s easy to say there was a tsunami and it damaged the reactor complex and loss of power caused radiation release. But why did loss of power cause radiation release? Why didn’t the backup power systems work? Why does the design depend upon the successful operation of backup power systems? Digging to the root cause takes the time, requires that all assumptions be challenged, and invariably leads to many issues that need to be addresses. Good post mortems are detailed, get to the root cause, and it’s rare that a detailed investigation of any complex system doesn’t yield a long, detailed list of design and operational changes. The Rogers Commission on the Space Shuttle Challenger failure is perhaps the best example of digging deeply, finding root cause both technical and operational, and making detailed recommendations.
On the second page of this report, the committee members were enumerated. The committed includes 1) seismologist, 2) 2 medical doctors, 3) chemist, 4) journalist, 5) 2 lawyers, 6) social system designer, 7) one politician, and 8) no nuclear scientist, no reactor designers, and no reactor operators. The earthquake and subsequent tsunami was clearly the seed for the event but since we can’t prevent these, I would argue that they should only play a contextual role in the post mortem. What we need to understand is exactly why the both the reactor and nuclear material storage design were not stable in the presence of cooling system failure. It's weird that there were no experts in the subject area where the most dangerous technical problems were encountered. Basically we can’t stop earthquakes and tsunamis so we need to ensure that systems remain safe in the presence of them.
Obviously the investigative team is very qualified to deal with the follow-on events both in assessing radiation exposure risk, how the evacuation was carried out, and regulatory effectiveness. And it is clear these factors are all important. But still, it feels like the core problem is that cooling system flow was lost and the both the reactors and nuclear material storage ponds overheated. Using materials that, when overheated, release explosive hydrogen gas is a particularly important area of investigation.
Personally, the largest part of my interest were it my investigation, would be focused on achieving designs stable in the presence of failure. Failing that, getting really good at evacuation seems like a good idea but still less important than ensuring these reactors and others in the country fail into a safe state.
The report reads like a political document. Its heavy on blame, light on root cause and the technical details of the root cause failure, and the recommended solution depends upon more regulatory oversight. The document focuses on more oversight by the Japanese Diet (a political body) and regulatory agencies but doesn't go after the core issues that lead to the nuclear release. From my perspective, the key issues are 1) scramming the reactor has to 100% stop the reaction and the passive cooling has to be sufficient to ensure the system can cool from full operating load without external power, operational oversight, or other input beyond dropping the rods. Good SCRAM systems automatically deploy and stop the nuclear reaction. This is common. What is uncommon is ensuring the system can successfully cool from a full load operational state without external input of power, cooling water, or administrative input.
The second key point that this nuclear release drove home for me is 2) all nuclear material storage areas must be seismically stable, above flood water height, maintain integrity through natural disasters, and must be able to stay stable and safe without active input or supervision for long periods of time. They can't depends upon pumped water cooling and have to 100% passive and stable for long periods without tending.
My third recommendation is arguably less important than my first two but applies to all systems: operators can’t figure out what is happening or take appropriate action without detailed visibility into the state of the system. The monitoring system needs to be independent (power, communications, sensors, …) , detailed, and able to operate correctly with large parts of the system destroyed or inoperative.
My fourth recommendation is absolutely vital and I would never trust any critical system without this: test failure modes frequently. Shut down all power to the entire facility at full operational load and establish that temperatures fall rather than rise and no containment systems are negatively impacted. Shut off the monitoring system and ensure that the system continues to operate safely. Never trust any system in any mode that hasn’t been tested.
The recommendations from the Official Report of the Fukushima Nuclear Accident Independent Investigation Commission Executive Summary follow:
Monitoring of the nuclear regulatory body by the National Diet
A permanent committee to deal with issues regarding nuclear power must be established in the National Diet in order to supervise the regulators to secure the safety of the public. Its responsibilities should be:
1. To conduct regular investigations and explanatory hearings of regulatory agencies, academics and stakeholders.
2. To establish an advisory body, including independent experts with a global perspective, to keep the committee’s knowledge updated in its dealings with regulators.
3. To continue investigations on other relevant issues.
4. To make regular reports on their activities and the implementation of their recommendations.
Reform the crisis management system
A fundamental reexamination of the crisis management system must be made. The boundaries dividing the responsibilities of the national and local governments and the operators must be made clear. This includes:
1. A reexamination of the crisis management structure of the government. A structure must be established with a consolidated chain of command and the power to deal with emergency situations.
2. National and local governments must bear responsibility for the response to off-site radiation release. They must act with public health and safety as the priority.
3. The operator must assume responsibility for on-site accident response, including the halting of operations, and reactor cooling and containment.
Government responsibility for public health and welfare
Regarding the responsibility to protect public health, the following must be implemented as soon as possible:
1. A system must be established to deal with long-term public health effects, including stress-related illness. Medical diagnosis and treatment should be covered by state funding. Information should be disclosed with public health and safety as the priority, instead of government convenience. This information must be comprehensive, for use by individual residents to make informed decisions.
2. Continued monitoring of hotspots and the spread of radioactive contamination must be undertaken to protect communities and the public. Measures to prevent any potential spread should also be implemented.
3. The government must establish a detailed and transparent program of decontamination and relocation, as well as provide information so that all residents will be knowledgeable about their compensation options.
Monitoring the operators
TEPCO must undergo fundamental corporate changes, including strengthening its governance, working towards building an organizational culture which prioritizes safety, changing its stance on information disclosure, and establishing a system which prioritizes the site. In order to prevent the Federation of Electric Power Companies (FEPC) from being used as a route for negotiating with regulatory agencies, new relationships among the electric power companies must also be established—built on safety issues, mutual supervision and transparency.
1. The government must set rules and disclose information regarding its relationship with the operators.NAIIC 23
2. Operators must construct a cross-monitoring system to maintain safety standards at the highest global levels.
3. TEPCO must undergo dramatic corporate reform, including governance and risk management and information disclosure—with safety as the sole priority.
4. All operators must accept an agency appointed by the National Diet as a monitoring authority of all aspects of their operations, including risk management, governance and safety standards, with rights to on-site investigations.
Criteria for the new regulatory body
The new regulatory organization must adhere to the following conditions. It must be:
1. Independent: The chain of command, responsible authority and work processes must be: (i) Independent from organizations promoted by the government (ii) Independent from the operators (iii) Independent from politics.
2. Transparent: (i) The decision-making process should exclude the involvement of electric power operator stakeholders. (ii) Disclosure of the decision-making process to the National Diet is a must. (iii) The committee must keep minutes of all other negotiations and meetings with promotional organizations, operators and other political organizations and disclose them to the public. (iv) The National Diet shall make the final selection of the commissioners after receiving third-party advice.
3. Professional: (i) The personnel must meet global standards. Exchange programs with overseas regulatory bodies must be promoted, and interaction and exchange of human resources must be increased. (ii) An advisory organization including knowledgeable personnel must be established. (iii) The no-return rule should be applied without exception.
4. Consolidated: The functions of the organizations, especially emergency communications, decision-making and control, should be consolidated.
5. Proactive: The organizations should keep up with the latest knowledge and technology, and undergo continuous reform activities under the supervision of the Diet.
Reforming laws related to nuclear energy
Laws concerning nuclear issues must be thoroughly reformed.
1. Existing laws should be consolidated and rewritten in order to meet global standards of safety, public health and welfare.
2. The roles for operators and all government agencies involved in emergency response activities must be clearly defined.
3. Regular monitoring and updates must be implemented, in order to maintain the highest standards and the highest technological levels of the international nuclear community.
4. New rules must be created that oversee the backfit operations of old reactors, and set criteria to determine whether reactors should be decommissioned.
Develop a system of independent investigation commissions
A system for appointing independent investigation committees, including experts largely from the private sector, must be developed to deal with unresolved issues, including, but not limited to, the decommissioning process of reactors, dealing with spent fuel issues, limiting accident effects and decontamination.
Many of the report recommendations are useful but they fall short of addressing the root cause. Here’s what I would like to see:
1. Scramming the reactor has to 100% stop the reaction and the passive cooling has to be sufficient to ensure the system can cool from full operating load without external power, operational oversight, or other input beyond dropping the rods.
2. All nuclear material storage areas must be seismically stable, above flood water height, maintain integrity through natural disasters, and must be able to stay stable and safe without active input or supervision for long periods of time.
3. The monitoring system needs to be independent, detailed, and able to operate correctly with large parts of the system destroyed or inoperative.
4. Test all failure modes frequently. Assume that all systems that haven’t been tested will not work. Surprisingly frequently, they don’t.
The Official Report of the Fukushima Nuclear Accident Independent Investigation Commission Executive Summary can be found at: http://naiic.go.jp/wp-content/uploads/2012/07/NAIIC_report_lo_res2.pdf.
Since our focus here is primarily on building reliable hardware and software systems, this best practices document may be of interest: Designing & Deploying Internet-Scale Services: http://mvdirona.com/jrh/talksAndPapers/JamesRH_Lisa.pdf
b: http://blog.mvdirona.com / http://perspectives.mvdirona.com