Monday, July 09, 2012

Last night, Tom Klienpeter sent me The Official Report of the Fukushima Nuclear Accident Independent Investigation Commission Executive Summary. They must have hardy executives in Japan in that the executive summary runs 86 pages in length. Overall, It’s an interesting document but I only managed to read in to the first page before starting to feel disappointed. What I was hoping for is a deep dive into why the reactors failed, the root causes of the failures, and what can be done to rectify it.


Because of the nature of my job, I’ve spent considerable time investigating hardware and software system failures and what I find most difficult and really time consuming is getting to the real details. It’s easy to say there was a tsunami and it damaged the reactor complex and loss of power caused radiation release. But why did loss of power cause radiation release? Why didn’t the backup power systems work? Why does the design depend upon the successful operation of backup power systems? Digging to the root cause takes the time, requires that all assumptions be challenged, and invariably leads to many issues that need to be addresses. Good post mortems are detailed, get to the root cause, and it’s rare that a detailed investigation of any complex system doesn’t yield a long, detailed list of design and operational changes. The Rogers Commission on the Space Shuttle Challenger failure is perhaps the best example of digging deeply, finding root cause both technical and operational, and making detailed recommendations.


On the second page of this report, the committee members were enumerated. The committed includes 1) seismologist, 2) 2 medical doctors, 3) chemist, 4) journalist, 5) 2 lawyers, 6) social system designer, 7) one politician, and 8) no nuclear scientist, no reactor designers, and no reactor operators. The earthquake and subsequent tsunami was clearly the seed for the event but since we can’t prevent these, I would argue that they should only play a contextual role in the post mortem.  What we need to understand is exactly why the both the reactor and nuclear material storage design were not stable in the presence of cooling system failure. It's weird that there were no experts in the subject area where the most dangerous technical problems were encountered. Basically we can’t stop earthquakes and tsunamis so we need to ensure that systems remain safe in the presence of them.


Obviously the investigative team is very qualified to deal with the follow-on events both in assessing radiation exposure risk,  how the evacuation was carried out, and regulatory effectiveness. And it is clear these factors are all important. But still, it feels like the core problem is that cooling system flow was lost and the both the reactors and nuclear material storage ponds overheated. Using materials that, when overheated, release explosive hydrogen gas is a particularly important area of investigation.


Personally, the largest part of my interest were it my investigation, would be focused on achieving designs stable in the presence of failure. Failing that, getting really good at evacuation seems like a good idea but still less important than ensuring these reactors and others in the country fail into a safe state.


The report reads like a political document. Its heavy on blame, light on root cause and the technical details of the root cause failure, and the recommended solution depends upon more regulatory oversight. The document focuses on more oversight by the Japanese Diet (a political body) and regulatory agencies but doesn't go after the core issues that lead to the nuclear release. From my perspective, the key issues are 1) scramming the reactor has to 100% stop the reaction and the passive cooling has to be sufficient to ensure the system can cool from full operating load without external power, operational oversight, or other input beyond dropping the rods. Good SCRAM systems automatically deploy and stop the nuclear reaction. This is common. What is uncommon is ensuring the system can successfully cool from a full load operational state without external input of power, cooling water, or administrative input.


The second key point that this nuclear release drove home for me is 2) all nuclear material storage areas must be seismically stable, above flood water height, maintain integrity through natural disasters, and must be able to stay stable and safe without active input or supervision for long periods of time. They can't depends upon pumped water cooling and have to 100% passive and stable for long periods without tending.


My third recommendation is arguably less important than my first two but applies to all systems: operators can’t figure out what is happening or take appropriate action without detailed visibility into the state of the system. The monitoring system needs to be independent (power, communications, sensors, …) , detailed, and able to operate correctly with large parts of the system destroyed or inoperative.


My fourth recommendation is absolutely vital and I would never trust any critical system without this: test failure modes frequently. Shut down all power to the entire facility at full operational load and establish that temperatures fall rather than rise and no containment systems are negatively impacted. Shut off the monitoring system and ensure that the system continues to operate safely. Never trust any system in any mode that hasn’t been tested.


The recommendations from the Official Report of the Fukushima Nuclear Accident Independent Investigation Commission Executive Summary follow:

Recommendation 1:

Monitoring of the nuclear regulatory body by the National Diet

A permanent committee to deal with issues regarding nuclear power must be established in the National Diet in order to supervise the regulators to secure the safety of the public. Its responsibilities should be:

1.       To conduct regular investigations and explanatory hearings of regulatory agencies, academics and stakeholders.

2.       To establish an advisory body, including independent experts with a global perspective, to keep the committee’s knowledge updated in its dealings with regulators.

3.       To continue investigations on other relevant issues.

4.       To make regular reports on their activities and the implementation of their recommendations.

Recommendation 2:

Reform the crisis management system

A fundamental reexamination of the crisis management system must be made. The boundaries dividing the responsibilities of the national and local governments and the operators must be made clear. This includes:

1.       A reexamination of the crisis management structure of the government. A structure must be established with a consolidated chain of command and the power to deal with emergency situations.

2.       National and local governments must bear responsibility for the response to off-site radiation release. They must act with public health and safety as the priority.

3.       The operator must assume responsibility for on-site accident response, including the halting of operations, and reactor cooling and containment.

Recommendation 3:

Government responsibility for public health and welfare

Regarding the responsibility to protect public health, the following must be implemented as soon as possible:

1.       A system must be established to deal with long-term public health effects, including stress-related illness. Medical diagnosis and treatment should be covered by state funding. Information should be disclosed with public health and safety as the priority, instead of government convenience. This information must be comprehensive, for use by individual residents to make informed decisions.

2.       Continued monitoring of hotspots and the spread of radioactive contamination must be undertaken to protect communities and the public. Measures to prevent any potential spread should also be implemented.

3.       The government must establish a detailed and transparent program of decontamination and relocation, as well as provide information so that all residents will be knowledgeable about their compensation options. 

Recommendation 4:

Monitoring the operators

TEPCO must undergo fundamental corporate changes, including strengthening its governance, working towards building an organizational culture which prioritizes safety, changing its stance on information disclosure, and establishing a system which prioritizes the site. In order to prevent the Federation of Electric Power Companies (FEPC) from being used as a route for negotiating with regulatory agencies, new relationships among the electric power companies must also be established—built on safety issues, mutual supervision and transparency.

1.       The government must set rules and disclose information regarding its relationship with the operators.NAIIC 23

2.       Operators must construct a cross-monitoring system to maintain safety standards at the highest global levels.

3.       TEPCO must undergo dramatic corporate reform, including governance and risk management and information disclosure—with safety as the sole priority.

4.       All operators must accept an agency appointed by the National Diet as a monitoring authority of all aspects of their operations, including risk management, governance and safety standards, with rights to on-site investigations.

Recommendation 5:

Criteria for the new regulatory body

The new regulatory organization must adhere to the following conditions. It must be:

1.       Independent: The chain of command, responsible authority and work processes must be: (i) Independent from organizations promoted by the government (ii) Independent from the operators (iii) Independent from politics.

2.       Transparent: (i) The decision-making process should exclude the involvement of electric power operator stakeholders. (ii) Disclosure of the decision-making process to the National Diet is a must. (iii) The committee must keep minutes of all other negotiations and meetings with promotional organizations, operators and other political organizations and disclose them to the public. (iv) The National Diet shall make the final selection of the commissioners after receiving third-party advice.

3.       Professional: (i) The personnel must meet global standards. Exchange programs with overseas regulatory bodies must be promoted, and interaction and exchange of human resources must be increased. (ii) An advisory organization including knowledgeable personnel must be established. (iii) The no-return rule should be applied without exception.

4.       Consolidated: The functions of the organizations, especially emergency communications, decision-making and control, should be consolidated.

5.       Proactive: The organizations should keep up with the latest knowledge and technology, and undergo continuous reform activities under the supervision of the Diet.

Recommendation 6:

Reforming laws related to nuclear energy

Laws concerning nuclear issues must be thoroughly reformed.

1.       Existing laws should be consolidated and rewritten in order to meet global standards of safety, public health and welfare.

2.       The roles for operators and all government agencies involved in emergency response activities must be clearly defined.

3.       Regular monitoring and updates must be implemented, in order to maintain the highest standards and the highest technological levels of the international nuclear community.

4.       New rules must be created that oversee the backfit operations of old reactors, and set criteria to determine whether reactors should be decommissioned.

Recommendation 7:

Develop a system of independent investigation commissions

A system for appointing independent investigation committees, including experts largely from the private sector, must be developed to deal with unresolved issues, including, but not limited to, the decommissioning process of reactors, dealing with spent fuel issues, limiting accident effects and decontamination.


Many of the report recommendations are useful but they fall short of addressing the root cause. Here’s what I would like to see:

1.       Scramming the reactor has to 100% stop the reaction and the passive cooling has to be sufficient to ensure the system can cool from full operating load without external power, operational oversight, or other input beyond dropping the rods.

2.       All nuclear material storage areas must be seismically stable, above flood water height, maintain integrity through natural disasters, and must be able to stay stable and safe without active input or supervision for long periods of time.

3.       The monitoring system needs to be independent, detailed, and able to operate correctly with large parts of the system destroyed or inoperative.

4.       Test all failure modes frequently. Assume that all systems that haven’t been tested will not work. Surprisingly frequently, they don’t.



The Official Report of the Fukushima Nuclear Accident Independent Investigation Commission Executive Summary can be found at:


Since our focus here is primarily on building reliable hardware and software systems, this best practices document may be of interest: Designing & Deploying Internet-Scale Services:




James Hamilton 
b: /


Monday, July 09, 2012 6:39:00 AM (Pacific Standard Time, UTC-08:00)  #    Comments [8] - Trackback
Hardware | Process | Services
 Wednesday, May 07, 2008

Some time back I got a question on what I look for when hiring a Program Manager from the leader of a 5 to 10 person startup.  I make no promise that what I look for is typical of what others look for – it almost certainly is not.  However, when I’m leading an engineering team and interviewing for a Program Manager role, these are the attribute I look for.  My response to the original query is below:


The good news is that you’re the CEO not me.  But, were our roles reversed, I would be asking you why you think you need  PM at this point?  A PM is responsible for making things work across groups and teams.  Essentially they are the grease that helps make a big company be able to ship products that work together and get them delivered through a complicated web of dependencies.  Does a single product startup in the pre-beta phase actually need PM?  Given my choice, I would always go with more great developer at this phase of the companies life and have the developers have more design ownership, spend more time with customers, etc.  I love the "many hats" model and it's one of the advantages of a start-up. With a bunch of smart engineers wearing as many hats as needed, you can go with less overhead and fewer fixed roles, and operate more efficiently. The PM role is super important but it’s not the first role I would staff in a early-stage startup.


But, you were asking for what I look for in a PM rather than advice on whether you should look to fill the role at this point in the company’s life.  I don't believe in non-technical PMs, so what I look for in PM is similar to what I look for in a developer.  I'm slightly more willing to put up with somewhat rusty code in a PM, but that's not a huge difference.  With a developer, I'm more willing to put up with certain types of minor skill deficits in certain areas if they are excellent at writing code.  For example, a talented developer that isn’t comfortable public speaking, or may only be barely comfortable in group meetings, can be fine. I'll never do anything to screw up team chemistry or bring in a prima donna but, with an excellent developer, I'm more willing to look at greatness around systems building and be OK with some other skills simply not being there as long as their absence doesn't screw-up the team chemistry overall.  With a PM, those skills need to be there and it just won't work without them.


It's mandatory that PMs not get "stuck in the weeds". They need to be able to look at the big picture and yet, at the same time, understand the details, even if they aren't necessarily writing the code that implements the details.  A PM is one of the folks on the team responsible for the product hanging together and having conceptual integrity.  They are one of the folks responsible for staying realistic and not letting the project scope grow and release dates slip. They are one of the team members that need to think customer first, to really know who the product is targeting, to keep the project focused on that target, and to get the product shipped


So, in summary: what I look for in a PM is similar to what I look for in a developer ( but I'll tolerate their coding possibly being a bit rusty. I expect they will have development experience. I'm pretty strongly against hiring a PM straight out of university -- a PM needs experience in a direct engineering role first to gain the experience to be effective in the PM role. I'll expect PMs to put the customer first and understand how a project comes together, keep it focused on the right customer set, not let feature creep set in, and to have the skill, knowledge, and experience to know when a schedule is based upon reality and when it's more of a dream.  Essentially I have all the expectations of a PM that I have of a senior developer, except that I need them to have a broad view of how the project comes together as a whole, in addition to knowing many of the details. They must be more customer focused, have a deeper view of the overall project schedules and how the project will come together, be a good communicator, perhaps a less sharp coder, but have excellent design skills. Finally, they must be good at getting a team making decisions, moving on to the next problem, and feeling good about it.






Wednesday, May 07, 2008 4:40:50 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
 Saturday, April 12, 2008

The only thing worse than no backups is restoring bad backups. A database guy should get these things right.  But, I didn’t, and earlier today I made some major site-wide changes and, as a side effect, this blog was restored to December 4th, 2007.  I’m working on recovering the content and will come up with something over the next 24 hours. However it’s very likely that comments between Dec 4th and earlier today will be lost.  My apologies.


Update 2008.04.13: I was able to restore all content other than comments between 12/4/2007 and yesterday morning.  All else is fine.  I'm sorry about the RSS noise during the restore and for the lost comments.  The backup/restore procedure problem is resolved.  Please report any broken links or lingering issues. Thanks,





James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | |  | blog:



Saturday, April 12, 2008 11:16:29 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Hardware | Process | Ramblings | Services | Software
 Saturday, December 22, 2007

I’m online over the holidays but everyone’s so busy there isn’t much point in blogging during this period.  More important things dominate so I won’t be posting until early January. 


Have a great holiday.




James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201

Saturday, December 22, 2007 8:59:11 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Hardware | Process | Ramblings | Services
 Monday, November 26, 2007

There are few things we do more important than interviewing and yet it’s done very unevenly across the company.  Some are amazing interviewers and others, well, I guess they write good code or something else J.


Fortunately, interviewing can be learned and, whatever you do and wherever you do it, interviewing with insight pays off.  Some time back, a substantial team was merged with the development group I lead and, as part of the merger,  a bunch of senior folks many of whom do As Appropriate (AA) interviews and all of which were frequently contributors on our interview loops.  The best way to get in sync on interviewing techniques and leveling is to talk about it so I brought us together several times to talk about interviewing, to learn from each other, and set some standards on how we’re going to run our loops.  In preparation for that meeting, I wrote up some notes of what I view as best practices for AAs but these apply to all interviewers and I typically send these out whenever I join a team. 


Some of these are specific to Microsoft but many apply much more broadly.  There is some internal Microsoft jargon used in the doc.  For example, at Microsoft the AA is short for “As Appropriate” and is the final decision making on whether an offer will be made. However, most of what is here is company invariant.


The doc: JamesRH_AA_Interview_NotesX.doc (37 KB).


I also pulled some of the key points into a ppt: AA Interview NotesX.ppt (156.5 KB).




James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | |  | Msft internal blog: msblogs/JamesRH


Monday, November 26, 2007 5:44:24 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
 Tuesday, November 06, 2007

I wrote this back in March of 2003 when I lead the SQL Server WebData team but it’s applicability is beyond that team. What’s below, is a set of Professional Engineering principles that I’ve built up over the years. Many of the concepts below are incredibly simple and most are easy to implement but it’s a rare team that does them all.


More important than the specific set of rules I outline below is to periodically stop, think in detail about what’s going well and what isn’t; think about what you want to personally do differently and what you would like to help your team do differently. I don’t do this as often as I should – we’re all busy with deadlines looming – but, each time I do, I get something significant out of it.


The latest word document is stored at: and the current version is inline below.  Send your debates and suggestions my way.




James Hamilton, Windows Live Platform Services
Bldg RedW-C/1279, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | |  | Msft internal blog: msblogs/JamesRH


Professional Engineering

James Hamilton, 2003.03.13

Update: 2007.03.09


·          Security and data Integrity: The data business is about storing, managing, querying, transforming, and analyzing customer data and, without data integrity and security, we would have nothing of value to customers.  No feature matters more than data integrity and security and failures along either of these two dimensions are considered the most serious. Our reputation with our customers and within our company is dependent upon us being successful by this measure, and our ability to help grow this business is completely dependent upon it.

·          Code ownership: Every line of code in the system should have a non-ambiguous owner and there should be a back-up within that department. Generally, each team should be organized roughly along architectural boundaries avoiding components spread over many or multiple teams. There should be no files "jointly" owned by multiple teams. If necessary, split files to get non-ambiguous ownership.

·          Design excellence: Utilize the collective expertise of the entire team and, where appropriate, experience from other teams. Design for the long haul. Think through cross-component and inter-feature interactions before proceeding. Never accept quick hacks that we can't maintain over the long term and don't rush something out if it can't be made to work. A good general rule: "never promise what you don't know how to do." Of course, all designs must be peer reviewed.

·          Peer review: Peer review is one of the most important aspects of the engineering process and it's through code and design review that we get more IQ involved on all important and long lasting decisions. All designs and all code changes must be reviewed before being checked in. Make sure your reviewer has the understanding of the code needed to a good job and don't accept rushed or sloppy reviews. Reviewing is a responsibility and, those that do an excellent job deserve and will get special credit. All teams should know who their best reviewers are and should go to them when risk or complexity is higher than normal. When there are problems, I will always ask who reviewed the work. Reviewing is a responsibility we all need to take seriously.

·          Personal integrity: It's impossible for a team to function and be healthy if we're not honest with each other and especially with ourselves. We should be learning from our mistakes and improving constantly and this simply isn't possible unless we admit our failures. If a commitment is made, it should be taken seriously and delivered upon.  And, when things go wrong, we need to be open about it.

·          Engineering process clarity: The engineering process including Development, PM, and Test should be simple and documented in sufficient detail that a new team member can join the team and be rapidly effective. It should be maintained and up-to-date and, when we decided to do something differently, it'll be documented here.

·          Follow-through and commitment to complete: In engineering, the first 10% of the job is often the most interesting while the last 10% is the most important. Professional engineering is about getting the job done. Completely.

·          Schedule integrity: Schedules on big teams are often looked at as "guidance" rather than commitments. Schedules, especially those released externally, are taken seriously and, as a consequence, external commitments need to be more carefully buffered than commitments made within your team. One of the best measures of engineering talent is how early scheduling problems are detected, admitted, and corrected. Ensure that there is sufficient time for "fit and finish" work. Ensure that the spec is solid early. Complete tests in parallel. Don't declare a feature to be done until at least 70% of the planned functional tests are passing (a SQL Server specific metric that I believe was originally suggested by Peter Spiro), and the code is checked in. Partner with dependent components for early private testing. When a feature is declared done, there should be very few bugs found subsequently, and none of these should be obvious.

·          Code base quality: Code owners are expected to have a multiple release plan for where the component is going. Component owners need to understand competitors, current customer requirements, and are expected to know where the current implementation is weak and have a plan to improve it over the next release or so.  Code naturally degrades over time and, without a focus on improvement, it becomes difficult to maintain over time. We need to invest 15 to 20% of our overall team resources in code hygiene. It's just part of us being invested in winning over multiple releases. We can’t afford to get slowed or wiped out by compounding code entropy as Sybase was.

·          Contributing to and mentoring others: All members of the team bring different skills to the team and all of us have an obligation to help others grow. Leads and more experienced members of the team should be helping other team members grow and gain good engineering habits. All team members have a responsibility to help others get better at their craft and part of doing well in this organization is in helping the team as a whole become stronger. Each of us have unique skills and experiences -- looks for ways to contribute and mentor other members of the team.

·          QFEs: must be on time and of top quality. QFEs are one of the few direct contacts points we have with paying customers and we take them very seriously prioritizing QFEs above all other commitments. Generally, we put paying customer first. When a pri-1 QFE comes in, drop everything and, if necessary, get help. When a pri-2 or Pri-3 comes in, start within the next one or two days at worst. Think hard about QFEs -- don't just assume that what is requested represents what the customer needs nor that the solution proposed is the right one. We intend to find a solution for the customer but we must choose a fix that we can continue to support over multiple releases. Private QFEs are very dangerous and I'm generally not in support of them. Almost invariably they lead to errors or regressions in a future SP or release. The quality of QFEs can make or break a customer relationship and regressions in a "fix" absolutely destroy customer confidence.

·          Shipped quality: This one is particularly tough to measure but it revolves around a class of decision that we have to make every day when we get close to a shipment: did we allow ourselves enough time to be able to fix bugs that will have customer impact or were we failing and madly triaging serious bugs into the next release trying to convince ourselves that this bug "wasn't very likely" (when I spend time with customers I'm constantly amazed at what they actually do in their shops – just about everything is likely across a sufficiently broad customer base). And, there's the flip side, did we fix bugs close to a release that destabilized the product or otherwise hurt customer satisfaction. On one side, triaging too much and on the other not enough and the only good way out of the squeeze is to always think of the customer when making the decision and to make sure that you always have enough time to be able to do the right thing.

·          Check-in quality: The overall quality of the source tree impacts the effectiveness and efficiency of all team members. Check-in test suites must be maintained, new features should get check-in test suite coverage, and they must run prior to checking in. To be effective, check-in tests suites can't run much longer than 20 to 40 minutes so, typically, additional tests are required. Two approaches I've seen work in the past: 1) gauntlet/snap pre-checkin automation, or 2) autobuilder post-checkin testing.

·          Bug limits: Large bug counts hide schedule slippage and the bugs count represents a liability that must be paid before shipping and large bug counts introduce a prodigious administrative cost. Each milestone, leads need to triage bugs and this consumes resources of productive members of the team that could be moving the product forward rather than taking care of the bug base. We will set limits for max number of bugs carried by each team and limits that I've used and found useful in the past are: each team limits active defects to less than 3 times the number of engineers on the team and no engineer should carry more than 5 active defects.

·          Responsibility: Never blame other teams or others on your team for failures. If your feature isn't coming together correctly, it's up to you to fix it. I never want to hear that test didn't test a feature sufficiently, the spec was sloppy, or the developer wasn't any good. If you own a feature, whether you work in Test, Dev, or PM, then you are responsible for the feature being done well and delivered on time. You own the problem. If something is going wrong in some other part of the team and that problem may prevent success for the feature, find a solution or involve your lead and/or their manager. “Not my department.” is not an option.

·          Learn from the past: When work is complete or results come in, consider as a team what can be learned from these results. Post mortems are a key component of healthy engineering. Learn to broadly apply techniques that work well and take quick action when we get results back that don't meet our expectations.

·          Challenge without failure: A healthy team should be giving all team members new challenges and pushing the limits for everyone. However, to make this work, you have to know when you are beyond your limits and before a problem is no longer solvable, get help. Basically, everyone should step to the plate but, before taking the last strike, get your lead involved. If that doesn't work, get their manager involved. Keep applying rule until success is found or the goal doesn't appear to be worth achieving.

·          Wear as many hats as needed: On startups, everyone on the team does whatever is necessary for the team to be successful and, unfortunately, this attribute is sometimes lost on larger, more mature teams. If testing is behind, become a tester. If the specs aren’t getting written, start writing. Generally, development can always out-pace test and sometimes can run faster than specs can be written. So self regulate by not allowing development to run more than a couple of weeks ahead of test (don’t check in until 70% of the planned tests are passing) and, if works needs to be done, don’t wait – just jump in help regardless of what discipline is in short supply.

·          Treat other team members with respect: No team member is so smart as to be above treating others on the team with respect. But do your homework before asking for help – show respect for the time of the person whose help you are seeking.

·          Represent your team professionally: When other teams ask questions, send notes, or leave phone messages ensure that they get quality answers. It’s very inefficient to have to call a team three times to get an answer and it doesn’t inspire confidence nor help teams work better together. Take representing your team seriously and don’t allow your email quotas to be hit or phone messages to go unanswered.

·          Customer Focus: Understand how customers are going to use your feature. Ensure that it works in all scenarios, with all data types, and supports all operating modes. Avoid half done features. For example, don’t add features to Windows that won’t run over Terminal Server and don’t add features to SQL server that don’t support all data types. Think about how a customer is going to use the feature and don’t take the easy way out and add a special UI for this feature only. If it’s administrative functionality, ensure that it is fully integrated into the admin UI and has API access consistent with the rest of the product. Avoid investing in a feature but not in how a customer uses the feature. For example, in SQL Server there is a temptation to expose new features as yet another stored procedure rather than adding full DDL and integrating into the management interface.

·         Code Serviceability & Self Test: All code should extensively self check.  Rather than simple asserts, a central product or service wide component should handle error recording and reporting.  On failure, this component is called.  Key internal structures are saved to disk along with a mini-dump and stack trace.  This state forms the core of the Watson return data and the central component is responsible for sending data back (if enabled).  Whether or not Watson reporting is enabled, the last N failures should be maintained on disk for analysis. There are two primary goals: 1) errors are detected early and before persistent state is damaged and 2) sufficient state is written to disk that problem determination is possible on the saved state alone and no-repro is required.  SQL Server helped force this during  the development of SQL Server 2005 by insisting that all failures during system test yield either 1) a fix based upon the stored failure data, or 2) a bug opened against the central bug tracking agent to record more state data to allow this class of issues to be more fully understood if it happens subsequently.  If a customer calls service, the state of the last failure is recorded and can be easy sent in without asking the customer to step through error prone data acquisition steps and without asking for a repro. 

·         Direct Customer feedback: Feedback directed systems like Watson and SQM are amazingly powerful and are strongly recommended for all products.

·         Ship often and incrementally: Products that ship frequently stay in touch with their customers, respond more quickly to changes in the market to changes in competitive offerings.  Shipping infrequently, tends to encourage bad behavior in the engineering community where partly done features are jammed in and V1 ends up being a good quality beta test rather than a product ready for production.  Infrastructure and systems should ship every 18 months, applications at least every 12 months, and services every three months.

·         Keep asking why and polish everything: It’s easy to get cynical when you see things going wrong around you and, although I’ve worked on some very fine teams, I’ve never seen a product or organization that didn’t need to improve.  Push for change across the board.  Find a way to improve all aspects of your product and don’t accept mediocrity anywhere. Fit and finish comes only when craftsman across the team care about the entire product as a whole. Look at everything and help improve everywhere.  Don’t spend weeks polishing your feature and then not read the  customer documentation carefully and critically.  Use the UI and API even if you didn’t write it and spend time thinking of how it or your feature could be presented better or more clearly to customers.  Never say “not my department” or “not my component” … always polish everything you come near.





ProfessionalEngineering.docx (19.79 KB)
Tuesday, November 06, 2007 4:55:16 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback

Disclaimer: The opinions expressed here are my own and do not necessarily represent those of current or past employers.

<January 2015>

This Blog
Member Login
All Content © 2015, James Hamilton
Theme created by Christoph De Baene / Modified 2007.10.28 by James Hamilton