VolkswagonDiesel

I’m an avid reader of engineering disasters since one of my primary roles in my day job is to avoid them. And, away from work, we are taking a small boat around the world with only two people on board and that too needs to be done with care where an engineering or operational mistake could conceivably be terminal.  In Why I enjoy reading about Engineering Accidents, Failures, and d Disasters I talk about some of the advantages of reading about and learning from disasters across all domains.  Past topics have included Studying the Costa Concordia Grounding, What Went Wrong at Fukushima Dai-1, Official Report of the Fukushima Nuclear Accident Independent Investigation Commission Executive,  The Power Failure Seen Around the World, and an operational mistake from my personal experience 69.1 degrees.

Almost all engineering or operational disasters have a large people related component even if the system is largely automated. People, and less than ideal decisions made by people, are at the core of many of these complex system failures. Knowing this, I’m always looking to learn more about what can cause bad operational practices to set in over time, and how leadership can set the right values and audit sufficiently closely ensure the right thing happens every day even when there have been no problems for years. How do we keep the team vigilant and active avoiding mistakes?

With this backdrop, the Volkswagen emission fiasco is a highly interesting example. In this situation it appears that Volkswagen intentionally configured their emissions control system to special case the emissions tests such that they pass while the system actually is unable to attain the standard when not being tested.

It’s easy to see why a company might find this tempting.  Emissions requirements force tough engineering compromises. Hitting them can reduce engine drivability and engine power, and can both consume more fuel and substantially increase costs. To be sure, many emissions requirements have actually improved engine efficiency, drivability, and power output but, as with all things in engineering, the more a single dimension is optimized, the harder it is to not give up ground in other dimensions.  Modern emissions standards bring compromises and put high pressure on the engineering teams.

For those not fully up to speed on the VW Emissions Fiasco, the summary from Volkswagen Emissions Scandal is worth including here:

On 18 September 2015, the United States Environmental Protection Agency (EPA) issued a notice of violation of the Clean Air Act to German automaker Volkswagen Group, after it was found that the car maker had intentionally programmed turbocharged direct injection (TDI) diesel engines to activate certain emissions controls only during laboratory emissions testing. The programming caused the vehicles’ nitrogen oxide (NOx) output to meet US standards during regulatory testing, but emit up to 40 times more NOx in real-world driving.[10] Volkswagen put this programming in about eleven million cars worldwide, and in 500,000 in the United States, during model years 2009 through 2015

Hitting automotive emissions requirements today is a very difficult engineering challenge and almost always forces tough compromises. It follows then that it could be a real competitive advantage to not meet the emission requirements. Whenever the challenges are great, the competitive pressures high, the temptation to do the wrong thing can be very large. This temptation is somewhat curtailed by the knowledge that there will be independent government tests that must be passed but still, what about special casing for these and still not meeting the emissions standards?

When there are many millions of dollars on the line, company leadership can be very tempted and individual engineers can feel enormous pressure to succeed, even to the point of saving their job by cheating. What usually prevents cheating is that big engineering projects, even though highly secretive, still involve 10s or even 100s of engineers. A single engineer simply can’t put a conditional test into the code that says, for example, if the OBDII connection is live (it will be during emissions tests) or if the car is not moving when under acceleration (on a chassis dynamometer), then comply with the emission standards but otherwise don’t. There are few secrets on large engineering projects. If company leadership asks the team to cheat, everyone on the team knows it.  If a single engineer puts in a code change to do something like what I outlined above, it’ll need to be reviewed by other engineers and it’s close to impossible that nobody will notice the illegal code. There is a good chance that any substantial engineering project will have at least one person that feels personally committed to doing the right thing on emissions and not cheating their customers. Hopefully a lot more than one, but one is all it takes. On large teams, and most automotive engineering projects are quite large, it’s almost impossible that there will not be at least one honest or environmentally committed engineer.

This event really caught my interest. How did VW intentionally implement a non-complying emissions systems and yet it not be reported or detected for years? It just seems impossible that someone on the team would not at least anonymously report it. But, since nobody did, we all need to understand what went wrong and find ways to avoid similar failures on our own engineering projects.

Lawsuits and other legal action both civil and criminal make it very difficult to get information and learn from this event. Volkswagen as a company is facing an estimated $18B liability (Putting a Price on Volkswagen’s Emission-Fraud Mess) so it’s particularly difficult to get data on the event. If VW management asked the team to cheat, how did they keep the knowledge of this so tightly controlled? If the decision was made by a rogue engineer under enormous pressure to hit the emissions standards without giving up cost, drivability, fuel economy, or power, then how did the changes go in to the firmware without being broadly seen by other engineers on the team? I’ve still not found a definitive answer for any of those questions but did find what appears to be a very credible explanation of exactly what actually happened at VW. Understanding what was done gives us some clues into how this escaped broad notice b the engineering team and why nobody reported it publicly.

In Inside the Volkswagen Emissions Cheating, Jake Edge reports on a talk given at 32nd Chaos Communications Congress (32C3).  From Edge’s posting:

The 32nd Chaos Communication Congress (32C3) held at the end of December, Daniel Lange and Felix Domke gave a detailed look into the Volkswagen emissions scandal—from the technical side. Lange gave an overview of the industry, the testing regime, and the regulatory side in the first half, while Domke presented the results of his reverse-engineering effort on the code in the engine electronic control unit (ECU), as well as tests he ran on his own affected VW car. The presentation and accompanying slides [PDF] provide far more detail than has previously been available.

One of the authors of the presentation, Lange, is a security researcher. These are folks that crack software and hardware systems looking for security weaknesses. Some of these problems are reported to the company that produced the system and fixed which is a service to the industry. Some are sold to the companies involved which isn’t a business model I particularly like but it arguably also contributes to the industry. Some of these security flaws are sold on the open market and get used to illegally. Again, this fringe aspect of the security research community is not my favorite but, whether we like all the business models, it’s still very important to stay current with the security research community if you work in the commercial hardware, software, or services world.

I really like this application of security research to understand what was actually done when a company isn’t being forthcoming due to legal complications on what went wrong and exactly what happened. What the Lange and Domke found is super interesting and is the best source I have come across so far on what actually happened at VW. What these researchers found involved a component of the emission control systems that injects controlled amounts of urea and water into the system. This is used by modern Selective Catalytic Reduction (SCR) diesel engines to control Nitrogen Oxide NOx emissions. But, like many things in the control system world, choosing the right amount to inject can be difficult. Insufficient urea injection levels will allow excessive NOx emissions which would fail the emission test. But excessive injection levels will produce high levels of ammonia which, of course, is highly undesirable.

Understanding that correct injection levels are incredibly difficult to achieve under all circumstances, some conditions are treated specially.  From Jake Edge’s posting, an ECU is an Engine Control Unit and AdBlue is the German nomenclature for the urea that is injected into SCR diesel engines to meet emission requirements in many jurisdictions:

The SCR is also modeled in the ECU. It takes sensor readings and outputs from other models and produces an amount of AdBlue to use. Ideally, that would be the right amount to eliminate NOx, but emit no ammonia. There is also a separate monitoring function that will trigger an OBD-II error if the efficiency of the conversions is too low. That might cause a “check engine” condition so that the owner takes the car in for service.

It turns out that the standard SCR model does not work under all conditions (e.g. if the engine is too hot), so there is an alternative model that runs in parallel. It is a much simpler model, with fewer inputs, that has the goal of never adding too much AdBlue. There is code in the ECU that determines which model to use, and that code depends on the data provided by the car maker. In addition, the ECU stores information about which model is chosen at each ten-millisecond interval.

At this point we have an expected and accepted exception in the engine management systems so nobody will be surprised to see this second, more conservative urea injection curve in the ECU injection maps.  And nobody will be surprised to see a complex set of conditions on whether to use the standard map or the exception map. Again, the existence of this code is unsurprising and normal.

What the researchers found is that their test car was consuming roughly 24% of the urea it would have been expected to consume under compliant emissions operations. So, being security researchers, they disassembled the engine management systems code and went through the sets of conditions that use the more conservative urea injection model and found these conditions were broader than they should be. Specifically the alternative conservative urea injection curve should be use whenever any of a variety of operating conditions test true but one of the conditions being ORed was engine temperature is above -3276.8K. For the non-physicists amongst you, that test will always be true. Essentially the alternative injection model is always used. This will clearly fail emissions tests so they knew it couldn’t be that simple.

In digging deeper Lange and Domke found another set of conditions that would force the system back to the standard, emission complying urea injection model. These conditions included a complex set of linear curves that if all matched true would force the system back to the compliant model.  As you could probably guess the emissions tests happen to just barely be contained in these curves while almost all normal driving will fall outside them.

Essentially the cheat was hidden in plain sight. There was expected to be an alternative curve. It’s unsurprising that the system of tests that select the curve be complex and rather than specific easy to think through discrete levels, they are curves that all must be matched. It wouldn’t be surprising that nobody thought through the ANDing of all the required curves.

It seems conceivable that everyone on the team could see this code and yet not realize that it is non-compliant.  Clearly we can’t know how many people were aware of the emissions tests optimization but it’s conceivable that the group was small. More detail will come out during the numerous civil and criminal actions that follow but in the interim I get two lessons:

  • Any metric that is used by a jurisdictional body or that we use internally to monitor our systems is going to be incomplete. Metrics necessarily abstract away some of the complexity of reality to allow us to use a small number of numbers or curves to understand how a system is performing. Without some testing to ensure your metrics are complete or not being optimized around, there is risk they are missing important details. The application of this learning for jurisdictions wanting to do emission testing is they need to do some random component of the emissions test conditions to ensure that the results are close to expected after averaging. One rule we use at Amazon to help ensure our metrics are sufficiently inclusive is to say that no customer should have a bad day without it showing up in at least one metric. This rule forces us to have an incredibly dense mesh of metrics but, without as many, important exceptions will be missed. The VW violation should have been caught. I’m not trying to relieve VW of responsibility but it certainly is the case that emissions tests are poor representatives of real world automobile usage and there needs to be more checks to ensure the real world results match the legal intentions.
  • It’s important for leadership to set aggressive goals for individual engineers and for teams. This is how great things are achieved and this tension helps deliver great products to customers. But, what this shows is that very detailed auditing is needed. Leaders need to set aggressive goals but they need to be in the details asking lots of questions. There needs to be strong metrics in place to detect quality, performance, and legal compliance issues early. These tests and metrics may run into the thousands of discrete data point in order to have the fidelity to prevent the tension of high expectations allowing even a single engineer to take a shortcut. Without a real focus on company values, constant questions and auditing, and a dense web of metrics to detect problems early, these violations will certainly happen.

I’m looking forward to learning far more about what happened in this case but the data already unearthed by Lange and Domke and reported by Edge gives several important lessons for anyone in an engineering or engineering leadership role. A mistake of this nature is enough to cause a great company to fail so it’s worth spending significantly to avoid the risk of these issues happening where we work. If the metrics are weak, even good people will get complacent and gaming will set in.

For more information:

 

The last set of slides is particularly worth studying.