Brad Porter is Director and Senior Principal engineer at Amazon. We work in different parts of the company but I have known him for years and he’s actually one of the reasons I ended up joining Amazon Web Services. Last week Brad sent me the guest blog post that follows where, on the basis of his operational experience, he prioritizes the most important points in the Lisa paper On Designing and Deploying Internet-Scale Services.


Prioritizing the Principles in “On Designing and Deploying Internet-Scale Services”

By Brad Porter

James published what I consider to be the single best paper to come out of the highly-available systems world in many years. He gave simple practical advice for delivering on the promise of high-availability. James presented “On Designing and Deploying Internet-Scale Services at Lisa 2007.

A few folks have commented to me that implementing all of these principles is a tall hill to climb. I thought I might help by highlighting what I consider to be the most important elements and why.

1. Keep it simple

Much of the work in recovery-oriented computing has been driven by the observation that human errors are the number one cause of failure in large-scale systems. However, in my experience complexity is the number one cause of human error.

Complexity originates from a number of sources: lack of a clear architectural model, variance introduced by forking or branching software or configuration, and implementation cruft never cleaned up. I’m going to add three new sub-principles to this.

Have Well Defined Architectural Roles and Responsibilities: Robust systems are often described as having “good bones.” The structural skeleton upon which the system has evolved and grown is solid. Good architecture starts from having a clear and widely shared understanding of the roles and responsibilities in the system. It should be possible to introduce the basic architecture to someone new in just a few minutes on a white-board.

Minimize Variance: Variance arises most often when engineering or operations teams use partitioning typically through branching or forking as a way to handle different use cases or requirements sets. Every new use case creates a slightly different variant. Variations occur along software boundaries, configuration boundaries, or hardware boundaries. To the extent possible, systems should be architected, deployed, and managed to minimize variance in the production environment.

Clean-Up Cruft: Cruft can be defined as those things that clearly should be fixed, but no one has bothered to fix. This can include unnecessary configuration values and variables, unnecessary log messages, test instances, unnecessary code branches, and low priority “bugs” that no one has fixed. Cleaning up cruft is a constant task, but it is a necessary to minimize complexity.

2. Expect failures

At its simplest, a production host or service need only exist in one of two states: on or off. On or off can be defined by whether that service is accepting requests or not. To “expect failures” is to recognize that “off” is always a valid state. A host or component may switch to the “off” state at any time without warning.

If you’re willing to turn a component off at any time, you’re immediately liberated. Most operational tasks become significantly simpler. You can perform upgrades when the component is off. In the event of any anomalous behavior, you can turn the component off.

3. Support version roll-back

Roll-back is similarly liberating. Many system problems are introduced on change-boundaries. If you can roll changes back quickly, you can minimize the impact of any change-induced problem. The perceived risk and cost of a change decreases dramatically when roll-back is enabled, immediately allowing for more rapid innovation and evolution, especially when combined with the next point.

4. Maintain forward-and-backward compatibility

Forcing simultaneous upgrade of many components introduces complexity, makes roll-back more difficult, and in some cases just isn’t possible as customers may be unable or unwilling to upgrade at the same time.

If you have forward-and-backwards compatibility for each component, you can upgrade that component transparently. Dependent services need not know that the new version has been deployed. This allows staged or incremental roll-out. This also allows a subset of machines in the system to be upgraded and receive real production traffic as a last phase of the test cycle simultaneously with older versions of the component.

5. Give enough information to diagnose

Once you have the big ticket bugs out of the system, the persistent bugs will only happen one in a million times or even less frequently. These problems are almost impossible to reproduce cost effectively. With sufficient production data, you can perform forensic diagnosis of the issue. Without it, you’re blind.

Maintaining production trace data is expensive, but ultimately less expensive than trying to build the infrastructure and tools to reproduce a one-in-a-million bug and it gives you the tools to answer exactly what happened quickly rather than guessing based on the results of a multi-day or multi-week simulation.

I rank these five as the most important because they liberate you to continue to evolve the system as time and resource permit to address the other dimensions the paper describes. If you fail to do the first five, you’ll be endlessly fighting operational overhead costs as you attempt to make forward progress.

  • If you haven’t kept it simple, then you’ll spend much of your time dealing with system dependencies, arguing over roles & responsibilities, managing variants, or sleuthing through data/config or code that is difficult to follow.
  • If you haven’t expected failures, then you’ll be reacting when the system does fail. You may also be dealing with complicated change-management processes designed to keep the system up and running while you’re attempting to change it.
  • If you haven’t implemented roll-back, then you’ll live in fear of your next upgrade. After one or two failures, you will hesitate to make any further system change, no matter how beneficial.
  • Without forward-and-backward compatibility, you’ll spend much of your time trying to force dependent customers through migrations.
  • Without enough information to diagnose, you’ll spend substantial amounts of time debugging or attempting to reproduce difficult-to-find bugs.

I’ll end with another encouragement to read the paper if you haven’t already: “On Designing and Deploying Internet-Scale Services”