Brad Porter is Director and Senior Principal engineer at Amazon. We work in different parts of the company but I have known him for years and he’s actually one of the reasons I ended up joining Amazon Web Services. Last week Brad sent me the guest blog post that follows where, on the basis of his operational experience, he prioritizes the most important points in the Lisa paper On Designing and Deploying Internet-Scale Services.
--jrh
Prioritizing the Principles in "On Designing and Deploying Internet-Scale Services"
By Brad Porter
James published what I consider to be the single best paper to come out of the highly-available systems world in many years. He gave simple practical advice for delivering on the promise of high-availability. James presented “On Designing and Deploying Internet-Scale Services” at Lisa 2007.
A few folks have commented to me that implementing all of these principles is a tall hill to climb. I thought I might help by highlighting what I consider to be the most important elements and why.
1. Keep it simple
Much of the work in recovery-oriented computing has been driven by the observation that human errors are the number one cause of failure in large-scale systems. However, in my experience complexity is the number one cause of human error.
Complexity originates from a number of sources: lack of a clear architectural model, variance introduced by forking or branching software or configuration, and implementation cruft never cleaned up. I'm going to add three new sub-principles to this.
Have Well Defined Architectural Roles and Responsibilities: Robust systems are often described as having "good bones." The structural skeleton upon which the system has evolved and grown is solid. Good architecture starts from having a clear and widely shared understanding of the roles and responsibilities in the system. It should be possible to introduce the basic architecture to someone new in just a few minutes on a white-board.
Minimize Variance: Variance arises most often when engineering or operations teams use partitioning typically through branching or forking as a way to handle different use cases or requirements sets. Every new use case creates a slightly different variant. Variations occur along software boundaries, configuration boundaries, or hardware boundaries. To the extent possible, systems should be architected, deployed, and managed to minimize variance in the production environment.
Clean-Up Cruft: Cruft can be defined as those things that clearly should be fixed, but no one has bothered to fix. This can include unnecessary configuration values and variables, unnecessary log messages, test instances, unnecessary code branches, and low priority "bugs" that no one has fixed. Cleaning up cruft is a constant task, but it is a necessary to minimize complexity.
2. Expect failures
At its simplest, a production host or service need only exist in one of two states: on or off. On or off can be defined by whether that service is accepting requests or not. To "expect failures" is to recognize that "off" is always a valid state. A host or component may switch to the "off" state at any time without warning.
If you're willing to turn a component off at any time, you're immediately liberated. Most operational tasks become significantly simpler. You can perform upgrades when the component is off. In the event of any anomalous behavior, you can turn the component off.
3. Support version roll-back
Roll-back is similarly liberating. Many system problems are introduced on change-boundaries. If you can roll changes back quickly, you can minimize the impact of any change-induced problem. The perceived risk and cost of a change decreases dramatically when roll-back is enabled, immediately allowing for more rapid innovation and evolution, especially when combined with the next point.
4. Maintain forward-and-backward compatibility
Forcing simultaneous upgrade of many components introduces complexity, makes roll-back more difficult, and in some cases just isn't possible as customers may be unable or unwilling to upgrade at the same time.
If you have forward-and-backwards compatibility for each component, you can upgrade that component transparently. Dependent services need not know that the new version has been deployed. This allows staged or incremental roll-out. This also allows a subset of machines in the system to be upgraded and receive real production traffic as a last phase of the test cycle simultaneously with older versions of the component.
5. Give enough information to diagnose
Once you have the big ticket bugs out of the system, the persistent bugs will only happen one in a million times or even less frequently. These problems are almost impossible to reproduce cost effectively. With sufficient production data, you can perform forensic diagnosis of the issue. Without it, you're blind.
Maintaining production trace data is expensive, but ultimately less expensive than trying to build the infrastructure and tools to reproduce a one-in-a-million bug and it gives you the tools to answer exactly what happened quickly rather than guessing based on the results of a multi-day or multi-week simulation.
I rank these five as the most important because they liberate you to continue to evolve the system as time and resource permit to address the other dimensions the paper describes. If you fail to do the first five, you'll be endlessly fighting operational overhead costs as you attempt to make forward progress.
I'll end with another encouragement to read the paper if you haven't already: "On Designing and Deploying Internet-Scale Services"
Disclaimer: The opinions expressed here are my own and do not necessarily represent those of current or past employers.