Blur the Development/Operations Boundary

I’ve long argued that the firm and clear division between development and operations common in many companies is a mistake. Development doesn’t feel the pain and understand what it takes to make their services more efficient to operate. Operations tends to hire more people to deal with the mess. It doesn’t work, it isn’t efficient, and it’s slow. The right model has the development/ops line very blurred. The Amazon model takes this to an extreme with essentially a “you wrote it, you run it” approach. This is nimble and delivers the pain right back where it can be most efficiently solved: development. But it can be bit tough on the engineering team. The approach I like most is a hybrid, where we have a very small tier 1 support team operations with everything else going back to development.

However, if I had to err to one extreme or the other, I would head down the Amazon path. A clear division between ops and dev leads to an over-the-wall approach that is too slow and too inefficient. What’s below is more detail on the Amazon approach, not so much because they do it perfectly but because they represent an extreme and therefore are a good data point to think though. Also note the reference to test-in-production found in the last paragraph. This is 100% the right approach in my opinion.

–jrh

When I was at Amazon (and I don’t think anything has changed) services were owned soup to nuts by the product team. There were no testers and no operations people. Datacenter operations, security, and operations were separate, centralized groups.

So ‘development’ was responsible for much of product definition and specification, the development, deployment and operation of services. This model came down directly from Jeff Bezos; his intention was to centralize responsibility and take away excuses. If something went wrong it was very clear who was responsible for fixing it. Having said that, Amazon encouraged a no-blame culture. If things went wrong the owner was expected to figure out the root cause and come up with a remediation plan.

This had many plusses and a few negatives. On the positive side:

· Clear ownership of problems

· Much shorter time to get features into production. There was no hand-off from dev to test to operations.

· Much less requirement for documentation, (both a plus and a minus)

· Very fast response to operational issues, since the people who really knew the code were engaged up-front.

· Significant focus by developers on reliability and operability, since they were the people responsible for running the service

· Model works really well for new products

Negatives:

· Developers have to carry pagers. On-call rotations, with people on-call required to respond to a sev-1 within 15 minutes of being paged 24 hours a day. Could lead to burn-out.

· Ramp up curve for new developers is extremely steep because of the lack of documentation and process

· For some teams the operations load completely dominated, making it very difficult to tackle new features or devote time to infrastructure work.

· Coordinating large feature development across multiple teams is tough.

The one surprise in this is that in my opinion code quality was as good as or better than here, despite having no testers. People developed an approach of deploying to a few machines and observing their behavior for a few days before starting a large roll out. Code could be reverted extremely quickly.

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.