Challenges in Designing at Scale: Formal Methods in Building Robust Distributed Systems

We all know that when designing and operating applications at scale, it is persistent state management that brings the most difficult challenges. Delivering state-free applications has always been (fairly) easy. But most interesting commercial and consumer applications need to manage persistent state. Advertising needs to be delivered, customer activity needs to be tracked, and products need to be purchased. Interesting applications that run at scale all have difficult persistent state problems. That’s why Amazon.com, other AWS customers, and even other AWS services make use of the various AWS database platform services. Delegating the challenge of managing high-performance, transactional, and distributed data management to underlying services makes applications more robust while reducing operational overhead and design complexity. But, it makes these mega-scale database systems absolutely critical resources that need to be rock solid with very well understood scaling capabilities.

One such service that is heavily used by AWS, Amazon.com, and external services is DynamoDB. On the surface, DynamoDB looks fairly simple. It has only 13 APIs and has been a service that is being widely adopted due to its ease of use and doing the undifferentiated heavy lifting of dealing with disk failures, host issues or network partitions etc. However, that external simplicity is built on a hidden substrate of complex distributed systems. Such complex internals are required to achieve high-availability while running on cost-efficient infrastructure, and also to cope with rapid business-growth. As an example of this growth, DynamoDB is handling millions of transactions every second in a single region while continuing to provide single digit millisecond latency even in the midst of disk failures, host failures and rack power disruptions.

Services like DynamoDB rely on fault-tolerant distributed algorithms for replication, consistency, concurrency control, provisioning, and auto scaling. There are many such algorithms in the literature, but combining them into a cohesive system is a significant challenge, as the algorithms usually need to be modified in order to interact properly in a real-world production system. In addition, we have found it necessary to invent algorithms of our own.

We all know that increasing complexity greatly magnifies the probability of human error in design, code, and operations. Errors in the core of the system could potentially cause service disruptions and impact our service availability goals. We work hard to avoid unnecessary complexity, but the complexity of the task remains high. Before launching a service or a new feature to service like DynamoDB, we need to reach extremely high confidence that the core of the system is correct.

Historically, software companies do this using a combination of design documents, design reviews, static code analysis, code reviews, stress testing, fault injection testing and many other techniques. While these techniques are necessary they may not be sufficient. If you are building a highly concurrent replication algorithm that serves as the backbone for systems like DynamoDB, you want to be able to model partial failures in a highly concurrent distributed systems. Moreover, you want to capture the failures at design level even as these might be harder to do in testing.

This is where AWS teams and Amazon DynamoDB and transactional services teams embarked on a path of building precise designs for the services they are building. While one can argue that traditional methods of writing design docs serve a similar purpose we found that design docs lack precision. This is because they are written in prose and diagrams and they are not easy to test in an automated fashion. At the other end of the spectrum, once we have implemented the system in code, it becomes much too complex to establish algorithm correctness or to debug subtle issues. To this end, we looked for a way to express designs that will eliminate hand waving and that has sufficient tools that can be applied to check for errors in the design.

We wanted a language that allowed us to express things like replication algorithms or distributed algorithms in hundreds of lines of code. We wanted the tool to have existing ecosystems that allowed us to test various failure conditions at design level quickly. We found what we were looking for in TLA+, a formal specification language invented by ACM Turing award winner, Leslie Lamport. TLA+ is based on simple discrete math, basic set theory and predicates with which all engineers are quite familiar. A TLA+ specification simply describes the set of all possible legal behaviors (execution traces) of a system. While TLA+ syntax was unfamiliar to our engineers, TLA+ is accompanied by PlusCal, a pseudo code language that is closer to a C-style programming language. Several engineers at Amazon have found they are more productive in PlusCal than TLA+. However, in other cases, the additional flexibility of plain TLA+ has been very useful.

PlusCal and TLA+ have proven very effective at establishing and maintaining the correctness through change of the fundamental components on which DynamoDB is based. We believe having these core components correct from day one has allowed the DynamoDB system to evolve more quickly and scale fast while avoiding the difficult times often experienced by engineers and customers early in a distributed systems life.

I’ve always been somewhat skeptical of formal methods in that some bring too much complexity to actually be applied to commercial systems while others tend to abstract much of the complexity but, in abstracting away complexity, give up precision. TLA+ and PlusCal appear to skirt these limitations and we believe that having the core and most fundamental algorithms on which DynamoDB is based provably correct helps speed innovation, ease the inherent complexity of scaling and, overall, improves both the customer experience and the experience of engineers working on the DynamoDB system at AWS. While this is an important part of our system, there are hundreds of innovations we do in building and operating robust scalable distributed systems on how we design, build, test, deploy, monitor and operate these large scale services.

If you are interested in reading more about the usage of TLA+ within AWS, this is a pretty good source: Use of Formal Methods at Amazon Web Services. And, if you are interested in joining the AWS database team and, especially if you have the experience, background, and interest to lead a team in our database group, drop me a note with your current resume (james@amazon.com). We’re expanding into new database areas quickly and growing existing services even faster.

James Hamilton, jrh@mvdirona.com

Services

11 comments on “Challenges in Designing at Scale: Formal Methods in Building Robust Distributed Systems”

Tim Rath says:

August 16, 2014 at 12:39 am

Hi Adrien, I’m one of the Amazon developers using TLA+, and to answer your question, the bugs found generally pointed to very subtle (extremely hard to see), yet fundamental algorithm flaws. The traces that TLC provides make the issue very clear, and easy to understand. The bugs I found were actually very easy to fix as well (though YMMV.) They were super corner cases, and I don’t think any amount of "thinking my way through it" would have found them without the tool showing it to me. They were also the type of bug that could have sat latent for years without being hit, but with extremely high scale systems, probabilities that today would require years to ever be seen can be uncovered in months/days as the system scales up, so we really like to discover these things up-front.

Also, James mentions that TLA+ doesn’t currently have an automated theorem prover. It actually does now. The TLAPS proof system is supported by the TLA toolbox today. Here’s a paper describing it: http://research.microsoft.com/en-us/um/people/lamport/pubs/tlaps.pdf

Reply
Adrian says:

August 14, 2014 at 7:04 pm

Neither the post nor the paper talk about how much effort was devoted to fix the bugs found. Were them easy to understand and fix once found? I am interested in whether having limited synthesis capabilities on top of TLA+ would be helpful in your setting. Thanks for the post!
Adrian

Reply
James Hamilton says:

July 7, 2014 at 9:17 pm

You are right,Petri-Nets are a component of many model checkers and TLA+ does not currently have an automated theorem prover.

Reply
Adrien M says:

July 7, 2014 at 7:38 pm

Interesting. But don’t you think formal models/proofs such as Petri Net could be more beneficial than TLA+ or +Cal?

Reply
James Hamilton says:

July 5, 2014 at 1:45 am

Craig suggests "it woudl be intsructive to compare solution to a given problem in, say, TLA+, Z3, and Cog and show this is the case [TLA+ is useful in production]".

I’ve not used Z3 and Cog so can’t say with any authority that TLA+ is better. What we can say is that TLA+ is useful on real-world, production applications.

Reply
Craig Stuntz says:

July 4, 2014 at 1:45 am

"TLA+ and PlusCal appear to skirt these limitations…"

This is a very interesting claim. It would be instructive to compare solutions to a given problem in, say, TLA+, Z3,and Coq, and show why this is the case.

Reply
James Hamilton says:

July 3, 2014 at 11:49 pm

The question was "why point to a Microsoft Research location for the paper since they are all Amazon authors." The paper is published so, of course, it is available from the publisher. You are correct the paper doesn’t directly involve msft. I pointed to Leslie Lamport’s web page because it’s available there online and without charge.

Reply
KLDavenport says:

July 3, 2014 at 6:13 pm

in reply to curious: A colleague at Microsoft says a thousand plus researchers work in the research labs with the goal of solving global problems, that is to say the end goal of their work isn’t necessarily something Microsoft can monetize. They’ve even contributed to the excellent http://ipython.org/ project.

Reply
Donnie Berkholz says:

July 3, 2014 at 5:38 pm

The report is on Microsoft because TLA+ is theirs and it’s basically a case study of Amazon using it.

Reply
Charles says:

July 3, 2014 at 10:46 am

Very interesting article as ever James. :)

Reply
curious says:

July 3, 2014 at 9:47 am

All the authors work at Amazon, all the systems described were Amazon systems… so why is the tech report on a Microsoft site?

Reply

Perspectives

Challenges in Designing at Scale: Formal Methods in Building Robust Distributed Systems

11 comments on “Challenges in Designing at Scale: Formal Methods in Building Robust Distributed Systems”

Leave a Reply Cancel reply