We all know that when designing and operating applications at scale, it is persistent state management that brings the most difficult challenges. Delivering state-free applications has always been (fairly) easy. But most interesting commercial and consumer applications need to manage persistent state. Advertising needs to be delivered, customer activity needs to be tracked, and products need to be purchased. Interesting applications that run at scale all have difficult persistent state problems. That’s why Amazon.com, other AWS customers, and even other AWS services make use of the various AWS database platform services. Delegating the challenge of managing high-performance, transactional, and distributed data management to underlying services makes applications more robust while reducing operational overhead and design complexity. But, it makes these mega-scale database systems absolutely critical resources that need to be rock solid with very well understood scaling capabilities.
One such service that is heavily used by AWS, Amazon.com, and external services is DynamoDB. On the surface, DynamoDB looks fairly simple. It has only 13 APIs and has been a service that is being widely adopted due to its ease of use and doing the undifferentiated heavy lifting of dealing with disk failures, host issues or network partitions etc. However, that external simplicity is built on a hidden substrate of complex distributed systems. Such complex internals are required to achieve high-availability while running on cost-efficient infrastructure, and also to cope with rapid business-growth. As an example of this growth, DynamoDB is handling millions of transactions every second in a single region while continuing to provide single digit millisecond latency even in the midst of disk failures, host failures and rack power disruptions.
Services like DynamoDB rely on fault-tolerant distributed algorithms for replication, consistency, concurrency control, provisioning, and auto scaling. There are many such algorithms in the literature, but combining them into a cohesive system is a significant challenge, as the algorithms usually need to be modified in order to interact properly in a real-world production system. In addition, we have found it necessary to invent algorithms of our own.
We all know that increasing complexity greatly magnifies the probability of human error in design, code, and operations. Errors in the core of the system could potentially cause service disruptions and impact our service availability goals. We work hard to avoid unnecessary complexity, but the complexity of the task remains high. Before launching a service or a new feature to service like DynamoDB, we need to reach extremely high confidence that the core of the system is correct.
Historically, software companies do this using a combination of design documents, design reviews, static code analysis, code reviews, stress testing, fault injection testing and many other techniques. While these techniques are necessary they may not be sufficient. If you are building a highly concurrent replication algorithm that serves as the backbone for systems like DynamoDB, you want to be able to model partial failures in a highly concurrent distributed systems. Moreover, you want to capture the failures at design level even as these might be harder to do in testing.
This is where AWS teams and Amazon DynamoDB and transactional services teams embarked on a path of building precise designs for the services they are building. While one can argue that traditional methods of writing design docs serve a similar purpose we found that design docs lack precision. This is because they are written in prose and diagrams and they are not easy to test in an automated fashion. At the other end of the spectrum, once we have implemented the system in code, it becomes much too complex to establish algorithm correctness or to debug subtle issues. To this end, we looked for a way to express designs that will eliminate hand waving and that has sufficient tools that can be applied to check for errors in the design.
We wanted a language that allowed us to express things like replication algorithms or distributed algorithms in hundreds of lines of code. We wanted the tool to have existing ecosystems that allowed us to test various failure conditions at design level quickly. We found what we were looking for in TLA+, a formal specification language invented by ACM Turing award winner, Leslie Lamport. TLA+ is based on simple discrete math, basic set theory and predicates with which all engineers are quite familiar. A TLA+ specification simply describes the set of all possible legal behaviors (execution traces) of a system. While TLA+ syntax was unfamiliar to our engineers, TLA+ is accompanied by PlusCal, a pseudo code language that is closer to a C-style programming language. Several engineers at Amazon have found they are more productive in PlusCal than TLA+. However, in other cases, the additional flexibility of plain TLA+ has been very useful.
PlusCal and TLA+ have proven very effective at establishing and maintaining the correctness through change of the fundamental components on which DynamoDB is based. We believe having these core components correct from day one has allowed the DynamoDB system to evolve more quickly and scale fast while avoiding the difficult times often experienced by engineers and customers early in a distributed systems life.
I’ve always been somewhat skeptical of formal methods in that some bring too much complexity to actually be applied to commercial systems while others tend to abstract much of the complexity but, in abstracting away complexity, give up precision. TLA+ and PlusCal appear to skirt these limitations and we believe that having the core and most fundamental algorithms on which DynamoDB is based provably correct helps speed innovation, ease the inherent complexity of scaling and, overall, improves both the customer experience and the experience of engineers working on the DynamoDB system at AWS. While this is an important part of our system, there are hundreds of innovations we do in building and operating robust scalable distributed systems on how we design, build, test, deploy, monitor and operate these large scale services.
If you are interested in reading more about the usage of TLA+ within AWS, this is a pretty good source: Use of Formal Methods at Amazon Web Services. And, if you are interested in joining the AWS database team and, especially if you have the experience, background, and interest to lead a team in our database group, drop me a note with your current resume (email@example.com). We’re expanding into new database areas quickly and growing existing services even faster.
James Hamilton, firstname.lastname@example.org