Last week AWS announced the Amazon Relational Database Service (Amazon RDS) and I blogged that it was big step forward for the cloud storage world: Amazon RDS, More Memory, and Lower Prices. This really is an important step forward in that a huge percentage of commercial applications are written to depend upon Relational Databases. But, I was a bit surprised to get a couple of notes asking about the status of Simple DB and whether the new service was a replacement. These questions were perhaps best characterized by the forum thread The End is Nigh for SimpleDB. I can understand why some might conclude that just having a relational database would be sufficient but the world of structured storage extends far beyond relational systems. In essence, one size does not fit all and both SimpleDB and RDS are important components in addressing the needs of the broader database market.
Relational databases have become so ubiquitous that the term “database” is often treated as synonymous with relational databases like Oracle, SQL Server, MySQL, or DB2. However, the term preceded the invention and implementation of the relational model and non-relational data stores remain important today.
Relational databases are incredibly rich and able to support a very broad class of applications but with incredible breadth comes significant complexity. Many applications don’t need the rich programming model of relational systems and some applications are better serviced by lighter-weight, easier-to-administer, and easier-to-scale solutions. Both relational and non-relational structured storage systems are important and no single solution is appropriate for all applications. I’ll refer to this broader, beyond-relational database market as “structured storage” to differentiate it from file stores and blob stores.
There are a near infinite number of different taxonomies for the structured storage market, but one I find useful is a simple one based upon customer intent: 1) features-first, 2) scale-first, 3) simple structure storage, and 4) purpose-optimized stores. In the discussion that follows, I assume that no database would ever be considered as viable that wasn’t secure and didn’t maintain data integrity. These are base requirements of any reasonable solutions.
The feature-first segment is perhaps the simplest to talk about in that there is near universal agreement. After 35 to 40 years, depending upon how you count, Relational Database Management Systems (RDBMSs) are the structured storage system of choice when a feature-rich solution is needed. Common Feature-First workloads are enterprise financial systems, human resources systems, and customer relationship management systems. In even very large enterprises, a single database instance can often support the entire workload and nearly all of these workloads are hosted on non-sharded relational database management systems.
Examples of products that meet this objective well include Oracle, SQL Server, DB2, MySQL, PostgreSQL amongst others. And the Amazon Relational Database Service announced last week is a good example of a cloud-based solution. Generally, the feature-first segment use RDBMSs.
The Scale-first segment is considerably less clear and the source of much more debate. Scale-first applications are those that absolutely must scale without bound and being able to do this without restriction is much more important than more features. These applications are exemplified by very high scale web sites such as Facebook, MySpace, Gmail, Yahoo, and Amazon.com. Some of these sites actually do make use of relational databases but many do not. The common theme across all of these services is that scale is more important than features and none of them could possibly run on a single RDBMS. As soon as a single RDBMS instance won’t handle the workload, there are two broad possibilities: 1) shard the application data over a large number of RDBMS systems, or 2) use a highly scalable key-value store.
Looking first at sharding over multiple RDBMS instances, this model requires that the programming model be significantly constrained to not expect cross-database instance joins, aggregations, globally unique secondary indexes, global stored procedures, and all the other relational database features that are incredibly hard to scale. Effectively, in this first usage mode, an RDBMS is being used as the implementation but the full relational model is not being exposed to the developer since the full model is incredibly difficult to scale. In this approach, the data is sharded over 10s or even 100s of independent database instances. The Windows Live Messenger group store is an excellent example of the Sharded RDBMS model of Scale-First.
There may be some that will jump in and say that DB2 Parallel Edition (DB2 PE, now part of the DB2 Enterprise Edition) and Oracle Real Application Clusters (Oracle RAC) actually do scale the full relational model. I was lucky enough to work closely with the DB2 PE team when I was Lead Architect on DB2 so I know it well. There is no question that both DB2 and RAC are great products but, as good as they are, very high scale sites still typically chose to either 1) shard over multiple instances or 2) use a high-scale, key-value store.
This first option, that of using an RDBMS as an implementation component, and sharding data over many instances is a perfectly reasonable and rational approach and one that is frequently used. The second option is to use a scalable key-value store. Some key-value store product examples include Project Voldemort, Ringo, Scalaris, Kai, Dynomite, MemcacheDB, ThruDB, CouchDB, Cassandra, HBase and Hypertable (see Key Value Stores). Amazon SimpleDB is a good example of a cloud-based offering.
Simple Structured Storage
There are many applications that have a structured storage requirement but they really don’t need the features, cost, or complexity of an RDBMS. Nor are they focused on the scale required by the scale-first structured storage segment. They just need a simple key value store. A file system or BLOB-store is not sufficiently rich in that simple query and index access is needed but nothing even close to the full set of RDBMS features is needed. Simple, cheap, fast, and low operational burden are the most important requirements of this segment of the market.
Uses of Simple Structured Storage at unremarkable and, as a consequence, there are less visible examples at the low-end of the scale spectrum to reference. Towards the high-end, we have email inbox search at Facebook (using Cassandra), Last.fm reports they will be using Project-Voldemort (using Project-Voldemort), and Amazon uses Dynamo for the retail shopping cart (using Dynamo). Perhaps the widest used example of this class of storage system is Berkeley DB. On the cloud-side, SimpleDB again is a good example (AdaptiveBlue, Livemocha, and Alexa).
Recently Mike Stonebraker wrote an influential paper titled One Size Fits All: An Idea Whose Time Has Come and Gone. In this paper, Mike argued that the existing commercial RDBMS offerings do not meet the needs of many important market segments. In a presentation with the same title, Stonebraker argues that StreamBase special purpose stream processing system beat the RDBMS solutions in benchmarks by 27x, that Vertica, a special purpose data warehousing product beat the RDBMS incumbents by never less than 30x, and H-Store (now VoltDB), a special purpose transaction processing system, beat the standard RDBMS offerings by a full 82x.
Many other Purpose-Optimized stores have emerged (for example, Aster Data, Netezza, and Greenplum) and this category continues to grow quickly. Clearly there is space and customer need for more than a single solution.
Where do SimpleDB and RDS Fit in?
The Amazon RDS service is aimed squarely at the first category above, Feature-First. This is a segment that needs features and mostly uses RDBMS databases. And RDS is amongst the easiest ways to bring up one or more databases quickly and efficiently without needing to hire a database administrator.
Amazon SimpleDB is a good solution for the third category, Simple Structured Storage. SimpleDB is there when you need it, is incredibly easy to use, and is inexpensive. The SimpleDB team will continue to focus on 1) very high availability, 2) supporting scale without bound, 3) simplicity and ease of use, and 4) lowest possible cost and this service will continue to evolve.
The second category, scale-first, is served by both SimpleDB and RDS. Solutions based upon RDS will shard the data over multiple, independent RDS database instances. Solutions based upon SimpleDB will either use the service directly or shard the data over multiple SimpleDB Domains. Of the two approaches, SimpleDB is the easiest to use and more directly targets this usage segment.
The SimpleDB team is incredibly busy right now getting ready for several big announcements over the next 6 to 9 months. Expect to see SimpleDB continue to get easier to use while approaching the goal of scaling without bound. The team is working hard and I’m looking forward to the new features being released.
The AWS solution for the final important category, purpose optimized storage, is based upon the Elastic Compute Cloud (EC2) and the Elastic Block Store (EBS). EC2 provides the capability to host specialized data engines and EBS provides virtualized storage for the data engine hosted in EC2. This combination is sufficiently rich to support Purpose-Optimized Stores such as Aster Data, Vertica, or Greenplum or any of the commonly used RDBMS offerings such as Oracle, SQL Server, DB2, MySQL, PostgreSQL.
The Amazon Web Services plan is to continue to invest deeply in both SimpleDB and RDS as direct structured storage solutions and to continue to rapidly enhance EC2 and EBS to ensure that broadly-used database solutions as well as purpose-built stores run extremely well in the cloud. This year has been a busy one in AWS storage and I’m looking forward to the same pace next year.