One Size Does Not Fit All

Last week AWS announced the Amazon Relational Database Service (Amazon RDS) and I blogged that it was big step forward for the cloud storage world: Amazon RDS, More Memory, and Lower Prices. This really is an important step forward in that a huge percentage of commercial applications are written to depend upon Relational Databases. But, I was a bit surprised to get a couple of notes asking about the status of Simple DB and whether the new service was a replacement. These questions were perhaps best characterized by the forum thread The End is Nigh for SimpleDB. I can understand why some might conclude that just having a relational database would be sufficient but the world of structured storage extends far beyond relational systems. In essence, one size does not fit all and both SimpleDB and RDS are important components in addressing the needs of the broader database market.

Relational databases have become so ubiquitous that the term “database” is often treated as synonymous with relational databases like Oracle, SQL Server, MySQL, or DB2. However, the term preceded the invention and implementation of the relational model and non-relational data stores remain important today.

Relational databases are incredibly rich and able to support a very broad class of applications but with incredible breadth comes significant complexity. Many applications don’t need the rich programming model of relational systems and some applications are better serviced by lighter-weight, easier-to-administer, and easier-to-scale solutions. Both relational and non-relational structured storage systems are important and no single solution is appropriate for all applications. I’ll refer to this broader, beyond-relational database market as “structured storage” to differentiate it from file stores and blob stores.

There are a near infinite number of different taxonomies for the structured storage market, but one I find useful is a simple one based upon customer intent: 1) features-first, 2) scale-first, 3) simple structure storage, and 4) purpose-optimized stores. In the discussion that follows, I assume that no database would ever be considered as viable that wasn’t secure and didn’t maintain data integrity. These are base requirements of any reasonable solutions.

Feature-First

The feature-first segment is perhaps the simplest to talk about in that there is near universal agreement. After 35 to 40 years, depending upon how you count, Relational Database Management Systems (RDBMSs) are the structured storage system of choice when a feature-rich solution is needed. Common Feature-First workloads are enterprise financial systems, human resources systems, and customer relationship management systems. In even very large enterprises, a single database instance can often support the entire workload and nearly all of these workloads are hosted on non-sharded relational database management systems.

Examples of products that meet this objective well include Oracle, SQL Server, DB2, MySQL, PostgreSQL amongst others. And the Amazon Relational Database Service announced last week is a good example of a cloud-based solution. Generally, the feature-first segment use RDBMSs.

Scale-First

The Scale-first segment is considerably less clear and the source of much more debate. Scale-first applications are those that absolutely must scale without bound and being able to do this without restriction is much more important than more features. These applications are exemplified by very high scale web sites such as Facebook, MySpace, Gmail, Yahoo, and Amazon.com. Some of these sites actually do make use of relational databases but many do not. The common theme across all of these services is that scale is more important than features and none of them could possibly run on a single RDBMS. As soon as a single RDBMS instance won’t handle the workload, there are two broad possibilities: 1) shard the application data over a large number of RDBMS systems, or 2) use a highly scalable key-value store.

Looking first at sharding over multiple RDBMS instances, this model requires that the programming model be significantly constrained to not expect cross-database instance joins, aggregations, globally unique secondary indexes, global stored procedures, and all the other relational database features that are incredibly hard to scale. Effectively, in this first usage mode, an RDBMS is being used as the implementation but the full relational model is not being exposed to the developer since the full model is incredibly difficult to scale. In this approach, the data is sharded over 10s or even 100s of independent database instances. The Windows Live Messenger group store is an excellent example of the Sharded RDBMS model of Scale-First.

There may be some that will jump in and say that DB2 Parallel Edition (DB2 PE, now part of the DB2 Enterprise Edition) and Oracle Real Application Clusters (Oracle RAC) actually do scale the full relational model. I was lucky enough to work closely with the DB2 PE team when I was Lead Architect on DB2 so I know it well. There is no question that both DB2 and RAC are great products but, as good as they are, very high scale sites still typically chose to either 1) shard over multiple instances or 2) use a high-scale, key-value store.

This first option, that of using an RDBMS as an implementation component, and sharding data over many instances is a perfectly reasonable and rational approach and one that is frequently used. The second option is to use a scalable key-value store. Some key-value store product examples include Project Voldemort, Ringo, Scalaris, Kai, Dynomite, MemcacheDB, ThruDB, CouchDB, Cassandra, HBase and Hypertable (see Key Value Stores). Amazon SimpleDB is a good example of a cloud-based offering.

Simple Structured Storage

There are many applications that have a structured storage requirement but they really don’t need the features, cost, or complexity of an RDBMS. Nor are they focused on the scale required by the scale-first structured storage segment. They just need a simple key value store. A file system or BLOB-store is not sufficiently rich in that simple query and index access is needed but nothing even close to the full set of RDBMS features is needed. Simple, cheap, fast, and low operational burden are the most important requirements of this segment of the market.

Uses of Simple Structured Storage at unremarkable and, as a consequence, there are less visible examples at the low-end of the scale spectrum to reference. Towards the high-end, we have email inbox search at Facebook (using Cassandra), Last.fm reports they will be using Project-Voldemort (using Project-Voldemort), and Amazon uses Dynamo for the retail shopping cart (using Dynamo). Perhaps the widest used example of this class of storage system is Berkeley DB. On the cloud-side, SimpleDB again is a good example (AdaptiveBlue, Livemocha, and Alexa).

Purpose-Optimized Stores

Recently Mike Stonebraker wrote an influential paper titled One Size Fits All: An Idea Whose Time Has Come and Gone. In this paper, Mike argued that the existing commercial RDBMS offerings do not meet the needs of many important market segments. In a presentation with the same title, Stonebraker argues that StreamBase special purpose stream processing system beat the RDBMS solutions in benchmarks by 27x, that Vertica, a special purpose data warehousing product beat the RDBMS incumbents by never less than 30x, and H-Store (now VoltDB), a special purpose transaction processing system, beat the standard RDBMS offerings by a full 82x.

Many other Purpose-Optimized stores have emerged (for example, Aster Data, Netezza, and Greenplum) and this category continues to grow quickly. Clearly there is space and customer need for more than a single solution.

Where do SimpleDB and RDS Fit in?

The Amazon RDS service is aimed squarely at the first category above, Feature-First. This is a segment that needs features and mostly uses RDBMS databases. And RDS is amongst the easiest ways to bring up one or more databases quickly and efficiently without needing to hire a database administrator.

Amazon SimpleDB is a good solution for the third category, Simple Structured Storage. SimpleDB is there when you need it, is incredibly easy to use, and is inexpensive. The SimpleDB team will continue to focus on 1) very high availability, 2) supporting scale without bound, 3) simplicity and ease of use, and 4) lowest possible cost and this service will continue to evolve.

The second category, scale-first, is served by both SimpleDB and RDS. Solutions based upon RDS will shard the data over multiple, independent RDS database instances. Solutions based upon SimpleDB will either use the service directly or shard the data over multiple SimpleDB Domains. Of the two approaches, SimpleDB is the easiest to use and more directly targets this usage segment.

The SimpleDB team is incredibly busy right now getting ready for several big announcements over the next 6 to 9 months. Expect to see SimpleDB continue to get easier to use while approaching the goal of scaling without bound. The team is working hard and I’m looking forward to the new features being released.

The AWS solution for the final important category, purpose optimized storage, is based upon the Elastic Compute Cloud (EC2) and the Elastic Block Store (EBS). EC2 provides the capability to host specialized data engines and EBS provides virtualized storage for the data engine hosted in EC2. This combination is sufficiently rich to support Purpose-Optimized Stores such as Aster Data, Vertica, or Greenplum or any of the commonly used RDBMS offerings such as Oracle, SQL Server, DB2, MySQL, PostgreSQL.

The Amazon Web Services plan is to continue to invest deeply in both SimpleDB and RDS as direct structured storage solutions and to continue to rapidly enhance EC2 and EBS to ensure that broadly-used database solutions as well as purpose-built stores run extremely well in the cloud. This year has been a busy one in AWS storage and I’m looking forward to the same pace next year.

–jrh

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

12 comments on “One Size Does Not Fit All
  1. Thanks for the comment Vivek. You are right, you can have a single programming API over data stores with very different capabilities as long as you are fine with the API not being symmetrically supported by all of them. There will be some features that are only exposed in a subset and, even for those features exposed in them all, there will be performance differences based upon what implementation technique was used that may be relevant for the application. But, with those caveats, its absolutely possible and JDBC, ODBC, and JDO API as examples have been implemented over very different data base systems.

    –jrh
    jrh@mvdirona.com

  2. Vivek Juneja says:

    James, I understand from your reply that achieving
    that kind of requirement is a difficult task.

    But instead, there could be scenarios where More than one type of Database solutions are required.
    A application database spread across SimpleDB and RDS, wherein parts of the Database requiring true scalability as a preference over other aspects would go for SimpleDB, and other parts requiring Features provided by RDS.

    Implementing such a system would require any designer to think about what needs to scale and what requires feature. I believe there could be an abstraction layer that could be built over SimpleDB and RDS providing configurable interface to partition the system over these.

    Provided such a configurable access, the application designer would need to then decide parts of the Database (Table Instances, and other DB objects) that would require SimpleDB functionalities and what parts require RDS features. And of course these DB objects have to talk between each other.

    What is, according to you, the scope for such Configurable abstraction. And as a Engineer what needs to be done to achieve the same. It will be interesting for me to hear from you on this.

  3. Vivek, what you are asking for is the holy grail of relational systems. Is it possible to combine the power of SQL and all the functionality of a full relational database in a highly scalable, multi-user service? That would be very hard and, so far, it has been an allusive goal industry wide.

    I argue that today no single product can meet this broad class of needs and I suspect it actually may stay that way.

    –jrh
    jrh@mvdirona.com

  4. Vivek Juneja says:

    Amazing read. Thanks James for a wonderful explanation.

    I had a question in mind for sometime, about a potential
    interop between SimpleDB and RDS. If we want to have the
    Relational power of RDS with the scalability of SimpleDB,
    could we not devise a kind of Mixin.

    What is your perspective on something like that ? Is it
    wise to built a product around a feature request
    like that. Or we have to look for alternatives like
    a Purpose optimized Stores targeting audiences looking
    for SimpleDB + RDS features.

  5. Thanks for the comment Bradford. There is no question that RDBMSs are the Swiss Army knife of computing. They can do nearly everything but specialized tools do many things better.

    Having spent 20+ years working on RDBMSs I’m still convinced they are very useful but, clearly, there are many problems where more specialized tools are a better choice.

    –jrh
    jrh@mvdirona.com

  6. This article is all kinds of awesome. It’ll have me thinking for days.

    I don’t have a mental taxonomy, but I have noticed that what people *do* with databases is usually very simple.

    1. Store and retrieve values (think K,V)
    2. Filter data
    3. Aggregate/Additive calculations (Group, Sum, Count).

    Entire companies are based around these operations — I’d say 90% of users are utilizing about 10% of the power of an RDBMS. That feels suboptimal. This is a recurring theme in my writings: there’s an impedance mismatch between storage (RDBMS) and use cases (‘aggregate my data’).

    It’s why I’m such a huge fan of faceted search – you can fulfill those 90% of use cases in an extremely scalable and intuitive way. When you optimize for what people are actually doing with their data, you get something that’s actually scalable.

  7. Glad it was helpful. Thanks David.

    –jrh
    jrh@mvdirona.com

  8. This was timely! I was just talking to a potential customer about SImpleDB vs RDS, or other solutions. Thanks!

  9. David, you were asking about where I would place LDAP stores like Microsoft AD, OpenLDAP, OpenDS, etc. Directory services are super important systems, but I don’t view them or recommend that they be treated as general purpose stores. I have seen some folks use them that way but it wouldn’t be my first choice or recommendation.

    –jrh
    jrh@mvdirona.com

  10. I agree that read replication is a good option as long as stale reads are permissible. Its a good and inexpensive solution for many workloads.

    Overall, the right tool for each problem. Those that blindly apply relational systems to all problems are making a mistake as are those that avoid them by reflex.

    Thanks for the comment Chris.

    –jrh
    jrh@mvdirona.com

  11. David Hart says:

    Where would you place X.500 directory-based databases in your scheme? For example MS Active Directory, NDS/eDirectory, RHDS (formerly NSDS), OpenLDAP, OpenDS, ADS, etc.?

  12. Chris Westin says:

    There are also intermediate alternatives that offer both scale and features subject to certain constraints. You need to consider exactly what it is that you need to scale. For example, for sites that have high read rates, but low write rates, its possible to use something like MySQL replication. Reads can be scaled arbitrarily high, and you don’t have to give up transactions or SQL for the small number of writes that you do require. If you need to scale up your write rate, then sharding becomes a necessity.

    This topic seems to have become rather polemical, with some folks insisting on complete adherence one way (must have SQL) or the other (must avoid SQL at all costs), when you’re quite right, one size does *not* fit all, and each solution needs to be examined in the light of its requirements.

    One peculiarity I find with the NoSQL camp is that they seem to be so dead-set against SQL as a language, believing that the scaling issues derive directly from it, and that therefore the language is a failure. SQL’s taken us a long way, and I wouldn’t throw it out so lightly. It is so useful that now users of some of the non-relational databases you mention are trying to get it back for non-real-time purposes through projects like Hive and Pig.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.