What follows is a guest posting from Phil Bernstein on the Google Megastore presentation by Jonas Karlsson, Philip Zeyliger at SIGMOD 2008:
Megastore is a transactional indexed record manager built by Google on top of BigTable. It is rumored to be the store behind Google AppEngine but this was not confirmed (or denied) at the talk. [JRH: I certainly recognize many similarities between the Google IO talk on the AppEngine store (see Under the Covers of the App Engine Datastore in Rough notes from Selected Sessions at Google IO Day 1) and Phil’s notes below].
· A transaction is allowed to read and write data in an entity group.
· The term “entity group” refers to a set of records, possibly in different BigTable instances. Therefore, different entities in an entity group might not be collocated on the same machine. The entities in an entity group share a common prefix of their primary key. So in effect, an entity group is a hierarchically-linked set of entities.
· A per-entity-group transaction log is used. One of the rows that stores the entity group is the entity group’s root. The log is stored with the root, which is replicated like all rows in Big Table.
· To commit a transaction, its updates are stored in the log and replicated to the other copies of the log. Then they’re copied into the database copy of the entity group.
· They commit to replicas before acking the caller and use Paxos to deal with replica failures. So it’s an ACID transaction.
· Optimistic concurrency control is used. Few details were provided, but I assume it’s the same as what they describe for Google Apps.
· Schemas are supported.
· They offer vertical partitioning to cluster columns that are frequently accessed together.
· They don’t support joins except across hierarchical paths within entity groups. I.e., if you want to do arbitrary joins, then you write an application program and there’s no consistency guarantee between the data in different entity groups.
· Big Table does not support indexes. It simply sorts the rows by primary key. Megastore supports indexes on top. They were vague about the details. It sounds like the index is a binary table with a column that contains the compound key as a slash-separated string and a column containing the primary key of the entity group.
· Referential integrity between the components of an entity group is not supported.
· Many-to-many relationships are not supported, though they said they can store the inverse of a functional relationship. It sounded like a materialized view that is incrementally updated asynchronously.
· It has been in use by apps for about a year.
The follow are the notes that I typed while listening to the talk. For the most part, it’s just what was written on the slides and is incomplete. I don’t think it adds much to my summary above.
TITLE: Megastore – Scalable Data System for User-facing Apps
SPEAKERS: Jonas Karlsson, Philip Zeyliger (Google)
User-facing apps have TP-system-like requirements
Ø data updated by a few users
Ø largely reads, small updates
Ø lots of work scaling the data layer
Ø users expect consistency
Megastore – scale, RAD, consistency, performance
Ø start with Big Table for storage
Ø add db technologies that scale: indices, schemas
Ø offer transparent replication and failover between data centers
Ø support many, frequent reads
Ø writes may be more costly, because they’re less frequent
Ø with a correct schema, the app should scale naturally
Rapid App Development
Ø hide operational aspects of scalability from app code
Ø Flexible declarative schemas (MDL) – looks like SQL
o indices, rich data types, consistency and partitioning
Ø offer transactions
Entity Group consistency is supported
Ø it’s a logical partition – e.g., blogs, posts, comments, which are all keyed by the owner of the blog
Ø all entities in the group have the same high-order key component
Ø roll forward transaction log per entity group, no rollbacks
o pb: It’s unclear to me whether they pool the log across all entity groups in a partition. If not, then they don’t benefit from group commit.
Ø A transaction over an entity group is ACID , but not across entity groups
Ø optimistic concurrency control
Ø updates are available only after commit
Ø api: newTransaction, read/write, commit (pb: couldn’t type fast enough for the details)
Ø non-blocking, consistent reads (pb: does a transaction see its previous writes?)
Ø cross-entity group operations have looser consistency
Ø schemas declare their physical data locality
Ø optimized to minimize seeks, RPCs, bandwidth, and storage
Ø several ways of declaring physical locality
o entity groups
o shared primary key prefixes (collocating tables in Big Table)
o locality groups – i.e., attribute partitioning
Ø simply cost-transparent API primitives imply
o only add scalable features
o cost of writes is linear to data/indices
o avoid scalability surprises
Ø hierarchical primary keys
Ø repeated fields (I guess just like Big Table)
Ø store hierarchical data in a column (it’s unclear to me whether this is the whole entity group or only part of it)
Ø Syntax looks like SQL: Create Table (with primary key), Create Index (on particular columns), ..
Ø uses Paxos-based algorithm, per entity group
Ø it was more complicated than they expected
Ø writes need a majority of replicas to be up in order to commit
Ø most reads are local, consistency ensured
Ø replication is consistent and synchronous
Ø automatic query failover: individual table servers may fail
Ø has been used in production for a year
Ø used by several internal tools and projects
Ø several large and user-visible apps (they wouldn’t say which ones, except we know them)
Ø used for rapidly implementing new features in older projects
Ø many other projects are migrating to it
Ø declarative schemas are good
Ø cost-transparent APIs good (SQL is not cost-transparent)
Ø avoid joins with hierarchical data and indices (if you want a join that isn’t on a hierarchical path, then write a program, e.g., Sawzall
Ø avoid scalability surprises
Ø schema reviews helpful
Ø consistency is necessary
Ø need a mindset of scalability and performance
James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | JamesRH@microsoft.com
H:mvdirona.com | W:research.microsoft.com/~jamesrh | blog:http://perspectives.mvdirona.com