Sunday, January 09, 2011

Megastore is the data engine supporting the Google Application Engine. It’s a scalable structured data store providing full ACID semantics within partitions but lower consistency guarantees across partitions.

 

I wrote up some notes on it back in 2008 Under the Covers of the App Engine Datastore and posted Phil Bernstein’s excellent notes from a 2008 SIGMOD talk: Google Megastore. But there has been remarkably little written about this datastore over the intervening couple of years until this year’s CIDR conference papers were posted. CIDR 2011 includes Megastore: Providing Scalable, Highly Available Storage for Interactive Services.

 

My rough notes from the paper:

·         Megastore is built upon BigTable

·         Bigtable supports fault-tolerant storage within a single datacenter

·         Synchronous replication based upon Paxos and optimized for long distance inter-datacenter links

·         Partitioned into a vast space of small databases each with its own replicated log

·         Each log stored across a Paxos cluster

·         Because they are so aggressively partitioned, each Paxos group only has to accept logs for operations on a small partition. However, the design does serialize updates on each partition

·         3 billion writes and 20 billion read transactions per day

·         Support for consistency unusual for a NoSQL database but driven by (what I believe to be) the correct belief that inconsistent updates make many applications difficult to write (see I Love Eventual Consistency but …)

·         Data Model:

·         The data model is declared in a strongly typed schema

·         There are potentially many tables per schema

·         There are potentially many entities per table

·         There are potentially many strongly typed properties per entity

·         Repeating properties are allowed

·         Tables can be arranged hierarchically where child tables point to root tables

·         Megastore tables are either entity group root tables or child tables

·         The root table and all child tables are stored in the same entity group

·         Secondary indexes are supported

·         Local secondary indexes index a specific entity group and are maintained consistently

·         Global secondary indexes index across entity groups are asynchronously updates and eventually consistent

·         Repeated indexes: supports indexing repeated values (e.g. photo tags)

·         Inline indexes provide a way to denormalize data from source entities into a related target entity as a virtual repeated column.

·         There are physical storage hints:

·         “IN TABLE” directs Megastore to store two tables in the same underlying BigTable

·         “SCATTER” attribute prepends a 2 byte hash to each key to cool hot spots on tables with monotonically increasing values like dates (e.g. a history table).

·         “STORING” clause on an index supports index-only-access by redundantly storing additional data in an index. This avoids the double access often required of doing a secondary index lookup to find the correct entity and then selecting the correct properties from that entity through a second table access. By pulling values up into the secondary index, the base table doesn’t need to be accessed to obtain these properties.

·         3 levels of read consistency:

·         Current: Last committed value

·         Snapshot: Value as of start of the read transaction

·         Inconsistent reads: used for cross entity group reads

·         Update path:

·         Transaction writes its mutations to the entity groups write-ahead log and then apply the mutations to the data (write ahead logging).

·         Write transaction always begins with a current read to determine the next available log position. The commit operation gathers mutations into a log entry, assigns an increasing timestamp, and appends to log which is maintained using paxos.

·         Update rates within a entity group are seriously limited by:

·         When there is log contention, one wins and the rest fail and must be retried.

·         Paxos only accepts a very limited update rate (order 10^2 updates per second).

·         Paper reports that “limiting updates within an entity group to a few writes per second per entity group yields insignificant write conflicts”

·         Implication: programmers must shard aggressively to get even moderate update rates and consistent update across shards is only supported using two phase commit which is not recommended.

·         Cross entity group updates are supported by:

·         Two-phase commit with the fragility that it brings

·         Queueing and asynchronously applying the changes

·         Excellent support for backup and redundancy:

·         Synchronous replication to protect against media failure

·         Snapshots and incremental log backups

 

Overall, an excellent paper with lots of detail on a nicely executed storage system. Supporting consistent read and full ACID update semantics is impressive although the limitation of not being able to update an entity group at more than a “few per second” is limiting.

 

The paper: http://www.cidrdb.org/cidr2011/Papers/CIDR11_Paper32.pdf.

 

Thanks to Zhu Han, Reto Kramer, and Chris Newcombe for all sending this paper my way.

 

                                                --jrh

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Sunday, January 09, 2011 10:38:39 AM (Pacific Standard Time, UTC-08:00)  #    Comments [3] - Trackback
Services
Monday, January 10, 2011 9:37:41 AM (Pacific Standard Time, UTC-08:00)
Thanks; this is interesting.

So if we assume:

1) CAP theorem is correct.
2) Different apps have different CAP requirements, and individual functions within apps may also have specific requirements (e.g., your example of sequential number generation).
3) It's easier to write, maintain, and administer an application that uses 2 or 3 data sources than 20 or 30.

Would it be fair to say that as scalable data sources mature they will increasingly and perhaps ubiquitously allow user-selectable CAP choices in much the same way that SQL stores almost always allow user-selectable transaction consistency?
Tuesday, January 11, 2011 7:11:03 AM (Pacific Standard Time, UTC-08:00)
Hi Craig.

The cap theorem is undeniably true but, like many engineering issues, it requires judgment. Consistency can be maintained over remarkably a large scope of data. The availability that CAP forces to be given up certainly exists but can be made quite small. Overall, I would say that CAP encourage many to give up hope way too quickly and build an eventually consistent system when better choices are available with more work. The paper under discussion here shows that a high-scale store can be fully consistent over substantial volumes of data and they wisely decided to not make the entire store globally consistent. Interpreting and applying CAP is a judgement call and we see one approach here. SimpleDB chose to do both (http://perspectives.mvdirona.com/2010/02/24/ILoveEventualConsistencyBut.aspx) which is along the lines you recommend in your conclusion: “as scalable data sources mature they will increasingly and perhaps ubiquitously allow user-selectable CAP choices in much the same way that SQL stores almost always allow user-selectable transaction consistency”. I think you are right.

--jrh
Tuesday, January 11, 2011 8:46:35 AM (Pacific Standard Time, UTC-08:00)
Thanks!
Comments are closed.

Disclaimer: The opinions expressed here are my own and do not necessarily represent those of current or past employers.

Archive
<January 2011>
SunMonTueWedThuFriSat
2627282930311
2345678
9101112131415
16171819202122
23242526272829
303112345

Categories
This Blog
Member Login
All Content © 2014, James Hamilton
Theme created by Christoph De Baene / Modified 2007.10.28 by James Hamilton