If you work in the database world, you already know Phil Bernstein. He’s the author of Principles of Transaction Processing and has a long track record as a successful and prolific database researcher. Past readers of this blog may remember Phil’s guest blog posting on Google Megastore. Over the past few years, Phil has been working on an innovative NoSQL system based upon flash storage. I like the work because it pushes the limit of what can be done on a single server with transaction rates approaching 400,000, leverages the characteristics of flash storage in a thought provoking way, and employs interesting techniques such as log-only storage.
Phil presented Hyder at the Amazon ECS series a couple of weeks back (a past ECS presentation at: High Availability for Cloud Computing Database Systems.
In the Hyder system, all cores operate on a single shared transaction log. Each core (or thread) processes Optimistic Concurrency Control (OCC) database transactions one at a time. Each transaction posts its after-image to the shared log. One core does OCC and rolls forward the log. The database is a binary search tree serialized into the log (A B-tree would work equally well in this application). Because the log is effectively a no-overwrite, log-only datastore, a changed node require that the parent must now point to this new node which forces the parent to be updated as well. Now its parent needs updating and this cascading set of changes proceeds to the root on each update.
The tree is maintained via copy-on-write semantics where updates are written to the front of the log with references to unchanged tree nodes pointing back to the appropriate locations in the log. Whenever a node changes, the changed node is written to the front of the log. Consequently all database changes result in changes to all nodes to the top of the search tree.
This has the downside of requiring many tree nodes to be updated on each database update but has the upside of the writes all being sequential at the front of the log. Since it is a no-overwrite store, when an update is made, the old nodes remain so transactional time travel is easy. The old search tree root still point to a complete tree that was current as of the point in time when that root was the current root of the search tree. As new nodes are written, some old nodes are no longer part of the current search tree and can be garbage collected over time.
Transactions are implemented by writing an intention log record to the front of the log with all changes required by this transaction and these tree nodes point either to other nodes within the intention record or to unchanged nodes further back in the log. This can be done quickly and all updates can proceed in parallel without need for locking or synchronization.
Before the transaction can be completed, it must now be checked for conflict using Optimistic Concurrency Control. If there are no conflicts, the root of the search tree is atomically moved to point to the new root and the transaction is acknowledged as successful. If the transaction is in conflict, it is failed and the tree root is not advanced and the intention record becomes garbage.
Most of the transactional update work can be done concurrently without locks but two issues come to mind quickly:
1) Garbage collection: because the systems is constantly rewriting large portions of the search tree, old versions of the tree a spread throughout the log and need to be recovered.
2) Transaction Rate: The transaction rate is limited by the rate at which conflicts can be checked and the tree root advanced.
The latter is the biggest concern and the rest of the presentation focuses on the rate with which this bottleneck can be processed. The presenter showed that rates in 400,000 transaction per second where obtained in performance testing so this is a hard limit but it is a fairly high hard limit. This design can go a long way before partitioning is required.
If you want to dig deeper, the Hyder presentation is at:
More detailed papers can be found at:
Philip A. Bernstein, Colin W. Reid, Sudipto Das: Hyder - A Transactional Record Manager for Shared Flash. CIDR 2011: 9-20
Philip A. Bernstein, Colin W. Reid, Ming Wu, Xinhao Yuan: Optimistic Concurrency Control by Melding Trees. PVLDB 4(11): 944-955 (2011)
Colin W. Reid, Philip A. Bernstein: Implementing an Append-Only Interface for Semiconductor Storage. IEEE Data Eng. Bull. 33(4): 14-20 (2010)
Mahesh Balakrishnan, Philip A. Bernstein, Dahlia Malkhi, Vijayan Prabhakaran, Colin W. Reid: Brief Announcement: Flash-Log - A High Throughput Log. DISC 2010: 401-403
b: http://blog.mvdirona.com / http://perspectives.mvdirona.com
Disclaimer: The opinions expressed here are my own and do not
necessarily represent those of current or past employers.