If you work in the database world, you already know Phil Bernstein. He’s the author of Principles of Transaction Processing and has a long track record as a successful and prolific database researcher. Past readers of this blog may remember Phil’s guest blog posting on Google Megastore. Over the past few years, Phil has been working on an innovative NoSQL system based upon flash storage. I like the work because it pushes the limit of what can be done on a single server with transaction rates approaching 400,000, leverages the characteristics of flash storage in a thought provoking way, and employs interesting techniques such as log-only storage.
Phil presented Hyder at the Amazon ECS series a couple of weeks back (a past ECS presentation at: High Availability for Cloud Computing Database Systems.
In the Hyder system, all cores operate on a single shared transaction log. Each core (or thread) processes Optimistic Concurrency Control (OCC) database transactions one at a time. Each transaction posts its after-image to the shared log. One core does OCC and rolls forward the log. The database is a binary search tree serialized into the log (A B-tree would work equally well in this application). Because the log is effectively a no-overwrite, log-only datastore, a changed node require that the parent must now point to this new node which forces the parent to be updated as well. Now its parent needs updating and this cascading set of changes proceeds to the root on each update.
The tree is maintained via copy-on-write semantics where updates are written to the front of the log with references to unchanged tree nodes pointing back to the appropriate locations in the log. Whenever a node changes, the changed node is written to the front of the log. Consequently all database changes result in changes to all nodes to the top of the search tree.
This has the downside of requiring many tree nodes to be updated on each database update but has the upside of the writes all being sequential at the front of the log. Since it is a no-overwrite store, when an update is made, the old nodes remain so transactional time travel is easy. The old search tree root still point to a complete tree that was current as of the point in time when that root was the current root of the search tree. As new nodes are written, some old nodes are no longer part of the current search tree and can be garbage collected over time.
Transactions are implemented by writing an intention log record to the front of the log with all changes required by this transaction and these tree nodes point either to other nodes within the intention record or to unchanged nodes further back in the log. This can be done quickly and all updates can proceed in parallel without need for locking or synchronization.
Before the transaction can be completed, it must now be checked for conflict using Optimistic Concurrency Control. If there are no conflicts, the root of the search tree is atomically moved to point to the new root and the transaction is acknowledged as successful. If the transaction is in conflict, it is failed and the tree root is not advanced and the intention record becomes garbage.
Most of the transactional update work can be done concurrently without locks but two issues come to mind quickly:
1) Garbage collection: because the systems is constantly rewriting large portions of the search tree, old versions of the tree a spread throughout the log and need to be recovered.
2) Transaction Rate: The transaction rate is limited by the rate at which conflicts can be checked and the tree root advanced.
The latter is the biggest concern and the rest of the presentation focuses on the rate with which this bottleneck can be processed. The presenter showed that rates in 400,000 transaction per second where obtained in performance testing so this is a hard limit but it is a fairly high hard limit. This design can go a long way before partitioning is required.
If you want to dig deeper, the Hyder presentation is at:
http://mvdirona.com/jrh/TalksAndPapers/Hyder4Amazon5Dec2011.pdf
More detailed papers can be found at:
Philip A. Bernstein, Colin W. Reid, Sudipto Das: Hyder – A Transactional Record Manager for Shared Flash. CIDR 2011: 9-20
http://www.cidrdb.org/cidr2011/Papers/CIDR11_Paper2.pdf
Philip A. Bernstein, Colin W. Reid, Ming Wu, Xinhao Yuan: Optimistic Concurrency Control by Melding Trees. PVLDB 4(11): 944-955 (2011)
http://www.vldb.org/pvldb/vol4/p944-bernstein.pdf
Colin W. Reid, Philip A. Bernstein: Implementing an Append-Only Interface for Semiconductor Storage. IEEE Data Eng. Bull. 33(4): 14-20 (2010)
http://sites.computer.org/debull/A10dec/hyder.pdf
Mahesh Balakrishnan, Philip A. Bernstein, Dahlia Malkhi, Vijayan Prabhakaran, Colin W. Reid: Brief Announcement: Flash-Log – A High Throughput Log. DISC 2010: 401-403
http://www.springerlink.com/content/c732l27h3mrn3170/
b: http://blog.mvdirona.com / http://perspectives.mvdirona.com
Zooko, said "So far as I’ve understood it, the same techniques would work pretty well on spinning disk as well as on SSD." Mostly true except for the technique of using log structured storage. Log structured stores tend to drive high random IO rates on read so this approach isn’t ideal for hard disk drives. Basically, log structured tend to sequentialize the writes at the cost of higher random read rates. This is a great trade-off on SSDs but not typically the best choice for spinning media.
Greg, said "be careful extrapolating too much from the benchmarks." Yeah, I’ve been around benchmarks for a few years now and know that one well. I used to joke about the TPC-C runs that we did as being guarantees to customers they would not run faster :-).
–jrh
Very interesting! Thank you for the readable and substantive summary. So far as I’ve understood it, the same techniques would work pretty well on spinning disk as well as on SSD. It seems like a lot of the exciting new developments are about structured storage (databases). Me, I’m still stuck working on bulk storage, but it is going pretty well so I can’t complain.
Interesting report, but I’d be careful about extrapolating too much from the benchmarks in this research project over to real world workloads. In particular, it looks like their experimental results are all on a 2G data set that easily fits entirely in main memory on the box.
From the VLDB 2011 paper, "All of the transactions access an initial database table containing 128K key-value pairs where keys and values are both 8-byte strings … All experiments are performed on a 4-core, 8-thread Intel Xeon x5550 2.67GHz, with 4×256KB L2 cache and 8MB shared L3 cache, 12GB main memory … The entire database table and all the appended intention records are main-memory resident to avoid any I/O during meld."