Wednesday, December 14, 2011

If you work in the database world, you already know Phil Bernstein. He’s the author of Principles of Transaction Processing and has a long track record as a successful and prolific database researcher.  Past readers of this blog may remember Phil’s guest blog posting on Google Megastore. Over the past few years, Phil has been working on an innovative NoSQL system based upon flash storage. I like the work because it pushes the limit of what can be done on a single server with transaction rates approaching 400,000, leverages the characteristics of flash storage in a thought provoking way, and employs interesting techniques such as log-only storage.

 

Phil presented Hyder at the Amazon ECS series a couple of weeks back (a past ECS presentation at: High Availability for Cloud Computing Database Systems.

 

In the Hyder system, all cores operate on a single shared transaction log. Each core (or thread) processes Optimistic Concurrency Control (OCC) database transactions one at a time. Each transaction posts its after-image to the shared log. One core does OCC and rolls forward the log. The database is a binary search tree serialized into the log (A B-tree would work equally well in this application). Because the log is effectively a no-overwrite, log-only datastore, a changed node require that the parent must now point to this new node which forces the parent to be updated as well. Now its parent needs updating and this cascading set of changes proceeds to the root on each update.

 

The tree is maintained via copy-on-write semantics where updates are written to the front of the log with references to unchanged tree nodes pointing back to the appropriate locations in the log. Whenever a node changes, the changed node is written to the front of the log. Consequently all database changes result in changes to all nodes to the top of the search tree.

 

This has the downside of requiring many tree nodes to be updated on each database update but has the upside of the writes all being sequential at the front of the log. Since it is a no-overwrite store, when an update is made, the old nodes remain so transactional time travel is easy. The old search tree root still point to a complete tree that was current as of the point in time when that root was the current root of the search tree.  As new nodes are written, some old nodes are no longer part of the current search tree and can be garbage collected over time.

Transactions are implemented by writing an intention log record to the front of the log with all changes required by this transaction and these tree nodes point either to other nodes within the intention record or to unchanged nodes further back in the log. This can be done quickly and all updates can proceed  in parallel without need for locking or synchronization.

 

Before the transaction can be completed, it must now be checked for conflict using Optimistic Concurrency Control. If there are no conflicts, the root of the search tree is atomically moved to point to the new root and the transaction is acknowledged as successful. If the transaction is in conflict, it is failed and the tree root is not advanced and the intention record becomes garbage.

 

Most of the transactional update work can be done concurrently without locks but two issues come to mind quickly:

 

1)      Garbage collection: because the systems is constantly rewriting large portions of the search tree, old versions of the tree a spread throughout the log and need to be recovered.

2)      Transaction Rate: The transaction rate is limited by the rate at which conflicts can be checked and the tree root advanced.

 

The latter is the biggest concern and the rest of the presentation focuses on the rate with which this bottleneck can be processed.  The presenter showed that rates in 400,000 transaction per second where obtained in performance testing so this is a hard limit but it is a fairly high hard limit. This design can go a long way before partitioning is required.

 

If you want to dig deeper, the Hyder presentation is at:

http://mvdirona.com/jrh/TalksAndPapers/Hyder4Amazon5Dec2011.pdf

 

More detailed papers can be found at:

 

Philip A. Bernstein, Colin W. Reid, Sudipto Das: Hyder - A Transactional Record Manager for Shared Flash. CIDR 2011: 9-20

http://www.cidrdb.org/cidr2011/Papers/CIDR11_Paper2.pdf

 

Philip A. Bernstein, Colin W. Reid, Ming Wu, Xinhao Yuan: Optimistic Concurrency Control by Melding Trees. PVLDB 4(11): 944-955 (2011)

http://www.vldb.org/pvldb/vol4/p944-bernstein.pdf

 

Colin W. Reid, Philip A. Bernstein: Implementing an Append-Only Interface for Semiconductor Storage. IEEE Data Eng. Bull. 33(4): 14-20 (2010)

http://sites.computer.org/debull/A10dec/hyder.pdf

 

Mahesh Balakrishnan, Philip A. Bernstein, Dahlia Malkhi, Vijayan Prabhakaran, Colin W. Reid: Brief Announcement: Flash-Log - A High Throughput Log. DISC 2010: 401-403

http://www.springerlink.com/content/c732l27h3mrn3170/

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Wednesday, December 14, 2011 9:43:25 AM (Pacific Standard Time, UTC-08:00)  #    Comments [4] - Trackback
Software
 Sunday, November 27, 2011

While at Microsoft I hosted a weekly talk series called the Enterprise Computing Series (ECS) where I mostly scheduled technical talks on server and high-scale service topics. I said “mostly” because the series occasionally roamed as far afield as having an ex-member of the Ferrari Formula 1 team present. Client-side topics are also occasionally on the list either because I particularly liked the work or technology behind it or thought it was a broadly relevant topic.

 

The Enterprise Computing Series has an interesting history. It was started by Jim Gray at Tandem.  Pat Helland picked up the mantle from Jim and ran it for years before Pat moved to Andy Heller’s Hal Computer Systems. He continued the ECS at HAL and then brought it with him when he joined Microsoft where he continued to run it for years. Pat eventually passed it to me and I hosted the ECS series for 8 or 9 years myself before moving to Amazon Web Services. Ironically when I arrived at Amazon, I found that Pat Helland had again created a series in the same vein as the ECS called the Principals of Amazon (PoA) series.

 

The PoA series is excellent but it doesn’t include external speakers and is hosted on a fixed day of the week so I occasionally come across a talk that I would like to host at Amazon that doesn’t fit the PoA. For those occasions, the Enterprise Computing Series lives on!

 

In this ECS talk Ashraf Aboulnaga of the University of Waterloo presented High Availability for Database Systems in Cloud Computing Environments. Ashraf presented two topics, 1) RemusDB: Database high availability using virtualization, and 2) DBECS: Database high availability and availability using eventually consistent cloud storage. The first topic was based upon the VLDB 2011 Best Paper Award “RemusDB: Transparent HighAvailability for Database Systems” by Umar Farooq Minha, Shriram Rajagopalan, Brendan Cully, Ashraf Aboulnaga, Ken Salem, and Andrew Warfield. The second topic is work that is not yet published nor as fully developed.

 

Focusing on the first paper, they built an active/standby database system using Remus. Remus implements transparent high availability for Xen VMs. It does this by reflecting all writes to memory in the active virtual machine to the non-active, backup VM.  Remus keeps the backup VM ready to take over with exactly the same memory state as the primary server. On failover, it can take over with the same memory contents including an already warm cache.

Remus is a simple and easy to understand approach to getting very fast takeover from a primary VM. The challenge is that memory write latencies are a fraction of network latencies so any solution that turns memory write latencies into network write latencies simply will not perform adequately for most workloads. Remus tackles this problem using the expected solution: batching many requests in a single network transfer. By default, every 25msec Remus suspends the primary VM, copies all changed pages to a Dom0 (hypervisor) buffer and the allows the VM to continue. The Dom0 buffer is used to minimized the length of time that the guest VM needs to be suspended but comes at the expense of requiring sufficient Dom0 memory for the largest group of changed pages in 25msec.

 

Once the guest machine changed pages are copied to Dom0, the primary VM is released from suspend state and the changes just copied to dom0 are then transferred to the secondary system and applied to the ready to run backup VM.

 

The downsides to the Remus approach are 1) a potentially large dom0 buffer is required and 2) up to 25msec of forward progress can be lost on failover, 3) the checkpoint work consumes considerable resources including time. The time to copy the changed pages may be acceptable but the other overheads are sufficiently high that it is very difficult to host demanding workloads like database workloads on Remus.

 

The authors tackle this problem but noting that Remus actually does more than is needed for database workloads. Or, worded differently, a Remus optimized for database workloads can dramatically reduce the implementation overhead. They introduced the following optimizations:

·         Asynchronous checkpoint compression: Maintain an LRU buffer of recent pages and only ship a delta of these pages. This optimization is based upon the assumption that DB systems modify some pages frequently and typically only change a small part of these pages between checkpoints.

·         Disk read tracking: don’t mark pages read from disk as dirty since they are already available to the backup server via an I/O

·         Memory deprotection: allows DB to declare regions of memory that don’t need to be replicated. This turned out not to be as powerful an optimization as the others and had the further downside of requiring database engine changes

·         Network optimization/Commit protection: Remus buffers every outgoing network packet to ensure clients never see the results of unsafe execution but this increases latency by not allowing any response back to the client until the next Remus checkpoint. Because DBs can fail and transactions can be aborted, they DB optimization is to send all packets back to client in real time except for commit, abort, or other database transaction state changing operations. On failover, any client in an unprotected network state (changes have been sent since the last checkpoint) has the transaction failed. A correct client will re-run the transaction and proceed without issue.

 

What was achieved is Remus, fast-failover protection for database workloads and far lower replication overhead. The authors used the database transaction benchmark TPC-C to show that Remus with DB optimizations has all the protection of Remus but with roughly 1/10th the overhead.

 

                Slides: http://mvdirona.com/jrh/TalksAndPapers/AshrafAboulnaga20111114.pdf

                VLDB Paper: http://www.cs.uwaterloo.ca/~ashraf/pubs/pvldb11remusdb.pdf

 

I'm not 100% convinced Remus is the best solution to the database high availability problem but I like the solution, learned from the proposed optimizations, and enjoyed the talk. Thanks to Pradeep Madhavarapu, who leads part of the Amazon database kernel engineering team (and is hiring :-)), for organizing this talk and to  Ashraf Aboulnaga for doing it.

 

                                                --jrh

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Sunday, November 27, 2011 12:50:18 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Software
 Sunday, October 23, 2011

From the Last Bastion of Mainframe Computing Perspectives post:

 

The networking equipment world looks just like mainframe computing ecosystem did 40 years ago. A small number of players produce vertically integrated solutions where the ASICs (the central processing unit responsible for high speed data packet switching), the hardware design, the hardware manufacture, and the entire software stack are stack are single sourced and vertically integrated.  Just as you couldn’t run IBM MVS on a Burrows computer, you can’t run Cisco IOS on Juniper equipment. When networking gear is purchased, it’s packaged as a single sourced, vertically integrated stack. In contrast, in the commodity server world, starting at the most basic component, CPUs are multi-sourced. We can get CPUs from AMD and Intel. Compatible servers built from either Intel or AMD CPUs are available from HP, Dell, IBM, SGI, ZT Systems, Silicon Mechanics, and many others.  Any of these servers can support both proprietary and open source operating systems. The commodity server world is open and multi-sourced at every layer in the stack.

 

Last week the Open Network Summit was hosted at Stanford University.  This conference focused on Software Defined Networks in general and Openflow specifically. Software defined networking separates out the router control plane responsible for what is in the routing table from the data plane that makes network packet routing decisions on the basis of what is actually in the routing table.  Historically, both operations have been implemented monolithically in each router. SDN, separates these functions allowing networking equipment to compete in how efficiently they route packets on the basis of instructions from a separate SDN control plane.

 

In the words of OpenFlow founder Nick Mckeown, Software Defined Networks (SDN), will: 1) empower network owners/operators, 2) increase the pace of network innovation, 3) diversify the supply chain, and 4) build a robust foundation for future networking innovation.

 

This conference was a bit of a coming of age for software defined networking for a couple of reasons. First, an excellent measure of relevance is who showed up to speak at the conference. From academia, attendees included Scott Shenker (Berkeley), Nick McKeown (Stanford), and Jennifer Rexford (Princeton).  From industry most major networking companies were represented by senior attendees including Dave Ward (Juniper), Dave Meyer (Cisco), Ken Duda (Arista), Mallik Tatipamula (Ericsson), Geng Lin (Dell), Samrat Ganguly (NEC),  and Charles Clark (HP). And some of the speakers from major networking user companies included: Stephen Stuart (Google), Albert Greenberg (Microsoft), Stuart Elby (Verizon), Rainer Weidmann (Deutsche Telekom), and Igor Gashinsky (Yahoo!). The full speaker list is up at: http://opennetsummit.org/speakers.html.

 

The second data point in support of SDN really coming of age was Dave Meyer, Cisco Distinguished Engineer, saying during his talk that Cisco was “doing Openflow”. I’ve always joked that Cisco would rather go bankrupt than support Openflow so this one definitely caught my interest. Since I wasn’t in attendance myself during Dave’s talk I checked in with him personally. He corrected that it wasn’t a product announcement. They have Openflow running on Cisco gear but “no product plans have been announced at this time”. Still exciting progress and hat’s off for Cisco for taking the first step. Good to see.

 

If you want a good summary of what is Software Defined Networking, perhaps the best description were the slides that Nick presented at the conference: http://mvdirona.com/jrh/TalksAndPapers/NickMckeown_ON%20Summit%20NickM%2010%202011.pdf.

 

If you are interested in what Cisco’s Dave Meyer presented at the summit, I’ve posted his slides here: http://mvdirona.com/jrh/TalksAndPapers/DavidMeyer_openflow_and_sdn_for_enterprises.pdf.

 

Other related postings I’ve made:

·         Datacenter Networks are in my Way

·         Stanford Clean Slate CTO Summit

·         Changes in Networking Systems

·         Software Load Balancing Using Software Defined Networking

 

Congratulations to the Stanford team for hosting a great conference and in helping to drive software defined networking from a great academic idea to what is rapidly becoming a supported option industry-wide.

 

                                                                --jrh

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Sunday, October 23, 2011 7:57:07 AM (Pacific Standard Time, UTC-08:00)  #    Comments [4] - Trackback
Hardware | Software
 Tuesday, August 16, 2011

I got a chance to chat with Eric Baldeschwieler while he was visiting Seattle a couple of weeks back and catch up on what’s happening in the Hadoop world at Yahoo and beyond. Eric recently started Hortonworks whose tag line is “architecting the future of big data.” I’ve known Eric for years when he led the Hadoop team at Yahoo! most recently as VP of Hadoop Engineering.  It was Eric’s team at Yahoo that contributed much of the code in Hadoop, Pig, and ZooKeeper. 

 

Many of that same group form the core of Hortonworks whose mission is revolutionize and commoditize the storage and processing of big data via open source. Hortonworks continues to supply Hadoop engineering to Yahoo! And Yahoo! Is a key investor in Hortonworks along with Benchmark Capital. Hortonworks intends to continue to leverage the large Yahoo! development, test, and operations team.  Yahoo! has over 1,000 Hadoop users and are running Hadoop over many clusters the largest of which was 4,000 nodes back in 2010. Hortonworks will be providing level 3 support for Yahoo! Engineering.

 

From Eric slides at the 2011 Hadoop summit, Hortonworks objectives:

      Make Apache Hadoop projects easier to install, manage & use

        Regular sustaining releases

        Compiled code for each project (e.g. RPMs)

        Testing at scale

      Make Apache Hadoop more robust

        Performance gains

        High availability

        Administration & monitoring

      Make Apache Hadoop easier to integrate & extend

        Open APIs for extension & experimentation

 

Hortonworks Technology Roadmap:

·         Phase 1: Making Hadoop Accessible (2011)

o   Release the most stable Hadoop version ever

o   Release directly usable code via Apache (RPMs, debs,…)

o   Frequent sustaining releases off of the stable branches

·         Phase 2: Next Generation Apache Hadoop (2012)

o   Address key product gaps (Hbase support, HA, Management, …)

o   Enable community and partner innovation via modular architecture & open APIs

o   Work with community to define integrated stack

 

Next generation Apache Hadoop:

·         Core

o   HDFS Federation

o   Next Gen MapReduce

o   New Write Pipeline (HBase support)

o   HA (no SPOF) and Wire compatibility

·         Data - HCatalog 0.3

o   Pig, Hive, MapReduce and Streaming as clients

o   HDFS and HBase as storage systems

o   Performance and storage improvements

·         Management & Ease of use

o   All components fully tested and deployable as a stack

o   Stack installation and centralized config management

o    REST and GUI for user tasks

 

Eric’s presentation from Hadoop Summit 2011 where he gave the keynote: Hortonworks: Architecting the Future of Big Data

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Tuesday, August 16, 2011 4:49:00 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Software
 Monday, August 01, 2011

It’s a clear sign that the Cloud Computing market is growing fast and the number of cloud providers is expanding quickly when startups begin to target cloud providers as their primary market. It’s not unusual for enterprise software companies to target cloud providers as well as their conventional enterprise customers but I’m now starting to see startups building products aimed exclusively at cloud providers. Years ago when there were only a handful of cloud services, targeting this market made no sense. There just weren’t enough buyers to make it an interesting market. And, many of the larger cloud providers are heavily biased to internal development further reducing the addressable market size.

 

A sign of the maturing of the cloud computing market is there now many companies interested in offering a cloud computing platform not all of which have substantial systems software teams. There is now a much larger number of companies to sell to and many are eager to purchase off the shelf products. Cloud providers have actually become a viable market to target in that there are many providers of all sizes and the overall market continues to expand faster than any I have seen any I’ve seen over the last 25 years.

 

An excellent example of this new trend of startups aiming to sell to the Cloud Computing market is SolidFire which targets the high performance block storage market with what can be loosely described as a distributed Storage Area Network. Enterprise SANs are typically expensive, single-box, proprietary hardware.  Enterprise SANs are mostly uninteresting to cloud providers due to high cost and the hard scaling limits that come from scale-up solutions. SolidFire implements a virtual SAN over a cluster of up to 100 nodes. Each node is a commodity 1RU, 10 drive storage server.  They are focused on the most demanding random IOPS workloads such as database and all 10 drives in the SolidFire node are Solid State Storage devices. The nodes are interconnected by up 2x 1GigE and 2x10GigE networking ports.

 

In aggregate, each node can deliver a booming 50,000 IOPS and the largest supported cluster with 100 nodes can support 5m IOPS in aggregate. The 100 node cluster scaling limit may sound like a hard service scaling limit but multiple storage clusters can be used to scale to any level.  Needing multiple clusters has the disadvantage of possibly fragmenting the storage but the advantage of dividing the fleet up into sub-clusters with rigid fault containment between them limiting the negative impact of software problems. Reducing the “blast radius” of a failure makes moderate sized sub-clusters a very good design point.

 

Offering distributed storage solution isn’t that rare – there are many out there. What caught my interest at SolidFire was 1) their exclusive use of SSDs, and 2) an unusually nice quality of service (QoS) approach. Going exclusively with SSD makes sense for block storage systems aimed exclusively at high random IOPS workload but they are not a great solution for storage bound workloads. The storage for these workloads is normally more cost-effectively hosted on hard disk drives. For more detail on where SSDs are win an where they are not:

 

·         When SSDs Make Sense in Server Applications

·         When SSDs Don’t Make Sense in Server Applications

·         When SSDs make sense in Client Applications (just about always)

 

The usual solution to this approach is do both but SolidFire wanted a single SSD optimized solution that would be cost effective across all workloads.  For many cloud providers, especially the smaller ones, a single versatile solution has significant appeal.

 

The SolidFire approach is pretty cool. They exploit the fact that SSDs have abundant IOPS but are capacity constrained and trade off IOPS to get capacity. Dave Wright the SolidFire CEO describes the design goal as SSD performance at a spinning media price point. The key tricks employed:

·         Multi-Layer Cell Flash: They use MLC Flash Memory storage since it is far cheaper than Single Level Cell, the slightly lower IOPS rate supported by MLC is still more than all but a handful of workloads require and they can solve the accelerated wear issues with MLC in the software layers above

·         Compression: Aggregate workload dependent gains estimated to be 30 to 70%

·         Data Deduplication: Aggregate workload dependent gains estimated to be 30 to 70%

·         Thin Provisioning: Only allocate blocks to a logical volume as they are actually written to. Many logical volumes never get close to the actual allocated size.

·         Performance Virtualization: Spread all volumes over many servers. Spreading the workload at a sub-volume level allows more control of meeting the individual volume performance SLA with good utilization and without negatively impacting other users

 

The combination of the capacity gains of thin provisioning, duplication, and compression bring the dollars per GB of the SolidFire solution very near to some hard disk based solutions at nearly 10x the IOPS performance.

 

The QoS solution is elegant in that they have three settings that allow multiple classes of storage to be sold. Each logical volume has 2 QoS settings: 1) Bandwidth, and 2) IOPS. Each setting has a min, max, and burst capacity setting.  The min setting sets a hard floor where capacity is reserved to ensure this resource is always available. The burst is the hard ceiling that prevents a single user for consuming excess resource. The max is the essentially the target. If you run below the max you build up credits that allow a certain time over the max. The Burst limits the potential negative impact of excursions above max on other users.

 

This system can support workloads that need dead reliable, never changing I/O requirements. It can also support dead reliable average case with rare excursions above (e.g. during a database checkpoint). Its also easy to  support workloads that soak up resources left over after satisfying the most demanding workloads without impacting other users. Overall, a nice simple and very flexible solution to a very difficult problem.

 

                                                --jrh

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Monday, August 01, 2011 12:49:03 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Software
 Thursday, June 23, 2011

Earlier this week, I was in Athens Greece attending annual conference of the ACM Machinery Special Interest Group on Management of Data. SIGMOD is one of the top two database events held each year attracting academic researchers and leading practitioners from industry.

 

I kicked off the conference with the Plenary keynote. In this talk I started with a short retrospection on the industry over the last 20 years. In my early days as a database developer, things were moving incredibly quickly. Customers were loving our products, the industry was growing fast and yet the products really weren’t all that good. You know you are working on important technology when customers are buying like crazy and the products aren’t anywhere close to where they should be.

 

In my first release as lead architect on DB2 20 years ago, we completely rewrote the DB2 database engine process model moving from a process-per-connected-user model to a single process where each connection only consumes a single thread supporting many more concurrent connections. It was a fairly fundamental architectural change completed in a single release. And in that same release, we improved TPC-A performance a booming factor of 10 and then did 4x more in the next release. It was a fun time and things were moving quickly.

 

From the mid-90s through to around 2005, the database world went through what I refer to as the dark ages. DBMS code bases had grown to the point where the smallest was more than 4 million lines of code, the commercial system engineering teams would no longer fit in a single building, and the number of database companies shrunk throughout the entire period down to only 3 major players. The pace of innovation was glacial and much of the research during the period was, in the words of Bruce Lindsay, “polishing the round ball”. The problem was that the products were actually passably good, customers didn’t have a lot of alternatives, and nothing slows innovation like large teams with huge code bases.

 

In the last 5 years, the database world has become exciting again. I’m seeing more opportunity in the database world now than any other time in the last 20 years. It’s now easy to get venture funding to do database products and the number of and diversity of viable products is exploding. My talk focused on what changed, why it happened, and some of the technical backdrop influencing.

 

A background thesis of the talk is that cloud computing solves two of the primary reasons why customers used to be stuck standardizing on a single database engine even though some of their workloads may have run poorly. The first is cost. Cloud computing reduces costs dramatically (some of the cloud economics argument: http://perspectives.mvdirona.com/2009/04/21/McKinseySpeculatesThatCloudComputingMayBeMoreExpensiveThanInternalIT.aspx) and charges by usage rather than via annual enterprise license. One of the favorite lock-ins of the enterprise software world is the enterprise license. Once you’ve signed one, you are completely owned and it’s hard to afford to run another product.  My fundamental rule of enterprise software is that any company that can afford to give you 50% to 80% reduction from “list price” is pretty clearly not a low margin operator. That is the way much of the enterprise computing world continues to work: start with a crazy price, negotiate down to a ½ crazy price, and then feel like a hero while you contribute to incredibly high profit margins.

 

Cloud computing charges by the use in small increments and any of the major database or open source offerings can be used at low cost. That is certainly a relevant reason but the really significant factor is the offloading of administrative complexity to the cloud provider.  One of the primary reasons to standardize on a single database is that each is so complex to administer, that it’s hard to have sufficient skill on staff to manage more than one. Cloud offerings like AWS Relational Database Service transfer much of the administrative work to the cloud provider making it easy to chose the database that best fits the application and to have many specialized engines in use across a given company.

 

As costs fall, more workloads become practical and existing workloads get larger.  For example, If analyzing three months of customer usage data has value to the business and it becomes affordable to analyze two years instead, customers correctly want to do it. The plunging cost of computing is fueling database size growth at a super-Moore pace requiring either partitioned (sharded) or parallel DB engines.

 

Customers now have larger and more complex data problems, they need the products always online, and they are now willing to use a wide variety of specialized solutions if needed. Data intensive workloads are growing quickly and never have there been so many opportunities and so many unsolved or incompletely solved problems. It’s a great time to be working on database systems.

 

·         The slides from the talk: http://mvdirona.com/jrh/TalksAndPapers/JamesHamilton_Sigmod2011Keynote.pdf

·         Proceedings extended abstract: http://www.sigmod2011.org/keynote_1.shtml

·         Video of talk: https://services.choruscall.eu/links/sigmod1106.html# (select June 14th to get to the video)

The talk video is available but, unfortunately, only to ACM digital library subscribers (thanks to Simon Leinen for pointing out the availability of the video link above).

                                                                                --jrh

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Thursday, June 23, 2011 3:21:11 PM (Pacific Standard Time, UTC-08:00)  #    Comments [2] - Trackback
Software
 Monday, May 23, 2011

Guido van Rossum was at Amazon a week back doing a talk. Guido presented 21 Years of Python: From Pet Project to Programming Language of the Year.

 

The slides are linked below and my rough notes follow:

·         Significant Python influencers:

o   Algol 60, Pascal, C

o   ABC

o   Modula0-2+ and 3

o   Lisp and Icon

·         ABC was the strongest language influencer of this set

·         ABC design goals:

o   Professionals but not professional programmers (lab personal, scientists, etc.)

o   Easy to teach, easy to learn, easy to use

·         Parts of ABC most liked by Guido:

o   Design iterations based on user testing

§  E.g. colon before indented blocks

o   Simple design: IF, WHILE, FOR, …

o   Indentation for grouping (Knuth, occam)

o   Tuples, lists, dictionaries (though changed)

o   Immutable data types

o   No limits

o   The >>> prompt

·         Parts of ABC that most needed improvement:

o   Monolithic design – not extensible

§  E.g. no graphics, not easily added

o   Invented non-standard terminology

§  E.g. “how-to” instead of “procedure”

o   ALL'CAPS keywords

o   No integration with rest of system

§  No file-based I/O (persistent variables instead)

·         The beginnings of Python:

o   Amoeba project at CWI

§  Writing apps in C and sh and wanting something in between

·         Python design philosophy:

o   Borrow ideas whenever it makes sense

o   As simple as possible, no simpler (Einstein)

o   Do one thing well (UNIX)

o   Don’t fret about performance (fix it later)

o   Go with the flow (don’t fight environment)

o   Perfection is the enemy of the good

o   Cutting corners is okay (get back to it later)

·         User Centric Design Philosophy:

o   Avoid platform ties, but not religiously

o   Don’t bother the user with details

o   Discourage but allow coding to the platform

o   Offer multiple levels of extensibility

o   Errors should not be fatal, if possible

o   Errors should never pass silently

o   Don’t blame the user for bugs in Python

·         Core language stabilized quickly in the 1990 to 1991 timeframe

·         Early days of active Python community:

o   1990 – internal at CWI

§  More internal use than ABC ever had

§  Internal contributors

o   1991 – first release; python-list@cwi.nl

o   1994 – USENET group comp.lang.python

o   1994 – first workshop (NIST)

o   1995-1999 – from workshops to conferences

o   1995 – Python Software Association

o   1997 – www.python.org goes online

o   1999 – Python Consortium

§  Modeled after X Consortium

o   2001 – Python Software Foundation

§  Modeled after Apache Software Foundation

·         Present day Python community:

o   PSF runs largest annual Python conference

§  PyCon Atlanta in 2011: 1500 attendees

§  2012-2013: Toronto; 2014-2015: Bay area

§  Also sponsors regional PyCons world-wide

o   EuroPython since 2002

o   Many local events, user groups

o   python.org

o   docs.python.org, mail.python.org, bugs.python.org, hg.python.org,
planet.python.org, wiki.python.org

o   Stackoverflow etc.

·         Python 2 vs Python 3

o   Fixing deep bugs intrinsic in the design

o   Avoid two extremes:

§  perpetual backwards compatibility (C++)

§  rewrite from scratch (Perl 6)

o   Our approach:

§  evolve the implementation gradually

§  some backwards incompatibilities

§  separate tools to help users cope

 

Thanks to Guido for doing the well received Python presentation.

 

Guido’s slides and blog URLS:

·         Slides: http://mvdirona.com/jrh/TalksAndPapers/GuidoVanRossum_21_years_of_python.pdf

·         Blog: http://python-history.blogspot.com

 

--jrh

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Monday, May 23, 2011 9:42:12 AM (Pacific Standard Time, UTC-08:00)  #    Comments [4] - Trackback
Software
 Friday, February 11, 2011

Ben Black always has interesting things on the go. He’s now down in San Francisco working on his startup Fastip which he describes as “an incredible platform for operating, exploring, and optimizing data networks.” A couple of days ago Deepak Singh sent me to a recent presentation of Ben’s I found interesting: Challenges and Trade-offs in Building a Web-scale Real-time Analytics System.

 

The problem described in this talk was “Collect, index, and query trillions of high dimensionality

records with seconds of latency for ingestion and response.” What Ben is doing is collecting per flow networking data with tcp/ip 11-tuples (src_mac, dst_mac, src_IP, dest_IP, …) as the dimension data and, as metrics, he is tracking start usecs, end usecs, packets, octets, and UID. This data is interesting for two reasons: 1) networks are huge, massively shared resources and most companies haven’t really a clue on the details of what traffic is clogging it and have only weak tools to understand what traffic is flowing – the data sets are so huge, the only hope is to sample it with solutions like Cisco's NetFlow. The second reason I find this data interesting is closely related: 2) it is simply vast and  I love big data problems. Even on small networks, this form of flow tracking produces a monstrous data set very quickly. So, it’s an interesting problem in that it’s both hard and very useful to solve.

 

Ben presented 3 possible solutions and why they don’t work before offering a solution. The failed approaches that couldn’t cope with high dimensionality and the sheer volume of the dataset:

1.       HBase: Insert into HBase then retrieve all records in a time range and filter, aggregate, and sort

2.       Cassandra: Insert all records into Cassandra partitioned over a large cluster with each dimension indexed independently. Select qualifying records on each dimension, aggregate, and sort.

3.       Statistical Cassandra: Run a statistical sample over the data stored in Cassandra in the previous attempt.

 

The end solution proposed by the presenter is to treat it as an Online Analytic Processing (OLAP) problem. He describes OLAP as “A business intelligence (BI) approach to swiftly answer multi-dimensional analytics queries by structuring the data specifically eliminate expensive processing at query time, even at a cost of enormous storage consumption.” OLAP is essentially a mechanism to support fast query over high-dimensional data by pre-computing all interesting aggregations and storing the pre-computed results in a highly compressed form that is often kept memory resident. However, in this case, the data set is far too large to be practical for an in-memory approach.

 

Ben draws from two papers to implement an OLAP based solution at this scale:

·         High-Dimensional OLAP: A Minimal Cubing Approach by Li, Han, and Gonzalez

·         Sorting improves word-aligned bitmap indexes by Lemire, Kaser, and Aouiche

 

The end solution is:

·         Insert records into Cassandra.

·         Materialize lower-dimensional cuboids using bitsets and then join as needed.

·         Perform all query steps directly in the database


Ben’s concluding advice:

·         Read the literature

·         Generic software is 90% wrong at scale, you just don’t know which 90%.

·         Iterate to discover and be prepared to start over

 

If you want to read more, the presentation is at: Challenges and Trade-offs in Building a Web-scale Real-time Analytics System.

 

                                                                --jrh

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Friday, February 11, 2011 5:41:36 AM (Pacific Standard Time, UTC-08:00)  #    Comments [12] - Trackback
Software
 Sunday, February 06, 2011

Last week, Sudipta Sengupta of Microsoft Research dropped by the Amazon Lake Union campus to give a talk on the flash memory work that he and the team at Microsoft Research have been doing over the past year.  Its super interesting work. You may recall Sudipta as one of the co-authors on the VL2 Paper (VL2: A Scalable and Flexible Data Center Network) I mentioned last October.

 

Sudipta’s slides for the flash memory talk are posted at Speeding Up Cloud/Server Applications With Flash Memory and my rough notes follow:

·         Technology has been used in client devices for more than a decade

·         Server side usage more recent and the difference between hard disk drive and flash characterizes brings some challenges that need to be managed in the on-device Flash Translation Layer (FTL)  or in the operating systems or Application layers.

·         Server requirements are more aggressive across several dimensions including required random I/O rates and higher reliability and durability (data life) requirements.

·         Key flash characteristics:

·         10x more expensive than HDD

·         10x cheaper than RAM

·         Multi Level Cell (MLC): ~$1/GB

·         Single Level Cell (SLC): ~$3/GB

·         Laid out as an linear array of flash blocks where a block is often 128k and a page is 2k

·         Unfortunately the unit of erasure is a full block but the unit of read or write is 2k and this makes the write in place technique used in disk drives not workable.

·         Block erase is a fairly slow operation at 1500 usec whereas read or write is 10 to 100 usec.

·         Wear is an issue with SLC supporting O(100k) erases and MLC O(10k)     

·         The FTL is responsible for managing the mapping between logical pages and physical pages such that logical pages can be overwritten and hot page wear is spread relatively evenly over the device.

·         Roughly 1/3 the power consumption of a commodity disk and 1/6 the power of an enterprise disk

·         100x the ruggedness over disk drives when active

·         Research Project: FlashStore

·         Use flash memory as a cache between RAM and HDD

·         Essentially a flash aware store where they implement a log structured block store (this is essentially what the FTLs do in the device implemented at the application layer.

·         Changed pages are written through to flash sequentially and an in-memory index of pages is maintained so that pages can be found quickly on the flash device.

·         On failure the index structure can be recovered by reading the flash device

·         Recently unused pages are destaged asynchronously to disk

·         A key contribution of this work is a very compact form for the index into the flash cache

·         Performance results excellent and you can find them in the slides and the papers referenced below

·         Research Project: ChunkStash

·         A very high performance, high throughput key-value store

·         Tested on two production workloads:

·         Xbox Live Primetime online gaming

·         Storage deduplication

·         The storage dedeuplication test is a good one in that dedupe is most effective with a large universe of objects to run deduplication over. But a large universe requires a large index. The most interesting challenge of deduplication is to keep the index size small through aggressive compaction

·         The slides include a summary of dedupe works and shows the performance and compression ratios they have achieved with ChunkStash

 

For those interested in digging deeper, the VLDB and USENIX papers are the right next stops:

·         http://research.microsoft.com/apps/pubs/default.aspx?id=141508 (FlashStore paper, VLDB 2010)

·         http://research.microsoft.com/apps/pubs/default.aspx?id=131571 (ChunkStash paper, USENIX ATC 2010)

·         Slides: http://mvdirona.com/jrh/talksandpapers/flash-amazon-sudipta-sengupta.pdf

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Sunday, February 06, 2011 1:48:10 PM (Pacific Standard Time, UTC-08:00)  #    Comments [10] - Trackback
Hardware | Software
 Thursday, January 13, 2011

If you have experience in database core engine development either professionally, on open source, or at university send me your resume. When I joined the DB world 20 years ago, the industry was young and the improvements were coming ridiculously fast.  In a single release we improved DB2 TPC-A performance by a factor of 10x. Things were changing quickly industry-wide.  These days single-server DBs are respectably good. It’s a fairly well understood space. Each year more features are added and a few percent performance improvement may happen but the code bases are monumentally large, many of the development teams are over 1,000 engineers, and things are happening anything but quickly.

 

If you are an excellent engineer, have done systems or DB work in the past, and are interested in working on the next decade’s problems in database, drop me a note (james@amazon.com).

 

                                                                                --jrh

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

Thursday, January 13, 2011 8:12:54 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Software

Disclaimer: The opinions expressed here are my own and do not necessarily represent those of current or past employers.

Archive
<September 2014>
SunMonTueWedThuFriSat
31123456
78910111213
14151617181920
21222324252627
2829301234
567891011

Categories
This Blog
Member Login
All Content © 2014, James Hamilton
Theme created by Christoph De Baene / Modified 2007.10.28 by James Hamilton