Saturday, January 02, 2010

In this month’s Communications of the Association of Computing Machinery, a rematch of the MapReduce debate was staged.  In the original debate, Dave Dewitt and Michael Stonebraker, both giants of the database community, complained that:

 

1.    MapReduce is a step backwards in database access

2.    MapReduce is a poor implementation

3.    MapReduce is not novel

4.    MapReduce is missing features

5.    MapReduce is incompatible with the DBMS tools

 

Unfortunately, the original article appear to be no longer available but you will find the debate branching out from that original article by searching on the title Map Reduce: A Major Step Backwards. The debate was huge, occasionally entertaining, but not always factual.  My contribution was MapReduce a Minor Step forward.


Update: In comments, csliu offered updated URLs for the original blog post and a follow-on article:

·  MapReduce: A Major Step Backwards

· MapReduce II

 

I like MapReduce for a variety of reasons the most significant of which is that it allows non-systems programmers to write very high-scale, parallel programs with comparative ease.  There have been many attempts to allow mortals to write parallel programs but there really have only been two widely adopted solutions that allow modestly skilled programmers to write highly concurrent executions: SQL and MapReduce. Ironically the two communities participating in the debate,  Systems and Database, have each produced a great success by this measure. 

 

More than 15 years ago, back when I worked on IBM DB2, we had DB2 Parallel Edition running well over a 512 server cluster.  Even back then you could write a SQL Statement that would run over a ½ thousand servers.  Similarly, programmers without special skills can run MapReduce programs that run over thousands of serves. The last I checked Yahoo, was running MapReduce jobs over a 4,000 node cluster: Scaling Hadoop to 4,000 nodes at Yahoo!.

 

The update on the MapReduce debate is worth reading but, unfortunately, the ACM has marked the first article as “premium content” so you can only read it if you are a CACM subscriber:

·         MapReduce and Parallel DBMSs: Friend or Foe

·         MapReduce: A Flexible Data Processing Tool


Update: Moshe Vardi, Editor in Chief of the Communications of the Association of Computing Machinery has kindly decided to make both the of the above articles freely available for all whether or not CACM member. Thank you Moshe.

 

Even more important to me than the MapReduce debate is seeing this sort of content made widely available. I hate seeing it classified as premium content restricted to members only. You really all should be members but, with the plunging cost of web publishing, why can’t the above content be made freely available? But, while complaining about the ACM publishing policies, I should hasten to point out that the CACM has returned to greatness.  When I started in this industry, the CACM was an important read each month. Well, good news, the long boring hiatus is over.  It’s now important reading again and has been for the last couple of years. I just wish the CACM would follow the lead of ACM Queue and make the content more broadly available outside of the membership community.

 

Returning to the MapReduce discussion, in the second CACM article above, MapReduce: A Flexible Data Processing Tool, Jeff Dean and Sanjay Ghemawat, do a thoughtful job of working through some of the recent criticism of MapReduce.

 

If you are interested in MapReduce, I recommend reading the original Operating Systems Design and Implementation MapReduce paper: MapReduce: Simplied Data Processing on Large Clusters and the detailed MapReduce vs database comparison paper: A Comparison of Approaches to Large-Scale Data Analysis.

 

                                                                                --jrh

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Saturday, January 02, 2010 8:15:19 AM (Pacific Standard Time, UTC-08:00)  #    Comments [8] - Trackback
Software
 Saturday, November 14, 2009

HPTS has always been one of my favorite workshops over the years. Margo Seltzer was the program chair this year and she and the program committee brought together one of the best programs ever.  Earlier I posted my notes from Andy Bectolsheim’s session Andy Bechtolsheim at HPTS 2009 and his slides Technologies for Data Intensive Computing.

 

Two other sessions were particularly interesting and worth summarizing here. The first is a great talk on high-scale services lessons learned from Randy Shoup and a talk by John Ousterhout on RAMCloud a research project to completely eliminate the storage hierarchy and store everything in DRAM.

 

Randy Shoup

My notes from Randy’s talk follow and his slides are at: eBay’s Challenges and Lessons from Growing an eCommerce Platform to Planet Scale.

·         eBay Manages

1.       Over 89 million active users worldwide

2.       190 million items for sale in 50,000 categories

3.       Over 8 billion URL requests per day

4.       Roughly 10% of the items are listed or ended each day

5.       70B read/write operations/day

·         Architectural Lessons

1.       Partition Everything

2.       Asynchrony Everywhere

3.       Automate Everything

4.       Remember Everything Fails

5.       Embrace Inconsistency

6.       Expect Service Evolution

7.       Dependencies Matter

8.       Know which databases are Authoritativeand which are caches

9.       Never enough data (save everything)

10.   Invest in custom infrastructure

 

John Ousterhout

My notes from John’s talk follow and his slides are at: RAMCloud: Scalable Data Center Storage Entirely in DRAM. I really enjoyed this talk despite the fact that I saw the same talk presented at the Stanford Clean Slate CTO Summit. This talk is sufficiently thought provoking to be just as interesting the second time through. My notes from John’s talk:

·         Storage entirely in DRAM spread over 10s to 10s of thousands of servers

·         Focus of project:

o   Low latency and very large scale

·         ~64GB server each supporting:

o   1M ops/second

o   5 to 10 us RPC

·         Today commodity servers can stretch easily to 64GB. Expect to see 1TB in commodity servers out 5 to 10 years

·         Current cost is roughly $60/GB. Expect this to fall to $4/GB in 5 to 10 years

·         Motivation for RAMCloud project:

o   Databases don’t scale

·         Disk access rates not keeping up with capacity so disks must become archival:

o   See Jim Gray’s excellent Disk is Tape

·         Aggressive goal of achieving 5 to 10 u sec RPC

·         Points out that very low latency applications are not built upon relational databases and argues that very low data access latency removes the need for optimization of access plans and concludes the relational model will disappear.

o   I see value in low latency but don’t agree that the relational model will disappear. See One Size Does Not Fit All.

·         John makes an interesting observation “the cost of consistency increases with transaction over-lap”

o   Let:

§  0 = # overlapping transactions

§  R = arrival rate for new transactions

§  D = duration of each transaction

o   Then:

§  0 is proportional to R * D

§  R increases with system scale and, eventually, strong consistency becomes unaffordable

§  But, D decreases with lower latency

o   The interesting question: can we afford higher levels of consistency with lower latency?

·         John argues perhaps with very low latency, one size might fit all (a single data storage system could handle all workloads).

o   The counter argument to this one is that capital cost and power cost of an all memory solution appears prohibitively expensive for cold sequential workloads. Its perfect for OLTP but I don’t yet see the “one size can fit again” prediction.

 

If you are interested in digging deeper, the slides for all sessions are posted at: http://www.hpts.ws/agenda.html.

 

                                                                --jrh

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Saturday, November 14, 2009 8:24:25 AM (Pacific Standard Time, UTC-08:00)  #    Comments [2] - Trackback
Services | Software
 Saturday, November 07, 2009

Just about exactly one year ago, I posted a summary and the slides from an excellent Butler Lampson talk: The Uses of Computers: What’s Past is Merely Prologue. Its time for another installment. Butler was at SOSP 2009 a few weeks back and Marvin Theimer caught up with him for a wide ranging discussion on distributed systems.

 

With Butler's permission, what follows are Marvin’s notes from the discussion.

 

Contrast cars with airplanes: when the fancy electronic systems fail you (most-of-the-time) can pull a car over to the side of the road and safely get out whereas an airplane will crash-and-burn.

 

Systems that behave like cars vs. airplanes:

·         It’s like a car if you can reboot it

·         What’s the scope of the damage if you reboot it

 

Ensuring that critical sub-systems keep functioning:

·         Layered approach with lower layers being simpler and able to cut off the higher layers and still keep functioning

·         Bottom layers need to be simple enough to reason about or perhaps even formally verify

·         Be skeptical about designing systems that gracefully degrade/approach their “melting points”.  Nice in theory, but not likely to be feasible in practice in most cases.

·         Have “firewalls” that partition your system into independent modules so that damage is contained.

·         Firewalls have “blast doors” that automatically come down in case of alarms going off.  Under normal circumstances the blast doors are up and you have efficient, optimized interaction between modules.  When alarms go off the blast doors go down.  The system must be able to work in degraded mode with the blast doors down.

·         You need to continually test your system for how it behaves with the blast doors down to ensure that “critical functioning” is still achievable despite system evolution and environment evolution.  Problem is that testing is expensive, so there is a trade-off between extensive testing and cost.  Essentially you can’t test everything.  This is part of the reason why the lowest levels of the system need to be simple enough to formally reason about their behavior.

o   Dave Gifford’s story about bank that had diesel backup generators for when power utility failed.  They religiously tested firing up the backup generators.  However, when a prolonged power actually occurred they discovered that the generators failed after half an hour because their lubricating oil failed.  No one had thought to test running on backup power for more than a few minutes.

 

Low-level multicast is bad because you can’t reason about the consequences of its use.  Better to have application-level multicast where you can explicitly control what’s going on.

 

RPC conundrum:

·         People have moved back from RPC to async messages because of the performance issues of sync RPC.

·         By doing so they are reintroducing concurrency issues into their programs.

A possible solution:

·         Constrain your system (if you can) to enable the definition of a small number of interaction patterns that hide the concurrency and asynchrony.

·         Your experts employ async messages to implement those higher-level interaction patterns.

·         The app developers only use the simper, higher-level abstractions.

·         Be happy with 80% solution – which you might achieve – and don’t expect to be able to handle all interactions this way.

 

Partitioned, primary-key scale-out approach is essentially mimicking the OLTP model of transactional DBs.  You are giving up certain kinds of join operators in exchange for scale-out and the app developer is essentially still programming the simple ACID DB model.

·         Need appropriate patterns/framework for updating multiple objects in a non-transactional manner.

·         Standard approach: update one object and transactionally write a message in a message queue for the other object.  Transactional update to other object is done asynchronously to the first object update.  Need compensation code for when things go wrong.

·         An interesting approach for simplifying the compensation problem: try to turn it into a garbage collection problem.  Background tasks look for to-do messages that haven’t been executed and figure out how to bring the system back into “compliance”.  You need this code for your fsck case anyway.

 

WARNING: don’t over-engineer your system.  Lots of interesting ideas here; you’ll be tempted to over-generalize and make things too complicated.  “Ordinary designers get it wrong 99% of the time; really smart designers get it wrong 50% of the time.”

 

Thanks to Butler Lampson and Marvin Theimer for making this summary available.

 

                                                --jrh

 

James Hamilton

e: jrh@mvdriona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Saturday, November 07, 2009 10:45:57 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Software
 Monday, September 21, 2009

Here’s another innovative application of commodity hardware and innovative software to the high-scale storage problem. MaxiScale focuses on 1) scalable storage, 2) distributed namespace, and 3) commodity hardware.

 

Today's announcement: http://www.maxiscale.com/news/newsrelease/092109.

 

They sell software designed to run on commodity servers with direct attached storage. They run N-way redundancy with a default of 3-way across storage servers to be able to survive disk and server failure. The storage can be accessed via HTTP or via Linux or Windows (2003 and XP) file system calls. The later approach requires a kernel installed device driver and uses a proprietary protocol to communicate back with the filer cluster but has the advantage of directly support local O/S read/write operations. MaxiScale architectural block diagram:

Overall I like the approach of using commodity systems with direct attached storage as the building block for very high scale storage clusters but that is hardly unique. Many companies have head down this path and, generally, it’s the right approach. What caught my interest when I spoke to the MaxiScale team last week was: 1) distributed metadata, 2) MapReduce support, and 3) small file support. Let’s look at each of these major features:

 

Distributed Metadata

File systems need to maintain a namespace. We need to maintain the directory hierarchy and we need to know where to find the storage blocks that make up the file.  In addition, other attributes and security may need to be stored depending upon the implemented file system semantics. This metadata is often stored in large key/value store. The metadata requires at least some synchronization since, for example, you don’t want to create two different objects of the same name at roughly the same time.  At high scale, storage servers will be joining and leaving the cluster all the time, so having a central metadata service is a easy approach to the problem. But, as easy as it is to implement a central metadata systems, they bring scaling limits. Eventually the metadata gets too hot and needs to be partitioned. In fairness, it’s amazing how far central metadata can be scaled but eventually hot spots develop and it needs to be partitioned. For example, Google GFS just went down this path: GFS: Evolution on Fast-forward. Partitioning metadata is a fairly well understood problem. What makes it a bit of a challenge is making the metadata system adaptive and able to re-partition when hot spots develop.

 

MaxiScale took an interesting approach to scaling the metadata. They distributed the metadata servers over the same servers that store the data rather than implement a cluster of dedicated metadata servers. They do this by hashing on the parent directory to find what they call a Peer Set and then, in that particular Peer Set, they look up the object name in the metadata store, find the file block location, and then apply the operation to the blocks in that same Peer Set.

 

Distributing the metadata over the same Peer Set as the stored data means that each peer set is independent and self-describing. The downside of having a fixed hash over the peer sets is that it’s difficult to cool down an excessively hot peer set by moving objects since the hash is known by all clients.

 

MapReduce Support

I love the MaxiScale approach to multi-server administration. They need to provide customers the ability to easily maintain multi-server clusters. They could have implemented a separate control plane to manage the all the servers that make up the cluster but, instead, they just use Hadoop and run MapReduce jobs.

 

All administrative operations are written as simple MapReduce jobs which certainly made the implementation task easier but it’s also a nice, extensible interface to allow customers to write custom administrative operations. And, since MapReduce is available over the cluster, its super easy to write data mining and data analysis jobs. Supporting MapReduce over the storage system is a nice extension of normal filer semantics.

 

Small File Support

The standard file system access model is to probe the metadata to get the block list and then access the blocks to perform the file operation. In the common case, this will require two I/Os which is fine for large files but the two I/Os can be expensive for small file access. And, since most filers use fixed size blocks and, for efficiency these block size tend to be larger than the average small file, some space is wasted. The common approach to this two problems is to pull small files “up” and rather than store the list of storage blocks in the file metadata, just store the small file. This works fine for small files and avoids both the block fragmentation and the multiple I/O problem on small files. This is what MaxiScale has done as well and they claim single I/O for any small file stored in the system.

 

More data on the MaxiScale filer: Small Files, Big Headaches: Ensuring Peak Performance

 

I love solutions based upon low-cost, commodity H/W and application maintained redundancy and what MaxiScale is doing has many of the features I would like to see in a remote filer.

 

                                                                                --jrh

 

James Hamilton

e: jrh@mvdriona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

 

Monday, September 21, 2009 6:41:49 AM (Pacific Standard Time, UTC-08:00)  #    Comments [1] - Trackback
Software
 Sunday, September 13, 2009

AJAX applications are wonderful because they allow richer web applications with much of the data being brought down asynchronously. The rich and responsive user interfaces of applications like Google Maps and Google Docs are excellent but JavaScript developers need to walk a fine line. The more code they download, the richer the UI they can support and the less synchronous server interactions they need. But, the more code they download, the slower the application can be to start. This is particularly noticeable when the client cache is cold and in mobile applications with restricted bandwidth back to the server.

 

Years ago profile directed code reorganization (a sub-class of Basic Block Transforms) were implemented to solve what might appear to be an unrelated problem. The problem tackled by these profile directed basic block reorganizations is decreasing the number of last level cache misses in a server. They do this by organizing frequently accessed code segments together and moving rarely executed code segments. The biggest gain is that seldom executed error handling code can be moved away from frequently executed application code. I’ve seen reports of error handling code making up more than 40% of an application. Moving this code away from the commonly executed mainline code allows fewer processor cache lines to support program execution which demands fewer memory faults. Error handling code will execute more slowly but that is seldom an issue. Profile directed basic block transforms need to be trained on “typical” applications workloads and code that typically executes together will be placed together. Unfortunately, “typical” is often an important, industry standard benchmark like TPC-C so sometimes “typical” is replaced by “important” :-). Nonetheless, the tools are effective and greater than 20% improvement is common and we often see much more. All commercial database servers use or misuse profile directed basic block reorganizations.

 

The JavaScript download problem is actually very similar to the problem addressed by basic block transforms. Getting code from the server takes relatively long time just as getting code from memory takes a long time relative to executing code already in the processor cache.  Much of the application doesn’t execute in the common case so it makes little sense to download it all unless needed in this execution. Most of the code isn’t needed to start the application so it’s a big win to download the code, start the application, and then download what is needed in the background.

 

Last week Ben Livshits and Emre Kiciman of the Microsoft Research team released an interesting tool that does exactly this for JavaScript applications. Doloto analyses client JavaScript systems and breaks them up into a series of independent modules. The primary module is downloaded first and includes just stubs for the other modules. This primary module is smaller, downloads faster, and dramatically improves time to live application. In the Doloto team measurements, the size of the initial download was only between 20% and 60% of the size of the standard download. In the case of Google docs, the initial download was less than 20% of the original size.

Once the initial module is downloaded, the application is live and running and the rest of the modules are brought down asynchronously or faulted in as needed. Many applications due these optimizations manually but this is a nice automated approach to the problem

 

I’ve seen 80,000 line JavaScript programs and there are many out there far larger. Getting the application running fast dramatically improves the user experience and this is a nice approach to achieving that goal.  Doloto is available for download at: http://msdn.microsoft.com/en-us/devlabs/ee423534.aspx. And there is a more detailed Doloto paper at: http://research.microsoft.com/en-us/um/people/livshits/papers/pdf/fse08.pdf and summary information at: http://research.microsoft.com/en-us/projects/doloto/.   

 

James Hamilton

e: jrh@mvdriona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Sunday, September 13, 2009 9:03:46 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Software
 Saturday, July 25, 2009

MapReduce has created some excitement in the relational database community. Dave Dewitt and Michael Stonebraker’s MapReduce: A Major Step Backwards is perhaps the best example.  In that posting they argued that map reduce is a poor structured storage technology, the execution engine doesn’t include many of the advances found in modern, parallel RDBMS execution engines, it’s not novel, and its missing features.

 

In Mapreduce: A Minor Step Forward I argued that MapReduce is an execution model rather than storage engine. It is true that it is typically run over a file system like GFS or HDFS or simple structured storage system like BigTable or Hbase. But, it could be run over a full relational database.

 

Why would we want to run Hadoop over a full relational database?  Hadoop scales: Hadoop has been scaled to 4,000 nodes at Yahoo! Scaling Hadoop to 4000 nodes at Yahoo!.  Scaling a clustered RDBMS too 4k nodes is certainly possible but the high scale single system image cluster I’ve seen was 512 nodes (what was then called DB2 Parallel Edition). Getting to 4k is big.  Hadoop is simple: automatic parallelism has been an industry goal for decades but progress has been limited. There really hasn’t been success in allowing programmers of average skill to write massively parallel programs except for SQL and Hadoop. Programmers of bounded skill can easily write SQL that will be run in parallel over high scale clusters. Hadoop is the only other example I know where this is possible and happening regularily. 

 

Hadoop makes the application of 100s or even 1000s of nodes of commodity computers easy  so why not Hadoop over full RDBMS nodes?  Daniel Abadi and team from Yale and Brown have done exactly that.  In this case, Hadoop over PostgresSQL. From Daniel’s blog:

 

HadoopDB is:

1.       A hybrid of DBMS and MapReduce technologies targeting analytical query workloads

2.       Designed to run on a shared-nothing cluster of commodity machines, or in the cloud

3.       An attempt to fill the gap in the market for a free and open source parallel DBMS

4.       Much more scalable than currently available parallel database systems and DBMS/MapReduce hybrid systems (see longer blog post).

5.       As scalable as Hadoop, while achieving superior performance on structured data analysis workloads

See: http://dbmsmusings.blogspot.com/2009/07/announcing-release-of-hadoopdb-longer.html for more detail and http://sourceforge.net/projects/hadoopdb/ for source code for HadoopDB.

 

A more detailed paper has been accepted for publication at VLDB: http://db.cs.yale.edu/hadoopdb/hadoopdb.pdf.

 

The development work for HadoopDB was done using AWS Elastic Compute Cluster. Nice work Daniel.

 

                                                --jrh

 

James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
james@amazon.com  

H:mvdirona.com | W:mvdirona.com/jrh/work  | blog:http://perspectives.mvdirona.com

 

Saturday, July 25, 2009 9:59:47 AM (Pacific Standard Time, UTC-08:00)  #    Comments [5] - Trackback
Services | Software
 Saturday, June 13, 2009

Erasure coding provides redundancy for greater than single disk failure without 3x or higher redundancy. I still like full mirroring for hot data but the vast majority of the worlds data is cold and much of it never gets referenced after writing it: Measurement and Analysis of Large-Scale Network File System Workloads. For less-than-hot workloads, erasure coding is an excellent solution. Companies such as EMC, Data Domain, Maidsafe, Allmydata, Cleversafe, and Panasas are all building products based upon erasure coding.

 

At FAST 2009 in late February, A Performance Evaluation and Examination of Open-Source Erasure Coding Libraries For Storage will be presented. This paper looks at 5 open source erasure coding systems and compares there relative performance. The open source erasure coding packages implement Read-Solomon, Cauchy Read-Solomon, Even-Odd, Row-Diagonal Parity (RDP), and Minimal Density RAID-6 codes.

 

The authors found:

·         The special-purpose RAID-6 codes vastly outperform their general-purpose counterparts. RDP performs the best of these by a narrow margin.

·         Cauchy Reed-Solomon coding outperforms classic Reed-Solomon coding significantly, as long as attention is paid to generating good encoding matrices.

·         An optimization called Code-Specific Hybrid Reconstruction  is necessary to achieve good decoding speeds in many of the codes.

·         Parameter selection can have a huge impact on how well an implementation performs. Not only must the number of computational operations be considered, but also how the code interacts with the memory hierarchy, especially the caches.

·         There is a need to achieve the levels of improvement that the RAID-6 codes show for higher numbers of failures.

 

The paper also provides a good introduction of how erasure coding works.  Recommended. I expect erasure codes to spring up in many more application in the near future.

 

                                                --jrh

 

James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
james@amazon.com  

H:mvdirona.com | W:mvdirona.com/jrh/work  | blog:http://perspectives.mvdirona.com

 

Saturday, June 13, 2009 9:42:58 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Software
 Thursday, March 26, 2009

Over the last couple of years, I’ve been getting more interested in Erlang as an high-scale services implementation language originally designed at Ericcson.  Back in May of last year I posted: Erlang and High-Scale System Software. 

 

The Erlang model of spawning many lightweight threads that communicate via message passing is typically less efficient than the more common shared memory and locks approach but the lightweight processes with message passing model but it is much easier to get a correct implementation using this model.  Erlang also encourages a “fail fast” programming model.  Years ago I became convinced that this design pattern is one of the best ways to get high scale systems software correct (Designing and Deploying Internet-Scale Services).   

Chris Newcombe of Amazon recently presented an excellent talk on Erlang at the Berkeley RAD Lab.  The first part of Chris’ Berkeley talk on Erlang is posted here: Erlang: Productivity and Performance (ChrisNewcombe_ErlangProductivityPerformance.pdf (298.21 KB)). The second half of Chris’ talk is posted at: http://ulf.wiger.net/weblog/wp-content/uploads/2009/01/damp09-erlang-multicore.pdf (unfortunately this link is down at the time of this posting). Update: Ulf Wiger offers a live URL for his excellent slides: http://www.cse.unsw.edu.au/~pls/damp09/damp09-wiger-keynote.pdf.

In this talk Chris gives an overview of Erlang, talks about some of the advantages of the language, and then goes through some of the performance strengths and weaknesses of Erlang.

 

                                                                --jrh

 

James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
james@amazon.com  

H:mvdirona.com | W:mvdirona.com/jrh/work  | blog:http://perspectives.mvdirona.com

Thursday, March 26, 2009 6:42:25 AM (Pacific Standard Time, UTC-08:00)  #    Comments [3] - Trackback
Software
 Sunday, February 22, 2009

Richard Jones of Last.fm has compiled an excellent list of key-value stores in Anti-RDBMS: A list of key-value stores.

 

In this post, Richard looks at Project Voldemort, Ringo, Scalaris, Kai, Dynomite, MemcacheDB, ThruDB, CouchDB, Cassandra, HBase and Hypertable. His conclusion for Last.fm use is that Project Voldemort has the most promise with Scalaris being a close second and Dynomite is also interesting.

 

                                                                --jrh

 

James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
james@amazon.com  

H:mvdirona.com | W:mvdirona.com/jrh/work  | blog:http://perspectives.mvdirona.com

 

Sunday, February 22, 2009 7:43:13 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Software
 Saturday, February 07, 2009

Last July, Facebook released Cassandra to open source under the Apache license: Facebook Releases Cassandra as Open Source.  Facebook uses Cassandra as email search system where, as of last summer, they had 25TB and over 100m mailboxes. This video gets into more detail on the architecture and design: http://www.new.facebook.com/video/video.php?v=540974400803#/video/video.php?v=540974400803. My notes are below if you don’t feel like watching the video.

·         Authors:

o   Prashant Malik

o   Karthnik Ranganathan

o   Avinash Lakshman

·         Structured storage system over P2p (keys are consistent hashed over servers)

·         Initially aimed at email inbox search problem

·         Design goals:

o   Cost Effective

o   Highly Available

o   Incrementally Scalable

o   Efficient data layout

o   Minimal administration

·         Why Cassandra

o   MySQL drives too many random I/Os

o   File-based solutions require far too many locks

·         What is Cassandra

o   Structured storage over a distributed cluster

o   Redundancy via replication

o   Supports append/insert without reads

o   Supports a caching layer

o   Supports Hadoop operations

·         Cassandra Architecture

o   Core Cassandra Services:

§  Messaging (async, non-blocking)

§  Failure detector

§  Cluster membership

§  Partitioning scheme

§  Replication strategy

o   Cassandra Middle Layer

§  Commit log

§  Mem-table

§  Compactions

§  Hinted handoff

§  Read repair

§  Bootstrap

o   Cassandra Top Layer

§  Key, block, & column indexes

§  Read consistency

§  Touch cache

§  Cassandra API

§  Admin API

§  Read Consistency

o   Above the top layer:

§  Tools

§  Hadoop integration

§  Search API and Routing

·         Cassandra Data Model

o   Key (uniquely specifies a “row”)

§  Any arbitrary string

o   Column families are declared or deleted in advance by administrative action

§  Columns can be added or deleted dynamically

§  Column families have attribute:

·         Name: arbitrary string

·         Type: simple,

o   Key can “contain” multiple column families

§  No requirement that two keys have any overlap in columns

o   Columns can be added or removed arbitrarily from column families

o   Columns:

§  Name: arbitrary string

§  Value: non-indexed blob

§  Timestamp (client provided)

o   Column families have sort orders

§  Time-based sort or name-based sort

o   Super-column families:

§  Big tables calls them locality groups

§  Super-column families have a sort order

§  Essentially a multi-column index

o   System column families

§  For internal use by Cassandra

o   Example from email application

§  Mail-list (sorted by name)

·         All mail that includes a given word

§  Thread-list (sorted by time)

·         All threads that include a given word

§  User-list (sorted by time)

·         All mail that includes a given word user

·         Cassandra API

o   Simple get/put model

·         Write model:

o   Quorum write or aysnc mode (used by email application)

o   Async: send request to any node

§  That node will push the data to appropriate nodes but return to client immediately

o   Quorum write:

§  Blocks until quorum is reached

o   If node down, then write to another node with a hint saying where it should be written two

§  Harvester every 15 min goes through and find hints and moves the data to the appropriate node

o   At write time, you first write to a commit log (sequential)

§  After write to log it is sent to the appropriate nodes

§  Each node receiving write first records it in a local log

·         Then makes update to appropriate memtables (1 for each column family)

§  Memtables are flushed to disk when:

·         Out of space

·         Too many keys (128 is default)

·         Time duration (client provided – no cluster clock)

§  When memtables written out two files go out:

·         Data File

·         Index File

o   Key, offset pairs (points into data file)

o   Bloom filter (all keys in data file)

§  When a commit log has had all its column families pushed to disk, it is deleted

·         Data files accumulate over time.  Periodically data files are merged sorted into a new file (and creates new index)

·         Write properties:

o   No locks in critical path

o   Sequential disk access only

o   Behaves like a write through cache

§  If you read from the same node, you see your own writes.  Doesn’t appear to provide any guarantee on read seeing latest change in failure case

o   Atomicity guaranteed for a key

o   Always writable

·         Read Path:

o   Connect to any node

o   That node will route to the closes data copy which services immediately

o   If high consistency required, don’t return from local immediately

§  First send digest request to all replicas

§  If delta is found, the updates are sent to the nodes that don’t have current data (read repair)

·         Replication supported via multiple consistent hash rings:

o   Servers are hashed over ring

o   Keys are hashed over ring

o   Redundancy via walking around the ring and placing on the next node (rack position unaware) or on the next node on a different rack (rack aware) or on a next system in a different data center (implication being that the ring can span data centers)

·         Cluster membership

o   Cluster membership and failure detection via gossip protocol

·         Accrual failure detector

o   Default sets PHI to 5 in Cassandra

o   Detection is 10 to 15 seconds with PHI=5

·         UDP control messages and TCP for data messages

·         Complies with Staged Event Driven Architecture (SEDA)

·         Email system:

o   100m users

o   4B threads

o   25TB with 3x replication

o   Uses and joins across 4 tables:

§  Mailbox (user_id to thread_id mapping)

§  Msg_threads (thread to subject mapping)

§  Msg_store (thread to message mapping)

§  Info (user_id to user name mapping)

·         Able to load using Hadoop at 1.5TB/hour

o   Can load 25TB at network bandwidth over Cassandra Cluster

 

James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
james@amazon.com  

H:mvdirona.com | W:mvdirona.com/jrh/work  | blog:http://perspectives.mvdirona.com

 

Saturday, February 07, 2009 11:16:20 AM (Pacific Standard Time, UTC-08:00)  #    Comments [2] - Trackback
Software

Disclaimer: The opinions expressed here are my own and do not necessarily represent those of current or past employers.

Archive
<March 2010>
SunMonTueWedThuFriSat
28123456
78910111213
14151617181920
21222324252627
28293031123
45678910

Categories
This Blog
Member Login
All Content © 2010, James Hamilton
Theme created by Christoph De Baene / Modified 2007.10.28 by James Hamilton