Facebook Cassandra Architecture and Design

Last July, Facebook released Cassandra to open source under the Apache license: Facebook Releases Cassandra as Open Source. Facebook uses Cassandra as email search system where, as of last summer, they had 25TB and over 100m mailboxes. This video gets into more detail on the architecture and design: http://www.new.facebook.com/video/video.php?v=540974400803#/video/video.php?v=540974400803. My notes are below if you don’t feel like watching the video.

· Authors:

o Prashant Malik

o Karthnik Ranganathan

o Avinash Lakshman

· Structured storage system over P2p (keys are consistent hashed over servers)

· Initially aimed at email inbox search problem

· Design goals:

o Cost Effective

o Highly Available

o Incrementally Scalable

o Efficient data layout

o Minimal administration

· Why Cassandra

o MySQL drives too many random I/Os

o File-based solutions require far too many locks

· What is Cassandra

o Structured storage over a distributed cluster

o Redundancy via replication

o Supports append/insert without reads

o Supports a caching layer

o Supports Hadoop operations

· Cassandra Architecture

o Core Cassandra Services:

§ Messaging (async, non-blocking)

§ Failure detector

§ Cluster membership

§ Partitioning scheme

§ Replication strategy

o Cassandra Middle Layer

§ Commit log

§ Mem-table

§ Compactions

§ Hinted handoff

§ Read repair

§ Bootstrap

o Cassandra Top Layer

§ Key, block, & column indexes

§ Read consistency

§ Touch cache

§ Cassandra API

§ Admin API

§ Read Consistency

o Above the top layer:

§ Tools

§ Hadoop integration

§ Search API and Routing

· Cassandra Data Model

o Key (uniquely specifies a “row”)

§ Any arbitrary string

o Column families are declared or deleted in advance by administrative action

§ Columns can be added or deleted dynamically

§ Column families have attribute:

· Name: arbitrary string

· Type: simple,

o Key can “contain” multiple column families

§ No requirement that two keys have any overlap in columns

o Columns can be added or removed arbitrarily from column families

o Columns:

§ Name: arbitrary string

§ Value: non-indexed blob

§ Timestamp (client provided)

o Column families have sort orders

§ Time-based sort or name-based sort

o Super-column families:

§ Big tables calls them locality groups

§ Super-column families have a sort order

§ Essentially a multi-column index

o System column families

§ For internal use by Cassandra

o Example from email application

§ Mail-list (sorted by name)

· All mail that includes a given word

§ Thread-list (sorted by time)

· All threads that include a given word

§ User-list (sorted by time)

· All mail that includes a given word user

· Cassandra API

o Simple get/put model

· Write model:

o Quorum write or aysnc mode (used by email application)

o Async: send request to any node

§ That node will push the data to appropriate nodes but return to client immediately

o Quorum write:

§ Blocks until quorum is reached

o If node down, then write to another node with a hint saying where it should be written two

§ Harvester every 15 min goes through and find hints and moves the data to the appropriate node

o At write time, you first write to a commit log (sequential)

§ After write to log it is sent to the appropriate nodes

§ Each node receiving write first records it in a local log

· Then makes update to appropriate memtables (1 for each column family)

§ Memtables are flushed to disk when:

· Out of space

· Too many keys (128 is default)

· Time duration (client provided – no cluster clock)

§ When memtables written out two files go out:

· Data File

· Index File

o Key, offset pairs (points into data file)

o Bloom filter (all keys in data file)

§ When a commit log has had all its column families pushed to disk, it is deleted

· Data files accumulate over time. Periodically data files are merged sorted into a new file (and creates new index)

· Write properties:

o No locks in critical path

o Sequential disk access only

o Behaves like a write through cache

§ If you read from the same node, you see your own writes. Doesn’t appear to provide any guarantee on read seeing latest change in failure case

o Atomicity guaranteed for a key

o Always writable

· Read Path:

o Connect to any node

o That node will route to the closes data copy which services immediately

o If high consistency required, don’t return from local immediately

§ First send digest request to all replicas

§ If delta is found, the updates are sent to the nodes that don’t have current data (read repair)

· Replication supported via multiple consistent hash rings:

o Servers are hashed over ring

o Keys are hashed over ring

o Redundancy via walking around the ring and placing on the next node (rack position unaware) or on the next node on a different rack (rack aware) or on a next system in a different data center (implication being that the ring can span data centers)

· Cluster membership

o Cluster membership and failure detection via gossip protocol

· Accrual failure detector

o Default sets PHI to 5 in Cassandra

o Detection is 10 to 15 seconds with PHI=5

· UDP control messages and TCP for data messages

· Complies with Staged Event Driven Architecture (SEDA)

· Email system:

o 100m users

o 4B threads

o 25TB with 3x replication

o Uses and joins across 4 tables:

§ Mailbox (user_id to thread_id mapping)

§ Msg_threads (thread to subject mapping)

§ Msg_store (thread to message mapping)

§ Info (user_id to user name mapping)

· Able to load using Hadoop at 1.5TB/hour

o Can load 25TB at network bandwidth over Cassandra Cluster

James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
james@amazon.com

H:mvdirona.com | W:mvdirona.com/jrh/work | blog:http://perspectives.mvdirona.com

2 comments on “Facebook Cassandra Architecture and Design
  1. It is true that failure conditions and recovery are the hardest to get right, the most important to understand, and often the poorest documented.

    –jrh
    jrh@mvdirona.com

  2. Phil Wise says:

    Nice. I was evaluating Cassandra a while back and went through the API and documented the high level operations that are available:

    http://www.brightyellowcow.com/blog/Evaluating-the-API-of-Cassandra-BigTable-.html

    Certain obvious operations didn’t seem to be present. This is (I assume) because they cannot be implemented without a significant performance penalty. A more insidious problem is that the precise behaviour that can be experienced by a client during failover conditions is not documented. It would be nice if folk started using something like TLA+ to document exactly which outcomes are ‘correct’ and which are ‘bugs’.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.