FriendFeed use of MySQL

I collect postings on high-scale service architectures, scaling war stories, and interesting implementation techniques. For past postings see Scaling Web Sites and Scaling LinkedIn.

Last week Bret Taylor posted an interesting description of the FriendFeed backend storage architecture: How FriendFeed uses MySQL to store Schema-less Data. Friendfeed faces a subset of what I’ve called the hairball problem. The short description of this issue is social networking sites need to be able to access per-user information both by user and also by searching for the data to find the users. For example, group membership. Sometimes we will want to find groups that X is a member of and other times we’ll want to find a given group and all users who are members of that group. If we partition by user, one access pattern is supported well. If we partition by group, then the other works well. The hairball problem shows up in many domains – I just focus on social networks as the problem is so common there — see Scaling LinkedIn.

Common design patterns to work around the hairball are: 1) application maintained, asynchronous materialized views, 2) distributed in-memory caching of alternate search paths, and 3) central in-memory caching. LinkedIn is a prototypical example of central in-memory caching. Facebook is the prototypical example of distributed in-memory caching using memcached. And, FriendFeed is a good example of the first pattern, application maintained, async materialized views.

In Bret’s How FreindFeed uses MySQL to store Schema-less Data he describes how Friendfeed manages the hairball problem. Data is stored in primary table sharded over the farm. The primary table can be efficiently accessed on whatever its key is. If you want access to the same data searching on a different dimension, they would have to search every shard individually. To avoid this, they create a secondary table with the appropriate search key where the “data” is just the primary key of the primary table. To find entities with some secondary property, they search first the secondary table to get the qualifying entity ID and then fetch the entities from the primary table.

Primary and secondary tables are not updated atomically – that would require two phase commit the protocol Pat Helland jokingly refers to as the anti-availability protocol. Since the primaries and secondary tables are not updated atomically, a secondary index may point to a primary that actually doesn’t qualify and some primaries that do quality may not be found if the secondary hasn’t yet been updated. The later is simply a reality of this technique and the application has to be tolerant of this short-time period data integrity anomaly. The former problem can be solved by reapplying the search predicate as a residual (a common RDBMS implementation technique).

The FriendFeed systems described in Bret Taylor’s post also addresses the schema change problem. Schema changes can disruptive and some RDBMS implement schema change incredibly inefficiently. This by the way, is completely unnecessary – the solution is well known – but bad implementations persist. The FriendFeed technique to deal with the schema change issue is arguably a bit heavy handed: they simply don’t show the schema to MySQL and, instead, use it as a key-value store where the values are either JSON objects or Python dictionaries.

–jrh

Thanks to Dave Quick for pointing me to the FriendFeed posting.

James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
james@amazon.com

H:mvdirona.com | W:mvdirona.com/jrh/work | blog:http://perspectives.mvdirona.com

4 comments on “FriendFeed use of MySQL
  1. I’ve worked for decades with RDBMSs and like them but I agree there is a place for non-relational stores. Working in Amazon Web Services, I work closely with the SimpleDB Team (http://aws.amazon.com/simpledb/). It’s easy to use, non-RDBMS hosted structured store.

    –jrh
    jrh@mvdirona.com

  2. MikeD says:

    I had not seen the FriendFeed post, so thanks for the tip!

    Mark – in using a master-slave setup to help with online schema changes, would you apply the schema changes to the slave, then flip the roles, then apply to the other server?

    We are moving away from MySQL to key-value approaches for some parts of our data (the simple key-value parts, of course) and I noticed that the Voldemort project is supporting some schema capabilities on JSON content. Do you have much experience with non-RDBMS oriented schema approaches?

  3. I agree that this problem is solvable many other ways. I’m a DB guy so recommending a schema-opaque solution wouldn’t be my first choice but that is the option that was taken in this case.

    Thanks for the comment Mark.

    –jrh
    jrh@mvdirona.com

  4. FriendFeed didn’t need to stop using a schema to perform online schema changes with MySQL. There is a well-known solution for this that doesn’t require the complexity of the traditional RDBMS solution — use master-master or master-slave replication and perform the change first on the server that isn’t the active master.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.