Geo-Replication at Facebook

Last Friday I arrived back from vacation (Back from the Outside Passage in BC) to 3,600 email messages. I’ve been slogging through them through the weekend to now and I’m actually starting to catch up.

Yesterday Tom Kleinpeter pointed me to this excellent posting from Jason Sobel of Facebook: Scaling Out. This excellent post describes the geo-replication support recently added to Facebook. Rather than having a single west coast data center they added an east coast center both to be near to East Coast users and to provide redundancy for the west coast site.

It’s a cool post for two reasons: 1) it’s a fairly detailed description of how one large scale service implemented geo-redundancy, and 2) they are writing about it externally. Way too many of these techniques are never discussed outside the implementing company and so everyone needs to keep re-inventing. Facebook continues to share both how they engineer aspects of their service in addition to contributing some of the code used to do it to open source (e.g. Facebook Releases Cassandra as Open Source). Good to see.

I’ve long been interested in geo-replication, geo-partitioning, and all other forms of cross data center operation because it’s a problem that every high scale service needs to solve and yet there really are no general, just-do-this recipes to easily operate over multiple data centers. A few common design patters have emerged but all solutions of reasonable complexity end up being application specific. I would love to see general infrastructure emerge to generally support geo-redundancy but it’s a hard problem and solutions invariably exploit knowledge of application semantics. Since most solutions tend to be ad hoc and application specific today, it’s worth studying what they do while looking for common patterns. That was the other reason I enjoyed seeing this posting by Jason. He described how Facebook is currently addressing the issue and its always worth understanding as many existing solutions as possible before attempting a generalization.

The solution they have adopted is one where the California data center is primary and all writes are routed to it. The Virginia data center is secondary and serves read-only traffic. A load balancer does layer 7 packet inspection and routes URIs for pages that support writes to CA. If the page is read-only, as much of the Facebook traffic is, it gets routed to the east coast site (assuming you are “near” to it for some definition of near).

All writes are done through the California data center. Its serves as primary for the entire service. When a write is done in the Facebook architecture, the corresponding memcached layer is invalidated after the write is completed. MySQL replication is used to replicate the changes to the remote data center and solved the problem of only invalidating the remote memcached entries after the remote MySQL update by modifying MySQL. They changed MySQL to clear the local memcached of the appropriate keys once the replicated write is complete.

It’s a simple and fairly elegant approach and there is no doubt that simple is good. I would prefer an approach that scales out both reads and writes and there is a slight robustness risk that some engineer in the future may sometime add a page to the site that does a write and forget to update the this-page-writes URI list. If MySQL supports running a secondary replica in read-only mode, then that potential issue can be quickly and easily detected.

An approach that would allow multi-data center updates is to replicate the entire DB contents everywhere as Jason described in this post but rather than routing all writes through the California DB, partition the user-base on userID and route their traffic to a fixed center. Allow updates for that user at the data center they were routed to and replicate these changes to the other centers. Essentially it’s the same solution that was described except rather than routing all requests to the single primary database in CA, the primary database is distributed over multiple datacenters partitioned on userID. This approach would have many advantages but it is much more difficult if not all the data is cleanly partitioned by userID. And, in social networks, it isn’t. I refer to this as the “Hairball Problem” (Scaling LinkedIn).

The alternative approach I describe above of partitioning the primary database by userID would be reasonably easy to do in a service like an email system with most state partitioned cleanly by user but in a social network with lots of cross-user, shared state (the hairball problem), it’s harder to do. Nonetheless, it’s probably the right place for FB to end up but the current solution is clean and works and it’s hard to argue with that.

If you come across other articles on geo-support or want to contribute one here, drop me a note jrh@mvdirona.com.

–jrh

James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh | blog:http://perspectives.mvdirona.com

Leave a Reply

Your email address will not be published. Required fields are marked *