James Hamilton's Blog RSS 2.0
 Thursday, August 21, 2008

Last Friday I arrived back from vacation (Back from the Outside Passage in BC) to 3,600 email messages.  I’ve been slogging through them through the weekend to now and I’m actually starting to catch up. 

 

Yesterday Tom Kleinpeter pointed me to this excellent posting from Jason Sobel of Facebook: Scaling Out. This excellent post describes the geo-replication support recently added to Facebook.  Rather than having a single west coast data center they added an east coast center both to be near to East Coast users and to provide redundancy for the west coast site.

 

It’s a cool post for two reasons: 1) it’s a fairly detailed description of how one large scale service implemented geo-redundancy, and 2) they are writing about it externally.  Way too many of these techniques are never discussed outside the implementing company and so everyone needs to keep re-inventing.  Facebook continues to share both how they engineer aspects of their service in addition to contributing some of the code used to do it to open source (e.g. Facebook Releases Cassandra as Open Source).  Good to see.

 

I’ve long been interested in geo-replication, geo-partitioning, and all other forms of cross data center operation because it’s a problem that every high scale service needs to solve and yet there really are no general, just-do-this recipes to easily operate over multiple data centers. A few common design patters have emerged but all solutions of reasonable complexity end up being application specific.  I would love to see general infrastructure emerge to generally support geo-redundancy but it’s a hard problem and solutions invariably exploit knowledge of application semantics. Since most solutions tend to be ad hoc and application specific today, it’s worth studying what they do while looking for common patterns. That was the other reason I enjoyed seeing this posting by Jason. He described how Facebook is currently addressing the issue and its always worth understanding as many existing solutions as possible before attempting a generalization.

 

The solution they have adopted is one where the California data center is primary and all writes are routed to it. The Virginia data center is secondary and serves read-only traffic.  A load balancer does layer 7 packet inspection and routes URIs for pages that support writes to CA.  If the page is read-only, as much of the Facebook traffic is, it gets routed to the east coast site (assuming you are “near” to it for some definition of near).  

 

All writes are done through the California data center. Its serves as primary for the entire service. When a write is done in the Facebook architecture, the corresponding memcached layer is invalidated after the write is completed.  MySQL replication is used to replicate the changes to the remote data center and solved the problem of only invalidating the remote memcached entries after the remote MySQL update by modifying MySQL.  They changed MySQL to clear the local memcached of the appropriate keys once the replicated write is complete. 

 

It’s a simple and fairly elegant approach and there is no doubt that simple is good. I would prefer an approach that scales out both reads and writes and there is a slight robustness risk that some engineer in the future may sometime add a page to the site that does a write and forget to update the this-page-writes URI list.  If MySQL supports running a secondary replica in read-only mode, then that potential issue can be quickly and easily detected. 

 

An approach that would allow multi-data center updates is to replicate the entire DB contents everywhere as Jason described in this post but rather than routing all writes through the California DB, partition the user-base on userID and route their traffic to a fixed center. Allow updates for that user at the data center they were routed to and replicate these changes to the other centers.  Essentially it’s the same solution that was described except rather than routing all requests to the single primary database in CA, the primary database is distributed over multiple datacenters partitioned on userID.  This approach would have many advantages but it is much more difficult if not all the data is cleanly partitioned by userID.  And, in social networks, it isn’t.  I refer to this as the “Hairball Problem” (Scaling LinkedIn).

 

The alternative approach I describe above of partitioning the primary database by userID would be reasonably easy to do in a service like an email system with most state partitioned cleanly by user but in a social network with lots of cross-user, shared state (the hairball problem), it’s harder to do.  Nonetheless, it’s probably the right place for FB to end up but the current solution is clean and works and it’s hard to argue with that.

 

If you come across other articles on geo-support or want to contribute one here, drop me a note jrh@mvdirona.com.

 

                                                --jrh

 

James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

 

Thursday, August 21, 2008 5:36:23 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services
 Thursday, July 17, 2008

Going  boating: http://mvdirona.com/ so I’ll be taking a break from blogging until mid-august when I’m back and caught back up.  Enjoy,

 

                                                --jrh

 

James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

Thursday, July 17, 2008 4:53:25 AM (Pacific Standard Time, UTC-08:00)  #    Comments [1] - Trackback
Ramblings
 Wednesday, July 16, 2008

I’ve been collecting scaling stories for some time now and last week I came across the following run down on Fliker scaling: Federation at Flickr: Doing Billions of Queries Per Day by Dathan Vance Pattishall, the Flickr database guy.

 

The Flickr DB Architecture is sharded with a PHP access layer to maintain consistency.  Flickr users are randomly assigned to a shard. Each shard is duplicated in another database that is also serving active shards. Each DB needs to be less than 50% loaded to be able to handle failover.

 

Shards are found via a lookup ring that maps userID or groupID to shardID and photoID to userID.  The DBs are protected by a memcached layer with a 30 minute caching lifetime. Slide 16 says they are maintaining consistency using distributed transactions but I strongly suspect they are actually just running two parallel transactions with application management rather than 2pc.

 

Maintenance is done by bringing down ½ the DBs and the remaining DBs will handle the load but it appears they have no redundancy (failure protection) during the maintenance periods.

 

They have 12TB of user data in aggregate and they appear to be using MySQL (slide 25 complains about an INNODB bug).

 

Other web site scaling stories:

·         Scaling Linkedin: http://perspectives.mvdirona.com/2008/06/08/ScalingLinkedIn.aspx

·         Scaling Amazon: http://glinden.blogspot.com/2006/02/early-amazon-splitting-website.html

·         Scaling Second Life: http://radar.oreilly.com/archives/2006/04/web_20_and_databases_part_1_se.html

·         Scaling Technorati: http://www.royans.net/arch/2007/10/25/scaling-technorati-100-million-blogs-indexed-everyday/

·         Scaling Flickr: http://radar.oreilly.com/archives/2006/04/database_war_stories_3_flickr.html

·         Scaling Craigslist: http://radar.oreilly.com/archives/2006/04/database_war_stories_5_craigsl.html

·         Scaling Findory: http://radar.oreilly.com/archives/2006/05/database_war_stories_8_findory_1.html

·         MySpace 2006: http://sessions.visitmix.com/upperlayer.asp?event=&session=&id=1423&year=All&search=megasite&sortChoice=&stype=

·         MySpace 2007: http://sessions.visitmix.com/upperlayer.asp?event=&session=&id=1521&year=All&search=scale&sortChoice=&stype=

·         Twitter, Flickr, Live Journal, Six Apart, Bloglines, Last.fm, SlideShare, and eBay: http://poorbuthappy.com/ease/archives/2007/04/29/3616/the-top-10-presentation-on-scaling-websites-twitter-flickr-bloglines-vox-and-more

 

                        -jrh

 

Thanks to Kevin Merritt (Blist) and Dave Quick (Microsoft) for sending this my way.

 

James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

 

Wednesday, July 16, 2008 5:13:40 AM (Pacific Standard Time, UTC-08:00)  #    Comments [2] - Trackback
Services
 Sunday, July 13, 2008

I just got back from O’Reilly’s Foo Camp.   Foo is an interesting conference format in that there is no set agenda. It’s basically self organized as a open space-type event but that’s not what makes it special.  What makes Foo a very cool conference is the people.  Lots of conferences invite good people but few invite such a  diverse a set of attendees.  It was a lot of fun.

 

Here’s a picture from Saturday night of (right to left) Jesse Robbins (Seattle entrepreneur and co-chair of O’Reilly’s Velocity conference), Pat Helland (Microsoft Developer Division Architect) and myself.

 

  

From: http://flickr.com/photos/jesserobbins/2663653582/

Foo is an invitational event loosely organized around folks O’Reilly find interesting or want to get to know better.  Tim O’Reilly describes the invite process in this blog comment: http://gigaom.com/2005/08/16/foocampfighting/#comment-19709 .

 

The conference starts around 5pm on Friday with drinks followed by dinner. After dinner, the sign-up boards for Saturday and Sunday talk sessions are put up. When Tim O’Reilly announced the sign-up boards were up, attendees rose from their tables and casually started ambling towards the sign-up boards as though there really was no rush. We’ll just be doing some signing up over the course of the evening.  We might do it now or perhaps we’ll do it later. No big rush.  And then some folks break into a slight trot towards but still not really an overt run  Suddenly, it’s a full raging stampede.  We became a single flow than as a discrete set of individuals. The flow accelerated and then crashed up against the sign-up boards spilling to both sides with folks madly negotiating for pens, better locations, sharing sessions, a shift of 6” to one side or the other, or  requesting to join two sessions together. Welcome to foo camp.  It didn’t slow down much from that point for the next 44 hours.

 

Sessions are 1 hour long but it’s good form to share a session if you don’t need the full hour. Speakers use their judgment to sign up for a small, medium or large room.  Some rooms have AV equipment and some don’t.

 

I shared a 1 hour ssession with Jeff Hammerbacher who leads the Facebook data team.  Earlier last week Jeff announced that he will be leaving Facebok in September: http://valleywag.com/5024169/yet-another-hoodie+wearing-harvard-kid-drops-out-of-facebook.  Jeff and I got a medium sized room in a tent beside the main building without A/V. 

 

The title for my session was Where Does the Power go in Data Centers and How to get it Back?  I didn’t show slides but much of what we covered is posted at: http://mvdirona.com/jrh/TalksAndPapers/JamesRH_DCPowerSavingsFooCamp08.ppt.  In the session, we talked through how contemporary large data centers work first looking at power distribution. We tracked the power from the feed to the substation at 115,000 volts through numerous conversions before arriving at the CPU at 1.2 volts. We then talked about power saving server design techniques.  And then the mechanical systems used to get the heat back out.  In each section we discussed what could be done to improve the design and how much could be saved.

 

Our conclusion from the session was that power savings of nearly 4x where both possible and affordable using only current technology.  For those participated in the session, thanks for your contribution and  for your help. It was fun.

 

                                                                --jrh

 

James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

Sunday, July 13, 2008 5:58:43 PM (Pacific Standard Time, UTC-08:00)  #    Comments [1] - Trackback
Ramblings
 Saturday, July 12, 2008

Last week the Facebook Data team released Cassandra as open source. Cassandra is an structured store with write ahead logging and indexing. Jeff Hammerbacher, who leads the Facebook Data team described Cassandra as a BigTable data model running on a Dynamo-like infrastructure.

 

Google Code for Cassandra (Apache 2.0 License): http://code.google.com/p/the-cassandra-project/.

 

Avinash Lakshman, Prashant Malik, and Karthik Ranganathan presented at SIGMOD 2008 this year: Cassandra: Structured Storage System over a P2P Network.  From the presentation:

 

Cassandra design goals:

·         High availability

·         Eventual consistency

·         Incremental scalability

·         Optimistic replication

·         Knobs to “tune” tradeoffs between consistency, durability, and latecy

·         Low cost of ownership

·         Minimal administration

 

Write operation: write to arbitrary node in Cassandra cluster, request sent to node owning the data, node writes to log first and then applied to in-memory copy. Properties of write: no locks in critical path, sequential disk accesses, behaves like a write through cache, atomicity guarantee for a key, and always writable.

 

Cluster membership is maintained via gossip protocol.

 

Lessons learned:

·         Add fancy features only when required

·         Many types of failures are possible

·         Big systems need proper systems-level monitoring

·         Value simple designs

 

Future work:

·         Atomicity guarantees across multiple keys

·         Distributed transactions (I’ll try to talk them out of this one)

·         Compression support

·         Fine grained security via ACLs

 

It looks like a well engineered system.

 

                                                --jrh

 

James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

 

Saturday, July 12, 2008 3:17:11 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Software
 Thursday, July 10, 2008

What follows is a guest posting from Phil Bernstein on the Google Megastore presentation by Jonas Karlsson, Philip Zeyliger at SIGMOD 2008:

 

Megastore is a transactional indexed record manager built by Google on top of BigTable. It is rumored to be the store behind Google AppEngine but this was not confirmed (or denied) at the talk. [JRH: I certainly recognize many similarities between the Google IO talk on the AppEngine store (see Under the Covers of the App Engine Datastore in Rough notes from Selected Sessions at Google IO Day 1) and Phil’s notes below].

 

·         A transaction is allowed to read and write data in an entity group.

·         The term “entity group” refers to a set of records, possibly in different BigTable instances. Therefore, different entities in an entity group might not be collocated on the same machine. The entities in an entity group share a common prefix of their primary key.  So in effect, an entity group is a hierarchically-linked set of entities.

·         A per-entity-group transaction log is used. One of the rows that stores the entity group is the entity group’s root. The log is stored with the root, which is replicated like all rows in Big Table.

·         To commit a transaction, its updates are stored in the log and replicated to the other copies of the log. Then they’re copied into the database copy of the entity group.

·         They commit to replicas before acking the caller and use Paxos to deal with replica failures. So it’s an ACID transaction.

·         Optimistic concurrency control is used. Few details were provided, but I assume it’s the same as what they describe for Google Apps.

·         Schemas are supported.

·         They offer vertical partitioning to cluster columns that are frequently accessed together.

·         They don’t support joins except across hierarchical paths within entity groups. I.e., if you want to do arbitrary joins, then you write an application program and there’s no consistency guarantee between the data in different entity groups.

·         Big Table does not support indexes. It simply sorts the rows by primary key. Megastore supports indexes on top. They were vague about the details. It sounds like the index is a binary table with a column that contains the compound key as a slash-separated string and a column containing the primary key of the entity group.

·         Referential integrity between the components of an entity group is not supported.

·         Many-to-many relationships are not supported, though they said they can store the inverse of a functional relationship.  It sounded like a materialized view that is incrementally updated asynchronously.

·         It has been in use by apps for about a year.

 

The follow are the notes that I typed while listening to the talk. For the most part, it’s just what was written on the slides and is incomplete. I don’t think it adds much to my summary above.

 

TITLE: Megastore – Scalable Data System for User-facing Apps

SPEAKERS: Jonas Karlsson, Philip Zeyliger (Google)

 

User-facing apps have TP-system-like requirements

Ø  data updated by a few users

Ø  largely reads, small updates

Ø  lots of work scaling the data layer

Ø  users expect consistency

 

Megastore – scale, RAD, consistency, performance

 

Scale

Ø  start with Big Table for storage

Ø  add db technologies that scale: indices, schemas

Ø  offer transparent replication and failover between data centers

Ø  support many, frequent reads

Ø  writes may be more costly, because they’re less frequent

Ø  with a correct schema, the app should scale naturally

 

Rapid App Development

Ø  hide operational aspects of scalability from app code

Ø  Flexible declarative schemas (MDL) – looks like SQL

o   indices, rich data types, consistency and partitioning

Ø  offer transactions

 

Entity Group consistency is supported

Ø  it’s a logical partition – e.g., blogs, posts, comments, which are all keyed by the owner of the blog

Ø  all entities in the group have the same high-order key component

 

Transactions

Ø  roll forward transaction log per entity group, no rollbacks

o   pb: It’s unclear to me whether they pool the log across all entity groups in a partition. If not, then they don’t benefit from group commit.

Ø  A transaction over an entity group is ACID , but not across entity groups

Ø  optimistic concurrency control

Ø  updates are available only after commit

Ø  api: newTransaction, read/write, commit (pb: couldn’t type fast enough for the details)

Ø  non-blocking, consistent reads (pb: does a transaction see its previous writes?)

Ø  cross-entity group operations have looser consistency

 

Performance

Ø  schemas declare their physical data locality

Ø  optimized to minimize seeks, RPCs, bandwidth, and storage

Ø  several ways of declaring physical locality

o   entity groups

o   shared primary key prefixes (collocating tables in Big Table)

o   locality groups – i.e., attribute partitioning

Ø  simply cost-transparent API primitives imply

o   only add scalable features

o   cost of writes is linear to data/indices

o   avoid scalability surprises

 

Avoiding joins

Ø  hierarchical primary keys

Ø  repeated fields (I guess just like Big Table)

Ø  store hierarchical data  in a column (it’s unclear to me whether this is the whole entity group or only part of it)

Ø  Syntax looks like SQL: Create Table (with primary key), Create Index (on particular columns), ..

 

Replication HA

Ø  uses Paxos-based algorithm, per entity group

Ø  it was more complicated than they expected

Ø  writes need a majority of replicas to be up in order to commit

Ø  most reads are local, consistency ensured

Ø  replication is consistent and synchronous

Ø  automatic query failover: individual table servers may fail

 

Usage

Ø  has been used in production for a year

Ø  used by several internal tools and projects

Ø  several large and user-visible apps (they wouldn’t say which ones, except we know them)

Ø  used for rapidly implementing new features in older projects

Ø  many other projects are migrating to it

 

Technical lessons

Ø  declarative schemas are good

Ø  cost-transparent APIs good (SQL is not cost-transparent)

Ø  avoid joins with hierarchical data and indices (if you want a join that isn’t on a hierarchical path, then write a program, e.g., Sawzall

Ø  avoid scalability surprises

 

Social

Ø  schema reviews helpful

Ø  consistency is necessary

Ø  need a mindset of scalability and performance

 

James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

 

Thursday, July 10, 2008 1:47:45 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services
 Tuesday, July 08, 2008

Jim Gray proposed the original sort benchmark back in his famous Anon et al paper A Measure of Transaction Processing Power originally published in Datamation April 1, 1985. TeraSort is one of the benchmarks that Jim evolved from this original proposal.

 

TeraSort is essentially a sequential I/O benchmark and the best way to get lots of I/O capacity is to have many servers.  The mainframe engineer-a-bigger-bus technique has produced some nice I/O rates but it doesn’t scale. There have been some very good designs but, in the end, commodity parts in volume always win. The trick is coming up with a programming model that is understandable to allow thousands of nodes to be harnessed.  MapReduce takes some heat for not being innovative and not having learned enough from the database community (MapReduce – A Major Step Backwards).  However, Google, Microsoft, and Yahoo run the model over thousands of nodes.  And all three have written higher level languages layers above MapReduce some of which look very SQL-like.

 

Owen O’Malley of the Yahoo Grid team took a moderate sized Hadoop cluster of 910 nodes and won the TeraSort benchmark.  Owen blogged the result: Apache Hadoop Wins Terabyte Sort Benchmark and provided more details in a short paper: TeraByte Sort on Apache Hadoop. Great result Owen.

 

Here’s the configuration that won:

  • 910 nodes
  • 4 dual core Xeons @ 2.0ghz per a node
  • 4 SATA disks per a node
  • 8G RAM per a node
  • 1 gigabit ethernet on each node
  • 40 nodes per a rack
  • 8 gigabit ethernet uplinks from each rack to the core
  • Red Hat Enterprise Linux Server Release 5.1 (kernel 2.6.18)
  • Sun Java JDK 1.6.0_05-b13

Yahoo bought expensive  4-socket servers for this configuration but, even then, this effort was won on less than ½ million in hardware.  Let’s assume that their fat nodes are $3.8k each.  They have 40 servers per rack so well need 23 racks. Let’s assume $4.2k per top of rack switch and $100k for core switching.  That’s 910*$3.8k+23*4.2k+$100k or $3,655k.  That means you can go out and spend roughly $3.5m and have the same resources that won the last sort benchmark. Amazing.  I love what’s happening in our industry.

Update: Math glitch in original posting fixed above (thanks to Ari Rabkin & Nathan Shrenk).

 

The next thing I would like to see is this same test run on very low power servers.  Assuming the fairly beefy nodes used above are 350W each (they may well be more), the overall cluster ignoring networking will be 318kW and it ran for 209 seconds which is 18.490kW/hrs. Let’s focus on power and show what can be sorted for 1kW/hr. The kW/hr sort.

 

Congratulations to the Yahoo and Hadoop team for a great result.

 

                                --jrh

 

Wei Xiao of the Internet Search Research Center sent Owen’s result my way.

 

James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

 

Tuesday, July 08, 2008 4:03:17 AM (Pacific Standard Time, UTC-08:00)