Scaling LinkedIn

I’m interested in high-scale web sites, their architecture and their scaling problems. Last Thursday, Oren Hurvitz posted a great blog entry summarizing two presentations at Java One on the LinkedIn service architecture.

LinkedIn scale is respectable:

· 22M members

· 130M connections

· 2M email messages per day

· 250k invitations per day

No big surprises other than LinkedIn are still using a big, expensive commercial RDBMS and a large central Sun SPARC Server. LinkedIn call this central server “The Cloud” and its described as the “backend server caching the entire LinkedIn Network”. Note that “server” is not plural – it appears that the cloud is not partitioned but it is replicated 40 times. However, each instance of the cloud hosts the entire social graph in 12GB of memory.

Social graphs are one of my favorite services problems in that they are notoriously difficult to effectively partition. I often refer to this as the “hairball problem”. There is no clean data partition that will support the workload with adequate performance. Typical approaches to this problem redundantly store a several copies of the data with different lookup keys and potentially different partitions. For example, most social networks have some notion of a user group. The membership list is stored with each group.

But, a common request is to find all groups for which a specific user is a member. Storing users with the group allows efficient group enumeration but doesn’t support the “find all groups of a given user” query. Storing the group membership with each user supports this query well but doesn’t allow efficient group membership enumeration. The most common solutions are to store the data redundantly both ways. Typically one is the primary copy and the other is updated asynchronously after the primary is updated. This effectively makes read much more efficient at the cost of more work at update time. Sometimes, the secondary copy isn’t persisted in the backend database and is only stored in an in memory cache often implemented using memcached.

The LinkedIn approach of using a central in-memory, social graph avoids some of the redundancy of the partitioned model I described above at the expensive of requiring a single-server memory large enough to store the entire un-partitioned social graph. Clearly this is more efficient by many measures but requiring that the entire social graph fit into a single servers memory means that more expensive servers are required as the service grows. And, many of us can’t sleep at night with hard scaling limits even if they are apparently fairly large.

Other scaling web sites pointers are posted at: http://perspectives.mvdirona.com/2007/11/12/ScalingWebSites.aspx.

Thanks to Dare Obasanjo and Kevin Merritt for sending my way.

–jrh

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh | blog:http://perspectives.mvdirona.com

One comment on “Scaling LinkedIn
  1. this is a very interesting post. it’s always fun to see how other popular properties are implementing their systems.

    it seems like linkedin will be able to tread water for a little while with this current design – their feature set and user community doesn’t seem to be as explosive as would be found on facebook/myspace, for example – and that’s probably fine. i agree, though, that it is a bit unsettling to ponder what would need to happen to rapidly scale out this service given how tied and hardened it is in its current approach. i do suspect if they made it this far, though, that they have enough smart folks who have thought of or are already thinking of what to do next.

    it’s also interesting to see the confabulation of so many little (but complex) packages that seems to be a trademark of java implementations. it’s interesting, for example, that they chose not to use hibernate, but are still bothering with spring. i’d think they’d also ditch spring to lose the extra deployment/mgmt overhead and complexity – but then again, i don’t know as much about how effective that would be. it also seems the gc and object size problems mentioned are a little troubling though i can’t imagine how they haven’t been solved yet – i’d guess a lot of it is just a matter of having to do a lot of testing and qualification for a good-sized matrix of packages all working together.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.