Sunday, June 29, 2008

Title: Needle in a Haystack: Efficient Storage of Billions of Photos

Speaker: Jason Sobel, Manager of the Facebook, Infrastructure Group)

Slides: http://beta.flowgram.com/f/p.html#2qi3k8eicrfgkv

 

An excellent talk that I really enjoyed.  I used to lead a much smaller service that also used a lot of NetApp storage and I recognized many of the problems Jason mentioned.  Throughout the introductory part of the talk I found myself thinking they need to move to a cheap, directly attached blob store. And that’s essentially the topic of remainder of the talk.  Jason presented Haystack, the Facebook solution to the problem of a filesystem not working terribly well for their high volume blob storage needs.

 

The same thing happened when he talked through the Facebook usage of Content Delivery Networks (CDNs).  The CDN stores the data once in a geo-distributed cache, Facebook stores it again in their distributed cache (Memcached) and then again the database tier.  Later in the talk Jason, made exactly this observation and observed the new design will allow them to use the CDNs less and as they get a broader geo-diverse data center footprint, they may move to being their own CDN. I 100% agree.

 

My rough notes below with some of what I found most interesting.

 

Overall Facebook facts:

·         #6 site on the internet

·         500 total employees

o   200 in engineering

o   25 in Infrastructure Engineering

·         One of the largest MySQL installations in the world

·         Big user and contributor to Memcached

·         More than 10k servers in production

·         6,000 logical databases in production

 

Photo Storage and Management at Facebook:

·         Photo facts:

o   6.5B photos in total

§  4 to 5 sizes of each picture is materialized (30B files)

§  475k images/second

·         Mostly served via CDN (Akamai & Limelight)

·         200k profile photos/second

§  100m uploads/week

o   Stored on netapp filers

·         First level caching via CDN (Akamai & Limelight)

o   99.8% hit rate for profiles

o   92% hit rate for remainder

·         Second level caching for profile pictures only via Cachr (non-profile goes directly against file handle cache)

o   Based upon a modified version of evhttp using memcached as a “backing” store

o   Since cachr is independent from memcachd, cachr failure doesn’t lose state

o   1 TB of cache over 40 servers

o   Delivers microsecond response

o   Redundancy so no loss of cache contents on server failure

·         Photo Servers

o   Non-profile requests go directly against the photo-servers

o   Only profile requests that miss the cachr cache.

·         File Handle Cache (FHC)

o   Based upon lighttpd and uses memcached as backing store

o   Reduces metadata workload on NetApp servers

o   Issue: filename to inode lookup is a serious scaling issue: 1) drives many I/Os or 2) wastes too much memory with a very large metadata caceh

§  They have extended the Linux kernel to allow NFS file opens via inode number rather than filename to avoid the NetApp scaling issue.

§  The inode numbers are stored in the FHC

§  This technique offloads the NetApp servers dramatically.

§  Note that files are write only.  Mods write a new file and delete the old ones so the handles will fail and a new metadata lookup will be driven.

·         Issues with this architecture:

o   Netapp storage overwhelmed by metadata (3 disk I/Os to read a single photo).

§  The original design required 15 I/Os for a single picture (due to deeper directory hierarchy I’m guessing)

§  Tracking last access time, last modified etc. has no value to Facebook.  They really only need a blob store but they are using a filesystem at additional expense

o   Heavy reliance on CDNs and caches such that netapp is basically almost pure backup

§  92% of non-profile and 99.8% of profile pictures are stored in CDN

§  Many of the rest are almost all stored in caching layers

·         Solution: Haystacks

o   Haystacks are a user level abstraction where lots of data is stored in a single file

o   Store an independent index vastly more efficient than the file store

o   1M of metadata/1G of data

§  Order of magnitude better on average than standard NetApp metadata

o   1 disk seek for all reads with any workload

o   Most likely store in XFS

o   Expect each haystack to be about 10G (with an index)

o   Speaker equates a Haystack to be a lot like a LUN and could be implemented on a LUN.  The actual implementation is via NFS onto NetApp as photos were previously stored

o   Net of what’s happening:

§  Haystack always hits on the metadata

o   Plan to replace NetApp

§  Haystack is a win over NetApp but we’ll likely run over XFS (originally done by Silicon Grapics)

§  Want more control of the cache behavior

o   Each Haystack Format:

§  Version number,

§  Magic number,

§  Length,

§  Data,

§  Checksum

o   Index format

§  Version,

§  Photo key,

§  Photo size,

§  Start,

§  Length.

o   Not planning to delete photos at all since delete rate is VERY low so it the resource that would be recovered are not worth the work to recover them in the Facebook usage.  Deletion just removes the entry from the index which makes the data unavailable but they don’t bother to actually remove it from the Haystack bulk storage system.

o   Q:Why not store the index in a RDBMS?  Feels that it’ll drive too many I/Os and have the problems they are trying to avoid (I’m not completely convinced but do understand that simplicity and being in control has value).

·         They still plan to use the CDN but they are hoping to reduce their dependence on CDN. They are considering becoming their own CDN (Facebook is absolutely large enough to be able to do this cost effectively today).

·         They are considering using to SSDs in the future.

·         Not interested in hosting with Google or Amazon. Compute is already close to the data and they are working to get both closer to users but don’t see a need/use for GAE or AWS at the Facebook scale.

·         The Facebook default is to use databases.  Photos are the largest exception but most data is stored in DBs. Few actions use transactions and joins though.

·         Almost all data is cached twice: once in memcached and then again in the DBs.

·         Random bits:

o   Canada: 1 out of 3 Canadians use Facebook.

o   Q:What is the strategy in China?  A:“not to do what Google did” :-)

o   Looking at de-duping and other commonality exploiting systems for client to server communications and storage (great idea although not clearly a big win for photos).

o   90% Indians access internet via a mobile device.  Facebook very focused on mobile and international.

 

Overall, an excellent talk by Jason.

 

--jrh

 

Sent my way by Hitesh Kanwathirtha  of the Windows Live Experience team, Mitch Wyle of Engineering Excellence, and Dave Quick and Alex Mallet both in Windows Live Cloud Storage group. It was originally Slashdotted at: http://developers.slashdot.org/article.pl?no_d2=1&sid=08/06/25/148203. The presentation is posted at: http://beta.flowgram.com/f/p.html#2qi3k8eicrfgkv.

 

James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

 

Sunday, June 29, 2008 8:04:21 PM (Pacific Standard Time, UTC-08:00)  #    Comments [3] - Trackback
Services
Tracked by:
"Facebook: Herr der Bilder" (TechBanger.de) [Trackback]
Sunday, June 29, 2008 8:51:10 PM (Pacific Standard Time, UTC-08:00)
I wonder if "don't delete data, just index entries" poses privacy concerns. I understand that the data basically can't be recovered, and can't be matched to a user, without that index. But do users understand it? Do courts?
Ari Rabkin
Monday, June 30, 2008 1:40:28 AM (Pacific Standard Time, UTC-08:00)
Hello James,

Could you please mention the sources of the statistics that you have come up with, in the section "Random bits" in the above post?
Monday, June 30, 2008 4:12:04 AM (Pacific Standard Time, UTC-08:00)
Ari, The "don't delete" question came up in the questions and Jason did a good job with it. He explained that most filesystems just deallocate the data blocks and remove the metadata on delete (that’s the reason that undelete utilities can exist). Facebook is doing the exact same thing as the filesystems. He explained they won’t serve deleted content the moment that it’s deleted so, from a user perspective, it IS deleted.

Shayon, you were asking about the sources of the random bits. It was all from the referenced talk. Most from the preamble but some from later parts of the talk.

--jrh
Comments are closed.

Disclaimer: The opinions expressed here are my own and do not necessarily represent those of current or past employers.

Archive
<June 2008>
SunMonTueWedThuFriSat
25262728293031
1234567
891011121314
15161718192021
22232425262728
293012345

Categories
This Blog
Member Login
All Content © 2014, James Hamilton
Theme created by Christoph De Baene / Modified 2007.10.28 by James Hamilton