Title: Needle in a Haystack: Efficient Storage of Billions of Photos
Speaker: Jason Sobel, Manager of the Facebook, Infrastructure Group)
An excellent talk that I really enjoyed. I used to lead a much smaller service that also used a lot of NetApp storage and I recognized many of the problems Jason mentioned. Throughout the introductory part of the talk I found myself thinking they need to move to a cheap, directly attached blob store. And that’s essentially the topic of remainder of the talk. Jason presented Haystack, the Facebook solution to the problem of a filesystem not working terribly well for their high volume blob storage needs.
The same thing happened when he talked through the Facebook usage of Content Delivery Networks (CDNs). The CDN stores the data once in a geo-distributed cache, Facebook stores it again in their distributed cache (Memcached) and then again the database tier. Later in the talk Jason, made exactly this observation and observed the new design will allow them to use the CDNs less and as they get a broader geo-diverse data center footprint, they may move to being their own CDN. I 100% agree.
My rough notes below with some of what I found most interesting.
Overall Facebook facts:
· #6 site on the internet
· 500 total employees
o 200 in engineering
o 25 in Infrastructure Engineering
· One of the largest MySQL installations in the world
· Big user and contributor to Memcached
· More than 10k servers in production
· 6,000 logical databases in production
Photo Storage and Management at Facebook:
· Photo facts:
o 6.5B photos in total
§ 4 to 5 sizes of each picture is materialized (30B files)
§ 475k images/second
· Mostly served via CDN (Akamai & Limelight)
· 200k profile photos/second
§ 100m uploads/week
o Stored on netapp filers
· First level caching via CDN (Akamai & Limelight)
o 99.8% hit rate for profiles
o 92% hit rate for remainder
· Second level caching for profile pictures only via Cachr (non-profile goes directly against file handle cache)
o Based upon a modified version of evhttp using memcached as a “backing” store
o Since cachr is independent from memcachd, cachr failure doesn’t lose state
o 1 TB of cache over 40 servers
o Delivers microsecond response
o Redundancy so no loss of cache contents on server failure
· Photo Servers
o Non-profile requests go directly against the photo-servers
o Only profile requests that miss the cachr cache.
· File Handle Cache (FHC)
o Based upon lighttpd and uses memcached as backing store
o Reduces metadata workload on NetApp servers
o Issue: filename to inode lookup is a serious scaling issue: 1) drives many I/Os or 2) wastes too much memory with a very large metadata caceh
§ They have extended the Linux kernel to allow NFS file opens via inode number rather than filename to avoid the NetApp scaling issue.
§ The inode numbers are stored in the FHC
§ This technique offloads the NetApp servers dramatically.
§ Note that files are write only. Mods write a new file and delete the old ones so the handles will fail and a new metadata lookup will be driven.
· Issues with this architecture:
o Netapp storage overwhelmed by metadata (3 disk I/Os to read a single photo).
§ The original design required 15 I/Os for a single picture (due to deeper directory hierarchy I’m guessing)
§ Tracking last access time, last modified etc. has no value to Facebook. They really only need a blob store but they are using a filesystem at additional expense
o Heavy reliance on CDNs and caches such that netapp is basically almost pure backup
§ 92% of non-profile and 99.8% of profile pictures are stored in CDN
§ Many of the rest are almost all stored in caching layers
· Solution: Haystacks
o Haystacks are a user level abstraction where lots of data is stored in a single file
o Store an independent index vastly more efficient than the file store
o 1M of metadata/1G of data
§ Order of magnitude better on average than standard NetApp metadata
o 1 disk seek for all reads with any workload
o Most likely store in XFS
o Expect each haystack to be about 10G (with an index)
o Speaker equates a Haystack to be a lot like a LUN and could be implemented on a LUN. The actual implementation is via NFS onto NetApp as photos were previously stored
o Net of what’s happening:
§ Haystack always hits on the metadata
o Plan to replace NetApp
§ Haystack is a win over NetApp but we’ll likely run over XFS (originally done by Silicon Grapics)
§ Want more control of the cache behavior
o Each Haystack Format:
§ Version number,
§ Magic number,
o Index format
§ Photo key,
§ Photo size,
o Not planning to delete photos at all since delete rate is VERY low so it the resource that would be recovered are not worth the work to recover them in the Facebook usage. Deletion just removes the entry from the index which makes the data unavailable but they don’t bother to actually remove it from the Haystack bulk storage system.
o Q:Why not store the index in a RDBMS? Feels that it’ll drive too many I/Os and have the problems they are trying to avoid (I’m not completely convinced but do understand that simplicity and being in control has value).
· They still plan to use the CDN but they are hoping to reduce their dependence on CDN. They are considering becoming their own CDN (Facebook is absolutely large enough to be able to do this cost effectively today).
· They are considering using to SSDs in the future.
· Not interested in hosting with Google or Amazon. Compute is already close to the data and they are working to get both closer to users but don’t see a need/use for GAE or AWS at the Facebook scale.
· The Facebook default is to use databases. Photos are the largest exception but most data is stored in DBs. Few actions use transactions and joins though.
· Almost all data is cached twice: once in memcached and then again in the DBs.
· Random bits:
o Canada: 1 out of 3 Canadians use Facebook.
o Q:What is the strategy in China? A:“not to do what Google did” :-)
o Looking at de-duping and other commonality exploiting systems for client to server communications and storage (great idea although not clearly a big win for photos).
o 90% Indians access internet via a mobile device. Facebook very focused on mobile and international.
Overall, an excellent talk by Jason.
Sent my way by Hitesh Kanwathirtha of the Windows Live Experience team, Mitch Wyle of Engineering Excellence, and Dave Quick and Alex Mallet both in Windows Live Cloud Storage group. It was originally Slashdotted at: http://developers.slashdot.org/article.pl?no_d2=1&sid=08/06/25/148203. The presentation is posted at: http://beta.flowgram.com/f/p.html#2qi3k8eicrfgkv.
James Hamilton, Data Center FuturesBldg 99/2428, One Microsoft Way, Redmond, Washington, 98052 W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | JamesRH@microsoft.com
H:mvdirona.com | W:research.microsoft.com/~jamesrh | blog:http://perspectives.mvdirona.com
Disclaimer: The opinions expressed here are my own and do not
necessarily represent those of current or past employers.