James Hamilton's Blog RSS 2.0
 Wednesday, July 02, 2008

Updated below with additional implementation details.

 

Last week Spansion made an interesting announcement: EcoRAM, a NOR Flash based storage part in a Dual In-line Memory Module (DIMM) package. 

 

NOR Flash technology growth has been fueled by the NOR support for Execute in Place (XIP).  Unlike the NAND Flash interface, where entire memory pages need to be shifted into memory to be operated upon, NOR flash is directly addressable.  And this direct addressability allows instructions to be read and executed directly from the memory.  There is no need to shift pages out one at a time. Byte addressability and support for XIP makes NOR ideal for boot loaders, ROMs, and the control program store for consumer devices. For example, the iPod Nano uses Silicon Storage Technology 39WF800A 8-Megabyte NOR boot flash (eeTimes).

 

Since NAND flash is not byte addressable providing only a block mode interface, it is typically attached to PCs and servers as an I/O device.  The NOR support for direct byte addressability makes it a candidate for attachment as a memory rather than as a block mode I/O device and when I first read the press release I thought this was what Spansion has done in partnership with Virident Systems. It’s clear they have NOR Flash memory in a memory (DIMM) package and they refer to it throughout the press release as “memory extension”. However, upon closer inspection, it appears to require that the NOR memory DIMM packages all be installed in a separate gateway server they refer as a “Green Gateway”.  It looks like the design has all the NOR flash in this separate server and a device driver on the host to virtualize memory on the NOR flash server. Essentially it still may be accessed via an I/O interface which is to say it’s not clear why you couldn’t do the same thing with NAND Flash.  And, it’s not immediately clear what protocol is used, what operating systems are supported, nor the exact performance but, overall, it still looks interesting.

 

Update: In conversations with Virident, it appears this part is potentially more interesting that I initially speculated. Rather than hosting the memory in an independent server as I speculated, it’s an in-server design but does require some BIOS engineering. From Virident: The current interconnect is the HTx bus for AMD servers.  Will be QPI for Intel.  We are doing AMD first.  You should be able to install on a standard two socket board.  DRAM sits behind the processor, and EcoRAM sits behind the controller.  Of course, the BIOS for the system must support the extended memory – we have HP systems up and running as a proof of concept, and Dell should work fairly soon. 

 

HTX and QPI open up big opportunities for hardware startups to innovate.  I know of many startups heading down this path. More innovation coming.

 

EcoRAM looks like it’s worth investigating in more detail.

 

                                --jrh

 

A (slightly) more detailed presentation is available at: http://www.spansion.com/about/news/events/Transforming_the_Internet_Data_Center.pdf.  Some interesting speeds and feeds from the press release and the presentation:

 

·         1/8th the power of DRAM at a given capacity,

·         Estimating that 8x power to storage capacity advantage over DRAM will grow to a full 16x by 2012

·         10x the reliability of DRAM,

·         smaller die area per bit,

·         much closer to DRAM access times (a bit vague on this one).

 

The Achilles heel of NOR Flash has been the poor write speed.  The press release claims 2x to 10x better than traditional NOR Memories but this is still considerably slower than DRAM.

 

We need a lot more technical data and repeatable performance measures but, with what has been published so far, it would appear that the sweet spot for this device are very high random IO rate, read-mostly workloads.  Potentially fairly interesting.

 

                                                --jrh

Thanks to Son VoBa of the Windows Virtualization team for sending this my way.

 

James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

 

Wednesday, July 02, 2008 5:14:17 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Hardware
 Sunday, June 29, 2008

Title: Needle in a Haystack: Efficient Storage of Billions of Photos

Speaker: Jason Sobel, Manager of the Facebook, Infrastructure Group)

Slides: http://beta.flowgram.com/f/p.html#2qi3k8eicrfgkv

 

An excellent talk that I really enjoyed.  I used to lead a much smaller service that also used a lot of NetApp storage and I recognized many of the problems Jason mentioned.  Throughout the introductory part of the talk I found myself thinking they need to move to a cheap, directly attached blob store. And that’s essentially the topic of remainder of the talk.  Jason presented Haystack, the Facebook solution to the problem of a filesystem not working terribly well for their high volume blob storage needs.

 

The same thing happened when he talked through the Facebook usage of Content Delivery Networks (CDNs).  The CDN stores the data once in a geo-distributed cache, Facebook stores it again in their distributed cache (Memcached) and then again the database tier.  Later in the talk Jason, made exactly this observation and observed the new design will allow them to use the CDNs less and as they get a broader geo-diverse data center footprint, they may move to being their own CDN. I 100% agree.

 

My rough notes below with some of what I found most interesting.

 

Overall Facebook facts:

·         #6 site on the internet

·         500 total employees

o   200 in engineering

o   25 in Infrastructure Engineering

·         One of the largest MySQL installations in the world

·         Big user and contributor to Memcached

·         More than 10k servers in production

·         6,000 logical databases in production

 

Photo Storage and Management at Facebook:

·         Photo facts:

o   6.5B photos in total

§  4 to 5 sizes of each picture is materialized (30B files)

§  475k images/second

·         Mostly served via CDN (Akamai & Limelight)

·         200k profile photos/second

§  100m uploads/week

o   Stored on netapp filers

·         First level caching via CDN (Akamai & Limelight)

o   99.8% hit rate for profiles

o   92% hit rate for remainder

·         Second level caching for profile pictures only via Cachr (non-profile goes directly against file handle cache)

o   Based upon a modified version of evhttp using memcached as a “backing” store

o   Since cachr is independent from memcachd, cachr failure doesn’t lose state

o   1 TB of cache over 40 servers

o   Delivers microsecond response

o   Redundancy so no loss of cache contents on server failure

·         Photo Servers

o   Non-profile requests go directly against the photo-servers

o   Only profile requests that miss the cachr cache.

·         File Handle Cache (FHC)

o   Based upon lighttpd and uses memcached as backing store

o   Reduces metadata workload on NetApp servers

o   Issue: filename to inode lookup is a serious scaling issue: 1) drives many I/Os or 2) wastes too much memory with a very large metadata caceh

§  They have extended the Linux kernel to allow NFS file opens via inode number rather than filename to avoid the NetApp scaling issue.

§  The inode numbers are stored in the FHC

§  This technique offloads the NetApp servers dramatically.

§  Note that files are write only.  Mods write a new file and delete the old ones so the handles will fail and a new metadata lookup will be driven.

·         Issues with this architecture:

o   Netapp storage overwhelmed by metadata (3 disk I/Os to read a single photo).

§  The original design required 15 I/Os for a single picture (due to deeper directory hierarchy I’m guessing)

§  Tracking last access time, last modified etc. has no value to Facebook.  They really only need a blob store but they are using a filesystem at additional expense

o   Heavy reliance on CDNs and caches such that netapp is basically almost pure backup

§  92% of non-profile and 99.8% of profile pictures are stored in CDN

§  Many of the rest are almost all stored in caching layers

·         Solution: Haystacks

o   Haystacks are a user level abstraction where lots of data is stored in a single file

o   Store an independent index vastly more efficient than the file store

o   1M of metadata/1G of data

§  Order of magnitude better on average than standard NetApp metadata

o   1 disk seek for all reads with any workload

o   Most likely store in XFS

o   Expect each haystack to be about 10G (with an index)

o   Speaker equates a Haystack to be a lot like a LUN and could be implemented on a LUN.  The actual implementation is via NFS onto NetApp as photos were previously stored

o   Net of what’s happening:

§  Haystack always hits on the metadata

o   Plan to replace NetApp

§  Haystack is a win over NetApp but we’ll likely run over XFS (originally done by Silicon Grapics)

§  Want more control of the cache behavior

o   Each Haystack Format:

§  Version number,

§  Magic number,

§  Length,

§  Data,

§  Checksum

o   Index format

§  Version,

§  Photo key,

§  Photo size,

§  Start,

§  Length.

o   Not planning to delete photos at all since delete rate is VERY low so it the resource that would be recovered are not worth the work to recover them in the Facebook usage.  Deletion just removes the entry from the index which makes the data unavailable but they don’t bother to actually remove it from the Haystack bulk storage system.

o   Q:Why not store the index in a RDBMS?  Feels that it’ll drive too many I/Os and have the problems they are trying to avoid (I’m not completely convinced but do understand that simplicity and being in control has value).

·         They still plan to use the CDN but they are hoping to reduce their dependence on CDN. They are considering becoming their own CDN (Facebook is absolutely large enough to be able to do this cost effectively today).

·         They are considering using to SSDs in the future.

·         Not interested in hosting with Google or Amazon. Compute is already close to the data and they are working to get both closer to users but don’t see a need/use for GAE or AWS at the Facebook scale.

·         The Facebook default is to use databases.  Photos are the largest exception but most data is stored in DBs. Few actions use transactions and joins though.

·         Almost all data is cached twice: once in memcached and then again in the DBs.

·         Random bits:

o   Canada: 1 out of 3 Canadians use Facebook.

o   Q:What is the strategy in China?  A:“not to do what Google did” :-)

o   Looking at de-duping and other commonality exploiting systems for client to server communications and storage (great idea although not clearly a big win for photos).

o   90% Indians access internet via a mobile device.  Facebook very focused on mobile and international.

 

Overall, an excellent talk by Jason.

 

--jrh

 

Sent my way by Hitesh Kanwathirtha  of the Windows Live Experience team, Mitch Wyle of Engineering Excellence, and Dave Quick and Alex Mallet both in Windows Live Cloud Storage group. It was originally Slashdotted at: http://developers.slashdot.org/article.pl?no_d2=1&sid=08/06/25/148203. The presentation is posted at: http://beta.flowgram.com/f/p.html#2qi3k8eicrfgkv.

 

James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

 

Sunday, June 29, 2008 8:04:21 PM (Pacific Standard Time, UTC-08:00)  #    Comments [3] - Trackback
Services
 Friday, June 27, 2008

Alex Mallet and Viraj Mody of the Windows Live Mesh team took great notes at the Structure ’08 (Put Cloud Computing to Work) conference (appended below).

 

Some pre-reading information was made available to all attendees as well: Refresh the Net: Why the Internet needs a Makeover?

 

Overall

-          Interesting mix of attendees from companies in all areas of “cloud computing”

-          The quality of the presentations and panels was somewhat uneven

-          Talks were not very technical

-          Amazon is the clear leader in mindshare; MS isn’t even on the board

-          Lots of speculation about how software-as-a-service, platform-as-a-service, everything-as-a-service is going to play out: who the users will be, how to make money, whether there will be cloud computing standards etc

 

5 min Nick Carr video [author of “The Big Switch”]

-          Drew symbolic link between BillG retiring and Structure ’08, the first “cloud computing” conference, being the same week, marking the shift of computing from the desktop to the datacenter

-          Generic pontificating on the coming “age of cloud computing”

 

“The Platform Revolution: a look into disruptive technologies”, Jonathan Yarmis, research analyst

-          Enterprises always lag behind consumers in adoption of new technology, and IT is powerless to stop users from adopting new technology

-          4 big tech trends: social networks, mobility, cloud computing, alternative business models [eg ad-supported]

-          Tech trends mutually reinforcing: mobility leads to more social networking applications, being able to access data/apps in the cloud leads to more mobility

-          Mobile is platform for next-gen and emerging markets: 1.4 billion devices per year, 20% device growth per year, average device lifetime 21 months; opens up market to new users and new uses

-          Claim: “single converged device will never exist”; cloud computing enables independence of device and location

-          Stream computing: rate of data creation is growing at 50-500% per year, and it’s becoming increasingly important to be able to quickly process the data, determine what’s interesting and discard the rest

-          “Economic value of peer relationships hasn’t been realized yet” – Facebook Beacon was a good idea, but poorly realized

 

“Virtualization and Cloud Computing”, with Mendel Rosenblum, VMWare founder 

-          Virtualization can/should be used to decouple software from hardware even in the datacenter

-          Virtualization is cloud-computing enabler: can decide whether to run your own machines, or use somebody else’s, without having to rewrite everything

-          Coming “age of multicore” makes virtualization even more important/useful

-          Smart software that figures out how to distribute VMs over physical hardware isn’t a commodity yet

-          VMWare is working on merging the various virtualization layers: machine, storage, networks [eg VLANs]

-          HW support for virtualization is mostly being done for server-class machines [?]

-          Rosenblum doesn’t think moving workloads from the datacenter to edge machines, to take advantage of spare cycles, will ever take off – it’s just too much of a pain to try to harness those spare cycles

-          Single-machine hypervisor is becoming commodity, so VMWare is moving to managing the whole datacenter, to stay ahead of the competition

 

Keynote, Werner Vogels, Amazon CTO:

-          Mostly a pitch for Amazon’s web services: EC2, S3, SQS, SimpleDB

-          Gave example of Animoto, company that merges music + photos to create a video, which has no servers whatsoever: had 25K users, launched a Facebook app, and went from 25K users total to adding 25K users/hour; were able to handle it by moving from 50 EC2 instances to 3000 EC2 instances in 2 days

-          Currently 370K registered AWS developers

-          Bandwidth consumed by AWS is bigger than bandwidth consumed by Amazon e-commerce services

-          Shift to service-oriented architecture occurred as result of being approached by Target in 2001/2002, asking whether Amazon could run their e-commerce for them. Realized that their current architecture wouldn’t scale/work, so they re-engineered it

-          Single Amazon page can depend on hundreds of other services

-          Big barrier between developing web app and operating it at scale: loadbalancing, hardware mgmt, routing, storage management etc. Called this the “undifferentiated heavy lifting” that needs to be done to even get in the game

-          Claim: typical company spends 70% effort/money on “undifferentiated heavy lifting” and 30% on differentiated value creation; AWS is intended to allow companies to focus much more on differentiated value creation

-          SmugMug has been at forefront of companies relying on AWS; currently store 600TB of photos in S3, and have launched an entirely new product, SmugVault, based purely on the existence of S3 => AWS not just replacement for existing infrastructure, but enabling new businesses

-          In 2 years, cloud computing will be evaluated along 5 axes: security, availability, scalability, performance, cost-effectiveness

-          Really plugged the pay-as-you-go model

 

“Working the Cloud: next-gen infrastructure for new entrepreneurs” panel

-          Q: is lock-in going to be a problem ie how easy will it be to move an app from one cloud computing platform to another ?

o   A: Strong desire for standards that will make it easy to port apps, but not there yet

o   A: To really use the cloud, you need to embed assumptions about it in your code; even bare-metal clouds require intelligence, like scripts to spin up new EC2 instances, so lock-in is a real concern

o   Side thread: Google person on panel claimed that using Google App Engine didn’t lock in developers, because the GAE APIs are well-documented, and he was promptly verbally mugged by just about everyone else on the panel, pointing out things like the use of BigTable in GAE making it difficult to extract data, or replace the underlying storage layer etc.

o   Prediction: there will be convergence to standards, and choice will come down to whether to use a generic cloud, or a more specialized/efficient cloud, eg one targeted at the medical information sector, with features for HiPPA compliance

-          Need new licensing models for cloud computing, to deal with the dynamic increase/decrease in number of application instances/virtual machines as load changes

-          Tidbit: Google has geographically distributed data centers, and geo-replicates [some] data

-          Q: will we be able to use our old “toys” [APIs, programming models etc] in the cloud ?

o   A: Yes, have to be able to, otherwise people won’t adopt it

o   A:  Yes, just have to be smart about replacing the plumbing underneath the various APIs

o   A: Yes, but current API frameworks are lacking some semantics that become important in cloud computing, like ways to specify how many times an object should be replicated, whether it’s ok to lazily replicate some data etc

 

Mini-note, “Optical networking”, Drew Perkins

-          Video is, by far, the largest consumer of bandwidth on the internet

-          Cost of content is disproportionate to size: 4MB song costs $1, 200MB TV show episode costs $2, 1.5GB movie costs $3-4.

-          Photonic integrated circuits that can be used to build 100GB/s are needed to meet future bandwidth requirements: less power,  need fewer network devices

 

“Race to the next database” panel

-          Quite poorly organized: panelists each got to give an [uninformative] infomercial for their company, and there was very little time for actual questions and discussion

-          Aster Data Systems is back-end data warehouse and analytics system for MySpace: 1 billion impressions/day, 1 TB of new data per day, new data is loaded into 100-node Aster cluster every hour and needs to be available for ad analytics engine to decide which ads to show

-          SQLStream is company that has built a data stream processing product that collapses the usual processing stages [data staging, cleaning, loading etc] into a pipeline that continuously produces results for “standing” queries; useful for real-time analytics

-          Web causes disruption to traditional DB model because of [10x] larger data volumes, need for high interactivity/turn-around, need to scale out instead of up. For example, GreenPlum is building a 20PB data warehouse for one customer.

-          Can’t rely on all the data being in a single store, so need to be able to do federated/distributed queries

 

“MS Datacenters”

-          Presentation centered on MS plans for datacenters-in-a-box

-          Datacenters-in-a-container are long-term strategy for MS, not just transient response to high demand and lack of space

-          Container blocks have lower reliability than traditional datacenters, so applications need to be geo-distributed and redundant to handle downtime

 

“End of boxed software”, Parker Harris, co-founder of Salesforce.com

-          Origins of salesforce.com: Modeled on consumer internet sites - amazon.com; ebay.com

-          Transition from client->server site to a platform (force.com): first instinct is to build a platform, but then you lose touch with why you're building it. As they started building their experience, they abstracted away components and started realizing it could become a platform. Revenue comes from site, platform is a bonus.

-          Initially scaled by buying bigger [Sun] boxes ie scaled up, not out, and ran into lots of complexity. Unclear whether that’s still the case or whether they’ve re-architected.

 

“Scaling to satiate demand” panel

-          Q: “When did you first realize your architecture was broken, and couldn’t scale ?”

o   A: When site started to get slow; Ebay: after massive site outages

-          Q: “How do you handle running code supplied by other people on your servers ?”

o   A: Compartmentalize ie isolate apps; have mgmt infrastructure and tooling to be able to monitor and control uploaded apps; provide developers with fixed APIs and tools so you can control what they do

-          Q: “How do Facebook and Slide [builds Facebook apps] figure out where the problems are if Slide starts failing ?”

o   A: Lots of real-time metrics; ops folks from both companies are in IM contact and do co-operative troubleshooting

-          Q: “How should you handle PR around outages ?”

o   A: Be transparent; communicate; set realistic timelines for when site will be back up; set expectations wrt “bakedness” of features

-          Beware of retrying failed operations too soon, since retries may cause an overloaded system to never be able to come up

-          Ebay: each app is instrumented with the same set of logging infrastructure and there’s a real-time OLAP engine that analyzes the logs and does correlation to try to find troubled spots

-          Facebook and Meebo both utilized their user base to translate their sites into multiple languages

-          Need to know which bits of the system you can turn off if you run into trouble

-          Biggest challenge is scaling features that involve many-many links between users; it’s easy to scale a single user + data

-          Keep monitoring: there are always scale issues to find, even without problems/outages

-          Slide: “Firefox 3 broke all of our apps”

-          Facebook has > 10K servers

 

Mini-note, “Creating fair bandwidth usage on the Internet”, Dr. Lawrence Roberts, leader of the original ARPANET team

-          P2P leads to unfair usage: people not using P2P get less; 5% of users (P2P users) receive 80% capacity

-          Deep packet inspection catches 75% of p2p traffic, but isn’t effective in creating fairness

-          Anagran has flow behavior mgmt: observe per user flow behavior & utilization and then equalize. Equalization is done in memory, on networking infrastructure [routers etc] and at the user level instead of the flow level

 

Mini-note, “Cloud computing and the mid-market”, Zach Nelson, CEO of NetSuite

-          Mid-market is the last great business applications opportunity

-          Cloud computing makes it economical to reach the “Fortune 5 million”

-          Cloud computing still doesn’t solve problem of application integration

-          Consulting services industry is next to be transformed by cloud computing

 

Mini-note, “Electricity use in datacenters”, Dr.Jonathan Koomey

-          Site Uptime Network, an organization of data center operators and designers did study of 19 datacenters from 1999-2006:

o   Floor area remained relatively constant

o   Power density went from 23 W/sq ft to 35 W/sq ft

-          In 2000, datacenters used 0.5% of world’s electricity; in 2005, used 1%.

-          Cooling and power distribution are largest electricity consumers; servers are second-largest; storage and networking equipment accounts for a small fraction

-          Asia-Pacific region’s use of power is increasing the fastest, over 25% growth per year

-          Lots of inefficiencies in facility design: wrong cost metrics [sq feet versus kW], different budgets and costs borne by different orgs [facilities vrs IT], multiple safety factors piled on top of each other

-          Designed Eco-Rack, which, with only a few months of work, reduces power consumption on normalized workload by 16-18%

-          Forecast: datacenter electricity consumption will grow by 76% by 2010, maybe a bit less with virtualization

 

“VC investment in cloud computing infrastructure” panel