James Hamilton's Blog RSS 2.0
 Thursday, July 10, 2008

What follows is a guest posting from Phil Bernstein on the Google Megastore presentation by Jonas Karlsson, Philip Zeyliger at SIGMOD 2008:

 

Megastore is a transactional indexed record manager built by Google on top of BigTable. It is rumored to be the store behind Google AppEngine but this was not confirmed (or denied) at the talk. [JRH: I certainly recognize many similarities between the Google IO talk on the AppEngine store (see Under the Covers of the App Engine Datastore in Rough notes from Selected Sessions at Google IO Day 1) and Phil’s notes below].

 

·         A transaction is allowed to read and write data in an entity group.

·         The term “entity group” refers to a set of records, possibly in different BigTable instances. Therefore, different entities in an entity group might not be collocated on the same machine. The entities in an entity group share a common prefix of their primary key.  So in effect, an entity group is a hierarchically-linked set of entities.

·         A per-entity-group transaction log is used. One of the rows that stores the entity group is the entity group’s root. The log is stored with the root, which is replicated like all rows in Big Table.

·         To commit a transaction, its updates are stored in the log and replicated to the other copies of the log. Then they’re copied into the database copy of the entity group.

·         They commit to replicas before acking the caller and use Paxos to deal with replica failures. So it’s an ACID transaction.

·         Optimistic concurrency control is used. Few details were provided, but I assume it’s the same as what they describe for Google Apps.

·         Schemas are supported.

·         They offer vertical partitioning to cluster columns that are frequently accessed together.

·         They don’t support joins except across hierarchical paths within entity groups. I.e., if you want to do arbitrary joins, then you write an application program and there’s no consistency guarantee between the data in different entity groups.

·         Big Table does not support indexes. It simply sorts the rows by primary key. Megastore supports indexes on top. They were vague about the details. It sounds like the index is a binary table with a column that contains the compound key as a slash-separated string and a column containing the primary key of the entity group.

·         Referential integrity between the components of an entity group is not supported.

·         Many-to-many relationships are not supported, though they said they can store the inverse of a functional relationship.  It sounded like a materialized view that is incrementally updated asynchronously.

·         It has been in use by apps for about a year.

 

The follow are the notes that I typed while listening to the talk. For the most part, it’s just what was written on the slides and is incomplete. I don’t think it adds much to my summary above.

 

TITLE: Megastore – Scalable Data System for User-facing Apps

SPEAKERS: Jonas Karlsson, Philip Zeyliger (Google)

 

User-facing apps have TP-system-like requirements

Ø  data updated by a few users

Ø  largely reads, small updates

Ø  lots of work scaling the data layer

Ø  users expect consistency

 

Megastore – scale, RAD, consistency, performance

 

Scale

Ø  start with Big Table for storage

Ø  add db technologies that scale: indices, schemas

Ø  offer transparent replication and failover between data centers

Ø  support many, frequent reads

Ø  writes may be more costly, because they’re less frequent

Ø  with a correct schema, the app should scale naturally

 

Rapid App Development

Ø  hide operational aspects of scalability from app code

Ø  Flexible declarative schemas (MDL) – looks like SQL

o   indices, rich data types, consistency and partitioning

Ø  offer transactions

 

Entity Group consistency is supported

Ø  it’s a logical partition – e.g., blogs, posts, comments, which are all keyed by the owner of the blog

Ø  all entities in the group have the same high-order key component

 

Transactions

Ø  roll forward transaction log per entity group, no rollbacks

o   pb: It’s unclear to me whether they pool the log across all entity groups in a partition. If not, then they don’t benefit from group commit.

Ø  A transaction over an entity group is ACID , but not across entity groups

Ø  optimistic concurrency control

Ø  updates are available only after commit

Ø  api: newTransaction, read/write, commit (pb: couldn’t type fast enough for the details)

Ø  non-blocking, consistent reads (pb: does a transaction see its previous writes?)

Ø  cross-entity group operations have looser consistency

 

Performance

Ø  schemas declare their physical data locality

Ø  optimized to minimize seeks, RPCs, bandwidth, and storage

Ø  several ways of declaring physical locality

o   entity groups

o   shared primary key prefixes (collocating tables in Big Table)

o   locality groups – i.e., attribute partitioning

Ø  simply cost-transparent API primitives imply

o   only add scalable features

o   cost of writes is linear to data/indices

o   avoid scalability surprises

 

Avoiding joins

Ø  hierarchical primary keys

Ø  repeated fields (I guess just like Big Table)

Ø  store hierarchical data  in a column (it’s unclear to me whether this is the whole entity group or only part of it)

Ø  Syntax looks like SQL: Create Table (with primary key), Create Index (on particular columns), ..

 

Replication HA

Ø  uses Paxos-based algorithm, per entity group

Ø  it was more complicated than they expected

Ø  writes need a majority of replicas to be up in order to commit

Ø  most reads are local, consistency ensured

Ø  replication is consistent and synchronous

Ø  automatic query failover: individual table servers may fail

 

Usage

Ø  has been used in production for a year

Ø  used by several internal tools and projects

Ø  several large and user-visible apps (they wouldn’t say which ones, except we know them)

Ø  used for rapidly implementing new features in older projects

Ø  many other projects are migrating to it

 

Technical lessons

Ø  declarative schemas are good

Ø  cost-transparent APIs good (SQL is not cost-transparent)

Ø  avoid joins with hierarchical data and indices (if you want a join that isn’t on a hierarchical path, then write a program, e.g., Sawzall

Ø  avoid scalability surprises

 

Social

Ø  schema reviews helpful

Ø  consistency is necessary

Ø  need a mindset of scalability and performance

 

James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

 

Thursday, July 10, 2008 1:47:45 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services
 Tuesday, July 08, 2008

Jim Gray proposed the original sort benchmark back in his famous Anon et al paper A Measure of Transaction Processing Power originally published in Datamation April 1, 1985. TeraSort is one of the benchmarks that Jim evolved from this original proposal.

 

TeraSort is essentially a sequential I/O benchmark and the best way to get lots of I/O capacity is to have many servers.  The mainframe engineer-a-bigger-bus technique has produced some nice I/O rates but it doesn’t scale. There have been some very good designs but, in the end, commodity parts in volume always win. The trick is coming up with a programming model that is understandable to allow thousands of nodes to be harnessed.  MapReduce takes some heat for not being innovative and not having learned enough from the database community (MapReduce – A Major Step Backwards).  However, Google, Microsoft, and Yahoo run the model over thousands of nodes.  And all three have written higher level languages layers above MapReduce some of which look very SQL-like.

 

Owen O’Malley of the Yahoo Grid team took a moderate sized Hadoop cluster of 910 nodes and won the TeraSort benchmark.  Owen blogged the result: Apache Hadoop Wins Terabyte Sort Benchmark and provided more details in a short paper: TeraByte Sort on Apache Hadoop. Great result Owen.

 

Here’s the configuration that won:

  • 910 nodes
  • 4 dual core Xeons @ 2.0ghz per a node
  • 4 SATA disks per a node
  • 8G RAM per a node
  • 1 gigabit ethernet on each node
  • 40 nodes per a rack
  • 8 gigabit ethernet uplinks from each rack to the core
  • Red Hat Enterprise Linux Server Release 5.1 (kernel 2.6.18)
  • Sun Java JDK 1.6.0_05-b13

Yahoo bought expensive  4-socket servers for this configuration but, even then, this effort was won on less than ½ million in hardware.  Let’s assume that their fat nodes are $3.8k each.  They have 40 servers per rack so well need 23 racks. Let’s assume $4.2k per top of rack switch and $100k for core switching.  That’s 910*$3.8k+23*4.2k+$100k or $3,655k.  That means you can go out and spend roughly $3.5m and have the same resources that won the last sort benchmark. Amazing.  I love what’s happening in our industry.

Update: Math glitch in original posting fixed above (thanks to Ari Rabkin & Nathan Shrenk).

 

The next thing I would like to see is this same test run on very low power servers.  Assuming the fairly beefy nodes used above are 350W each (they may well be more), the overall cluster ignoring networking will be 318kW and it ran for 209 seconds which is 18.490kW/hrs. Let’s focus on power and show what can be sorted for 1kW/hr. The kW/hr sort.

 

Congratulations to the Yahoo and Hadoop team for a great result.

 

                                --jrh

 

Wei Xiao of the Internet Search Research Center sent Owen’s result my way.

 

James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

 

Tuesday, July 08, 2008 4:03:17 AM (Pacific Standard Time, UTC-08:00)  #    Comments [9] - Trackback
Software
 Friday, July 04, 2008

Recently results from two academic researchers in Japan will be significant to the NAND Flash market: http://www.electronicsweekly.com/Articles/Article.aspx?liArticleID=44028&PrinterFriendly=true.  Clearly the trip from laboratory to volume production is often longer than the early estimates but these results look important. 

 

Back in 2006, Jim Gray argued in Tape is Dead, Disk is Tape, Flash is Disk, & Ram Locality is King that we need a new layer in the storage hierarchy between memory and disk and NAND Flash was an excellent candidate.  Early NAND Flash-based SSDs could sustain read rates well beyond 10x of disk random IO rates but the write rates were terrible. Some were as bad as 1/5 the rate of magnetic disk. Second generation devices are solving the random write problem as expected.  Costs continue to plunge, overall performance continue to improve, and many very high scale server workloads have been deploying flash devices over the past year. A success by most measures but two issues remain. The first issue is that we have one important metric heading in the wrong direction: endurance.  NAND flash can only support a limited number of erase cycles before failing.  The second issue is that many don’t expect the feature size to be reduced below 32 nm which, were that to happen, would slow the improvement rate dramatically.

 

When I first got interested in single level cell (SLC) NAND Flash most published endurance numbers were typically in the 10^6 cycle range. Most current devices are in the 10^5 range and many see as low as 10^4 cycles on the horizon.  A million cycles is fine and will not restrict the life of the device.  100,000 cycles is closer to the line but my back of envelope numbers suggest 100k will (barely) be acceptable.  10k cycles is a problem and will restrict longevity of the device.

 

In this research work Shigeki Sakai and Ken Takeuchi show how Feroelectric gate Field Effect Transistors can dramatically improve the durability, reduce required programming voltage, improve performance, and support further generational reductions in feature size.  The device prototype they demonstrated uses 6v to program rather than 20v which may reduce the cost or increase the speed of devices slightly.  What’s most important in the demonstrated results is estimated endurance in the 10^8 cycle range which is at least three orders of magnitude better than most current generation NAND parts.  That would take endurance completely off the NAND Flash worry list. 

 

Potential feature size reduction is the other improvement of interest in this result.  Feature size reduction is the engine of Moore’s law and drives the semi-conductor economics we’ve all become used to.  Many experts don’t expect to be able to reduce NAND flash features size below roughly 30nm.  The Fe-NAND result shows potential for two more generational feature size reductions down to the 10nm range. This is important in that it drives costs reductions and we all want them to continue.

 

Fe-NAND looks extremely interesting and, if the research can be confirmed and is manufacturability, we have a very significant technology that can address the two major concerns with current generation NAND flash: 1) rapidly falling endurance, and 2) expected inability to drop down below 32nm feature size.  Flash continues to build industry momentum.

 

                                --jrh

 

Thanks to Jack Creasey for sending this my way.

 

James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

 

Friday, July 04, 2008 5:55:51 AM (Pacific Standard Time, UTC-08:00)  #    Comments [1] - Trackback
Hardware
 Wednesday, July 02, 2008

Updated below with additional implementation details.

 

Last week Spansion made an interesting announcement: EcoRAM, a NOR Flash based storage part in a Dual In-line Memory Module (DIMM) package. 

 

NOR Flash technology growth has been fueled by the NOR support for Execute in Place (XIP).  Unlike the NAND Flash interface, where entire memory pages need to be shifted into memory to be operated upon, NOR flash is directly addressable.  And this direct addressability allows instructions to be read and executed directly from the memory.  There is no need to shift pages out one at a time. Byte addressability and support for XIP makes NOR ideal for boot loaders, ROMs, and the control program store for consumer devices. For example, the iPod Nano uses Silicon Storage Technology 39WF800A 8-Megabyte NOR boot flash (eeTimes).

 

Since NAND flash is not byte addressable providing only a block mode interface, it is typically attached to PCs and servers as an I/O device.  The NOR support for direct byte addressability makes it a candidate for attachment as a memory rather than as a block mode I/O device and when I first read the press release I thought this was what Spansion has done in partnership with Virident Systems. It’s clear they have NOR Flash memory in a memory (DIMM) package and they refer to it throughout the press release as “memory extension”. However, upon closer inspection, it appears to require that the NOR memory DIMM packages all be installed in a separate gateway server they refer as a “Green Gateway”.  It looks like the design has all the NOR flash in this separate server and a device driver on the host to virtualize memory on the NOR flash server. Essentially it still may be accessed via an I/O interface which is to say it’s not clear why you couldn’t do the same thing with NAND Flash.  And, it’s not immediately clear what protocol is used, what operating systems are supported, nor the exact performance but, overall, it still looks interesting.

 

Update: In conversations with Virident, it appears this part is potentially more interesting that I initially speculated. Rather than hosting the memory in an independent server as I speculated, it’s an in-server design but does require some BIOS engineering. From Virident: The current interconnect is the HTx bus for AMD servers.  Will be QPI for Intel.  We are doing AMD first.  You should be able to install on a standard two socket board.  DRAM sits behind the processor, and EcoRAM sits behind the controller.  Of course, the BIOS for the system must support the extended memory – we have HP systems up and running as a proof of concept, and Dell should work fairly soon. 

 

HTX and QPI open up big opportunities for hardware startups to innovate.  I know of many startups heading down this path. More innovation coming.

 

EcoRAM looks like it’s worth investigating in more detail.

 

                                --jrh

 

A (slightly) more detailed presentation is available at: http://www.spansion.com/about/news/events/Transforming_the_Internet_Data_Center.pdf.  Some interesting speeds and feeds from the press release and the presentation:

 

·         1/8th the power of DRAM at a given capacity,

·         Estimating that 8x power to storage capacity advantage over DRAM will grow to a full 16x by 2012

·         10x the reliability of DRAM,

·         smaller die area per bit,

·         much closer to DRAM access times (a bit vague on this one).

 

The Achilles heel of NOR Flash has been the poor write speed.  The press release claims 2x to 10x better than traditional NOR Memories but this is still considerably slower than DRAM.

 

We need a lot more technical data and repeatable performance measures but, with what has been published so far, it would appear that the sweet spot for this device are very high random IO rate, read-mostly workloads.  Potentially fairly interesting.

 

                                                --jrh

Thanks to Son VoBa of the Windows Virtualization team for sending this my way.

 

James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

 

Wednesday, July 02, 2008 5:14:17 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Hardware
 Sunday, June 29, 2008

Title: Needle in a Haystack: Efficient Storage of Billions of Photos

Speaker: Jason Sobel, Manager of the Facebook, Infrastructure Group)

Slides: http://beta.flowgram.com/f/p.html#2qi3k8eicrfgkv

 

An excellent talk that I really enjoyed.  I used to lead a much smaller service that also used a lot of NetApp storage and I recognized many of the problems Jason mentioned.  Throughout the introductory part of the talk I found myself thinking they need to move to a cheap, directly attached blob store. And that’s essentially the topic of remainder of the talk.  Jason presented Haystack, the Facebook solution to the problem of a filesystem not working terribly well for their high volume blob storage needs.

 

The same thing happened when he talked through the Facebook usage of Content Delivery Networks (CDNs).  The CDN stores the data once in a geo-distributed cache, Facebook stores it again in their distributed cache (Memcached) and then again the database tier.  Later in the talk Jason, made exactly this observation and observed the new design will allow them to use the CDNs less and as they get a broader geo-diverse data center footprint, they may move to being their own CDN. I 100% agree.

 

My rough notes below with some of what I found most interesting.

 

Overall Facebook facts:

·         #6 site on the internet

·         500 total employees

o   200 in engineering

o   25 in Infrastructure Engineering

·         One of the largest MySQL installations in the world

·         Big user and contributor to Memcached

·         More than 10k servers in production

·         6,000 logical databases in production

 

Photo Storage and Management at Facebook:

·         Photo facts:

o   6.5B photos in total

§  4 to 5 sizes of each picture is materialized (30B files)

§  475k images/second

·         Mostly served via CDN (Akamai & Limelight)

·         200k profile photos/second

§  100m uploads/week

o   Stored on netapp filers

·         First level caching via CDN (Akamai & Limelight)

o   99.8% hit rate for profiles

o   92% hit rate for remainder

·         Second level caching for profile pictures only via Cachr (non-profile goes directly against file handle cache)

o   Based upon a modified version of evhttp using memcached as a “backing” store

o   Since cachr is independent from memcachd, cachr failure doesn’t lose state

o   1 TB of cache over 40 servers

o   Delivers microsecond response

o   Redundancy so no loss of cache contents on server failure

·         Photo Servers

o   Non-profile requests go directly against the photo-servers

o   Only profile requests that miss the cachr cache.

·         File Handle Cache (FHC)

o   Based upon lighttpd and uses memcached as backing store

o   Reduces metadata workload on NetApp servers

o   Issue: filename to inode lookup is a serious scaling issue: 1) drives many I/Os or 2) wastes too much memory with a very large metadata caceh

§  They have extended the Linux kernel to allow NFS file opens via inode number rather than filename to avoid the NetApp scaling issue.

§  The inode numbers are stored in the FHC

§  This technique offloads the NetApp servers dramatically.

§  Note that files are write only.  Mods write a new file and delete the old ones so the handles will fail and a new metadata lookup will be driven.

·         Issues with this architecture:

o   Netapp storage overwhelmed by metadata (3 disk I/Os to read a single photo).

§  The original design required 15 I/Os for a single picture (due to deeper directory hierarchy I’m guessing)

§  Tracking last access time, last modified etc. has no value to Facebook.  They really only need a blob store but they are using a filesystem at additional expense

o   Heavy reliance on CDNs and caches such that netapp is basically almost pure backup

§  92% of non-profile and 99.8% of profile pictures are stored in CDN

§  Many of the rest are almost all stored in caching layers

·         Solution: Haystacks

o   Haystacks are a user level abstraction where lots of data is stored in a single file

o   Store an independent index vastly more efficient than the file store

o   1M of metadata/1G of data

§  Order of magnitude better on average than standard NetApp metadata

o   1 disk seek for all reads with any workload

o   Most likely store in XFS

o   Expect each haystack to be about 10G (with an index)

o   Speaker equates a Haystack to be a lot like a LUN and could be implemented on a LUN.  The actual implementation is via NFS onto NetApp as photos were previously stored

o   Net of what’s happening:

§  Haystack always hits on the metadata

o   Plan to replace NetApp

§  Haystack is a win over NetApp but we’ll likely run over XFS (originally done by Silicon Grapics)

§  Want more control of the cache behavior

o   Each Haystack Format:

§  Version number,

§  Magic number,

§  Length,

§  Data,

§  Checksum

o   Index format

§  Version,

§  Photo key,

§  Photo size,

§  Start,

§  Length.

o   Not planning to delete photos at all since delete rate is VERY low so it the resource that would be recovered are not worth the work to recover them in the Facebook usage.  Deletion just removes the entry from the index which makes the data unavailable but they don’t bother to actually remove it from the Haystack bulk storage system.

o   Q:Why not store the index in a RDBMS?  Feels that it’ll drive too many I/Os and have the problems they are trying to avoid (I’m not completely convinced but do understand that simplicity and being in control has value).

·         They still plan to use the CDN but they are hoping to reduce their dependence on CDN. They are considering becoming their own CDN (Facebook is absolutely large enough to be able to do this cost effectively today).

·         They are considering using to SSDs in the future.

·         Not interested in hosting with Google or Amazon. Compute is already close to the data and they are working to get both closer to users but don’t see a need/use for GAE or AWS at the Facebook scale.

·         The Facebook default is to use databases.  Photos are the largest exception but most data is stored in DBs. Few actions use transactions and joins though.

·         Almost all data is cached twice: once in memcached and then again in the DBs.

·         Random bits:

o   Canada: 1 out of 3 Canadians use Facebook.

o   Q:What is the strategy in China?  A:“not to do what Google did” :-)

o   Looking at de-duping and other commonality exploiting systems for client to server communications and storage (great idea although not clea