James Hamilton's Blog RSS 2.0
 Thursday, July 17, 2008

Going  boating: http://mvdirona.com/ so I’ll be taking a break from blogging until mid-august when I’m back and caught back up.  Enjoy,

 

                                                --jrh

 

James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

Thursday, July 17, 2008 4:53:25 AM (Pacific Standard Time, UTC-08:00)  #    Comments [1] - Trackback
Ramblings
 Wednesday, July 16, 2008

I’ve been collecting scaling stories for some time now and last week I came across the following run down on Fliker scaling: Federation at Flickr: Doing Billions of Queries Per Day by Dathan Vance Pattishall, the Flickr database guy.

 

The Flickr DB Architecture is sharded with a PHP access layer to maintain consistency.  Flickr users are randomly assigned to a shard. Each shard is duplicated in another database that is also serving active shards. Each DB needs to be less than 50% loaded to be able to handle failover.

 

Shards are found via a lookup ring that maps userID or groupID to shardID and photoID to userID.  The DBs are protected by a memcached layer with a 30 minute caching lifetime. Slide 16 says they are maintaining consistency using distributed transactions but I strongly suspect they are actually just running two parallel transactions with application management rather than 2pc.

 

Maintenance is done by bringing down ½ the DBs and the remaining DBs will handle the load but it appears they have no redundancy (failure protection) during the maintenance periods.

 

They have 12TB of user data in aggregate and they appear to be using MySQL (slide 25 complains about an INNODB bug).

 

Other web site scaling stories:

·         Scaling Linkedin: http://perspectives.mvdirona.com/2008/06/08/ScalingLinkedIn.aspx

·         Scaling Amazon: http://glinden.blogspot.com/2006/02/early-amazon-splitting-website.html

·         Scaling Second Life: http://radar.oreilly.com/archives/2006/04/web_20_and_databases_part_1_se.html

·         Scaling Technorati: http://www.royans.net/arch/2007/10/25/scaling-technorati-100-million-blogs-indexed-everyday/

·         Scaling Flickr: http://radar.oreilly.com/archives/2006/04/database_war_stories_3_flickr.html

·         Scaling Craigslist: http://radar.oreilly.com/archives/2006/04/database_war_stories_5_craigsl.html

·         Scaling Findory: http://radar.oreilly.com/archives/2006/05/database_war_stories_8_findory_1.html

·         MySpace 2006: http://sessions.visitmix.com/upperlayer.asp?event=&session=&id=1423&year=All&search=megasite&sortChoice=&stype=

·         MySpace 2007: http://sessions.visitmix.com/upperlayer.asp?event=&session=&id=1521&year=All&search=scale&sortChoice=&stype=

·         Twitter, Flickr, Live Journal, Six Apart, Bloglines, Last.fm, SlideShare, and eBay: http://poorbuthappy.com/ease/archives/2007/04/29/3616/the-top-10-presentation-on-scaling-websites-twitter-flickr-bloglines-vox-and-more

 

                        -jrh

 

Thanks to Kevin Merritt (Blist) and Dave Quick (Microsoft) for sending this my way.

 

James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

 

Wednesday, July 16, 2008 5:13:40 AM (Pacific Standard Time, UTC-08:00)  #    Comments [2] - Trackback
Services
 Sunday, July 13, 2008

I just got back from O’Reilly’s Foo Camp.   Foo is an interesting conference format in that there is no set agenda. It’s basically self organized as a open space-type event but that’s not what makes it special.  What makes Foo a very cool conference is the people.  Lots of conferences invite good people but few invite such a  diverse a set of attendees.  It was a lot of fun.

 

Here’s a picture from Saturday night of (right to left) Jesse Robbins (Seattle entrepreneur and co-chair of O’Reilly’s Velocity conference), Pat Helland (Microsoft Developer Division Architect) and myself.

 

  

From: http://flickr.com/photos/jesserobbins/2663653582/

Foo is an invitational event loosely organized around folks O’Reilly find interesting or want to get to know better.  Tim O’Reilly describes the invite process in this blog comment: http://gigaom.com/2005/08/16/foocampfighting/#comment-19709 .

 

The conference starts around 5pm on Friday with drinks followed by dinner. After dinner, the sign-up boards for Saturday and Sunday talk sessions are put up. When Tim O’Reilly announced the sign-up boards were up, attendees rose from their tables and casually started ambling towards the sign-up boards as though there really was no rush. We’ll just be doing some signing up over the course of the evening.  We might do it now or perhaps we’ll do it later. No big rush.  And then some folks break into a slight trot towards but still not really an overt run  Suddenly, it’s a full raging stampede.  We became a single flow than as a discrete set of individuals. The flow accelerated and then crashed up against the sign-up boards spilling to both sides with folks madly negotiating for pens, better locations, sharing sessions, a shift of 6” to one side or the other, or  requesting to join two sessions together. Welcome to foo camp.  It didn’t slow down much from that point for the next 44 hours.

 

Sessions are 1 hour long but it’s good form to share a session if you don’t need the full hour. Speakers use their judgment to sign up for a small, medium or large room.  Some rooms have AV equipment and some don’t.

 

I shared a 1 hour ssession with Jeff Hammerbacher who leads the Facebook data team.  Earlier last week Jeff announced that he will be leaving Facebok in September: http://valleywag.com/5024169/yet-another-hoodie+wearing-harvard-kid-drops-out-of-facebook.  Jeff and I got a medium sized room in a tent beside the main building without A/V. 

 

The title for my session was Where Does the Power go in Data Centers and How to get it Back?  I didn’t show slides but much of what we covered is posted at: http://mvdirona.com/jrh/TalksAndPapers/JamesRH_DCPowerSavingsFooCamp08.ppt.  In the session, we talked through how contemporary large data centers work first looking at power distribution. We tracked the power from the feed to the substation at 115,000 volts through numerous conversions before arriving at the CPU at 1.2 volts. We then talked about power saving server design techniques.  And then the mechanical systems used to get the heat back out.  In each section we discussed what could be done to improve the design and how much could be saved.

 

Our conclusion from the session was that power savings of nearly 4x where both possible and affordable using only current technology.  For those participated in the session, thanks for your contribution and  for your help. It was fun.

 

                                                                --jrh

 

James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

Sunday, July 13, 2008 5:58:43 PM (Pacific Standard Time, UTC-08:00)  #    Comments [1] - Trackback
Ramblings
 Saturday, July 12, 2008

Last week the Facebook Data team released Cassandra as open source. Cassandra is an structured store with write ahead logging and indexing. Jeff Hammerbacher, who leads the Facebook Data team described Cassandra as a BigTable data model running on a Dynamo-like infrastructure.

 

Google Code for Cassandra (Apache 2.0 License): http://code.google.com/p/the-cassandra-project/.

 

Avinash Lakshman, Prashant Malik, and Karthik Ranganathan presented at SIGMOD 2008 this year: Cassandra: Structured Storage System over a P2P Network.  From the presentation:

 

Cassandra design goals:

·         High availability

·         Eventual consistency

·         Incremental scalability

·         Optimistic replication

·         Knobs to “tune” tradeoffs between consistency, durability, and latecy

·         Low cost of ownership

·         Minimal administration

 

Write operation: write to arbitrary node in Cassandra cluster, request sent to node owning the data, node writes to log first and then applied to in-memory copy. Properties of write: no locks in critical path, sequential disk accesses, behaves like a write through cache, atomicity guarantee for a key, and always writable.

 

Cluster membership is maintained via gossip protocol.

 

Lessons learned:

·         Add fancy features only when required

·         Many types of failures are possible

·         Big systems need proper systems-level monitoring

·         Value simple designs

 

Future work:

·         Atomicity guarantees across multiple keys

·         Distributed transactions (I’ll try to talk them out of this one)

·         Compression support

·         Fine grained security via ACLs

 

It looks like a well engineered system.

 

                                                --jrh

 

James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

 

Saturday, July 12, 2008 3:17:11 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Software
 Thursday, July 10, 2008

What follows is a guest posting from Phil Bernstein on the Google Megastore presentation by Jonas Karlsson, Philip Zeyliger at SIGMOD 2008:

 

Megastore is a transactional indexed record manager built by Google on top of BigTable. It is rumored to be the store behind Google AppEngine but this was not confirmed (or denied) at the talk. [JRH: I certainly recognize many similarities between the Google IO talk on the AppEngine store (see Under the Covers of the App Engine Datastore in Rough notes from Selected Sessions at Google IO Day 1) and Phil’s notes below].

 

·         A transaction is allowed to read and write data in an entity group.

·         The term “entity group” refers to a set of records, possibly in different BigTable instances. Therefore, different entities in an entity group might not be collocated on the same machine. The entities in an entity group share a common prefix of their primary key.  So in effect, an entity group is a hierarchically-linked set of entities.

·         A per-entity-group transaction log is used. One of the rows that stores the entity group is the entity group’s root. The log is stored with the root, which is replicated like all rows in Big Table.

·         To commit a transaction, its updates are stored in the log and replicated to the other copies of the log. Then they’re copied into the database copy of the entity group.

·         They commit to replicas before acking the caller and use Paxos to deal with replica failures. So it’s an ACID transaction.

·         Optimistic concurrency control is used. Few details were provided, but I assume it’s the same as what they describe for Google Apps.

·         Schemas are supported.

·         They offer vertical partitioning to cluster columns that are frequently accessed together.

·         They don’t support joins except across hierarchical paths within entity groups. I.e., if you want to do arbitrary joins, then you write an application program and there’s no consistency guarantee between the data in different entity groups.

·         Big Table does not support indexes. It simply sorts the rows by primary key. Megastore supports indexes on top. They were vague about the details. It sounds like the index is a binary table with a column that contains the compound key as a slash-separated string and a column containing the primary key of the entity group.

·         Referential integrity between the components of an entity group is not supported.

·         Many-to-many relationships are not supported, though they said they can store the inverse of a functional relationship.  It sounded like a materialized view that is incrementally updated asynchronously.

·         It has been in use by apps for about a year.

 

The follow are the notes that I typed while listening to the talk. For the most part, it’s just what was written on the slides and is incomplete. I don’t think it adds much to my summary above.

 

TITLE: Megastore – Scalable Data System for User-facing Apps

SPEAKERS: Jonas Karlsson, Philip Zeyliger (Google)

 

User-facing apps have TP-system-like requirements

Ø  data updated by a few users

Ø  largely reads, small updates

Ø  lots of work scaling the data layer

Ø  users expect consistency

 

Megastore – scale, RAD, consistency, performance

 

Scale

Ø  start with Big Table for storage

Ø  add db technologies that scale: indices, schemas

Ø  offer transparent replication and failover between data centers

Ø  support many, frequent reads

Ø  writes may be more costly, because they’re less frequent

Ø  with a correct schema, the app should scale naturally

 

Rapid App Development

Ø  hide operational aspects of scalability from app code

Ø  Flexible declarative schemas (MDL) – looks like SQL

o   indices, rich data types, consistency and partitioning

Ø  offer transactions

 

Entity Group consistency is supported

Ø  it’s a logical partition – e.g., blogs, posts, comments, which are all keyed by the owner of the blog

Ø  all entities in the group have the same high-order key component

 

Transactions

Ø  roll forward transaction log per entity group, no rollbacks

o   pb: It’s unclear to me whether they pool the log across all entity groups in a partition. If not, then they don’t benefit from group commit.

Ø  A transaction over an entity group is ACID , but not across entity groups

Ø  optimistic concurrency control

Ø  updates are available only after commit

Ø  api: newTransaction, read/write, commit (pb: couldn’t type fast enough for the details)

Ø  non-blocking, consistent reads (pb: does a transaction see its previous writes?)

Ø  cross-entity group operations have looser consistency

 

Performance

Ø  schemas declare their physical data locality

Ø  optimized to minimize seeks, RPCs, bandwidth, and storage

Ø  several ways of declaring physical locality

o   entity groups

o   shared primary key prefixes (collocating tables in Big Table)

o   locality groups – i.e., attribute partitioning

Ø  simply cost-transparent API primitives imply

o   only add scalable features

o   cost of writes is linear to data/indices

o   avoid scalability surprises

 

Avoiding joins

Ø  hierarchical primary keys

Ø  repeated fields (I guess just like Big Table)

Ø  store hierarchical data  in a column (it’s unclear to me whether this is the whole entity group or only part of it)

Ø  Syntax looks like SQL: Create Table (with primary key), Create Index (on particular columns), ..

 

Replication HA

Ø  uses Paxos-based algorithm, per entity group

Ø  it was more complicated than they expected

Ø  writes need a majority of replicas to be up in order to commit

Ø  most reads are local, consistency ensured

Ø  replication is consistent and synchronous

Ø  automatic query failover: individual table servers may fail

 

Usage

Ø  has been used in production for a year

Ø  used by several internal tools and projects

Ø  several large and user-visible apps (they wouldn’t say which ones, except we know them)

Ø  used for rapidly implementing new features in older projects

Ø  many other projects are migrating to it

 

Technical lessons

Ø  declarative schemas are good

Ø  cost-transparent APIs good (SQL is not cost-transparent)

Ø  avoid joins with hierarchical data and indices (if you want a join that isn’t on a hierarchical path, then write a program, e.g., Sawzall

Ø  avoid scalability surprises

 

Social

Ø  schema reviews helpful

Ø  consistency is necessary

Ø  need a mindset of scalability and performance

 

James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

 

Thursday, July 10, 2008 1:47:45 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services
 Tuesday, July 08, 2008

Jim Gray proposed the original sort benchmark back in his famous Anon et al paper A Measure of Transaction Processing Power originally published in Datamation April 1, 1985. TeraSort is one of the benchmarks that Jim evolved from this original proposal.

 

TeraSort is essentially a sequential I/O benchmark and the best way to get lots of I/O capacity is to have many servers.  The mainframe engineer-a-bigger-bus technique has produced some nice I/O rates but it doesn’t scale. There have been some very good designs but, in the end, commodity parts in volume always win. The trick is coming up with a programming model that is understandable to allow thousands of nodes to be harnessed.  MapReduce takes some heat for not being innovative and not having learned enough from the database community (MapReduce – A Major Step Backwards).  However, Google, Microsoft, and Yahoo run the model over thousands of nodes.  And all three have written higher level languages layers above MapReduce some of which look very SQL-like.

 

Owen O’Malley of the Yahoo Grid team took a moderate sized Hadoop cluster of 910 nodes and won the TeraSort benchmark.  Owen blogged the result: Apache Hadoop Wins Terabyte Sort Benchmark and provided more details in a short paper: TeraByte Sort on Apache Hadoop. Great result Owen.

 

Here’s the configuration that won:

  • 910 nodes
  • 4 dual core Xeons @ 2.0ghz per a node
  • 4 SATA disks per a node
  • 8G RAM per a node
  • 1 gigabit ethernet on each node
  • 40 nodes per a rack
  • 8 gigabit ethernet uplinks from each rack to the core
  • Red Hat Enterprise Linux Server Release 5.1 (kernel 2.6.18)
  • Sun Java JDK 1.6.0_05-b13

Yahoo bought expensive  4-socket servers for this configuration but, even then, this effort was won on less than ½ million in hardware.  Let’s assume that their fat nodes are $3.8k each.  They have 40 servers per rack so well need 23 racks. Let’s assume $4.2k per top of rack switch and $100k for core switching.  That’s 910*$3.8k+23*4.2k+$100k or $3,655k.  That means you can go out and spend roughly $3.5m and have the same resources that won the last sort benchmark. Amazing.  I love what’s happening in our industry.

Update: Math glitch in original posting fixed above (thanks to Ari Rabkin & Nathan Shrenk).

 

The next thing I would like to see is this same test run on very low power servers.  Assuming the fairly beefy nodes used above are 350W each (they may well be more), the overall cluster ignoring networking will be 318kW and it ran for 209 seconds which is 18.490kW/hrs. Let’s focus on power and show what can be sorted for 1kW/hr. The kW/hr sort.

 

Congratulations to the Yahoo and Hadoop team for a great result.

 

                                --jrh

 

Wei Xiao of the Internet Search Research Center sent Owen’s result my way.

 

James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

 

Tuesday, July 08, 2008 4:03:17 AM (Pacific Standard Time, UTC-08:00)  #    Comments [10] - Trackback
Software
 Friday, July 04, 2008

Recently results from two academic researchers in Japan will be significant to the NAND Flash market: http://www.electronicsweekly.com/Articles/Article.aspx?liArticleID=44028&PrinterFriendly=true.  Clearly the trip from laboratory to volume production is often longer than the early estimates but these results look important. 

 

Back in 2006, Jim Gray argued in Tape is Dead, Disk is Tape, Flash is Disk, & Ram Locality is King that we need a new layer in the storage hierarchy between memory and disk and NAND Flash was an excellent candidate.  Early NAND Flash-based SSDs could sustain read rates well beyond 10x of disk random IO rates but the write rates were terrible. Some were as bad as 1/5 the rate of magnetic disk. Second generation devices are solving the random write problem as expected.  Costs continue to plunge, overall performance continue to improve, and many very high scale server workloads have been deploying flash devices over the past year. A success by most measures but two issues remain. The first issue is that we have one important metric heading in the wrong direction: endurance.  NAND flash can only support a limited number of erase cycles before failing.  The second issue is that many don’t expect the feature size to be reduced below 32 nm which, were that to happen, would slow the improvement rate dramatically.

 

When I first got interested in single level cell (SLC) NAND Flash most published endurance numbers were typically in the 10^6 cycle range. Most current devices are in the 10^5 range and many see as low as 10^4 cycles on the horizon.  A million cycles is fine and will not restrict the life of the device.  100,000 cycles is closer to the line but my back of envelope numbers suggest 100k will (barely) be acceptable.  10k cycles is a problem and will restrict longevity of the device.

 

In this research work Shigeki Sakai and Ken Takeuchi show how Feroelectric gate Field Effect Transistors can dramatically improve the durability, reduce required programming voltage, improve performance, and support further generational reductions in feature size.  The device prototype they demonstrated uses 6v to program rather than 20v which may reduce the cost or increase the speed of devices slightly.  What’s most important in the demonstrated results is estimated endurance in the 10^8 cycle range which is at least three orders of magnitude better than most current generation NAND parts.  That would take endurance completely off the NAND Flash worry list. 

 

Potential feature size reduction is the other improvement of interest in this result.  Feature size reduction is the engine of Moore’s law and drives the semi-conductor economics we’ve all become used to.  Many experts don’t expect to be able to reduce NAND flash features size below roughly 30nm.  The Fe-NAND result shows potential for two more generational feature size reductions down to the 10nm range. This is important in that it drives costs reductions and we all want them to continue.

 

Fe-NAND looks extremely interesting and, if the research can be confirmed and is manufacturability, we have a very significant technology that can address the two major concerns with current generation NAND flash: 1) rapidly falling endurance, and 2) expected inability to drop down below 32nm feature size.  Flash continues to build industry momentum.

 

                                --jrh

 

Thanks to Jack Creasey for sending this my way.

 

James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

 

Friday, July 04, 2008 5:55:51 AM (Pacific Standard Time, UTC-08:00)  #    Comments [1] - Trackback
Hardware
 Wednesday, July 02, 2008

Updated below with additional implementation details.

 

Last week Spansion made an interesting announcement: EcoRAM, a NOR Flash based storage part in a Dual In-line Memory Module (DIMM) package. 

 

NOR Flash technology growth has been fueled by the NOR support for Execute in Place (XIP).  Unlike the NAND Flash interface, where entire memory pages need to be shifted into memory to be operated upon, NOR flash is directly addressable.  And this direct addressability allows instructions to be read and executed directly from the memory.  There is no need to shift pages out one at a time. Byte addressability and support for XIP makes NOR ideal for boot loaders, ROMs, and the control program store for consumer devices. For example, the iPod Nano uses Silicon Storage Technology 39WF800A 8-Megabyte NOR boot flash (eeTimes).

 

Since NAND flash is not byte addressable providing only a block mode interface, it is typically attached to PCs and servers as an I/O device.  The NOR support for direct byte addressability makes it a candidate for attachment as a memory rather than as a block mode I/O device and when I first read the press release I thought this was what Spansion has done in partnership with Virident Systems. It’s clear they have NOR Flash memory in a memory (DIMM) package and they refer to it throughout the press release as “memory extension”. However, upon closer inspe