Wednesday, July 16, 2008

I’ve been collecting scaling stories for some time now and last week I came across the following run down on Fliker scaling: Federation at Flickr: Doing Billions of Queries Per Day by Dathan Vance Pattishall, the Flickr database guy.

 

The Flickr DB Architecture is sharded with a PHP access layer to maintain consistency.  Flickr users are randomly assigned to a shard. Each shard is duplicated in another database that is also serving active shards. Each DB needs to be less than 50% loaded to be able to handle failover.

 

Shards are found via a lookup ring that maps userID or groupID to shardID and photoID to userID.  The DBs are protected by a memcached layer with a 30 minute caching lifetime. Slide 16 says they are maintaining consistency using distributed transactions but I strongly suspect they are actually just running two parallel transactions with application management rather than 2pc.

 

Maintenance is done by bringing down ½ the DBs and the remaining DBs will handle the load but it appears they have no redundancy (failure protection) during the maintenance periods.

 

They have 12TB of user data in aggregate and they appear to be using MySQL (slide 25 complains about an INNODB bug).

 

Other web site scaling stories:

·         Scaling Linkedin: http://perspectives.mvdirona.com/2008/06/08/ScalingLinkedIn.aspx

·         Scaling Amazon: http://glinden.blogspot.com/2006/02/early-amazon-splitting-website.html

·         Scaling Second Life: http://radar.oreilly.com/archives/2006/04/web_20_and_databases_part_1_se.html

·         Scaling Technorati: http://www.royans.net/arch/2007/10/25/scaling-technorati-100-million-blogs-indexed-everyday/

·         Scaling Flickr: http://radar.oreilly.com/archives/2006/04/database_war_stories_3_flickr.html

·         Scaling Craigslist: http://radar.oreilly.com/archives/2006/04/database_war_stories_5_craigsl.html

·         Scaling Findory: http://radar.oreilly.com/archives/2006/05/database_war_stories_8_findory_1.html

·         MySpace 2006: http://sessions.visitmix.com/upperlayer.asp?event=&session=&id=1423&year=All&search=megasite&sortChoice=&stype=

·         MySpace 2007: http://sessions.visitmix.com/upperlayer.asp?event=&session=&id=1521&year=All&search=scale&sortChoice=&stype=

·         Twitter, Flickr, Live Journal, Six Apart, Bloglines, Last.fm, SlideShare, and eBay: http://poorbuthappy.com/ease/archives/2007/04/29/3616/the-top-10-presentation-on-scaling-websites-twitter-flickr-bloglines-vox-and-more

 

                        -jrh

 

Thanks to Kevin Merritt (Blist) and Dave Quick (Microsoft) for sending this my way.

 

James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

 

Wednesday, July 16, 2008 5:13:40 AM (Pacific Standard Time, UTC-08:00)  #    Comments [2] - Trackback
Services
 Sunday, July 13, 2008

I just got back from O’Reilly’s Foo Camp.   Foo is an interesting conference format in that there is no set agenda. It’s basically self organized as a open space-type event but that’s not what makes it special.  What makes Foo a very cool conference is the people.  Lots of conferences invite good people but few invite such a  diverse a set of attendees.  It was a lot of fun.

 

Here’s a picture from Saturday night of (right to left) Jesse Robbins (Seattle entrepreneur and co-chair of O’Reilly’s Velocity conference), Pat Helland (Microsoft Developer Division Architect) and myself.

 

  

From: http://flickr.com/photos/jesserobbins/2663653582/

Foo is an invitational event loosely organized around folks O’Reilly find interesting or want to get to know better.  Tim O’Reilly describes the invite process in this blog comment: http://gigaom.com/2005/08/16/foocampfighting/#comment-19709 .

 

The conference starts around 5pm on Friday with drinks followed by dinner. After dinner, the sign-up boards for Saturday and Sunday talk sessions are put up. When Tim O’Reilly announced the sign-up boards were up, attendees rose from their tables and casually started ambling towards the sign-up boards as though there really was no rush. We’ll just be doing some signing up over the course of the evening.  We might do it now or perhaps we’ll do it later. No big rush.  And then some folks break into a slight trot towards but still not really an overt run  Suddenly, it’s a full raging stampede.  We became a single flow than as a discrete set of individuals. The flow accelerated and then crashed up against the sign-up boards spilling to both sides with folks madly negotiating for pens, better locations, sharing sessions, a shift of 6” to one side or the other, or  requesting to join two sessions together. Welcome to foo camp.  It didn’t slow down much from that point for the next 44 hours.

 

Sessions are 1 hour long but it’s good form to share a session if you don’t need the full hour. Speakers use their judgment to sign up for a small, medium or large room.  Some rooms have AV equipment and some don’t.

 

I shared a 1 hour ssession with Jeff Hammerbacher who leads the Facebook data team.  Earlier last week Jeff announced that he will be leaving Facebok in September: http://valleywag.com/5024169/yet-another-hoodie+wearing-harvard-kid-drops-out-of-facebook.  Jeff and I got a medium sized room in a tent beside the main building without A/V. 

 

The title for my session was Where Does the Power go in Data Centers and How to get it Back?  I didn’t show slides but much of what we covered is posted at: http://mvdirona.com/jrh/TalksAndPapers/JamesRH_DCPowerSavingsFooCamp08.ppt.  In the session, we talked through how contemporary large data centers work first looking at power distribution. We tracked the power from the feed to the substation at 115,000 volts through numerous conversions before arriving at the CPU at 1.2 volts. We then talked about power saving server design techniques.  And then the mechanical systems used to get the heat back out.  In each section we discussed what could be done to improve the design and how much could be saved.

 

Our conclusion from the session was that power savings of nearly 4x where both possible and affordable using only current technology.  For those participated in the session, thanks for your contribution and  for your help. It was fun.

 

                                                                --jrh

 

James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

Sunday, July 13, 2008 5:58:43 PM (Pacific Standard Time, UTC-08:00)  #    Comments [1] - Trackback
Ramblings
 Saturday, July 12, 2008

Last week the Facebook Data team released Cassandra as open source. Cassandra is an structured store with write ahead logging and indexing. Jeff Hammerbacher, who leads the Facebook Data team described Cassandra as a BigTable data model running on a Dynamo-like infrastructure.

 

Google Code for Cassandra (Apache 2.0 License): http://code.google.com/p/the-cassandra-project/.

 

Avinash Lakshman, Prashant Malik, and Karthik Ranganathan presented at SIGMOD 2008 this year: Cassandra: Structured Storage System over a P2P Network.  From the presentation:

 

Cassandra design goals:

·         High availability

·         Eventual consistency

·         Incremental scalability

·         Optimistic replication

·         Knobs to “tune” tradeoffs between consistency, durability, and latecy

·         Low cost of ownership

·         Minimal administration

 

Write operation: write to arbitrary node in Cassandra cluster, request sent to node owning the data, node writes to log first and then applied to in-memory copy. Properties of write: no locks in critical path, sequential disk accesses, behaves like a write through cache, atomicity guarantee for a key, and always writable.

 

Cluster membership is maintained via gossip protocol.

 

Lessons learned:

·         Add fancy features only when required

·         Many types of failures are possible

·         Big systems need proper systems-level monitoring

·         Value simple designs

 

Future work:

·         Atomicity guarantees across multiple keys

·         Distributed transactions (I’ll try to talk them out of this one)

·         Compression support

·         Fine grained security via ACLs

 

It looks like a well engineered system.

 

                                                --jrh

 

James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

 

Saturday, July 12, 2008 3:17:11 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Software
 Thursday, July 10, 2008

What follows is a guest posting from Phil Bernstein on the Google Megastore presentation by Jonas Karlsson, Philip Zeyliger at SIGMOD 2008:

 

Megastore is a transactional indexed record manager built by Google on top of BigTable. It is rumored to be the store behind Google AppEngine but this was not confirmed (or denied) at the talk. [JRH: I certainly recognize many similarities between the Google IO talk on the AppEngine store (see Under the Covers of the App Engine Datastore in Rough notes from Selected Sessions at Google IO Day 1) and Phil’s notes below].

 

·         A transaction is allowed to read and write data in an entity group.

·         The term “entity group” refers to a set of records, possibly in different BigTable instances. Therefore, different entities in an entity group might not be collocated on the same machine. The entities in an entity group share a common prefix of their primary key.  So in effect, an entity group is a hierarchically-linked set of entities.

·         A per-entity-group transaction log is used. One of the rows that stores the entity group is the entity group’s root. The log is stored with the root, which is replicated like all rows in Big Table.

·         To commit a transaction, its updates are stored in the log and replicated to the other copies of the log. Then they’re copied into the database copy of the entity group.

·         They commit to replicas before acking the caller and use Paxos to deal with replica failures. So it’s an ACID transaction.

·         Optimistic concurrency control is used. Few details were provided, but I assume it’s the same as what they describe for Google Apps.

·         Schemas are supported.

·         They offer vertical partitioning to cluster columns that are frequently accessed together.

·         They don’t support joins except across hierarchical paths within entity groups. I.e., if you want to do arbitrary joins, then you write an application program and there’s no consistency guarantee between the data in different entity groups.

·         Big Table does not support indexes. It simply sorts the rows by primary key. Megastore supports indexes on top. They were vague about the details. It sounds like the index is a binary table with a column that contains the compound key as a slash-separated string and a column containing the primary key of the entity group.

·         Referential integrity between the components of an entity group is not supported.

·         Many-to-many relationships are not supported, though they said they can store the inverse of a functional relationship.  It sounded like a materialized view that is incrementally updated asynchronously.

·         It has been in use by apps for about a year.

 

The follow are the notes that I typed while listening to the talk. For the most part, it’s just what was written on the slides and is incomplete. I don’t think it adds much to my summary above.

 

TITLE: Megastore – Scalable Data System for User-facing Apps

SPEAKERS: Jonas Karlsson, Philip Zeyliger (Google)

 

User-facing apps have TP-system-like requirements

Ø  data updated by a few users

Ø  largely reads, small updates

Ø  lots of work scaling the data layer

Ø  users expect consistency

 

Megastore – scale, RAD, consistency, performance

 

Scale

Ø  start with Big Table for storage

Ø  add db technologies that scale: indices, schemas

Ø  offer transparent replication and failover between data centers

Ø  support many, frequent reads

Ø  writes may be more costly, because they’re less frequent

Ø  with a correct schema, the app should scale naturally

 

Rapid App Development

Ø  hide operational aspects of scalability from app code

Ø  Flexible declarative schemas (MDL) – looks like SQL

o   indices, rich data types, consistency and partitioning

Ø  offer transactions

 

Entity Group consistency is supported

Ø  it’s a logical partition – e.g., blogs, posts, comments, which are all keyed by the owner of the blog

Ø  all entities in the group have the same high-order key component

 

Transactions

Ø  roll forward transaction log per entity group, no rollbacks

o   pb: It’s unclear to me whether they pool the log across all entity groups in a partition. If not, then they don’t benefit from group commit.

Ø  A transaction over an entity group is ACID , but not across entity groups

Ø  optimistic concurrency control

Ø  updates are available only after commit

Ø  api: newTransaction, read/write, commit (pb: couldn’t type fast enough for the details)

Ø  non-blocking, consistent reads (pb: does a transaction see its previous writes?)

Ø  cross-entity group operations have looser consistency

 

Performance

Ø  schemas declare their physical data locality

Ø  optimized to minimize seeks, RPCs, bandwidth, and storage

Ø  several ways of declaring physical locality

o   entity groups

o   shared primary key prefixes (collocating tables in Big Table)

o   locality groups – i.e., attribute partitioning

Ø  simply cost-transparent API primitives imply

o   only add scalable features

o   cost of writes is linear to data/indices

o   avoid scalability surprises

 

Avoiding joins

Ø  hierarchical primary keys

Ø  repeated fields (I guess just like Big Table)

Ø  store hierarchical data  in a column (it’s unclear to me whether this is the whole entity group or only part of it)

Ø  Syntax looks like SQL: Create Table (with primary key), Create Index (on particular columns), ..

 

Replication HA

Ø  uses Paxos-based algorithm, per entity group

Ø  it was more complicated than they expected

Ø  writes need a majority of replicas to be up in order to commit

Ø  most reads are local, consistency ensured

Ø  replication is consistent and synchronous

Ø  automatic query failover: individual table servers may fail

 

Usage

Ø  has been used in production for a year

Ø  used by several internal tools and projects

Ø  several large and user-visible apps (they wouldn’t say which ones, except we know them)

Ø  used for rapidly implementing new features in older projects

Ø  many other projects are migrating to it

 

Technical lessons

Ø  declarative schemas are good

Ø  cost-transparent APIs good (SQL is not cost-transparent)

Ø  avoid joins with hierarchical data and indices (if you want a join that isn’t on a hierarchical path, then write a program, e.g., Sawzall

Ø  avoid scalability surprises

 

Social

Ø  schema reviews helpful

Ø  consistency is necessary

Ø  need a mindset of scalability and performance

 

James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

 

Thursday, July 10, 2008 1:47:45 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services
 Tuesday, July 08, 2008

Jim Gray proposed the original sort benchmark back in his famous Anon et al paper A Measure of Transaction Processing Power originally published in Datamation April 1, 1985. TeraSort is one of the benchmarks that Jim evolved from this original proposal.

 

TeraSort is essentially a sequential I/O benchmark and the best way to get lots of I/O capacity is to have many servers.  The mainframe engineer-a-bigger-bus technique has produced some nice I/O rates but it doesn’t scale. There have been some very good designs but, in the end, commodity parts in volume always win. The trick is coming up with a programming model that is understandable to allow thousands of nodes to be harnessed.  MapReduce takes some heat for not being innovative and not having learned enough from the database community (MapReduce – A Major Step Backwards).  However, Google, Microsoft, and Yahoo run the model over thousands of nodes.  And all three have written higher level languages layers above MapReduce some of which look very SQL-like.

 

Owen O’Malley of the Yahoo Grid team took a moderate sized Hadoop cluster of 910 nodes and won the TeraSort benchmark.  Owen blogged the result: Apache Hadoop Wins Terabyte Sort Benchmark and provided more details in a short paper: TeraByte Sort on Apache Hadoop. Great result Owen.

 

Here’s the configuration that won:

  • 910 nodes
  • 4 dual core Xeons @ 2.0ghz per a node
  • 4 SATA disks per a node
  • 8G RAM per a node
  • 1 gigabit ethernet on each node
  • 40 nodes per a rack
  • 8 gigabit ethernet uplinks from each rack to the core
  • Red Hat Enterprise Linux Server Release 5.1 (kernel 2.6.18)
  • Sun Java JDK 1.6.0_05-b13

Yahoo bought expensive  4-socket servers for this configuration but, even then, this effort was won on less than ½ million in hardware.  Let’s assume that their fat nodes are $3.8k each.  They have 40 servers per rack so well need 23 racks. Let’s assume $4.2k per top of rack switch and $100k for core switching.  That’s 910*$3.8k+23*4.2k+$100k or $3,655k.  That means you can go out and spend roughly $3.5m and have the same resources that won the last sort benchmark. Amazing.  I love what’s happening in our industry.

Update: Math glitch in original posting fixed above (thanks to Ari Rabkin & Nathan Shrenk).

 

The next thing I would like to see is this same test run on very low power servers.  Assuming the fairly beefy nodes used above are 350W each (they may well be more), the overall cluster ignoring networking will be 318kW and it ran for 209 seconds which is 18.490kW/hrs. Let’s focus on power and show what can be sorted for 1kW/hr. The kW/hr sort.

 

Congratulations to the Yahoo and Hadoop team for a great result.

 

                                --jrh

 

Wei Xiao of the Internet Search Research Center sent Owen’s result my way.

 

James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

 

Tuesday, July 08, 2008 4:03:17 AM (Pacific Standard Time, UTC-08:00)  #    Comments [10] - Trackback
Software
 Friday, July 04, 2008

Recently results from two academic researchers in Japan will be significant to the NAND Flash market: http://www.electronicsweekly.com/Articles/Article.aspx?liArticleID=44028&PrinterFriendly=true.  Clearly the trip from laboratory to volume production is often longer than the early estimates but these results look important. 

 

Back in 2006, Jim Gray argued in Tape is Dead, Disk is Tape, Flash is Disk, & Ram Locality is King that we need a new layer in the storage hierarchy between memory and disk and NAND Flash was an excellent candidate.  Early NAND Flash-based SSDs could sustain read rates well beyond 10x of disk random IO rates but the write rates were terrible. Some were as bad as 1/5 the rate of magnetic disk. Second generation devices are solving the random write problem as expected.  Costs continue to plunge, overall performance continue to improve, and many very high scale server workloads have been deploying flash devices over the past year. A success by most measures but two issues remain. The first issue is that we have one important metric heading in the wrong direction: endurance.  NAND flash can only support a limited number of erase cycles before failing.  The second issue is that many don’t expect the feature size to be reduced below 32 nm which, were that to happen, would slow the improvement rate dramatically.

 

When I first got interested in single level cell (SLC) NAND Flash most published endurance numbers were typically in the 10^6 cycle range. Most current devices are in the 10^5 range and many see as low as 10^4 cycles on the horizon.  A million cycles is fine and will not restrict the life of the device.  100,000 cycles is closer to the line but my back of envelope numbers suggest 100k will (barely) be acceptable.  10k cycles is a problem and will restrict longevity of the device.

 

In this research work Shigeki Sakai and Ken Takeuchi show how Feroelectric gate Field Effect Transistors can dramatically improve the durability, reduce required programming voltage, improve performance, and support further generational reductions in feature size.  The device prototype they demonstrated uses 6v to program rather than 20v which may reduce the cost or increase the speed of devices slightly.  What’s most important in the demonstrated results is estimated endurance in the 10^8 cycle range which is at least three orders of magnitude better than most current generation NAND parts.  That would take endurance completely off the NAND Flash worry list. 

 

Potential feature size reduction is the other improvement of interest in this result.  Feature size reduction is the engine of Moore’s law and drives the semi-conductor economics we’ve all become used to.  Many experts don’t expect to be able to reduce NAND flash features size below roughly 30nm.  The Fe-NAND result shows potential for two more generational feature size reductions down to the 10nm range. This is important in that it drives costs reductions and we all want them to continue.

 

Fe-NAND looks extremely interesting and, if the research can be confirmed and is manufacturability, we have a very significant technology that can address the two major concerns with current generation NAND flash: 1) rapidly falling endurance, and 2) expected inability to drop down below 32nm feature size.  Flash continues to build industry momentum.

 

                                --jrh

 

Thanks to Jack Creasey for sending this my way.

 

James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

 

Friday, July 04, 2008 5:55:51 AM (Pacific Standard Time, UTC-08:00)  #    Comments [1] - Trackback
Hardware
 Wednesday, July 02, 2008

Updated below with additional implementation details.

 

Last week Spansion made an interesting announcement: EcoRAM, a NOR Flash based storage part in a Dual In-line Memory Module (DIMM) package. 

 

NOR Flash technology growth has been fueled by the NOR support for Execute in Place (XIP).  Unlike the NAND Flash interface, where entire memory pages need to be shifted into memory to be operated upon, NOR flash is directly addressable.  And this direct addressability allows instructions to be read and executed directly from the memory.  There is no need to shift pages out one at a time. Byte addressability and support for XIP makes NOR ideal for boot loaders, ROMs, and the control program store for consumer devices. For example, the iPod Nano uses Silicon Storage Technology 39WF800A 8-Megabyte NOR boot flash (eeTimes).

 

Since NAND flash is not byte addressable providing only a block mode interface, it is typically attached to PCs and servers as an I/O device.  The NOR support for direct byte addressability makes it a candidate for attachment as a memory rather than as a block mode I/O device and when I first read the press release I thought this was what Spansion has done in partnership with Virident Systems. It’s clear they have NOR Flash memory in a memory (DIMM) package and they refer to it throughout the press release as “memory extension”. However, upon closer inspection, it appears to require that the NOR memory DIMM packages all be installed in a separate gateway server they refer as a “Green Gateway”.  It looks like the design has all the NOR flash in this separate server and a device driver on the host to virtualize memory on the NOR flash server. Essentially it still may be accessed via an I/O interface which is to say it’s not clear why you couldn’t do the same thing with NAND Flash.  And, it’s not immediately clear what protocol is used, what operating systems are supported, nor the exact performance but, overall, it still looks interesting.

 

Update: In conversations with Virident, it appears this part is potentially more interesting that I initially speculated. Rather than hosting the memory in an independent server as I speculated, it’s an in-server design but does require some BIOS engineering. From Virident: The current interconnect is the HTx bus for AMD servers.  Will be QPI for Intel.  We are doing AMD first.  You should be able to install on a standard two socket board.  DRAM sits behind the processor, and EcoRAM sits behind the controller.  Of course, the BIOS for the system must support the extended memory – we have HP systems up and running as a proof of concept, and Dell should work fairly soon. 

 

HTX and QPI open up big opportunities for hardware startups to innovate.  I know of many startups heading down this path. More innovation coming.

 

EcoRAM looks like it’s worth investigating in more detail.

 

                                --jrh

 

A (slightly) more detailed presentation is available at: http://www.spansion.com/about/news/events/Transforming_the_Internet_Data_Center.pdf.  Some interesting speeds and feeds from the press release and the presentation:

 

·         1/8th the power of DRAM at a given capacity,

·         Estimating that 8x power to storage capacity advantage over DRAM will grow to a full 16x by 2012

·         10x the reliability of DRAM,

·         smaller die area per bit,

·         much closer to DRAM access times (a bit vague on this one).

 

The Achilles heel of NOR Flash has been the poor write speed.  The press release claims 2x to 10x better than traditional NOR Memories but this is still considerably slower than DRAM.

 

We need a lot more technical data and repeatable performance measures but, with what has been published so far, it would appear that the sweet spot for this device are very high random IO rate, read-mostly workloads.  Potentially fairly interesting.

 

                                                --jrh

Thanks to Son VoBa of the Windows Virtualization team for sending this my way.

 

James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

 

Wednesday, July 02, 2008 5:14:17 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Hardware
 Sunday, June 29, 2008

Title: Needle in a Haystack: Efficient Storage of Billions of Photos

Speaker: Jason Sobel, Manager of the Facebook, Infrastructure Group)

Slides: http://beta.flowgram.com/f/p.html#2qi3k8eicrfgkv

 

An excellent talk that I really enjoyed.  I used to lead a much smaller service that also used a lot of NetApp storage and I recognized many of the problems Jason mentioned.  Throughout the introductory part of the talk I found myself thinking they need to move to a cheap, directly attached blob store. And that’s essentially the topic of remainder of the talk.  Jason presented Haystack, the Facebook solution to the problem of a filesystem not working terribly well for their high volume blob storage needs.

 

The same thing happened when he talked through the Facebook usage of Content Delivery Networks (CDNs).  The CDN stores the data once in a geo-distributed cache, Facebook stores it again in their distributed cache (Memcached) and then again the database tier.  Later in the talk Jason, made exactly this observation and observed the new design will allow them to use the CDNs less and as they get a broader geo-diverse data center footprint, they may move to being their own CDN. I 100% agree.

 

My rough notes below with some of what I found most interesting.

 

Overall Facebook facts:

·         #6 site on the internet

·         500 total employees

o   200 in engineering

o   25 in Infrastructure Engineering

·         One of the largest MySQL installations in the world

·         Big user and contributor to Memcached

·         More than 10k servers in production

·         6,000 logical databases in production

 

Photo Storage and Management at Facebook:

·         Photo facts:

o   6.5B photos in total

§  4 to 5 sizes of each picture is materialized (30B files)

§  475k images/second

·         Mostly served via CDN (Akamai & Limelight)

·         200k profile photos/second

§  100m uploads/week

o   Stored on netapp filers

·         First level caching via CDN (Akamai & Limelight)

o   99.8% hit rate for profiles

o   92% hit rate for remainder

·         Second level caching for profile pictures only via Cachr (non-profile goes directly against file handle cache)

o   Based upon a modified version of evhttp using memcached as a “backing” store

o   Since cachr is independent from memcachd, cachr failure doesn’t lose state

o   1 TB of cache over 40 servers

o   Delivers microsecond response

o   Redundancy so no loss of cache contents on server failure

·         Photo Servers

o   Non-profile requests go directly against the photo-servers

o   Only profile requests that miss the cachr cache.

·         File Handle Cache (FHC)

o   Based upon lighttpd and uses memcached as backing store

o   Reduces metadata workload on NetApp servers

o   Issue: filename to inode lookup is a serious scaling issue: 1) drives many I/Os or 2) wastes too much memory with a very large metadata caceh

§  They have extended the Linux kernel to allow NFS file opens via inode number rather than filename to avoid the NetApp scaling issue.

§  The inode numbers are stored in the FHC

§  This technique offloads the NetApp servers dramatically.

§  Note that files are write only.  Mods write a new file and delete the old ones so the handles will fail and a new metadata lookup will be driven.

·         Issues with this architecture:

o   Netapp storage overwhelmed by metadata (3 disk I/Os to read a single photo).

§  The original design required 15 I/Os for a single picture (due to deeper directory hierarchy I’m guessing)

§  Tracking last access time, last modified etc. has no value to Facebook.  They really only need a blob store but they are using a filesystem at additional expense

o   Heavy reliance on CDNs and caches such that netapp is basically almost pure backup

§  92% of non-profile and 99.8% of profile pictures are stored in CDN

§  Many of the rest are almost all stored in caching layers

·         Solution: Haystacks

o   Haystacks are a user level abstraction where lots of data is stored in a single file

o   Store an independent index vastly more efficient than the file store

o   1M of metadata/1G of data

§  Order of magnitude better on average than standard NetApp metadata

o   1 disk seek for all reads with any workload

o   Most likely store in XFS

o   Expect each haystack to be about 10G (with an index)

o   Speaker equates a Haystack to be a lot like a LUN and could be implemented on a LUN.  The actual implementation is via NFS onto NetApp as photos were previously stored

o   Net of what’s happening:

§  Haystack always hits on the metadata

o   Plan to replace NetApp

§  Haystack is a win over NetApp but we’ll likely run over XFS (originally done by Silicon Grapics)

§  Want more control of the cache behavior

o   Each Haystack Format:

§  Version number,

§  Magic number,

§  Length,

§  Data,

§  Checksum

o   Index format

§  Version,

§  Photo key,

§  Photo size,

§  Start,

§  Length.

o   Not planning to delete photos at all since delete rate is VERY low so it the resource that would be recovered are not worth the work to recover them in the Facebook usage.  Deletion just removes the entry from the index which makes the data unavailable but they don’t bother to actually remove it from the Haystack bulk storage system.

o   Q:Why not store the index in a RDBMS?  Feels that it’ll drive too many I/Os and have the problems they are trying to avoid (I’m not completely convinced but do understand that simplicity and being in control has value).

·         They still plan to use the CDN but they are hoping to reduce their dependence on CDN. They are considering becoming their own CDN (Facebook is absolutely large enough to be able to do this cost effectively today).

·         They are considering using to SSDs in the future.

·         Not interested in hosting with Google or Amazon. Compute is already close to the data and they are working to get both closer to users but don’t see a need/use for GAE or AWS at the Facebook scale.

·         The Facebook default is to use databases.  Photos are the largest exception but most data is stored in DBs. Few actions use transactions and joins though.

·         Almost all data is cached twice: once in memcached and then again in the DBs.

·         Random bits:

o   Canada: 1 out of 3 Canadians use Facebook.

o   Q:What is the strategy in China?  A:“not to do what Google did” :-)

o   Looking at de-duping and other commonality exploiting systems for client to server communications and storage (great idea although not clearly a big win for photos).

o   90% Indians access internet via a mobile device.  Facebook very focused on mobile and international.

 

Overall, an excellent talk by Jason.

 

--jrh

 

Sent my way by Hitesh Kanwathirtha  of the Windows Live Experience team, Mitch Wyle of Engineering Excellence, and Dave Quick and Alex Mallet both in Windows Live Cloud Storage group. It was originally Slashdotted at: http://developers.slashdot.org/article.pl?no_d2=1&sid=08/06/25/148203. The presentation is posted at: http://beta.flowgram.com/f/p.html#2qi3k8eicrfgkv.

 

James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

 

Sunday, June 29, 2008 8:04:21 PM (Pacific Standard Time, UTC-08:00)  #    Comments [3] - Trackback
Services
 Friday, June 27, 2008

Alex Mallet and Viraj Mody of the Windows Live Mesh team took great notes at the Structure ’08 (Put Cloud Computing to Work) conference (appended below).

 

Some pre-reading information was made available to all attendees as well: Refresh the Net: Why the Internet needs a Makeover?

 

Overall

-          Interesting mix of attendees from companies in all areas of “cloud computing”

-          The quality of the presentations and panels was somewhat uneven

-          Talks were not very technical

-          Amazon is the clear leader in mindshare; MS isn’t even on the board

-          Lots of speculation about how software-as-a-service, platform-as-a-service, everything-as-a-service is going to play out: who the users will be, how to make money, whether there will be cloud computing standards etc

 

5 min Nick Carr video [author of “The Big Switch”]

-          Drew symbolic link between BillG retiring and Structure ’08, the first “cloud computing” conference, being the same week, marking the shift of computing from the desktop to the datacenter

-          Generic pontificating on the coming “age of cloud computing”

 

“The Platform Revolution: a look into disruptive technologies”, Jonathan Yarmis, research analyst

-          Enterprises always lag behind consumers in adoption of new technology, and IT is powerless to stop users from adopting new technology

-          4 big tech trends: social networks, mobility, cloud computing, alternative business models [eg ad-supported]

-          Tech trends mutually reinforcing: mobility leads to more social networking applications, being able to access data/apps in the cloud leads to more mobility

-          Mobile is platform for next-gen and emerging markets: 1.4 billion devices per year, 20% device growth per year, average device lifetime 21 months; opens up market to new users and new uses

-          Claim: “single converged device will never exist”; cloud computing enables independence of device and location

-          Stream computing: rate of data creation is growing at 50-500% per year, and it’s becoming increasingly important to be able to quickly process the data, determine what’s interesting and discard the rest

-          “Economic value of peer relationships hasn’t been realized yet” – Facebook Beacon was a good idea, but poorly realized

 

“Virtualization and Cloud Computing”, with Mendel Rosenblum, VMWare founder 

-          Virtualization can/should be used to decouple software from hardware even in the datacenter

-          Virtualization is cloud-computing enabler: can decide whether to run your own machines, or use somebody else’s, without having to rewrite everything

-          Coming “age of multicore” makes virtualization even more important/useful

-          Smart software that figures out how to distribute VMs over physical hardware isn’t a commodity yet

-          VMWare is working on merging the various virtualization layers: machine, storage, networks [eg VLANs]

-          HW support for virtualization is mostly being done for server-class machines [?]

-          Rosenblum doesn’t think moving workloads from the datacenter to edge machines, to take advantage of spare cycles, will ever take off – it’s just too much of a pain to try to harness those spare cycles

-          Single-machine hypervisor is becoming commodity, so VMWare is moving to managing the whole datacenter, to stay ahead of the competition

 

Keynote, Werner Vogels, Amazon CTO:

-          Mostly a pitch for Amazon’s web services: EC2, S3, SQS, SimpleDB

-          Gave example of Animoto, company that merges music + photos to create a video, which has no servers whatsoever: had 25K users, launched a Facebook app, and went from 25K users total to adding 25K users/hour; were able to handle it by moving from 50 EC2 instances to 3000 EC2 instances in 2 days

-          Currently 370K registered AWS developers

-          Bandwidth consumed by AWS is bigger than bandwidth consumed by Amazon e-commerce services

-          Shift to service-oriented architecture occurred as result of being approached by Target in 2001/2002, asking whether Amazon could run their e-commerce for them. Realized that their current architecture wouldn’t scale/work, so they re-engineered it

-          Single Amazon page can depend on hundreds of other services

-          Big barrier between developing web app and operating it at scale: loadbalancing, hardware mgmt, routing, storage management etc. Called this the “undifferentiated heavy lifting” that needs to be done to even get in the game

-          Claim: typical company spends 70% effort/money on “undifferentiated heavy lifting” and 30% on differentiated value creation; AWS is intended to allow companies to focus much more on differentiated value creation

-          SmugMug has been at forefront of companies relying on AWS; currently store 600TB of photos in S3, and have launched an entirely new product, SmugVault, based purely on the existence of S3 => AWS not just replacement for existing infrastructure, but enabling new businesses

-          In 2 years, cloud computing will be evaluated along 5 axes: security, availability, scalability, performance, cost-effectiveness

-          Really plugged the pay-as-you-go model

 

“Working the Cloud: next-gen infrastructure for new entrepreneurs” panel

-          Q: is lock-in going to be a problem ie how easy will it be to move an app from one cloud computing platform to another ?

o   A: Strong desire for standards that will make it easy to port apps, but not there yet

o   A: To really use the cloud, you need to embed assumptions about it in your code; even bare-metal clouds require intelligence, like scripts to spin up new EC2 instances, so lock-in is a real concern

o   Side thread: Google person on panel claimed that using Google App Engine didn’t lock in developers, because the GAE APIs are well-documented, and he was promptly verbally mugged by just about everyone else on the panel, pointing out things like the use of BigTable in GAE making it difficult to extract data, or replace the underlying storage layer etc.

o   Prediction: there will be convergence to standards, and choice will come down to whether to use a generic cloud, or a more specialized/efficient cloud, eg one targeted at the medical information sector, with features for HiPPA compliance

-          Need new licensing models for cloud computing, to deal with the dynamic increase/decrease in number of application instances/virtual machines as load changes

-          Tidbit: Google has geographically distributed data centers, and geo-replicates [some] data

-          Q: will we be able to use our old “toys” [APIs, programming models etc] in the cloud ?

o   A: Yes, have to be able to, otherwise people won’t adopt it

o   A:  Yes, just have to be smart about replacing the plumbing underneath the various APIs

o   A: Yes, but current API frameworks are lacking some semantics that become important in cloud computing, like ways to specify how many times an object should be replicated, whether it’s ok to lazily replicate some data etc

 

Mini-note, “Optical networking”, Drew Perkins

-          Video is, by far, the largest consumer of bandwidth on the internet

-          Cost of content is disproportionate to size: 4MB song costs $1, 200MB TV show episode costs $2, 1.5GB movie costs $3-4.

-          Photonic integrated circuits that can be used to build 100GB/s are needed to meet future bandwidth requirements: less power,  need fewer network devices

 

“Race to the next database” panel

-          Quite poorly organized: panelists each got to give an [uninformative] infomercial for their company, and there was very little time for actual questions and discussion

-          Aster Data Systems is back-end data warehouse and analytics system for MySpace: 1 billion impressions/day, 1 TB of new data per day, new data is loaded into 100-node Aster cluster every hour and needs to be available for ad analytics engine to decide which ads to show

-          SQLStream is company that has built a data stream processing product that collapses the usual processing stages [data staging, cleaning, loading etc] into a pipeline that continuously produces results for “standing” queries; useful for real-time analytics

-          Web causes disruption to traditional DB model because of [10x] larger data volumes, need for high interactivity/turn-around, need to scale out instead of up. For example, GreenPlum is building a 20PB data warehouse for one customer.

-          Can’t rely on all the data being in a single store, so need to be able to do federated/distributed queries

 

“MS Datacenters”

-          Presentation centered on MS plans for datacenters-in-a-box

-          Datacenters-in-a-container are long-term strategy for MS, not just transient response to high demand and lack of space

-          Container blocks have lower reliability than traditional datacenters, so applications need to be geo-distributed and redundant to handle downtime

 

“End of boxed software”, Parker Harris, co-founder of Salesforce.com

-          Origins of salesforce.com: Modeled on consumer internet sites - amazon.com; ebay.com

-          Transition from client->server site to a platform (force.com): first instinct is to build a platform, but then you lose touch with why you're building it. As they started building their experience, they abstracted away components and started realizing it could become a platform. Revenue comes from site, platform is a bonus.

-          Initially scaled by buying bigger [Sun] boxes ie scaled up, not out, and ran into lots of complexity. Unclear whether that’s still the case or whether they’ve re-architected.

 

“Scaling to satiate demand” panel

-          Q: “When did you first realize your architecture was broken, and couldn’t scale ?”

o   A: When site started to get slow; Ebay: after massive site outages

-          Q: “How do you handle running code supplied by other people on your servers ?”

o   A: Compartmentalize ie isolate apps; have mgmt infrastructure and tooling to be able to monitor and control uploaded apps; provide developers with fixed APIs and tools so you can control what they do

-          Q: “How do Facebook and Slide [builds Facebook apps] figure out where the problems are if Slide starts failing ?”

o   A: Lots of real-time metrics; ops folks from both companies are in IM contact and do co-operative troubleshooting

-          Q: “How should you handle PR around outages ?”

o   A: Be transparent; communicate; set realistic timelines for when site will be back up; set expectations wrt “bakedness” of features

-          Beware of retrying failed operations too soon, since retries may cause an overloaded system to never be able to come up

-          Ebay: each app is instrumented with the same set of logging infrastructure and there’s a real-time OLAP engine that analyzes the logs and does correlation to try to find troubled spots

-          Facebook and Meebo both utilized their user base to translate their sites into multiple languages

-          Need to know which bits of the system you can turn off if you run into trouble

-          Biggest challenge is scaling features that involve many-many links between users; it’s easy to scale a single user + data

-          Keep monitoring: there are always scale issues to find, even without problems/outages

-          Slide: “Firefox 3 broke all of our apps”

-          Facebook has > 10K servers

 

Mini-note, “Creating fair bandwidth usage on the Internet”, Dr. Lawrence Roberts, leader of the original ARPANET team

-          P2P leads to unfair usage: people not using P2P get less; 5% of users (P2P users) receive 80% capacity

-          Deep packet inspection catches 75% of p2p traffic, but isn’t effective in creating fairness

-          Anagran has flow behavior mgmt: observe per user flow behavior & utilization and then equalize. Equalization is done in memory, on networking infrastructure [routers etc] and at the user level instead of the flow level

 

Mini-note, “Cloud computing and the mid-market”, Zach Nelson, CEO of NetSuite

-          Mid-market is the last great business applications opportunity

-          Cloud computing makes it economical to reach the “Fortune 5 million”

-          Cloud computing still doesn’t solve problem of application integration

-          Consulting services industry is next to be transformed by cloud computing

 

Mini-note, “Electricity use in datacenters”, Dr.Jonathan Koomey

-          Site Uptime Network, an organization of data center operators and designers did study of 19 datacenters from 1999-2006:

o   Floor area remained relatively constant

o   Power density went from 23 W/sq ft to 35 W/sq ft

-          In 2000, datacenters used 0.5% of world’s electricity; in 2005, used 1%.

-          Cooling and power distribution are largest electricity consumers; servers are second-largest; storage and networking equipment accounts for a small fraction

-          Asia-Pacific region’s use of power is increasing the fastest, over 25% growth per year

-          Lots of inefficiencies in facility design: wrong cost metrics [sq feet versus kW], different budgets and costs borne by different orgs [facilities vrs IT], multiple safety factors piled on top of each other

-          Designed Eco-Rack, which, with only a few months of work, reduces power consumption on normalized workload by 16-18%

-          Forecast: datacenter electricity consumption will grow by 76% by 2010, maybe a bit less with virtualization

 

“VC investment in cloud computing infrastructure” panel

-          Overall thesis of panel was that VCs are not investing in infrastructure

-          VCs disagreed with panel theme, and said that it depended on the definition of infrastructure; said they are investing in infrastructure, but it’s moving higher in the stack, like Heroku [?]

-          HW infrastructure requires serious investment, large teams, and long time-frame – not a good fit with VC investment model

-          Any companies that want to build datacenters or commodity storage and compute services are not  a good investment – there are established, large competitors, and it’s very expensive to compete in that space

-          Infrastructure needed for really large scale [like a 400 Gbit/sec switch] has a pretty small market, which makes it hard to justify the investment. If there’s a small market, the buyers all know they’re the only buyers and exert large downward pressure on price, which makes it hard for company to stay in business

-          Quote: “any company that’s doing something worthwhile, and building something defensible, will take at least 24 months to develop”

 

James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

 

Friday, June 27, 2008 5:45:07 AM (Pacific Standard Time, UTC-08:00)  #    Comments [1] - Trackback
Services
 Wednesday, June 25, 2008

John Breslin did an excellent job of writing up Kai-Fu Lee’s Keynote at WWW2008.  John’s post: Dr. Kai-Fu Lee (Google) – “Cloud Computing”. 

 

There are 235m internet users in China and Kai-Fu believes they want:

1.       Accessibility

2.       Support for sharing

3.       Access data from wherever they are

4.       Simplicity

5.       Security

 

He argues that Cloud Computing is the best answer for these requirements.  He defined the key components of what he is referring to as the cloud to be: 1) data stored centrally without need for the user to understand where it actually is, 2) software and services also delivered from the central location and delivered via browser, 3) built on open standards and protocols (Linux, AJAX, LAMP, etc.) to avoid control by one company, and 4) accessible from any device especially cell phones.  I don’t follow how the use of Linux in the cloud will improve or in any way change the degree of openness and the ease with which a user could move to a different provider.  The technology base used in the cloud is mostly irrelevant. I agree that open and standard protocols are both helpful and a good thing.

 

Kai-Fu then argues that what he has defined as the cloud has been technically possible for decades but three main factors make it practical today:

1.       Falling cost of storage

2.       Ubiquitous broadband

3.       Good development tools available cost effectively to all

 

He enumerated six properties that make this area exciting: 1) user centric, 2) task centric, 3) powerful, 4) accessible, 5) intelligent, and 6) programmable.  He went through each in detail (see Dan’s posting).  In my read I just spent time on the data provided on GFS and Bigtable scaling, hardware selection, and failure rates that were sprinkled throughout the remainder of the talk:

·         Scale-out: he argues that when comparing a $42,000 high-end servers to the same amount spent on $2,500 servers, the commodity scale-out solution is 33x more efficient.  That seems like a reasonable number but I would be amazed if Google spent anywhere near $2,500 for a server.  I’m betting on $700 to perhaps as low as $500. See Jeff Dean on Google Hardware Infrastructure for a picture of what Jeff Dean reported to be the  current Google internally designed server design.

·         Failure management. Kai-Fu stated that a farm of 20,000 servers will have 110 failures per day.  This is a super interesting data point from Google in that failure rates are almost never published by major players. However, 110 per day on a population of 20k servers is ½% a day which seems impossibly high.  That says, on average, the entire farm is turned over in 181 days.  No servers are anywhere close to that unreliable so this failure data must be of all types of failures whether software or hardware. When including all types of issues, the ½% number is perfectly credible.  Assuming there current server population is roughly one million, they are managing 5,500 failures per day requiring some form of intervention.  It’s pretty clear why auto-management systems are needed at anything even hinting at this scale. It would be super interesting to understand how many of these are recoverable software errors, recoverable hardware errors (memory faults etc.), and unrecoverable hardware errors requiring service or replacement. 

·         He reports there are “around 200 Google File System (GFS) clusters in operation. Some have over 5 PB of disk space over 200 machines.”   That ratio is about 10TB per machine.  Assuming they are buying 750GB disks that just over 13 disks.  I’ve argued in the past that a good service design point is to build everything on two hardware SKUs: 1) data light, and 2) data heavy.  Web servers and mid-tier boxes run the former and data stores run the later.  One design I like uses the same server board for both SKUs with 12 SATA disks in SAS attached disk modules.  Data light is just the server board.  Data heavy is the server board coupled with 1 or optionally more disk modules to get 12 , 24, or even 36 disks for each server. Cheap cold storage needs high disk to server ratios.

·         The largest Big table cells are 700TBs over 2,000 servers.”  I’m surprised to see two thousand reported by Kai-Fu as the largest BigTable cell – in the past I’ve seen references to over 5k. Let’s look at the storage to server ratio since he offered both.  700TB storage spread over 2k servers is only 350 GB per node. Given that they are using SATA disks, that would be only a single disk and a fairly small one at that.  That seems VERY light on storage. BigTable is a semi-structured storage layer over GFS.  I can’t imagine a GFS cluster with only 1 disk/server so I suspect the 2,000 node BigTable cluster that Kai-Fu described didn’t include the GFS cluster that it’s running over.  That helps but the number still are somewhat hard to make work.  These data don’t line up well with what’s been published in the past nor do they appear to be the most economic configurations.

 

Thanks to Zac Leow to sending this pointer my way.

 

                                --jrh

 

James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

 

Wednesday, June 25, 2008 8:06:53 AM (Pacific Standard Time, UTC-08:00)  #    Comments [2] - Trackback
Services
 Tuesday, June 24, 2008

Earlier today Nokia announced it will acquire the remaining 52% share of the Symbian Limited to take over controlling interest of the mobile operating system provider with 91% of the outstanding shares.  This alone is interesting but what is fascinating is they also announced their intention to open source Symbian to create “the most attractive platform for mobile innovation and drive the development of new and compelling web-enabled applications”.  The press release reports the acquisition will be completed at 3.647 EUR/share at a total cost of 264m EUR. The new open source project responsible for the Symbian operating systems will be managed by the newly set up Symbian Foundation with support announced by Nokia, AT&T, Broadcom, Digia, NTT docomo, EA Mobile, Freescale, Fujitsu, LG, Motorola, Orange, Plusmo, Samsung, Sony Ericcson, ST, Symbian, Teleca, Texas Instruments, T Mobile, Vodaphone, and Wipro.

 

Other articles on the acquisition:

·         http://www.techcraver.com/2008/06/23/huge-news-nokia-acquires-symbian/

·         http://www.readwriteweb.com/archives/nokia_acquires_symbian.php

 

This substantially changes the mobile operating system world with all major offerings other than Windows Mobile, iPhone, and RIM now available in open source form.  The timing of this acquisition strongly suggests that it’s a response to a perceived threat from Google Android ensuring that, even if Android never gets substantial market traction, it’s already had a lasting impact on the market.

 

                                                                --jrh

 

Sent my way by Eric Schoonover

Update: Added iPhone to prooprietary mobile O/S list (thanks Dare Obasanjo).

 

James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

 

Tuesday, June 24, 2008 4:01:40 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Ramblings
 Wednesday, June 18, 2008

Lars Bak leads the Google Aarhus Denmark lab. He’s one of the original developers of Sun HotSpot Java VM. the Self Programming Language, and the sun Connected Limited Device Configuration VM for mobile phone.  He’s schedule to do a talk at JAOO Aarhaus, Denmark (Sept. 30, 2008).  Unconfirmed rumors report he will be announcing “Google Secret Project” during his JAOO keynote.

 

It’s hard to know for sure what is coming but the popular speculation is that Google will be announcing a dynamic language runtime with support for Python, JavaScript, and Java. A language runtime running on both server-side and client-side with support for a broad range of client devices including mobile phones would be pretty interesting.

 

                                --jrh

 

John Lam pointed me to: https://secure.trifork.com/speaker/Lars+Bak.

 

James Hamilton, Data Center Futures Team
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

Wednesday, June 18, 2008 6:54:03 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services
 Saturday, June 14, 2008

Earlier today Google hosted the second Seattle Conference on Scalability. The talk on Chapel was a good description of a parallel language for high performance computing being implemented done at Cray.  The GIGA+ talk described a highly scalable filesystem metadata system implemented in Garth Gibson’s lab at CMU. The Google presentation described how they implemented maps.google.com on various mobile devices. It was full of gems on managing device proliferation and scaling the user interface down to small screen sizes. 

 

The Wikipedia talk showed an interesting transactional DHT implementation using Erlang.  And, the last talk of the day was a well presented talk by Vijay Menon on transactional memory. My rough notes from the 1 day conference follow.

 

                                    --jrh

 

 

Kick-off by Brian Bershad, Director of Google Seattle Lab

 

Communicating like Nemo: Scalability from a fish-eye View

·         Speaker: Jennifer Wong (Masters Student from University of Victoria)

o   Research area: faul tolerance in Mobile collaborative systems

·         Research aimed to bring easy communications at cost and beyond two-people for SCUBA divers.

·         Fairly large market with requirement to communicate

o   Note that PADI has 130k members (there are other many other recreational diving groups and, of course, there are commercial groups as well)

·         Proposal: use acoustic for underwater to surface stations. Wireless between surface stations doing relay.

·         Acoustic unit to be integrated into dive computer.

 

Maidsafe: Exploring Infinity

·         Speaker: David Irvine, Mindsafe.net Ltd.

o   http://www.maidsafe.net/ 

·         Problems with existing client systems

o   Data gets lost

o   Encryption is difficult

o   Remot4e access diff

o   Syncing between devices hard

·         Proposal: chunk files, hash the chunks, xor and then encrypt the chunks and distribute over a DHT over many systems

o   It wasn’t sufficiently clear where the “random data” that was xored in came from or where it was stored.

o   It wasn’t sufficiently clear where the encryption key that was used in the AES encryption came from or was stored. 

·         Minimum server count for a reliable cloud is 2,500.

·         It is a quid pro quo network. You need to store data for others to be able store data yourself.

·         Lots of technology in play. Uses SHA, XORing, AES, DHTs PKI (without central authority) and something the speaker referred to as self encryption. The talk was given in whiteboard format which didn’t get to the point where I could fully understand enough of the details.  There were several question from the audience on the security of the system.  Generally an interesting system that distributes data over a large number of non-cooperating client systems.

·         Seems similar in design goals to Oceanstore and Farsite.  Available for download at http://www.maidsafe.net/. 

UPDATE: The speaker, David Irvine, sent me a paper providing more detail on Maidsafe: 0316_maidsafe_Rev_004(Maidsafe_Scalability_Response)-Letter1 (2).pdf (72.33 KB).

 

Chapel: Productive Parallel Programming at Scale

·         Speaker: Bradford Chamberlain, Cray Inc.

·         Three main limits to HPC scalability:

o   Single Program, Multiple Data (SPMD) programming model (vector machines)

o   Exposure of low-level implementation details

o   Lack of programmability

·         Chapel Themes:

o   General parallel programming

o   Reduce gap between mainstream and parallel languages

o   Global-view abstractions vs fragmented

§  Fragmented model is thinking through the partitioned or fragmented solution over many processors (more complex than global-view)

§  MPI programmers program a fragmented view

o   Control of locality

o   Multi-resolution design

·         Chapel is work done at Cray further developing the ZPL work that Brad did at the University of Washington.

·         Approaches to HPC parallelism:

o   Communication Libraries

§  MPI, MPI-2, SHMEM, ARMCI, GASNet

o   S/hared Memory Models:

§  OpenMP, Pthrads

o   PGAS Languages (Partitioned Global Address Space)

·         Chapel is a block-structured, imperative programming language.  Selects the best features from ZPL, HPF, LCU, Java, C#, C++, C, Modula, Ada,…

·         Overall design is to allow efficient code to be written at a high level but still allow low level control of data and computation placement

 

CARMEN: A Scalable Science Cloud

·         Speaker: Paul Watson, CSC Dept., Newcastle University, UK

·         Essentially a generalized, service-based e-Science platform for bio-informatics

o   Vertical services implemented over general cloud-storage and computation

o   Currently neuroscience focused but also getting used by chemistry and a few other domains

·         Services include workflow, data, security, metatdata, and service (code) repository

·         e-Science Requirements Summary:

o   share both data and codes

o   Capacity: 100’s TB

o   Cloud architecture (facilitates sharing and economies of scale)

·         Many code deployment techniques:

o   .war files, .net, JAR, VMWare virtual machines, etc.

·         Summary: system supports upload data, metadata describing providence of the data (data type specific), code in many forms, and security policy.  Work requests are received and appropriate code is shipped from the code repository to the data (if not already there) and the job is scheduled and run returning results.  Execution graph is described as a workflow.  Workflows are constructed using graphical UI in a browser. Currently running on a proprietary resource cloud but using a commercial cloud such as AWS is becoming interesting to them. (pay as you go model is very interesting).

o   To use AWS or similar service, they need to 1) have the data there, and 2) someone has to maintain the vertical service.

o   Considering http://www.inkspotscience.com/ , a company focused upon “on-demnd e-science”. 

 

GIGA+: Scalable Directories for Shared File Systems

·         Speaker: Swapnil Patil, CSC Dept., Carnegie Mellon University

·         Work done with Garth Gibson

·         Large FS don’t scale the metadata service will.

·         The goals of this work is to scale millions and trillions of objects per directory

·         Supports UNIX VFS interface

·         As clusters scale, we need to scale filesystems and as filesystems grow, we need to scale them to millions of objects per directory

·         Unsynchronized, parallel growth without central coordination

·         Current state of the art:

·         Hash-tables: Linux EX2/Ext3

·         B-trees: XFS

·         Hash table issues are they don’t incrementally grow. 

·         Can use extensible hashing (Fagin 79) to solve the problem as

·         Lustre (Sun) uses a single central metadata server (limits scalability)

·         PVFS2 distribute the metadata over many servers using directory partitioning.  Helps distribute the load but a very large or very active directory is still on one metadata server.

·         IBM GPFS implements distributed locking (scalability problems) and shared storage as the communications mechanism (if node 1 has page X needed by node 2, it has to write it to disk and node 2 has to read it (ping ponging).

·         GPFS does well with lookup intensive workloads but doesn’t do well with high metadata update scenarios

·         What’s new in GIGA+:

·         Eliminate serialization

·         No synchronization or consistency bottlenecks

·         Weak consistency

·         Clients work with potentially stale data but it mostly works. On failure, the updated metadata is returned to the client.  Good optimistic approach.

·         When a hash bucket is split, the client will briefly be operating on stale data but, on failure, the correct location is returned updating the client cache.  Could use more aggressive techniques to update caches (e.g. gossip) but this is a future optimization.

·         Partition presence or absence is indicated by a bit-map on each server.

·         Very compact

·         Unlike extensible hashing, only those partitions that need to be split are split (supports non-symetric growth).

·         Appears to have similarities to consistent hashing.

 

Scaling Google Maps from the Big Screen Down to Mobile Phones

·         Speaker: Jerry Morrison, Google Inc.

·         Maps on the go.

·         Three main topics covered:

·         Scaling the UX

·         Coping with the mobile network

·         Note that in many countries the majority of network access is through mobile devices rather than PCs

·         Key problem of supporting mobile devices well is mostly about scaling the user experience (240x320 with very little keyboard support)

·         Basic approach is to replace the AJAX application with a device optimized client/server application.

·         Coping with the mobile network:

·         Low bandwidth and high latency (Peter Denning notes that bandwidth increases at the square of the latency improvement)

·         4.25 seconds average latency for an HTTP request

·         The initial connection takes additional seconds

·         Note: RIM handles all world-wide traffic through an encrypted VPN back to Waterloo, CA.

·         Google traffic requests

·         Load balancer

·         GMM which pulls concurrently from:

1.       Map search, Local search, & Directions

2.       Maps & sat tiles

3.       Highway traffic

4.       Location

·         Tiles are PNG at 64x64 pixels.  Larger tiles compress more but sends more than needed. Small can respond quicker but you need more tiles to complete the screen.

·         They believe that 22x22 is optimal for 130x180 screen (they use 64x64 looking to the future)

·         Tiles need mobile adaption:

·         More repetition of street names to get at least one on a screen

·         Want more compression so use less colors and some other image simplifications

·         JPEG has 600 bytes of header

·         Showed a picture of the Google Mobile Testing Lab

·         100s of phones

·         Grows 10% a quarter

·         Three code bases:

·         Java, C++, ObjectiveC (iPhone)

·         Language translation: fetch language translation dynamically as needed (can’t download 20+ on each request)

·         Interesting challenges from local juristictions:

·         Can’t export lat & long from China

·         Different representations of disputed borders

 

Scalable Wikipedia with Erlang

·         Speaker: Thorsten Schuett, Zuse Institute Berlin

·         Wikipedia #7 website (actually #6 in that MSN and Windows Live was shown as independent)

·         50,000 requests per second

·         Standard scaling story:

·         Started with a single server

·         Then split DB from mid-tier

·         Then partitioned DB

·         Then partitioned DB again into clusters (with directory to find which one)

·         Then put memcache in front of it

·         Their approach: use P2P overlay in the data center

·         Start with chord# (DHT) which supports insert, delete, and update

·         Add transactions, load balancing, ….

·         Transactions:

·         Implemented by electing a subset as transaction managers elected with Paxos protection

·         Quorum of >N/2 nodes

·         Supports Load Balancing

·         And supports multi-data center policy based geo-distribution for locality (low latency) and robustness (remote copies).

·         They have implemented distributed Erlang

·         Security problems

·         Scalability problems

·         Ended up having to implement their own transport layer

·         Summary:

·         DHT + Transactions = scalable, reliable, efficient key/value store

·         Overall system is message based and fail-fast and that made Erlang a good choice.

 

Scalable Multiprocessor Programming via Transactional Programming

·         Speaker: Vijay Menon, Google Inc.

·         Scalability no longer just about large scale distributed systems.  Modern processors are multi-core and trending towards many-core

·         Scalability is no longer restricted to a server problem. Clients and even some mobile devices today.

·         Conventional programming model is threads and locks.

·         Transactional memory: replaced locked regions with transactions

·         Optimistic concurrency control

·         Declarative safety in the language

·         Transactional Memory examples:

·         Azul: Large scale Java server with more than 500 cores

·         Transactional memory used in implementation but not exposed

·         Sun Rock (expected 2009)

·         Conflict resolution:

·         Eager: Assume potential conflict and prevent

·         Lazy: assume no conflict (record state and validate)

·         Most hardware TMs are lazy.  S/W implementations vary.

·         Example techniques:

·         Write in place in memory with old values written to log to support rollback)

·         Buffer changes and only persist on success

·         Showed some bugs in transaction management and showed how programming language support for transactions could help.:

·         Implementation challenges

·         Large transactions expensive

·         Implementation overhead

·         Semantic challenges including Allowing I/O and other operations that can’t be rolled back

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

 

Saturday, June 14, 2008 4:15:52 PM (Pacific Standard Time, UTC-08:00)  #    Comments [2] - Trackback
Software
 Wednesday, June 11, 2008

Jeff Dean did a great talk at Google IO this year. Some key points from Steve Garrity (msft pm) and some note from the excellent write-up at Google spotlights data center inner workings:

·         many unreliable servers to fewer high cost servers

·         Single search query touches 700 to up to 1k machines in < 0.25sec

·         36 data centers containing > 800K servers

o   40 servers/rack

·         Typical H/W failures: Install 1000 machines and in 1 year you’ll see: 1000+ HD failures, 20 mini switch failures, 5 full switch failures, 1 PDU failure

·         There are more than 200 Google File System clusters

·         The largest BigTable instance manages about 6 petabytes of data spread across thousands of machines

·          MapReduce is increasing used within Google.

o   29,000 jobs in August 2004 and 2.2 million in September 2007

o   Average time to complete a job has dropped from 634 seconds to 395 seconds

o   Output of MapReduce tasks has risen from 193 terabytes to 14,018 terabytes

·         Typical day will run about 100,000 MapReduce jobs

o   each occupies about 400 servers

o   takes about 5 to 10 minutes to finish

 

More detail on the typical failures during the first year of a cluster from Jeff:

·         ~0.5 overheating (power down most machines in <5 mins, ~1-2 days to recover)

·         ~1 PDU failure (~500-1000 machines suddenly disappear, ~6 hours to come back)

·         ~1 rack-move (plenty of warning, ~500-1000 machines powered down, ~6 hours)

·         ~1 network rewiring (rolling ~5% of machines down over 2-day span)

·         ~20 rack failures (40-80 machines instantly disappear, 1-6 hours to get back)

·         ~5 racks go wonky (40-80 machines see 50% packetloss)

·         ~8 network maintenances (4 might cause ~30-minute random connectivity losses)

·         ~12 router reloads (takes out DNS and external vips for a couple minutes)

·         ~3 router failures (have to immediately pull traffic for an hour)

·         ~dozens of minor 30-second blips for dns

·         ~1000 individual machine failures

·         ~thousands of hard drive failures

 

A pictorial history of Google hardware through the years starting with the current generation server hardware and working backwards from Jeff’s talk at the 2007 Seattle Scalability Conference:

Current Generation Google Servers

 

Google Servers 2001

 

Google Servers 2000

 

Google Servers 1999

 

Google Servers 1997

My general rule on hardware is that, if you have a viewing window into the data center, you are probably spending too much on servers. The Google model of cheap servers with software redundancy is the only economic solution at scale.

 

Other notes from Google IO:

·         http://perspectives.mvdirona.com/2008/05/29/RoughNotesFromSelectedSessionsAtGoogleIODay1.aspx

·         http://perspectives.mvdirona.com/2008/05/29/IO2008RoughNotesFromMarissaMayerDay2KeynoteAtGoogleIO.aspx

·         http://perspectives.mvdirona.com/2008/05/30/IO2008RoughNotesFromSelectedSessionsAtGoogleIODay2.aspx

 

All pictures above courtesy of Jeff Dean.

 

                                                --jrh

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

 

 

Wednesday, June 11, 2008 4:57:56 AM (Pacific Standard Time, UTC-08:00)  #    Comments [4] - Trackback
Hardware
 Sunday, June 08, 2008

I’m interested in high-scale web sites, their architecture and their scaling problems.  Last Thursday, Oren Hurvitz posted a great blog entry summarizing two presentations at Java One on the LinkedIn service architecture.

 

LinkedIn scale is respectable:

·         22M members

·         130M connections

·         2M email messages per day

·         250k invitations per day

 

No big surprises other than LinkedIn are still using a big, expensive commercial RDBMS and a large central Sun SPARC Server. LinkedIn call this central server “The Cloud” and its described as the “backend server caching the entire LinkedIn Network”.  Note that “server” is not plural – it appears that the cloud is not partitioned  but it is replicated 40 times. However, each instance of the cloud hosts the entire social graph in 12GB of memory. 

 

Social graphs are one of my favorite services problems in that they are notoriously difficult to effectively partition. I often refer to this as the “hairball problem”.  There is no clean data partition that will support the workload with adequate performance. Typical approaches to this problem redundantly store a several copies of the data with different lookup keys and potentially different partitions. For example, most social networks have some notion of a user group.  The membership list is stored with each group.  

 

But, a common request is to find all groups for which a specific user is a member. Storing users with the group allows efficient group enumeration but doesn’t support the “find all groups of a given user” query.  Storing the group membership with each user supports this query well but doesn’t allow efficient group membership enumeration. The most common solutions are to store the data redundantly both ways. Typically one is the primary copy and the other is updated asynchronously after the primary is updated. This effectively makes read much more efficient at the cost of more work at update time.  Sometimes, the secondary copy isn’t persisted in the backend database and is only stored in an in memory cache often implemented using memcached.

 

The LinkedIn approach of using a central in-memory, social graph avoids some of the redundancy of the partitioned model I described above at the expensive of requiring a single-server memory large enough to store the entire un-partitioned social graph.  Clearly this is more efficient by many measures but requiring that the entire social graph fit into a single servers memory means that more expensive servers are required as the service grows. And, many of us can’t sleep at night with hard scaling limits even if they are apparently fairly large.

 

Other scaling web sites pointers are posted at: http://perspectives.mvdirona.com/2007/11/12/ScalingWebSites.aspx.

 

Thanks to Dare Obasanjo and Kevin Merritt for sending my way.

 

                                                                --jrh

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

 

 

Sunday, June 08, 2008 11:14:02 AM (Pacific Standard Time, UTC-08:00)  #    Comments [1] - Trackback
Services
 Friday, June 06, 2008

There was an interesting talk earlier today at Microsoft Research by Jason Cong of the UCLA Computer Science Department on compiling design specifications in C/C++/SystemC and user constraints into ASIC and FPGA design. The advantage of compiler based approaches include, more productivity working at a higher level, automating verification, allows optimization, and allows rapid experimentation at different frequencies and different optimization goals (performance vs power, for example). As design complexity increases higher level language and optimization have excellent potential.  Essentially the same thing is happening in hardware design as happened 30 years ago in operating systems implementation languages. High level implementation languages replace lower level as complexity climbs.  For example, the Intel Core 2 Duo processor is a 1B transistor implementation whereas the 386 was 1/1000th of that complexity at 1M.

 

Also super interesting was the example from a financial institution that is taking a software based stock analysis system where they take the hottest parts of the system and compile these to FPGA implementations. 30x faster at 1/10th the power. Very cool.  

 

Now that AMD supports hyper transport it is possible to implement custom processors with excellent overall system performance.  Intel has opened up the FSB and is also expected to offer a non-compatible hyper transport-like implementation in the future.

 

My rough notes from the talk follow:

 

·         Speaker: Jason Cong, UCLA Computer Science (cong@cs.ucla.edu)

·         Working on on-chip interconnects & communications

o   3D IC design

o   RF-interconnects

§  Note that power restrictions restrict processors to ~5GH

·         But, communications lines can scale to 100s of GH

·         Dividing communications link into 10 or more “channels” that operate at different frequencies

·         This talk focused on  ESL SystemC to FPGA compiler

·         Why?

o   700,000 lines of RTL for a 10M gate design is too much

o   Allows executable specification

o   Verification requires executable design

o   Accelerated computing or reconfigurable computing also need C/C++ based compilation/synthesis to FPGAs

§  CPUs coupled with FPGA to support common functions at high performance and lower power

·         Note that performance limited by communications (getting data to the CPU)

o   Long wires that have to be traversed in a single clock are the limiting factor

o   This research focuses on supporting multi-cycle communications

·         xPilot: Behavior-to-RTL (Register Transfer Level design) synthesis flow

o   takes behavior spec in C/SystemC to front-end compiler to SSDM

o   SSDM is optimized using standard compiler optimization (loop unrolling, strength reduction, scheduling, etc.)

o   SSDM is compiled to:

§  Verilog/VHDL/SystemC

§  FPGAs: Altera, Xilinx

§  ASICs: Magma, Synopsys

o   UPS: Uniform Power Specification

o   During final compilation optimize for power and shut off compute units that are not being used and shut off those that are being during idle periods (a busy disk controller is frequently waiting for mechanicals and not need to execute instructions)

§  Can’t shut FPGAs but can with ASICs

§  The only solution to dynamic power leakage only solution is shut the component of

o   Allows faster experimentation than hand coding.  You can try different frequencies and different power optimizations (too complex for most humans)

o   Scheduling (allocation of operations to compute logic and specific clock cycles) is NP-complete and automated techniques can exceed quality of expert designs

·         Example:

o   Schedule the behavior to RTL using the following characterization, cycle time, constraints, and objectives:

§  Platform characterization: adder (2ns) & multiplier (5ns)

§  Target cycle time (10ns)

§  Resource constraint: only one multiplier is available

§  Objective: high performance or lower power as examples

·         Note as optimizing and reducing component counts, less space is required, which can allow faster clocking

·         Investigating compilation for Reconfigurable Accelerated Computing

o   Take GCC 3DES implementation and synthesis FPGS RTL description

o   Example took 3DES from C level implementation to a FPGA (Xilinx Virtex-5)

·         Investment bank is using this tool to compile financial optimizations from S/W implementation to FPGA accelerators (Black-Scholes S/w kernel)

o   30x speed-up over software implementation and 1/10 the power (6 vs 68W)

 

A related presentation: http://cadlab.cs.ucla.edu/~cong/slides/fpt05_xpilot_final.pdf.

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

 

 

Friday, June 06, 2008 10:52:35 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Hardware
 Tuesday, June 03, 2008

 

Last week at Google IO, pricing was announced for Google Application Engine. Actually it was blogged the night before at: http://googleappengine.blogspot.com/2008/05/announcing-open-signups-expected.html.

 

The prices are close to identical with Amazon AWS although GAE differs substantially from the AWS offerings.  The former offers a easy to use Python  execution environment whereas Amazon offers the infinitely flexible run-this-virtual-machine model. Clearly the Amazon model costs more to provide so, by that measure, AWS pricing is somewhat better:

 

Google Application Engine Pricing:

·    $0.10 - $0.12 per CPU core-hour

·    $0.15 - $0.18 per GB-month of storage

·    $0.11 - $0.13 per GB outgoing bandwidth

·    $0.09 - $0.11 per GB incoming bandwidth

·    From: http://googleappengine.blogspot.com/2008/05/announcing-open-signups-expected.html

Compared with AWS Pricing:

·         $0.10 - $0.80 per VM hour (depending upon resources allocated)

·         $0.15 per GB-month of storage

·         $0.100 - $0.170 per GB outgoing bandwidth

·         $0.100 per GB incoming bandwidth

·         From: http://www.amazon.com/S3-AWS-home-page-Money/b/ref=sc_fe_l_2?ie=UTF8&node=16427261&no=3435361&me=A36L942TSJ2AJA and http://www.amazon.com/EC2-AWS-Service-Pricing/b/ref=sc_fe_l_2?ie=UTF8&node=201590011&no=3435361&me=A36L942TSJ2AJA.

There are some important differences that make the pricing comparison somewhat biased in a couple of ways. Two important differences: 1) as mentioned above, Amazon gives an entire virtual machine so EC2 is much more flexible than GAE both in that it can run arbitrary applications in arbitrary languages and that it supports all execution models whereas GAE only supports HTTP request/response.  Another key difference is the storage subsystem.  In the numbers above, we’re comparing the Amazon blob store (S3) with the more structured storage model offered by GAE.  The more comparable AWS SimpleDB pricing is considerably higher than the GAE storage pricing. SimpleDB charges $1.50 GB/month in addition to machine usage and network transmission costs.  GAE is offering much more affordable semi-structured storage and the GAE storage model actually supports data types rather than having to force everything to character format.

 

GAE is still free to start with under 5M page views/month and up to ½ GB storage for free.  Obviously this helps developers get started without strings and that’s a good thing. But, more importantly, it avoids Google from having to go to the expense of billing very small values.  In a weird sort of way, I’m more impressed with AWS billing $0.04 on some accounts in that it shows there billing system is incredibly lean. Scaling down billing is hard, hard, hard.

 

In addition to announcing prices, GAE went from a controlled admission beta to a fully open beta where all comers are welcome.  I’m impressed how quickly they have gone from making the service initially available to a full open beta.  Impressive.

 

Also announced last week was a new GAE Memcached API which appears to have been a 20% project of Brad Fitzpatrick (Sriram Krishnan sent my way). And a set of image manipulation APIs supporting image scaling, rotating, etc. will now be part of the GAE API set.

 

My notes from Google IO:

·         http://perspectives.mvdirona.com/2008/05/29/RoughNotesFromSelectedSessionsAtGoogleIODay1.aspx

·         http://perspectives.mvdirona.com/2008/05/29/IO2008RoughNotesFromMarissaMayerDay2KeynoteAtGoogleIO.aspx

·         http://perspectives.mvdirona.com/2008/05/30/IO2008RoughNotesFromSelectedSessionsAtGoogleIODay2.aspx

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

Tuesday, June 03, 2008 8:05:05 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services
 Sunday, June 01, 2008

Yesterday the Tribute to Honor Jim Gray was held at the University of California at Berkeley. We all miss Jim deeply so it really is a tough topic.  But it was great to get together with literally 100s of Jim’s friends and share stories and talk about some of his accomplishments, his contributions to the field, and his contributions to each of us.  Jim is amazing across all three dimensions but what is most remarkable is the profound way he helped others achieve more throughout the industry.  We’re all better engineers, researchers, and human beings for having been lucky enough to have known and worked with Jim.

 

Also announced yesterday was the creation of the Jim Gray Chair at Cal Berkeley.  Bill Gates Eric Schmidt, Marc Benioff, and Mike Stonebraker each donated $250,000 which were matched by a $1,000,000 from the Hewlett Foundation.

 

Seattle PI coverage: Gathering in Berkeley, Calif., today to honor legendary scientist, Microsoft researcher Jim Gray.

 

The morning, general session agenda:

             Welcome - Shankar Sastry

             Opening Remarks - Joseph Hellerstein

             A Tribute, Not a Memorial: Understanding Ambiguous Loss - Pauline Boss

             The Amateur Search - Michael Olson

             Jim Gray at Berkeley - Michael Harrison

             Knowledge and Wisdom - Pat Helland

             Why Did Jim Gray Win the Turing Award? - Michael Stonebraker

             Jim Gray Chair - Stuart Russell

             500 Special Relationships: Jim as a Mentor to Faculty and Students - Ed Lazowska

             Jim Gray: His Contributions to Industry - David Vaskevitch

             A "Gap Bridger" - Richard Rashid

             Thanks to the U.S. Coast Guard - Paula Hawthorn

 

The afternoon, technical session agenda:

·         Welcome - Shankar Sastry

·         Opening Remarks - Joseph Hellerstein

·         A Tribute, Not a Memorial: Understanding Ambiguous Loss - Pauline Boss

·         The Amateur Search - Michael Olson

·         Jim Gray at Berkeley - Michael Harrison

·         Knowledge and Wisdom - Pat Helland

·         Why Did Jim Gray Win the Turing Award? - Michael Stonebraker

·         Jim Gray Chair - Stuart Russell

·         500 Special Relationships: Jim as a Mentor to Faculty and Students - Ed Lazowska

·         Jim Gray: His Contributions to Industry - David Vaskevitch 

·         A "Gap Bridger" - Richard Rashid

·         Thanks to the U.S. Coast Guard - Paula Hawthorn

 

The event was video recorded and streamed via http://webcast.berkeley.edu/.

 

Update: the video will be at:  Tribute to Honor Jim Gray - General Session (thanks to George Spix for sending my way).

Second Update: A good article by John Markoff of the NY Times: A Tribute to Jim Gray: Sometimes Nice Guys Do Finish First.

 

                                    --jrh

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

 

 

Sunday, June 01, 2008 10:08:01 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Ramblings
 Thursday, May 29, 2008

Google IO notes continued from earlier in the day: http://perspectives.mvdirona.com/2008/05/29/IO2008RoughNotesFromMarissaMayerDay2KeynoteAtGoogleIO.aspx and yesterday: http://perspectives.mvdirona.com/2008/05/29/RoughNotesFromSelectedSessionsAtGoogleIODay1.aspx

 

Google Web Toolkit and Client-Server Communications

·         Speaker: Miquel Mendez

·         GWT client/server communication options:

o   Frames

o   Form Panel

o   XHR: RequestBuilder (be careful don’t to start too many—many browsers have limits)

o   XML RPC

·         XML Encoding/Decoding: com.google.gwt.xml defines XML related classes

·         JSON Encoding/Decoding: com.google.gwt.json.JSON defines JSON related classes

·         GWT RPC: Generator that generates code and makes use of RequestBuilder

 

Reusing Google APIs with Google Web Toolkit

·         Speaker: Miquel Mendez

·         GALGWT: Google API Library for GWT.  It’s an open source project lead by Miquel (Javascript bindings to GWT).

o   It’s a collection of easy to use GWT bindings for existing Google JavaScript APIs

o   It’s a Google code open source project

·         Reminder: GWT is a java to Javascript compiler.

·         GWT now has a gadget class.  Google Gadget creation using GWT by extending the Gadget class and implementing the NeedsXXX intrerfaces.

·         Gears support:

o   Exposes database, LocalServer, and WorkerPool JS modules

o   Provides an offline module that automates the process fo going offline (creates the necessary manifests automatically)

·         Google Maps support as well

 

Engaging User Experiences with Google App Engine

·         Speakers: John Skidgel (designer) & & Lindsay Simon (developer).

·         Showed a guest book application written using Djanjo Form.  It’s been modified to run under App Engine (didn’t say how).

·         App engine development environment makes it easy to work with a designer as it’s easy to install and runs well on a Mac.

·         Walked through what they called 3D (Design, Develop, & Deploy) and how they handle it.

·         Authentication options:

·         Do your own

·         Use GAE (any authenticated user or you can  narrow the population to your domain only – all supported out of the box).

·         Don’t make auth a gating factor or you will lose users – auth at the last possible moment

·         Use the App Engine Datastore for sessions

·         Decreasing Latency:

·         Create build rules to concatenate and minify (Yahoo! Minifier) CSS and JS

·         File fingerprinting

·         Set expires headers for a very long time but add a version ID.  He showed how to handle the version number on the server side.  The recommendation was 10 year expiration with version numbers.

·         Recommends “Progressive Enhancement” or “Defensive Enhancement”.  You should still be able to render without JS. JS should give a better experience but you may not have JS (crappy mobile browsers for example).  Another test, shut off CSS and it should still work.

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

 

 

Thursday, May 29, 2008 5:55:26 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services

Continued from Yesterday (day 1): Rough notes from Selected Sessions at Google IO Day 1.

 

Marissa Mayer Keynote: A Glimpse Under the Hood at Google

·         Showed iGoogle and talked about how Google Gadgets are a great way to get broad distribution and are a form of advertising.

·         Search is number 2 most used application (after email)

·         The ordinary and the everyday

·         Why is search page so simple?

·         Variation of Occam’s Razor: “the simple design is probably right”

·         Sergey did it and it was because “there was nobody else to do it and he doesn’t do HTML”

·         Described process of answering a query (700 to 1,000 machines in .16 seconds):

·         This time of day we’re busy so the query will likely go to one data center and likely get bounced to another (must be a simplification of what really happens – load ballancing)

·         Mixer

·         Google Web Server

·         Ads + Websearch (300 to 400 systems)

·         Back to mixer

·         Back to Web server

·         Back to load balancer

·         Split A/B Testing:

·         We given a subset of users a different user experience. Web services allow very detailed views and to iterate very quickly and evolve rapidly.

·         Example: amount of white space under Google logo on results page?

·         This test showed convincingly that less white space rather than more (produces more usage and more revenue)

·         Example: yellow or blue as background for paid adds

·         Yellow produced both more satisfaction and more revenue.

·         “If you don’t listen to your customers, someone else will” – Sam Walton

·         But you need to test rather than ask since they often don’t know.

·         Example: would you like 10, 20, or 30 results. Users unanimously wanted 30.

·         But 10 did way better in A/B testing (30 was 20% worse) due to lower latency of 10 results

·         30 is about twice the latency of 10 (I would have expected the other overheads to dominate.  Suggests there is another solution waiting to be found here).

·         Example: Maps was 120k for launch page.  We took 30 to 40k out.  Got a proportional increase in usage.

·         Example: Google Video uploads used to be 1 day to watch while YouTube offered “Watch it now”.  Much more compelling.

·         Urgent can drown out important

·         Users go from unskilled to skilled searchers very fast (under 1 month).  Consequently it’s better to optimize for expert since most are and novices get there fast due to fast feedback loop.

·         The lesson is to think longer term at all levels in design.

·         Think beyond the current development horizon.  10 years for major products and services.

·         Example: Universal search vs vertical search.  Users want verticals now but what they really want is universal search.  They just want to find the answer they are searching for.

·         Goog-411: don’t know if we can make money off this but it helps us develop voice recognition. Applications of voice recognition are monetizable so, even if Goog-411 doesn’t yield revenue, other applications will.

·         International content:

·         50% of the web is English but only aobut 1% of the web is Arabic

·         Conclusion: take an Arabic search, translate find relevant pages, then translate the result.  Opens up MUCH more content and dramatically improves the results for an Arabic user.

·         Larry Paige: ”A Healthy Disrespect for the Impossible” opens up many possibilities.

·         Showed examples of how search is not generally “solvable” but getting to 90 to 95% has HUGE benefit. Search is a hard and unconstrained problem.  Same with health records.

·         Recommendation: Be Scrappy & revel in constraints

·         Google operates in 140 countries and 110 languages.  Described the complexity of pulling out text strings from a web site, sending out to translation, dealing with multiple string versions, etc.

·         Betters solution: let the users help with the translated content.  If you don’t see your language, help us do it.  There are now ¼ million users helping with translation from all over the world.

·         Interesting little Easter egg:  one of the languages on the Google home page is “Bork! Bork! Bork!” – it’s the Swedish chef from the Muppets

·         Interesting little example: they took 11k Googler’s to Indiana Jones last week

·         Marissa went through a bunch of examples of taking on the impossible and brainstorming possible solutions and showing that some just exercised their thinking and others produced cool products/solutions.  Explained that 20% time is just another way of exercising the brain (“Imagination as a muscle”).  And Orkut, Google News, and during one period 50% of their new products, were from 20% time.

·         Random note: What you last searched for is the best context signal for the current search.

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

 

 

Thursday, May 29, 2008 8:59:35 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services

Disclaimer: The opinions expressed here are my own and do not necessarily represent those of current or past employers.

Archive
<July 2008>
SunMonTueWedThuFriSat
293012345
6789101112
13141516171819
20212223242526
272829303112
3456789

Categories
This Blog
Member Login
All Content © 2013, James Hamilton
Theme created by Christoph De Baene / Modified 2007.10.28 by James Hamilton