Tuesday, August 26, 2008

Facebooks F8 conference was held last month in San Francisco. During his mid-day keynote Mark Zuckerberg reported that the Facebook platform now has 400,000 developers and 90 million users of which 32% are from the United States.  The platforms US user population grew 2.4x last year while the international population grew at an astounding 5.1x.

Vladimir Fedorov (Windows Live Mesh) attended F8 and brought together this excellent sent of notes on the conference.

                                                --jrh

Summary:

I spent the day on Wednesday at Facebook (F8) conference and talked to some of the companies building facebook applications today. Overall I was pleasantly surprised by overall sense of direction/messaging and organization of the conference itself.

There were only 12 talks divided into 3 tracks - Technical/User Experience/Business, so I was able to attend a third of all talks.

The was focus throughout the day was on making it easier for applications that increase the value of the Facebook ecosystem and stopping abusing applications that detract value. The event itself was organized through a Facebook application. Here are the main changes in the Facebook application platform:

  1. Improve visibility of applications and allow users to observe functionality offered by an application without user taking an explicit install action
  2. Lower the barrier to using the application i.e. remove the necessity of a dialog granting rights to the application prior to any functionality being available
  3. Make the rights granting to application more granular i.e. remove the necessity of granting the application an extensive set of rights prior to using it. Grant specific permission at the time the application performs an operation.
  4. Allow external websites to act as applications on the Facebook platform by using Facebook as an identity provider, using social graph from Facebook and submitting data to Facebook news feed
  5. Allow the internalization method used for Facebook itself (translation by users) to be used by applications

The statistics given at keynote were 400k developers, 90 million users (32 % US / 68 % International) as compared to 24 million last year (50% US / 50% international), 200 million in venture capital given to facebook applications. Note that while the number of international users increased by 5.1x (by 49.2 million), the number of US users only increased by 2.4x (16.8 million).  

I went through the booths and talked to a number of Facebook application companies. I was primarily focusing on what they do and how they plan to make money. The business models are:

  1. Transaction fees - charging a small percentage per transaction for organizing events or coordinating travel
  2. Software as service - sell packages to organizations such as donation drive or car pooling applications
  3. Indirect advertising - large companies want to drive brand awareness through the social graph, but don't know how. There were different methods here - branded gifts i.e. Gunness beer, full featured brand campaigns, games which incorporate brand info in them, etc
  4. Direct advertising - trip planning, activity planning, wedding planning, reviews, etc

There were a number of companies that didn't have a real business model, but are still adding value to the ecosystem especially when combined with an offering from a different company.

The major features released are new application authorization model, new news feed (with new backend),  Facebook Connect, new look to the site and opening of internalization support used for Facebook itself to applications.

News Feed/News backend

They decided to do fan out on read in order to minimize storage costs and maximize the ability to tinker with the algorithm that decides which news events are shown to the user. The backend is made of two classes of machines – transient storage machines and aggregator machines. The users are assigned to buckets using a hashing algorithm and the buckets are assigned to transient storage machines using a DB table. For each user they store 30 days of events generated by the user in the transient storage machines. There are two replicas of the data in transient storage machines (replicas are on different racks). Each transient storage machine has 40GB of RAM and they use 40 machines for 90 million users. They also use 40 aggregator machines which actually construct the news feed that is shown on the website (in <50ms) by reading the events for each friend of the user from the transient storage machines and aggregating them. There are two racks each with twenty aggregators and twenty transient storage machines, where each rack has a complete copy of the data. The aggregators have affinity to transient storage machines in the same rack, but will go to the other  rack when local machine fails. There is no affinity between users and aggregators. They report they have 8x to 10x extra capacity in this solution. Facebook doesn’t have any geo-partitioning, which interesting given that majority of users are international. [JRH: they now have some geo-redundancy to serve read only queries nearer to users and to backup the primary site: Geo-Replication at Facebook]

The transient data is updated by another process called the “tailor” which reads the tail of a file on a network file system which actually contains the persisted copy of the data. The “tailor” periodically updates each user in the transient storage system via a system of dirty flags. Any of the transient machines can be restarted and reloaded from the persisted store in 10-20 minutes. This is different from the normal MySQL solution they use for the rest of their metadata.

They now allow comments on the news feed items. They also formalized 3 formats for the entry – one liner, summary, and picture plus text. The developers can register templates for each format (i.e. “author” has listened to “track” on MyMusicFoo) and then post just the data together with the template id instead of the whole message. The coalescing is black box – the system requires the developers to register multiple templates and will choose between them depending on the event volume. The event volume is throttles but the throttles change dynamically on the basis of user behavior i.e. if your applications event is marked as spam by some percentage of users the throttle is lowered.

Facebook Connect

In order to merge other websites into the ecosystem, Facebook is providing identity services to third party registered websites. A good example is integration with CitySearch. If you are logged into Facebook, you are automatically logged into CitySearch if you “CitySearch” enabled your Facebook account. Whenever you do a CitySearch review you have an option of spamming your friends news feed with it. You can also view reviews by your friends, who have “CitySearch” enabled their  Facebook account. Through a system of exchanging hashes for email addresses, there is a UI to invite your friends who are already on CitySearch to “CitySearch” enabled their  Facebook account. The end result is that in addition to providing identity services they also provide social network services and drive extra traffic to your site, making it more desirable for third party web sites to offer integration. In exchange the Facebook pages become more content rich and third party websites start acting almost like Facebook application.

 

International Support

 

Facebook is translated by the users themselves via voting system, where a user suggests a translation and the rest of the users vote on it. They opened this system up to applications, where application strings can be translated in the same way. While Facebook itself has had success with this model (complete translation to a  new language in <24 hours) it is less clear that application with smaller user bases will be translated quickly.

 

 

Tuesday, August 26, 2008 5:09:05 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services
 Saturday, August 23, 2008

Kevin Clark, Director of IT Operations at Lucasfilm was interviewed by On-Demand Enterprise in We’ve Come a Long Way Since Star Wars.  His organization owns IT for LucasArts, Lucasfilm, and Industrial Light and Magic.

 

Lucasfilm runs a 4,500 server dedicated rendering farm and they expand this farm with workstations when  they are not in use to 5,500 servers in total.  The servers are dual socket, dual core Opterons with 32GB of memory.  Nothing unusual except the memory configuration is a bit larger than the current average.  They have 400TB of storage and produce 10 to 20TB of new and changed data each day.

 

Clark expects the big investment next year is making their datacenter more efficient. Partly for environmental reasons and partly because, like all businesses, they are power and cooling rather than floor space constrained. This is becoming the number one issue industry-wide and I’m glad to see. Current data center designs leave a lot of room for improvement.  At this year’s Foo Camp, I lead a short session on large scale data center power consumption: Where Does the Power Go and What Can We Do About It?

 

This cluster is medium sized but the data change rate is unusually high at 10 to 20TB a day.  It’s mostly batch work with each job being quite large.  It would be interesting to see more detail on the workload scheduler they have written to manage this workload.  It’s a bit ironic that IBM MVS (now called Z/OS) had a great scheduler 40 years ago. In the 10 years I worked for IBM, they constantly were requesting that a high quality batch scheduler be added to AIX.  And in the 11 years I’ve worked at Microsoft, there has been great interest in improving batch scheduling to the MVS-like levels.  More recently, Apache Hadoop has been used to run mega-jobs and, guess what?  It too needs a high-quality, prioritized, multi-job scheduler.  At the Hadoop Summit, Yahoo said they are working on one.  They typically contribute their Hadoop work to open source so Hadoop may have a better scheduler coming.

 

Thanks to Jeff Hammerbacher for pointing me to the note on Lucasfilm.

 

                                                --jrh

 

James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

 

Saturday, August 23, 2008 6:50:16 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services
 Thursday, August 21, 2008

Last Friday I arrived back from vacation (Back from the Outside Passage in BC) to 3,600 email messages.  I’ve been slogging through them through the weekend to now and I’m actually starting to catch up. 

 

Yesterday Tom Kleinpeter pointed me to this excellent posting from Jason Sobel of Facebook: Scaling Out. This excellent post describes the geo-replication support recently added to Facebook.  Rather than having a single west coast data center they added an east coast center both to be near to East Coast users and to provide redundancy for the west coast site.

 

It’s a cool post for two reasons: 1) it’s a fairly detailed description of how one large scale service implemented geo-redundancy, and 2) they are writing about it externally.  Way too many of these techniques are never discussed outside the implementing company and so everyone needs to keep re-inventing.  Facebook continues to share both how they engineer aspects of their service in addition to contributing some of the code used to do it to open source (e.g. Facebook Releases Cassandra as Open Source).  Good to see.

 

I’ve long been interested in geo-replication, geo-partitioning, and all other forms of cross data center operation because it’s a problem that every high scale service needs to solve and yet there really are no general, just-do-this recipes to easily operate over multiple data centers. A few common design patters have emerged but all solutions of reasonable complexity end up being application specific.  I would love to see general infrastructure emerge to generally support geo-redundancy but it’s a hard problem and solutions invariably exploit knowledge of application semantics. Since most solutions tend to be ad hoc and application specific today, it’s worth studying what they do while looking for common patterns. That was the other reason I enjoyed seeing this posting by Jason. He described how Facebook is currently addressing the issue and its always worth understanding as many existing solutions as possible before attempting a generalization.

 

The solution they have adopted is one where the California data center is primary and all writes are routed to it. The Virginia data center is secondary and serves read-only traffic.  A load balancer does layer 7 packet inspection and routes URIs for pages that support writes to CA.  If the page is read-only, as much of the Facebook traffic is, it gets routed to the east coast site (assuming you are “near” to it for some definition of near).  

 

All writes are done through the California data center. Its serves as primary for the entire service. When a write is done in the Facebook architecture, the corresponding memcached layer is invalidated after the write is completed.  MySQL replication is used to replicate the changes to the remote data center and solved the problem of only invalidating the remote memcached entries after the remote MySQL update by modifying MySQL.  They changed MySQL to clear the local memcached of the appropriate keys once the replicated write is complete. 

 

It’s a simple and fairly elegant approach and there is no doubt that simple is good. I would prefer an approach that scales out both reads and writes and there is a slight robustness risk that some engineer in the future may sometime add a page to the site that does a write and forget to update the this-page-writes URI list.  If MySQL supports running a secondary replica in read-only mode, then that potential issue can be quickly and easily detected. 

 

An approach that would allow multi-data center updates is to replicate the entire DB contents everywhere as Jason described in this post but rather than routing all writes through the California DB, partition the user-base on userID and route their traffic to a fixed center. Allow updates for that user at the data center they were routed to and replicate these changes to the other centers.  Essentially it’s the same solution that was described except rather than routing all requests to the single primary database in CA, the primary database is distributed over multiple datacenters partitioned on userID.  This approach would have many advantages but it is much more difficult if not all the data is cleanly partitioned by userID.  And, in social networks, it isn’t.  I refer to this as the “Hairball Problem” (Scaling LinkedIn).

 

The alternative approach I describe above of partitioning the primary database by userID would be reasonably easy to do in a service like an email system with most state partitioned cleanly by user but in a social network with lots of cross-user, shared state (the hairball problem), it’s harder to do.  Nonetheless, it’s probably the right place for FB to end up but the current solution is clean and works and it’s hard to argue with that.

 

If you come across other articles on geo-support or want to contribute one here, drop me a note jrh@mvdirona.com.

 

                                                --jrh

 

James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

 

Thursday, August 21, 2008 5:36:23 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services
 Thursday, July 17, 2008

Going  boating: http://mvdirona.com/ so I’ll be taking a break from blogging until mid-august when I’m back and caught back up.  Enjoy,

 

                                                --jrh

 

James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

Thursday, July 17, 2008 4:53:25 AM (Pacific Standard Time, UTC-08:00)  #    Comments [1] - Trackback
Ramblings
 Wednesday, July 16, 2008

I’ve been collecting scaling stories for some time now and last week I came across the following run down on Fliker scaling: Federation at Flickr: Doing Billions of Queries Per Day by Dathan Vance Pattishall, the Flickr database guy.

 

The Flickr DB Architecture is sharded with a PHP access layer to maintain consistency.  Flickr users are randomly assigned to a shard. Each shard is duplicated in another database that is also serving active shards. Each DB needs to be less than 50% loaded to be able to handle failover.

 

Shards are found via a lookup ring that maps userID or groupID to shardID and photoID to userID.  The DBs are protected by a memcached layer with a 30 minute caching lifetime. Slide 16 says they are maintaining consistency using distributed transactions but I strongly suspect they are actually just running two parallel transactions with application management rather than 2pc.

 

Maintenance is done by bringing down ½ the DBs and the remaining DBs will handle the load but it appears they have no redundancy (failure protection) during the maintenance periods.

 

They have 12TB of user data in aggregate and they appear to be using MySQL (slide 25 complains about an INNODB bug).

 

Other web site scaling stories:

·         Scaling Linkedin: http://perspectives.mvdirona.com/2008/06/08/ScalingLinkedIn.aspx

·         Scaling Amazon: http://glinden.blogspot.com/2006/02/early-amazon-splitting-website.html

·         Scaling Second Life: http://radar.oreilly.com/archives/2006/04/web_20_and_databases_part_1_se.html

·         Scaling Technorati: http://www.royans.net/arch/2007/10/25/scaling-technorati-100-million-blogs-indexed-everyday/

·         Scaling Flickr: http://radar.oreilly.com/archives/2006/04/database_war_stories_3_flickr.html

·         Scaling Craigslist: http://radar.oreilly.com/archives/2006/04/database_war_stories_5_craigsl.html

·         Scaling Findory: http://radar.oreilly.com/archives/2006/05/database_war_stories_8_findory_1.html

·         MySpace 2006: http://sessions.visitmix.com/upperlayer.asp?event=&session=&id=1423&year=All&search=megasite&sortChoice=&stype=

·         MySpace 2007: http://sessions.visitmix.com/upperlayer.asp?event=&session=&id=1521&year=All&search=scale&sortChoice=&stype=

·         Twitter, Flickr, Live Journal, Six Apart, Bloglines, Last.fm, SlideShare, and eBay: http://poorbuthappy.com/ease/archives/2007/04/29/3616/the-top-10-presentation-on-scaling-websites-twitter-flickr-bloglines-vox-and-more

 

                        -jrh

 

Thanks to Kevin Merritt (Blist) and Dave Quick (Microsoft) for sending this my way.

 

James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

 

Wednesday, July 16, 2008 5:13:40 AM (Pacific Standard Time, UTC-08:00)  #    Comments [2] - Trackback
Services
 Sunday, July 13, 2008

I just got back from O’Reilly’s Foo Camp.   Foo is an interesting conference format in that there is no set agenda. It’s basically self organized as a open space-type event but that’s not what makes it special.  What makes Foo a very cool conference is the people.  Lots of conferences invite good people but few invite such a  diverse a set of attendees.  It was a lot of fun.

 

Here’s a picture from Saturday night of (right to left) Jesse Robbins (Seattle entrepreneur and co-chair of O’Reilly’s Velocity conference), Pat Helland (Microsoft Developer Division Architect) and myself.

 

  

From: http://flickr.com/photos/jesserobbins/2663653582/

Foo is an invitational event loosely organized around folks O’Reilly find interesting or want to get to know better.  Tim O’Reilly describes the invite process in this blog comment: http://gigaom.com/2005/08/16/foocampfighting/#comment-19709 .

 

The conference starts around 5pm on Friday with drinks followed by dinner. After dinner, the sign-up boards for Saturday and Sunday talk sessions are put up. When Tim O’Reilly announced the sign-up boards were up, attendees rose from their tables and casually started ambling towards the sign-up boards as though there really was no rush. We’ll just be doing some signing up over the course of the evening.  We might do it now or perhaps we’ll do it later. No big rush.  And then some folks break into a slight trot towards but still not really an overt run  Suddenly, it’s a full raging stampede.  We became a single flow than as a discrete set of individuals. The flow accelerated and then crashed up against the sign-up boards spilling to both sides with folks madly negotiating for pens, better locations, sharing sessions, a shift of 6” to one side or the other, or  requesting to join two sessions together. Welcome to foo camp.  It didn’t slow down much from that point for the next 44 hours.

 

Sessions are 1 hour long but it’s good form to share a session if you don’t need the full hour. Speakers use their judgment to sign up for a small, medium or large room.  Some rooms have AV equipment and some don’t.

 

I shared a 1 hour ssession with Jeff Hammerbacher who leads the Facebook data team.  Earlier last week Jeff announced that he will be leaving Facebok in September: http://valleywag.com/5024169/yet-another-hoodie+wearing-harvard-kid-drops-out-of-facebook.  Jeff and I got a medium sized room in a tent beside the main building without A/V. 

 

The title for my session was Where Does the Power go in Data Centers and How to get it Back?  I didn’t show slides but much of what we covered is posted at: http://mvdirona.com/jrh/TalksAndPapers/JamesRH_DCPowerSavingsFooCamp08.ppt.  In the session, we talked through how contemporary large data centers work first looking at power distribution. We tracked the power from the feed to the substation at 115,000 volts through numerous conversions before arriving at the CPU at 1.2 volts. We then talked about power saving server design techniques.  And then the mechanical systems used to get the heat back out.  In each section we discussed what could be done to improve the design and how much could be saved.

 

Our conclusion from the session was that power savings of nearly 4x where both possible and affordable using only current technology.  For those participated in the session, thanks for your contribution and  for your help. It was fun.

 

                                                                --jrh

 

James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

Sunday, July 13, 2008 5:58:43 PM (Pacific Standard Time, UTC-08:00)  #    Comments [1] - Trackback
Ramblings
 Saturday, July 12, 2008

Last week the Facebook Data team released Cassandra as open source. Cassandra is an structured store with write ahead logging and indexing. Jeff Hammerbacher, who leads the Facebook Data team described Cassandra as a BigTable data model running on a Dynamo-like infrastructure.

 

Google Code for Cassandra (Apache 2.0 License): http://code.google.com/p/the-cassandra-project/.

 

Avinash Lakshman, Prashant Malik, and Karthik Ranganathan presented at SIGMOD 2008 this year: Cassandra: Structured Storage System over a P2P Network.  From the presentation:

 

Cassandra design goals:

·         High availability

·         Eventual consistency

·         Incremental scalability

·         Optimistic replication

·         Knobs to “tune” tradeoffs between consistency, durability, and latecy

·         Low cost of ownership

·         Minimal administration

 

Write operation: write to arbitrary node in Cassandra cluster, request sent to node owning the data, node writes to log first and then applied to in-memory copy. Properties of write: no locks in critical path, sequential disk accesses, behaves like a write through cache, atomicity guarantee for a key, and always writable.

 

Cluster membership is maintained via gossip protocol.

 

Lessons learned:

·         Add fancy features only when required

·         Many types of failures are possible

·         Big systems need proper systems-level monitoring

·         Value simple designs

 

Future work:

·         Atomicity guarantees across multiple keys

·         Distributed transactions (I’ll try to talk them out of this one)

·         Compression support

·         Fine grained security via ACLs

 

It looks like a well engineered system.

 

                                                --jrh

 

James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

 

Saturday, July 12, 2008 3:17:11 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Software
 Thursday, July 10, 2008

What follows is a guest posting from Phil Bernstein on the Google Megastore presentation by Jonas Karlsson, Philip Zeyliger at SIGMOD 2008:

 

Megastore is a transactional indexed record manager built by Google on top of BigTable. It is rumored to be the store behind Google AppEngine but this was not confirmed (or denied) at the talk. [JRH: I certainly recognize many similarities between the Google IO talk on the AppEngine store (see Under the Covers of the App Engine Datastore in Rough notes from Selected Sessions at Google IO Day 1) and Phil’s notes below].

 

·         A transaction is allowed to read and write data in an entity group.

·         The term “entity group” refers to a set of records, possibly in different BigTable instances. Therefore, different entities in an entity group might not be collocated on the same machine. The entities in an entity group share a common prefix of their primary key.  So in effect, an entity group is a hierarchically-linked set of entities.

·         A per-entity-group transaction log is used. One of the rows that stores the entity group is the entity group’s root. The log is stored with the root, which is replicated like all rows in Big Table.

·         To commit a transaction, its updates are stored in the log and replicated to the other copies of the log. Then they’re copied into the database copy of the entity group.

·         They commit to replicas before acking the caller and use Paxos to deal with replica failures. So it’s an ACID transaction.

·         Optimistic concurrency control is used. Few details were provided, but I assume it’s the same as what they describe for Google Apps.

·         Schemas are supported.

·         They offer vertical partitioning to cluster columns that are frequently accessed together.

·         They don’t support joins except across hierarchical paths within entity groups. I.e., if you want to do arbitrary joins, then you write an application program and there’s no consistency guarantee between the data in different entity groups.

·         Big Table does not support indexes. It simply sorts the rows by primary key. Megastore supports indexes on top. They were vague about the details. It sounds like the index is a binary table with a column that contains the compound key as a slash-separated string and a column containing the primary key of the entity group.

·         Referential integrity between the components of an entity group is not supported.

·         Many-to-many relationships are not supported, though they said they can store the inverse of a functional relationship.  It sounded like a materialized view that is incrementally updated asynchronously.

·         It has been in use by apps for about a year.

 

The follow are the notes that I typed while listening to the talk. For the most part, it’s just what was written on the slides and is incomplete. I don’t think it adds much to my summary above.

 

TITLE: Megastore – Scalable Data System for User-facing Apps

SPEAKERS: Jonas Karlsson, Philip Zeyliger (Google)

 

User-facing apps have TP-system-like requirements

Ø  data updated by a few users

Ø  largely reads, small updates

Ø  lots of work scaling the data layer

Ø  users expect consistency

 

Megastore – scale, RAD, consistency, performance

 

Scale

Ø  start with Big Table for storage

Ø  add db technologies that scale: indices, schemas

Ø  offer transparent replication and failover between data centers

Ø  support many, frequent reads

Ø  writes may be more costly, because they’re less frequent

Ø  with a correct schema, the app should scale naturally

 

Rapid App Development

Ø  hide operational aspects of scalability from app code

Ø  Flexible declarative schemas (MDL) – looks like SQL

o   indices, rich data types, consistency and partitioning

Ø  offer transactions

 

Entity Group consistency is supported

Ø  it’s a logical partition – e.g., blogs, posts, comments, which are all keyed by the owner of the blog

Ø  all entities in the group have the same high-order key component

 

Transactions

Ø  roll forward transaction log per entity group, no rollbacks

o   pb: It’s unclear to me whether they pool the log across all entity groups in a partition. If not, then they don’t benefit from group commit.

Ø  A transaction over an entity group is ACID , but not across entity groups

Ø  optimistic concurrency control

Ø  updates are available only after commit

Ø  api: newTransaction, read/write, commit (pb: couldn’t type fast enough for the details)