James Hamilton's Blog RSS 2.0
 Saturday, June 14, 2008

Earlier today Google hosted the second Seattle Conference on Scalability. The talk on Chapel was a good description of a parallel language for high performance computing being implemented done at Cray.  The GIGA+ talk described a highly scalable filesystem metadata system implemented in Garth Gibson’s lab at CMU. The Google presentation described how they implemented maps.google.com on various mobile devices. It was full of gems on managing device proliferation and scaling the user interface down to small screen sizes. 

 

The Wikipedia talk showed an interesting transactional DHT implementation using Erlang.  And, the last talk of the day was a well presented talk by Vijay Menon on transactional memory. My rough notes from the 1 day conference follow.

 

                                    --jrh

 

 

Kick-off by Brian Bershad, Director of Google Seattle Lab

 

Communicating like Nemo: Scalability from a fish-eye View

·         Speaker: Jennifer Wong (Masters Student from University of Victoria)

o   Research area: faul tolerance in Mobile collaborative systems

·         Research aimed to bring easy communications at cost and beyond two-people for SCUBA divers.

·         Fairly large market with requirement to communicate

o   Note that PADI has 130k members (there are other many other recreational diving groups and, of course, there are commercial groups as well)

·         Proposal: use acoustic for underwater to surface stations. Wireless between surface stations doing relay.

·         Acoustic unit to be integrated into dive computer.

 

Maidsafe: Exploring Infinity

·         Speaker: David Irvine, Mindsafe.net Ltd.

o   http://www.maidsafe.net/ 

·         Problems with existing client systems

o   Data gets lost

o   Encryption is difficult

o   Remot4e access diff

o   Syncing between devices hard

·         Proposal: chunk files, hash the chunks, xor and then encrypt the chunks and distribute over a DHT over many systems

o   It wasn’t sufficiently clear where the “random data” that was xored in came from or where it was stored.

o   It wasn’t sufficiently clear where the encryption key that was used in the AES encryption came from or was stored. 

·         Minimum server count for a reliable cloud is 2,500.

·         It is a quid pro quo network. You need to store data for others to be able store data yourself.

·         Lots of technology in play. Uses SHA, XORing, AES, DHTs PKI (without central authority) and something the speaker referred to as self encryption. The talk was given in whiteboard format which didn’t get to the point where I could fully understand enough of the details.  There were several question from the audience on the security of the system.  Generally an interesting system that distributes data over a large number of non-cooperating client systems.

·         Seems similar in design goals to Oceanstore and Farsite.  Available for download at http://www.maidsafe.net/. 

UPDATE: The speaker, David Irvine, sent me a paper providing more detail on Maidsafe: 0316_maidsafe_Rev_004(Maidsafe_Scalability_Response)-Letter1 (2).pdf (72.33 KB).

 

Chapel: Productive Parallel Programming at Scale

·         Speaker: Bradford Chamberlain, Cray Inc.

·         Three main limits to HPC scalability:

o   Single Program, Multiple Data (SPMD) programming model (vector machines)

o   Exposure of low-level implementation details

o   Lack of programmability

·         Chapel Themes:

o   General parallel programming

o   Reduce gap between mainstream and parallel languages

o   Global-view abstractions vs fragmented

§  Fragmented model is thinking through the partitioned or fragmented solution over many processors (more complex than global-view)

§  MPI programmers program a fragmented view

o   Control of locality

o   Multi-resolution design

·         Chapel is work done at Cray further developing the ZPL work that Brad did at the University of Washington.

·         Approaches to HPC parallelism:

o   Communication Libraries

§  MPI, MPI-2, SHMEM, ARMCI, GASNet

o   S/hared Memory Models:

§  OpenMP, Pthrads

o   PGAS Languages (Partitioned Global Address Space)

·         Chapel is a block-structured, imperative programming language.  Selects the best features from ZPL, HPF, LCU, Java, C#, C++, C, Modula, Ada,…

·         Overall design is to allow efficient code to be written at a high level but still allow low level control of data and computation placement

 

CARMEN: A Scalable Science Cloud

·         Speaker: Paul Watson, CSC Dept., Newcastle University, UK

·         Essentially a generalized, service-based e-Science platform for bio-informatics

o   Vertical services implemented over general cloud-storage and computation

o   Currently neuroscience focused but also getting used by chemistry and a few other domains

·         Services include workflow, data, security, metatdata, and service (code) repository

·         e-Science Requirements Summary:

o   share both data and codes

o   Capacity: 100’s TB

o   Cloud architecture (facilitates sharing and economies of scale)

·         Many code deployment techniques:

o   .war files, .net, JAR, VMWare virtual machines, etc.

·         Summary: system supports upload data, metadata describing providence of the data (data type specific), code in many forms, and security policy.  Work requests are received and appropriate code is shipped from the code repository to the data (if not already there) and the job is scheduled and run returning results.  Execution graph is described as a workflow.  Workflows are constructed using graphical UI in a browser. Currently running on a proprietary resource cloud but using a commercial cloud such as AWS is becoming interesting to them. (pay as you go model is very interesting).

o   To use AWS or similar service, they need to 1) have the data there, and 2) someone has to maintain the vertical service.

o   Considering http://www.inkspotscience.com/ , a company focused upon “on-demnd e-science”. 

 

GIGA+: Scalable Directories for Shared File Systems

·         Speaker: Swapnil Patil, CSC Dept., Carnegie Mellon University

·         Work done with Garth Gibson

·         Large FS don’t scale the metadata service will.

·         The goals of this work is to scale millions and trillions of objects per directory

·         Supports UNIX VFS interface

·         As clusters scale, we need to scale filesystems and as filesystems grow, we need to scale them to millions of objects per directory

·         Unsynchronized, parallel growth without central coordination

·         Current state of the art:

·         Hash-tables: Linux EX2/Ext3

·         B-trees: XFS

·         Hash table issues are they don’t incrementally grow. 

·         Can use extensible hashing (Fagin 79) to solve the problem as

·         Lustre (Sun) uses a single central metadata server (limits scalability)

·         PVFS2 distribute the metadata over many servers using directory partitioning.  Helps distribute the load but a very large or very active directory is still on one metadata server.

·         IBM GPFS implements distributed locking (scalability problems) and shared storage as the communications mechanism (if node 1 has page X needed by node 2, it has to write it to disk and node 2 has to read it (ping ponging).

·         GPFS does well with lookup intensive workloads but doesn’t do well with high metadata update scenarios

·         What’s new in GIGA+:

·         Eliminate serialization

·         No synchronization or consistency bottlenecks

·         Weak consistency

·         Clients work with potentially stale data but it mostly works. On failure, the updated metadata is returned to the client.  Good optimistic approach.

·         When a hash bucket is split, the client will briefly be operating on stale data but, on failure, the correct location is returned updating the client cache.  Could use more aggressive techniques to update caches (e.g. gossip) but this is a future optimization.

·         Partition presence or absence is indicated by a bit-map on each server.

·         Very compact

·         Unlike extensible hashing, only those partitions that need to be split are split (supports non-symetric growth).

·         Appears to have similarities to consistent hashing.

 

Scaling Google Maps from the Big Screen Down to Mobile Phones

·         Speaker: Jerry Morrison, Google Inc.

·         Maps on the go.

·         Three main topics covered:

·         Scaling the UX

·         Coping with the mobile network

·         Note that in many countries the majority of network access is through mobile devices rather than PCs

·         Key problem of supporting mobile devices well is mostly about scaling the user experience (240x320 with very little keyboard support)

·         Basic approach is to replace the AJAX application with a device optimized client/server application.

·         Coping with the mobile network:

·         Low bandwidth and high latency (Peter Denning notes that bandwidth increases at the square of the latency improvement)

·         4.25 seconds average latency for an HTTP request

·         The initial connection takes additional seconds

·         Note: RIM handles all world-wide traffic through an encrypted VPN back to Waterloo, CA.

·         Google traffic requests

·         Load balancer

·         GMM which pulls concurrently from:

1.       Map search, Local search, & Directions

2.       Maps & sat tiles

3.       Highway traffic

4.       Location

·         Tiles are PNG at 64x64 pixels.  Larger tiles compress more but sends more than needed. Small can respond quicker but you need more tiles to complete the screen.

·         They believe that 22x22 is optimal for 130x180 screen (they use 64x64 looking to the future)

·         Tiles need mobile adaption:

·         More repetition of street names to get at least one on a screen

·         Want more compression so use less colors and some other image simplifications

·         JPEG has 600 bytes of header

·         Showed a picture of the Google Mobile Testing Lab

·         100s of phones

·         Grows 10% a quarter

·         Three code bases:

·         Java, C++, ObjectiveC (iPhone)

·         Language translation: fetch language translation dynamically as needed (can’t download 20+ on each request)

·         Interesting challenges from local juristictions:

·         Can’t export lat & long from China

·         Different representations of disputed borders

 

Scalable Wikipedia with Erlang

·         Speaker: Thorsten Schuett, Zuse Institute Berlin

·         Wikipedia #7 website (actually #6 in that MSN and Windows Live was shown as independent)

·         50,000 requests per second

·         Standard scaling story:

·         Started with a single server

·         Then split DB from mid-tier

·         Then partitioned DB

·         Then partitioned DB again into clusters (with directory to find which one)

·         Then put memcache in front of it

·         Their approach: use P2P overlay in the data center

·         Start with chord# (DHT) which supports insert, delete, and update

·         Add transactions, load balancing, ….

·         Transactions:

·         Implemented by electing a subset as transaction managers elected with Paxos protection

·         Quorum of >N/2 nodes

·         Supports Load Balancing

·         And supports multi-data center policy based geo-distribution for locality (low latency) and robustness (remote copies).

·         They have implemented distributed Erlang

·         Security problems

·         Scalability problems

·         Ended up having to implement their own transport layer

·         Summary:

·         DHT + Transactions = scalable, reliable, efficient key/value store

·         Overall system is message based and fail-fast and that made Erlang a good choice.

 

Scalable Multiprocessor Programming via Transactional Programming

·         Speaker: Vijay Menon, Google Inc.

·         Scalability no longer just about large scale distributed systems.  Modern processors are multi-core and trending towards many-core

·         Scalability is no longer restricted to a server problem. Clients and even some mobile devices today.

·         Conventional programming model is threads and locks.

·         Transactional memory: replaced locked regions with transactions

·         Optimistic concurrency control

·         Declarative safety in the language

·         Transactional Memory examples:

·         Azul: Large scale Java server with more than 500 cores

·         Transactional memory used in implementation but not exposed

·         Sun Rock (expected 2009)

·         Conflict resolution:

·         Eager: Assume potential conflict and prevent

·         Lazy: assume no conflict (record state and validate)

·         Most hardware TMs are lazy.  S/W implementations vary.

·         Example techniques:

·         Write in place in memory with old values written to log to support rollback)

·         Buffer changes and only persist on success

·         Showed some bugs in transaction management and showed how programming language support for transactions could help.:

·         Implementation challenges

·         Large transactions expensive

·         Implementation overhead

·         Semantic challenges including Allowing I/O and other operations that can’t be rolled back

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

 

Saturday, June 14, 2008 4:15:52 PM (Pacific Standard Time, UTC-08:00)  #    Comments [2] - Trackback
Software
 Friday, May 23, 2008

Wednesday Yahoo announced they have a built a petascale, distributed relational database.  In Yahoo Claims Record With Petabyte Database, the details are thin but they built on the PostgreSQL relational database system. In Size matters: Yahoo claims 2-petabyte database is world's biggest, busiest, the system is described as an over 2 petabyte repository of user click stream and context data with an update rate for 24 billion events per day.  Waqar Hasan, VP of Engineering at Yahoo! Data group, describes the system as updated in real time and live – essentially a real time data warehouse where changes go in as they are made and queries always run against the most current data. I strongly suspect they are bulk parsing logs and the data is being pushed into the system in large bulk units but, even near real time at this update rate, is impressive.

 

The original work was done at a Seattle startup called Mahat Technologies acquired by Yahoo! in November 2005.

 

The approach appears to be similar to what we did with IBM DB2 Parallel Edition.  13 years ago we had it running on a cluster of 512 RS/6000s at the Maui Super Computer Center and 256 nodes at the Cornel Theory Center.  It’s a shared nothing design which means that each server in the cluster have independent disk and don’t share memory. The upside of this approach is it scales incredibly well. It looks like Yahoo! has done something similar using PostgreSQL as the base technology.  Each node in the cluster runs a full copy of the storage engine.  The query execution engine is replaced with one modified to run over a cluster and use a communications fabric to interconnect the nodes in the cluster.  The parallel query plans are run over the entire cluster with the plan nodes interconnected by the communication fabric.  The PostgreSQL client, communications protocol and server side components with some big exceptions run mostly unchanged.  The query optimizer is either replaced completely with a cluster parallel aware implementation that models the data layout and cluster topology in making optimization decisions.  Or the original, non-cluster parallel optimizer is used and the resultant single node plans are then optimized for the cluster in a post optimization phase. The former will yield provably better plans but it’s also more complex. I’m fearful of complexity around optimizers and, as a consequence, I actually prefer the slightly less optimal, post-optimization phase.  Many other problems have to be addressed including having the cluster metadata available on each node to support SQL query compilation but what I’ve sketched here covers the major points required to get such a design running.

 

The result is a modified version of PostgreSQL runs on each node.  A client can connect to any of the nodes in the cluster (or a policy restricted subset).  A query flows from the client to the server it chose to connect with. The SQL compiler on that node compiles and optimizes the query on that single node (no parallelism). The query optimizer is either cluster-aware or uses a post-optimization cluster-aware component.  The resultant query plan when ready for execution is divided up into sub-plans (plan fragments) that run on each node connected over the communication fabric.  Some execution engines initiate top-down and some bottom up. I don’t recall what PostgreSQL uses but bottom-up is easier in this case.  However, either can be made to work.  The plan fragments are distributed to the appropriate nodes in the cluster.  Each runs on local data and pipes results to other nodes which run plan fragments and forward the results yet again toward the root of the plan. The root of the plan runs on the node that started the compilation and the final results end up there to be returned to the client.

 

It’s a nice approach and as evidenced by Yahoo’s experience it scales, scales, scales.  I also like the approach in that most tools and applications can continue to work with little change.  Most clusters of this design have some restrictions such unique ID generation is either not supported or slow as is referential integrity.  Nonetheless, a large class of software can be run without change.

 

If you are interested in digging deeper into Relational Database technology and how the major commercial systems are written, see Architecture of a Database System.

 

Yahoo has a long history of contributing to Open Source and they are the largest contributor to the Apache Hadoop project. It’ll be interesting to see if Yahoo! Data ends up open source or held as an internal only asset.

 

Kevin Merritt pointed me to the Yahoo! Data work.

 

                                                -jrh

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

Friday, May 23, 2008 6:22:38 AM (Pacific Standard Time, UTC-08:00)  #    Comments [4] - Trackback
Software
 Saturday, May 17, 2008

I’ve been involved with high scale systems software projects, mostly database engines, for the last 20 years and I’ve watched the transition from low level and proprietary languages to C. Then C to C++. Recently I’ve been thinking a bit about what’s next.

 

Back in the very early 90’s when I was Lead Architect on IBM DB2, I was dead against C++ usage in the Storage Engine and wouldn’t allow exceptions to be used anywhere in the system. At the time, the quality of C++ compilers was variable with some being real compilers that were actually fairly well done (I lead the IBM RS/6000 C++ team in the late 80s) while others were Cfront-based and pretty weak.  At the time no compiler, including the one I worked on, did a good job implementing exceptions.  Times change.  SQL Server, for example, is 100% C++ and it makes excellent use of exception to clean up resources on failure. 

 

The productivity benefits of new programming languages and tools eventually wins out.  When they get broad use, implementations improve reducing the performance tax and, eventually, even very performance sensitive system software make the transition.

 

I got interested in Java in the mid-90’s and more recently I’ve been using C# quite a bit partly due to where I work and partly because I actually find the language and surrounding tools impressively good.  JITed languages typically don’t perform as well as statically compiled languages but the advantages completely swamp the minor performance costs.  And, as managed language (Java, C#, etc.) implementations improve, the performance tax continues to fall. There is no question in my mind that managed languages will end up being broadly used in even the most performance critical software systems such as database engines.

 

Recently, I’ve gotten interested in Erlang as an systems software implementation language.  By most measures, it looks to be an unlikely choice for high scale system software in that its interpreted, has a functional subset at its core, and uses message passing rather than shared memory and locks. Basically, it’s just about the opposite of everything you would find in a modern commercial database kernel.  So what makes it interesting? The short answer is all the things that make it an unlikely choice also make it interesting.  Servers are becoming increasingly unbalanced with CPU speeds continuing to outpace memory and network bandwidth.  More and more operations are going to be memory and network bound rather than CPU if they aren’t already.  Trading some CPU resources to get a more robust implementation that is easier to understand and maintain is a good choice.  In addition, CPU speed increases are now coming more from multiple cores than from frequency scaling a single core. Consequently a language that produces an abundance of parallelism is a an asset rather than a problem. Finally, large systems software projects like database management systems, operating systems, web servers, IM servers, email systems, etc. are incredibly large and complex. The Erlang model of spawning many lightweight threads that communicate via message passing is going to be less efficient than the more common shared memory and locks solution but it’s much easier to get correct.  Erlang also encourages a “fail fast” programming model.  I’ve long argued that this is the only way to get high scale systems software correct (Designing and Deploying Internet-Scale Services).