Wednesday, June 18, 2008

Lars Bak leads the Google Aarhus Denmark lab. He’s one of the original developers of Sun HotSpot Java VM. the Self Programming Language, and the sun Connected Limited Device Configuration VM for mobile phone.  He’s schedule to do a talk at JAOO Aarhaus, Denmark (Sept. 30, 2008).  Unconfirmed rumors report he will be announcing “Google Secret Project” during his JAOO keynote.

 

It’s hard to know for sure what is coming but the popular speculation is that Google will be announcing a dynamic language runtime with support for Python, JavaScript, and Java. A language runtime running on both server-side and client-side with support for a broad range of client devices including mobile phones would be pretty interesting.

 

                                --jrh

 

John Lam pointed me to: https://secure.trifork.com/speaker/Lars+Bak.

 

James Hamilton, Data Center Futures Team
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

Wednesday, June 18, 2008 6:54:03 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services
 Saturday, June 14, 2008

Earlier today Google hosted the second Seattle Conference on Scalability. The talk on Chapel was a good description of a parallel language for high performance computing being implemented done at Cray.  The GIGA+ talk described a highly scalable filesystem metadata system implemented in Garth Gibson’s lab at CMU. The Google presentation described how they implemented maps.google.com on various mobile devices. It was full of gems on managing device proliferation and scaling the user interface down to small screen sizes. 

 

The Wikipedia talk showed an interesting transactional DHT implementation using Erlang.  And, the last talk of the day was a well presented talk by Vijay Menon on transactional memory. My rough notes from the 1 day conference follow.

 

                                    --jrh

 

 

Kick-off by Brian Bershad, Director of Google Seattle Lab

 

Communicating like Nemo: Scalability from a fish-eye View

·         Speaker: Jennifer Wong (Masters Student from University of Victoria)

o   Research area: faul tolerance in Mobile collaborative systems

·         Research aimed to bring easy communications at cost and beyond two-people for SCUBA divers.

·         Fairly large market with requirement to communicate

o   Note that PADI has 130k members (there are other many other recreational diving groups and, of course, there are commercial groups as well)

·         Proposal: use acoustic for underwater to surface stations. Wireless between surface stations doing relay.

·         Acoustic unit to be integrated into dive computer.

 

Maidsafe: Exploring Infinity

·         Speaker: David Irvine, Mindsafe.net Ltd.

o   http://www.maidsafe.net/ 

·         Problems with existing client systems

o   Data gets lost

o   Encryption is difficult

o   Remot4e access diff

o   Syncing between devices hard

·         Proposal: chunk files, hash the chunks, xor and then encrypt the chunks and distribute over a DHT over many systems

o   It wasn’t sufficiently clear where the “random data” that was xored in came from or where it was stored.

o   It wasn’t sufficiently clear where the encryption key that was used in the AES encryption came from or was stored. 

·         Minimum server count for a reliable cloud is 2,500.

·         It is a quid pro quo network. You need to store data for others to be able store data yourself.

·         Lots of technology in play. Uses SHA, XORing, AES, DHTs PKI (without central authority) and something the speaker referred to as self encryption. The talk was given in whiteboard format which didn’t get to the point where I could fully understand enough of the details.  There were several question from the audience on the security of the system.  Generally an interesting system that distributes data over a large number of non-cooperating client systems.

·         Seems similar in design goals to Oceanstore and Farsite.  Available for download at http://www.maidsafe.net/. 

UPDATE: The speaker, David Irvine, sent me a paper providing more detail on Maidsafe: 0316_maidsafe_Rev_004(Maidsafe_Scalability_Response)-Letter1 (2).pdf (72.33 KB).

 

Chapel: Productive Parallel Programming at Scale

·         Speaker: Bradford Chamberlain, Cray Inc.

·         Three main limits to HPC scalability:

o   Single Program, Multiple Data (SPMD) programming model (vector machines)

o   Exposure of low-level implementation details

o   Lack of programmability

·         Chapel Themes:

o   General parallel programming

o   Reduce gap between mainstream and parallel languages

o   Global-view abstractions vs fragmented

§  Fragmented model is thinking through the partitioned or fragmented solution over many processors (more complex than global-view)

§  MPI programmers program a fragmented view

o   Control of locality

o   Multi-resolution design

·         Chapel is work done at Cray further developing the ZPL work that Brad did at the University of Washington.

·         Approaches to HPC parallelism:

o   Communication Libraries

§  MPI, MPI-2, SHMEM, ARMCI, GASNet

o   S/hared Memory Models:

§  OpenMP, Pthrads

o   PGAS Languages (Partitioned Global Address Space)

·         Chapel is a block-structured, imperative programming language.  Selects the best features from ZPL, HPF, LCU, Java, C#, C++, C, Modula, Ada,…

·         Overall design is to allow efficient code to be written at a high level but still allow low level control of data and computation placement

 

CARMEN: A Scalable Science Cloud

·         Speaker: Paul Watson, CSC Dept., Newcastle University, UK

·         Essentially a generalized, service-based e-Science platform for bio-informatics

o   Vertical services implemented over general cloud-storage and computation

o   Currently neuroscience focused but also getting used by chemistry and a few other domains

·         Services include workflow, data, security, metatdata, and service (code) repository

·         e-Science Requirements Summary:

o   share both data and codes

o   Capacity: 100’s TB

o   Cloud architecture (facilitates sharing and economies of scale)

·         Many code deployment techniques:

o   .war files, .net, JAR, VMWare virtual machines, etc.

·         Summary: system supports upload data, metadata describing providence of the data (data type specific), code in many forms, and security policy.  Work requests are received and appropriate code is shipped from the code repository to the data (if not already there) and the job is scheduled and run returning results.  Execution graph is described as a workflow.  Workflows are constructed using graphical UI in a browser. Currently running on a proprietary resource cloud but using a commercial cloud such as AWS is becoming interesting to them. (pay as you go model is very interesting).

o   To use AWS or similar service, they need to 1) have the data there, and 2) someone has to maintain the vertical service.

o   Considering http://www.inkspotscience.com/ , a company focused upon “on-demnd e-science”. 

 

GIGA+: Scalable Directories for Shared File Systems

·         Speaker: Swapnil Patil, CSC Dept., Carnegie Mellon University

·         Work done with Garth Gibson

·         Large FS don’t scale the metadata service will.

·         The goals of this work is to scale millions and trillions of objects per directory

·         Supports UNIX VFS interface

·         As clusters scale, we need to scale filesystems and as filesystems grow, we need to scale them to millions of objects per directory

·         Unsynchronized, parallel growth without central coordination

·         Current state of the art:

·         Hash-tables: Linux EX2/Ext3

·         B-trees: XFS

·         Hash table issues are they don’t incrementally grow. 

·         Can use extensible hashing (Fagin 79) to solve the problem as

·         Lustre (Sun) uses a single central metadata server (limits scalability)

·         PVFS2 distribute the metadata over many servers using directory partitioning.  Helps distribute the load but a very large or very active directory is still on one metadata server.

·         IBM GPFS implements distributed locking (scalability problems) and shared storage as the communications mechanism (if node 1 has page X needed by node 2, it has to write it to disk and node 2 has to read it (ping ponging).

·         GPFS does well with lookup intensive workloads but doesn’t do well with high metadata update scenarios

·         What’s new in GIGA+:

·         Eliminate serialization

·         No synchronization or consistency bottlenecks

·         Weak consistency

·         Clients work with potentially stale data but it mostly works. On failure, the updated metadata is returned to the client.  Good optimistic approach.

·         When a hash bucket is split, the client will briefly be operating on stale data but, on failure, the correct location is returned updating the client cache.  Could use more aggressive techniques to update caches (e.g. gossip) but this is a future optimization.

·         Partition presence or absence is indicated by a bit-map on each server.

·         Very compact

·         Unlike extensible hashing, only those partitions that need to be split are split (supports non-symetric growth).

·         Appears to have similarities to consistent hashing.

 

Scaling Google Maps from the Big Screen Down to Mobile Phones

·         Speaker: Jerry Morrison, Google Inc.

·         Maps on the go.

·         Three main topics covered:

·         Scaling the UX

·         Coping with the mobile network

·         Note that in many countries the majority of network access is through mobile devices rather than PCs

·         Key problem of supporting mobile devices well is mostly about scaling the user experience (240x320 with very little keyboard support)

·         Basic approach is to replace the AJAX application with a device optimized client/server application.

·         Coping with the mobile network:

·         Low bandwidth and high latency (Peter Denning notes that bandwidth increases at the square of the latency improvement)

·         4.25 seconds average latency for an HTTP request

·         The initial connection takes additional seconds

·         Note: RIM handles all world-wide traffic through an encrypted VPN back to Waterloo, CA.

·         Google traffic requests

·         Load balancer

·         GMM which pulls concurrently from:

1.       Map search, Local search, & Directions

2.       Maps & sat tiles

3.       Highway traffic

4.       Location

·         Tiles are PNG at 64x64 pixels.  Larger tiles compress more but sends more than needed. Small can respond quicker but you need more tiles to complete the screen.

·         They believe that 22x22 is optimal for 130x180 screen (they use 64x64 looking to the future)

·         Tiles need mobile adaption:

·         More repetition of street names to get at least one on a screen

·         Want more compression so use less colors and some other image simplifications

·         JPEG has 600 bytes of header

·         Showed a picture of the Google Mobile Testing Lab

·         100s of phones

·         Grows 10% a quarter

·         Three code bases:

·         Java, C++, ObjectiveC (iPhone)

·         Language translation: fetch language translation dynamically as needed (can’t download 20+ on each request)

·         Interesting challenges from local juristictions:

·         Can’t export lat & long from China

·         Different representations of disputed borders

 

Scalable Wikipedia with Erlang

·         Speaker: Thorsten Schuett, Zuse Institute Berlin

·         Wikipedia #7 website (actually #6 in that MSN and Windows Live was shown as independent)

·         50,000 requests per second

·         Standard scaling story:

·         Started with a single server

·         Then split DB from mid-tier

·         Then partitioned DB

·         Then partitioned DB again into clusters (with directory to find which one)

·         Then put memcache in front of it

·         Their approach: use P2P overlay in the data center

·         Start with chord# (DHT) which supports insert, delete, and update

·         Add transactions, load balancing, ….

·         Transactions:

·         Implemented by electing a subset as transaction managers elected with Paxos protection

·         Quorum of >N/2 nodes

·         Supports Load Balancing

·         And supports multi-data center policy based geo-distribution for locality (low latency) and robustness (remote copies).

·         They have implemented distributed Erlang

·         Security problems

·         Scalability problems

·         Ended up having to implement their own transport layer

·         Summary:

·         DHT + Transactions = scalable, reliable, efficient key/value store

·         Overall system is message based and fail-fast and that made Erlang a good choice.

 

Scalable Multiprocessor Programming via Transactional Programming

·         Speaker: Vijay Menon, Google Inc.

·         Scalability no longer just about large scale distributed systems.  Modern processors are multi-core and trending towards many-core

·         Scalability is no longer restricted to a server problem. Clients and even some mobile devices today.

·         Conventional programming model is threads and locks.

·         Transactional memory: replaced locked regions with transactions

·         Optimistic concurrency control

·         Declarative safety in the language

·         Transactional Memory examples:

·         Azul: Large scale Java server with more than 500 cores

·         Transactional memory used in implementation but not exposed

·         Sun Rock (expected 2009)

·         Conflict resolution:

·         Eager: Assume potential conflict and prevent

·         Lazy: assume no conflict (record state and validate)

·         Most hardware TMs are lazy.  S/W implementations vary.

·         Example techniques:

·         Write in place in memory with old values written to log to support rollback)

·         Buffer changes and only persist on success

·         Showed some bugs in transaction management and showed how programming language support for transactions could help.:

·         Implementation challenges

·         Large transactions expensive

·         Implementation overhead

·         Semantic challenges including Allowing I/O and other operations that can’t be rolled back

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

 

Saturday, June 14, 2008 4:15:52 PM (Pacific Standard Time, UTC-08:00)  #    Comments [2] - Trackback
Software
 Wednesday, June 11, 2008

Jeff Dean did a great talk at Google IO this year. Some key points from Steve Garrity (msft pm) and some note from the excellent write-up at Google spotlights data center inner workings:

·         many unreliable servers to fewer high cost servers

·         Single search query touches 700 to up to 1k machines in < 0.25sec

·         36 data centers containing > 800K servers

o   40 servers/rack

·         Typical H/W failures: Install 1000 machines and in 1 year you’ll see: 1000+ HD failures, 20 mini switch failures, 5 full switch failures, 1 PDU failure

·         There are more than 200 Google File System clusters

·         The largest BigTable instance manages about 6 petabytes of data spread across thousands of machines

·          MapReduce is increasing used within Google.

o   29,000 jobs in August 2004 and 2.2 million in September 2007

o   Average time to complete a job has dropped from 634 seconds to 395 seconds

o   Output of MapReduce tasks has risen from 193 terabytes to 14,018 terabytes

·         Typical day will run about 100,000 MapReduce jobs

o   each occupies about 400 servers

o   takes about 5 to 10 minutes to finish

 

More detail on the typical failures during the first year of a cluster from Jeff:

·         ~0.5 overheating (power down most machines in <5 mins, ~1-2 days to recover)

·         ~1 PDU failure (~500-1000 machines suddenly disappear, ~6 hours to come back)

·         ~1 rack-move (plenty of warning, ~500-1000 machines powered down, ~6 hours)

·         ~1 network rewiring (rolling ~5% of machines down over 2-day span)

·         ~20 rack failures (40-80 machines instantly disappear, 1-6 hours to get back)

·         ~5 racks go wonky (40-80 machines see 50% packetloss)

·         ~8 network maintenances (4 might cause ~30-minute random connectivity losses)

·         ~12 router reloads (takes out DNS and external vips for a couple minutes)

·         ~3 router failures (have to immediately pull traffic for an hour)

·         ~dozens of minor 30-second blips for dns

·         ~1000 individual machine failures

·         ~thousands of hard drive failures

 

A pictorial history of Google hardware through the years starting with the current generation server hardware and working b