Earlier today Google hosted the second Seattle Conference on Scalability. The talk on Chapel was a good description of a parallel language for high performance computing being implemented done at Cray. The GIGA+ talk described a highly scalable filesystem metadata system implemented in Garth Gibson’s lab at CMU. The Google presentation described how they implemented maps.google.com on various mobile devices. It was full of gems on managing device proliferation and scaling the user interface down to small screen sizes.
The Wikipedia talk showed an interesting transactional DHT implementation using Erlang. And, the last talk of the day was a well presented talk by Vijay Menon on transactional memory. My rough notes from the 1 day conference follow.
–jrh
Kick-off by Brian Bershad, Director of Google Seattle Lab
Communicating like Nemo: Scalability from a fish-eye View
· Speaker: Jennifer Wong (Masters Student from University of Victoria)
o Research area: faul tolerance in Mobile collaborative systems
· Research aimed to bring easy communications at cost and beyond two-people for SCUBA divers.
· Fairly large market with requirement to communicate
o Note that PADI has 130k members (there are other many other recreational diving groups and, of course, there are commercial groups as well)
· Proposal: use acoustic for underwater to surface stations. Wireless between surface stations doing relay.
· Acoustic unit to be integrated into dive computer.
Maidsafe: Exploring Infinity
· Speaker: David Irvine, Mindsafe.net Ltd.
· Problems with existing client systems
o Data gets lost
o Encryption is difficult
o Remot4e access diff
o Syncing between devices hard
· Proposal: chunk files, hash the chunks, xor and then encrypt the chunks and distribute over a DHT over many systems
o It wasn’t sufficiently clear where the “random data” that was xored in came from or where it was stored.
o It wasn’t sufficiently clear where the encryption key that was used in the AES encryption came from or was stored.
· Minimum server count for a reliable cloud is 2,500.
· It is a quid pro quo network. You need to store data for others to be able store data yourself.
· Lots of technology in play. Uses SHA, XORing, AES, DHTs PKI (without central authority) and something the speaker referred to as self encryption. The talk was given in whiteboard format which didn’t get to the point where I could fully understand enough of the details. There were several question from the audience on the security of the system. Generally an interesting system that distributes data over a large number of non-cooperating client systems.
· Seems similar in design goals to Oceanstore and Farsite. Available for download at http://www.maidsafe.net/.
UPDATE: The speaker, David Irvine, sent me a paper providing more detail on Maidsafe: 0316_maidsafe_Rev_004(Maidsafe_Scalability_Response)-Letter1 (2).pdf (72.33 KB).
Chapel: Productive Parallel Programming at Scale
· Speaker: Bradford Chamberlain, Cray Inc.
· Three main limits to HPC scalability:
o Single Program, Multiple Data (SPMD) programming model (vector machines)
o Exposure of low-level implementation details
o Lack of programmability
· Chapel Themes:
o General parallel programming
o Reduce gap between mainstream and parallel languages
o Global-view abstractions vs fragmented
§ Fragmented model is thinking through the partitioned or fragmented solution over many processors (more complex than global-view)
§ MPI programmers program a fragmented view
o Control of locality
o Multi-resolution design
· Chapel is work done at Cray further developing the ZPL work that Brad did at the University of Washington.
· Approaches to HPC parallelism:
o Communication Libraries
§ MPI, MPI-2, SHMEM, ARMCI, GASNet
o S/hared Memory Models:
§ OpenMP, Pthrads
o PGAS Languages (Partitioned Global Address Space)
· Chapel is a block-structured, imperative programming language. Selects the best features from ZPL, HPF, LCU, Java, C#, C++, C, Modula, Ada,…
· Overall design is to allow efficient code to be written at a high level but still allow low level control of data and computation placement
CARMEN: A Scalable Science Cloud
· Speaker: Paul Watson, CSC Dept., Newcastle University, UK
· Essentially a generalized, service-based e-Science platform for bio-informatics
o Vertical services implemented over general cloud-storage and computation
o Currently neuroscience focused but also getting used by chemistry and a few other domains
· Services include workflow, data, security, metatdata, and service (code) repository
· e-Science Requirements Summary:
o share both data and codes
o Capacity: 100’s TB
o Cloud architecture (facilitates sharing and economies of scale)
· Many code deployment techniques:
o .war files, .net, JAR, VMWare virtual machines, etc.
· Summary: system supports upload data, metadata describing providence of the data (data type specific), code in many forms, and security policy. Work requests are received and appropriate code is shipped from the code repository to the data (if not already there) and the job is scheduled and run returning results. Execution graph is described as a workflow. Workflows are constructed using graphical UI in a browser. Currently running on a proprietary resource cloud but using a commercial cloud such as AWS is becoming interesting to them. (pay as you go model is very interesting).
o To use AWS or similar service, they need to 1) have the data there, and 2) someone has to maintain the vertical service.
o Considering http://www.inkspotscience.com/ , a company focused upon “on-demnd e-science”.
GIGA+: Scalable Directories for Shared File Systems
· Speaker: Swapnil Patil, CSC Dept., Carnegie Mellon University
· Work done with Garth Gibson
· Large FS don’t scale the metadata service will.
· The goals of this work is to scale millions and trillions of objects per directory
· Supports UNIX VFS interface
· As clusters scale, we need to scale filesystems and as filesystems grow, we need to scale them to millions of objects per directory
· Unsynchronized, parallel growth without central coordination
· Current state of the art:
· Hash-tables: Linux EX2/Ext3
· B-trees: XFS
· Hash table issues are they don’t incrementally grow.
· Can use extensible hashing (Fagin 79) to solve the problem as
· Lustre (Sun) uses a single central metadata server (limits scalability)
· PVFS2 distribute the metadata over many servers using directory partitioning. Helps distribute the load but a very large or very active directory is still on one metadata server.
· IBM GPFS implements distributed locking (scalability problems) and shared storage as the communications mechanism (if node 1 has page X needed by node 2, it has to write it to disk and node 2 has to read it (ping ponging).
· GPFS does well with lookup intensive workloads but doesn’t do well with high metadata update scenarios
· What’s new in GIGA+:
· Eliminate serialization
· No synchronization or consistency bottlenecks
· Weak consistency
· Clients work with potentially stale data but it mostly works. On failure, the updated metadata is returned to the client. Good optimistic approach.
· When a hash bucket is split, the client will briefly be operating on stale data but, on failure, the correct location is returned updating the client cache. Could use more aggressive techniques to update caches (e.g. gossip) but this is a future optimization.
· Partition presence or absence is indicated by a bit-map on each server.
· Very compact
· Unlike extensible hashing, only those partitions that need to be split are split (supports non-symetric growth).
· Appears to have similarities to consistent hashing.
Scaling Google Maps from the Big Screen Down to Mobile Phones
· Speaker: Jerry Morrison, Google Inc.
· Maps on the go.
· Three main topics covered:
· Scaling the UX
· Coping with the mobile network
· Note that in many countries the majority of network access is through mobile devices rather than PCs
· Key problem of supporting mobile devices well is mostly about scaling the user experience (240×320 with very little keyboard support)
· Basic approach is to replace the AJAX application with a device optimized client/server application.
· Coping with the mobile network:
· Low bandwidth and high latency (Peter Denning notes that bandwidth increases at the square of the latency improvement)
· 4.25 seconds average latency for an HTTP request
· The initial connection takes additional seconds
· Note: RIM handles all world-wide traffic through an encrypted VPN back to Waterloo, CA.
· Google traffic requests
· Load balancer
· GMM which pulls concurrently from:
1. Map search, Local search, & Directions
2. Maps & sat tiles
3. Highway traffic
4. Location
· Tiles are PNG at 64×64 pixels. Larger tiles compress more but sends more than needed. Small can respond quicker but you need more tiles to complete the screen.
· They believe that 22×22 is optimal for 130×180 screen (they use 64×64 looking to the future)
· Tiles need mobile adaption:
· More repetition of street names to get at least one on a screen
· Want more compression so use less colors and some other image simplifications
· JPEG has 600 bytes of header
· Showed a picture of the Google Mobile Testing Lab
· 100s of phones
· Grows 10% a quarter
· Three code bases:
· Java, C++, ObjectiveC (iPhone)
· Language translation: fetch language translation dynamically as needed (can’t download 20+ on each request)
· Interesting challenges from local juristictions:
· Can’t export lat & long from China
· Different representations of disputed borders
Scalable Wikipedia with Erlang
· Speaker: Thorsten Schuett, Zuse Institute Berlin
· Wikipedia #7 website (actually #6 in that MSN and Windows Live was shown as independent)
· 50,000 requests per second
· Standard scaling story:
· Started with a single server
· Then split DB from mid-tier
· Then partitioned DB
· Then partitioned DB again into clusters (with directory to find which one)
· Then put memcache in front of it
· Their approach: use P2P overlay in the data center
· Start with chord# (DHT) which supports insert, delete, and update
· Add transactions, load balancing, ….
· Transactions:
· Implemented by electing a subset as transaction managers elected with Paxos protection
· Quorum of >N/2 nodes
· Supports Load Balancing
· And supports multi-data center policy based geo-distribution for locality (low latency) and robustness (remote copies).
· They have implemented distributed Erlang
· Security problems
· Scalability problems
· Ended up having to implement their own transport layer
· Summary:
· DHT + Transactions = scalable, reliable, efficient key/value store
· Overall system is message based and fail-fast and that made Erlang a good choice.
Scalable Multiprocessor Programming via Transactional Programming
· Speaker: Vijay Menon, Google Inc.
· Scalability no longer just about large scale distributed systems. Modern processors are multi-core and trending towards many-core
· Scalability is no longer restricted to a server problem. Clients and even some mobile devices today.
· Conventional programming model is threads and locks.
· Transactional memory: replaced locked regions with transactions
· Optimistic concurrency control
· Declarative safety in the language
· Transactional Memory examples:
· Azul: Large scale Java server with more than 500 cores
· Transactional memory used in implementation but not exposed
· Sun Rock (expected 2009)
· Conflict resolution:
· Eager: Assume potential conflict and prevent
· Lazy: assume no conflict (record state and validate)
· Most hardware TMs are lazy. S/W implementations vary.
· Example techniques:
· Write in place in memory with old values written to log to support rollback)
· Buffer changes and only persist on success
· Showed some bugs in transaction management and showed how programming language support for transactions could help.:
· Implementation challenges
· Large transactions expensive
· Implementation overhead
· Semantic challenges including Allowing I/O and other operations that can’t be rolled back
James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | JamesRH@microsoft.com
H:mvdirona.com | W:research.microsoft.com/~jamesrh | blog:http://perspectives.mvdirona.com
Thanks for the additional detail above David. Veery useful. My read of the man-in-the-middle attack was that you were likely vulnerable. But, your explaination above is a reasonable mitigation.
If you can send more detailed slides and/or a paper my way, I would love to dig deeper and perhaps post it. Thanks,
–jrh
Hi James
Thanks for the precis, very good. The maidsafe presentation normally takes around 4 hours and developers around 6 months to catch up in the office. It was cut pretty last minute from 1 hour to 30 mins. Tis was to allow everything on one track which actually makes sense, although difficult.
To answer your questions though:
Self encryption
Hash of initial chunk (pre encryption) is used to encrypt the corresponding xored chunk + 1 (i.e. C2 hash is used to encrypt C1, C3 encrypts C3). The post encrypted chunks are renamed with the new hash of the content and stored as a key value pairn on the dht (actually its pointers to where the maid layer places the data based on rank).
The xored data is created from repeating the hash of the file itself, this is repeated to create a chunk of same size as the rest of the chunks. There are many ways to do this and we do some other tricks but this would be a straight forward way of describing it.
Security of the system is based on PKI encryption and signing with AES encrypted obfiscated data. I was unsure about what man in the middle attack issue one questioner had but after speaking to him it was thought that a parter could listen to another parter to get the answer to a random question about chunk integrity. This is actually prevented as tehre are several thousand parters for each chunk and multiple clients requesting data, so an attacker would require to decrypt several thousand encrypted and signed messages in a very short space fo time to fool the system into believing he had data. The clients would also pick this up pretty quickly as they asked for and never got the data.
Unfortunately due to time constraints I could not describe the node activity, which includes ranking alerts and false accusation management. This would take 45 mins minimum but does allow the audience to appreciate where the security of the system works.
Thanks for a great summation of all the talks James.
David (do not hesitate to contact me for any further info david dot irvine (at) maidsafe dot net)