Mesh Services Architecture and Concepts

Abolade Gbadegesin, Windows Live Mesh Architect gave a great talk at the Microsoft Professional Developers Conference on Windows Live Mesh (talk video, talk slides). Live mesh is a service that supports p2p file sharing amongst your devices, file storage in the cloud, remote access to all your devices (through firewalls and NATS), and web access to those files you chose to store in the cloud. Live Mesh is a good service and worth investigating in its own right but what makes this talk particularly interesting is Abolade gets into the architecture of how the system is written and, in many cases, why it is designed that way.

I’ve been advocating redundant, partitioned, fail fast service designs based upon Recovery Oriented Computing for years. For example, Designing and Deploying Internet Scale Services (paper, slides). Live Mesh is a great example of such a service. It’s designed with enough redundancy and monitoring such that service anomalies are detected and, when detected, it’ll auto-recover by first restarting, then rebooting, and finally re-image the failing system.

It’s partitioned across multiple data centers and, in each datacenter, across many symmetric commodity servers each of which is a 2 core, 4 disk, 8 GB system. The general design principles are:

· Commodity hardware

· Partitioning for scaling out, redundancy for availability

· Loose coupling across roles

· Xcopy deployment and configuration

· Fail-fast, recovery-oriented error handling

· Self-monitoring and self-healing

The scale out strategy is to:

· Partition by user, device, and Mesh Object

· Use soft state to minimize I/O load

· Leverage HTTP 1.1 semantics for caching, change notification, and incremental state transfer

· Leverage client-side resources for holding state

· Leverage peer connectivity for content replication

Experiences and lessons learned on availability:

· Design for loosely coupled dependence on building blocks

· Diligently validate client/cloud upgrade scenarios

· Invest in pre-production stress and functional coverage in environments that look like production

· Design for throttling based on both dynamic thresholds and static bounds

Experiences and lessons learned on monitoring:

· Continuously refine performance counters, logs, and log processing tools

· Monitor end-user-visible operations (Keynote)

· Build end-to-end tracing across tiers

· Self-healing is hard: Invest in tuning watchdogs and thresholds

Experiences and lessons learned on deployment:

· Deployments every other week, client upgrades every month

· Major functionality roughly each quarter

· Took advantage of gradual ramp to learn lessons early

–jrh

Thanks to Andrew Enfield for sending this one my way.

James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh | blog:http://perspectives.mvdirona.com

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.