Microsoft COSMOS at HPTS

Rough notes from a talk on COSMOS, Microsoft’s internal Map reduce systems from HPTS 2011. This is the service Microsoft uses internally to run MapReduce jobs. Interesting, Microsoft plans to use Hadoop in the external Azure service even though COSMOS looks quite good: Microsoft Announces Open Source Based Cloud Service. Rough notes below:

Talk: COSMOS: Big Data and Big Challenges

Speaker: Ed Harris

· Petabyte storage and computation systems

· Used primarily by search and advertising inside Microsoft

· Operated as a service with just over 4 9s of availability

· Massively parallel processing based upon Dryad

o Dryad is very similar to MapReduce

· Use SCOPE (structured Computation Optimized for Parallel Execution) over Dryad

o A SQL-like language with an optimizers implemented over Dryad

· They run hundreds of virtual clusters. In this model, internal Microsoft teams buy servers and given them to COSMOS and are subsequently assured at least these resources

o Average 85% CPU over the cluster

· Ingest 1 to 2 PB/day

· Roughly 30% of the Search fleet is running COSMOS

· Architecture:

o Store Layer

§ Many extent nodes store and compress streams

§ Streams are sequences of extents

§ CSM: Cosmos Store Layer handles names, streams, and replication

· First level compression is light. Data that is kept more than a week is more aggressively compressed after a week on the assumption that data that lives a week will likely live longer

o Execution Layer:

§ Jobs queue up on virtual clusters and then executed

o SCOPE Layer

§ Compiler and optimizer for SCOPE

§ Ed said that the optimizer is a branch of the SQL Server optimizer

· They have 60+ Phd internships each year and hire ~30 a year

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

Leave a Reply

Your email address will not be published. Required fields are marked *