This morning I came across an article written by Sid Anand, an architect at Netflix that is super interesting. I liked it for two reasons: 1) it talks about the move of substantial portions of a high-scale web site to the cloud, some of how it was done, and why it was done, and 2) its gives best practices on AWS SimpleDB usage.
I love articles about how high scale systems work. Some past postings:
- FriendFeed use of MySQL
- Facebook Cassandra Architecture and Design
- Wikipedia Architecture
- MySpace Architecture and .Net
- Flickr DB Architecture
- Geo-Replication at Facebook
- Scaling at LucasFilms
- Facebook: Needle in a Haystack: Efficient Storage of Billions of Photos
- Scaling LinkedIn
- Scaling at MySpace
The article starts off by explaining why Netflix decided to move their infrastructure to the cloud:
Circa late 2008, Netflix had a single data center. This single data center raised a few concerns. As a single-point-of-failure, it represented a liability – data center outages meant interruptions to service and negative customer impact. Additionally, with growth in both streaming adoption and subscription levels, Netflix would soon outgrow this data center – we foresaw an imminent need for more power, better cooling, more space, and more hardware.
Our option was to build more data centers. Aside from high upfront costs, this endeavor would likely tie up key engineering resources in data center scale out activities, making them unavailable for new product initiatives. Additionally, we recognized the management of multiple data centers to be a complex task. Building out and managing multiple data centers seemed a risky distraction.
Rather than embarking on this path, we chose a more radical one. We decided to leverage one of the leading IAAS offerings at the time, Amazon Web Services. With multiple, large data centers already in operation and multiple levels of redundancy in various web services (e.g. S3 and SimpleDB), AWS promised better availability and scalability in a relatively short amount of time.
By migrating various network and back-end operations to a 3rd party cloud provider, Netflix chose to focus on its core competency: to deliver movies and TV shows.
I’ve read considerable speculation over the years on the difficulty of moving to cloud services. Some I agree with – these migrations do take engineering investment – while other reports seem to less well thought through focusing mostly on repeating concerns speculated upon by others. Often, the information content is light.
I know the move to the cloud can be done and is done frequently because, where I work, I’m lucky enough to see it happen every day. But the Netflix example is particularly interesting in that 1) Netflix is a fairly big enterprise with a market capitalization of $7.83B – moving this infrastructure is substantial and represents considerable complexity. It is a great example of what can be done; 2) Netflix is profitable and has no economic need to make the change – they made the decision to avoid distraction and stay focused on the innovation that made the company as successful as it is; and 3) they are willing to contribute their experiences back to the community. Thanks to Sid and Netflix for the later.
For more detail, check out the more detailed document that Sid Anand has posted: Netflix’s Transition to High-Availability Storage Systems.
For those SimpleDB readers, here’s a set of SimpleDB best practices from Sid’s write-up:
· Partial or no SQL support. Loosely-speaking, SimpleDB supports a subset of SQL
o Do GROUP BY and JOIN operations in the application layer
o One way to avoid the need for JOINs is to denormalize multiple Oracle tables into a single logical SimpleDB domain.
· No relations between domains
o Compose relations in the application layer
· No transactions
o Use SimpleDB’s Optimistic Concurrency Control API: ConditionalPut and ConditionalDelete
· No Triggers
o Do without
· No PL/SQL
o Do without
· No schema – This is non-obvious. A query for a misspelled attribute name will not fail with an error
o Implement a schema validator in a common data access layer
· No sequences
o Sequences are often used as primary keys
§ In this case, use a naturally occurring unique key. For example, in a Customer Contacts domain, use the customer’s mobile phone number as the item key
§ If no naturally occurring unique key exists, use a UUID
o Sequences are also often used for ordering
§ Use a distributed sequence generator
· No clock operations
o Do without
· No constraints. Specifically, no uniqueness, no foreign key, no referential & no integrity constraints
o Application can check constraints on read and repair data as a side effect. This is known as read-repair. Read repair should use the CondtionalPut or ConditionalDelete API to make atomic changes
If you are interested high sale web sites or cloud computing in general, this one is worth a read: Netflix’s Transition to High-Availability Storage Systems.
Update: Netflix architects Adrian Cockcroft and Sid Anand are both presenting at QconSF between November 1st and 5th in San Francisco: http://qconsf.com/sf2010.