As many of you know I collect high-scale scaling war stories. I’ve appended many of them below. Last week Ars Technica published a detailed article on Scaling Second Life: What Second Life can Teach your Datacenter About Scaling Web Apps. This article by Ian Wilkes who worked at Second Life from 2001 to 2009 where he was director of operations. My rough notes follow:
· Understand scale required:
o Billing system serving US and EU where each user interacts annually and the system has 10% penetration: 2 to 3 events/second
o Chat system serving UE and EU where each user sends 10 message/day during workday: 20k messages/second
· Does the system have to be available 24×7 and understand the impact of downtime (beware of over-investing in less important dimensions at the expense of those more important)
· Understand the resource impact of features. Be especially cautious around relational database systems and object relational mapping frameworks. If nobody knows the resource requirements, expect trouble in the near future.
· Database pain: “Almost all online systems use an SQL-based RDBMS, or something very much like one, to store some or all of their data, and this is often the first and biggest bottleneck. Depending on your choice of vendor, scaling a single database to higher capacity can range from very expensive to outright impossible. Linden’s experience with uber-popular MySQL is illustrative: we used it for storing all manner of structured data, ultimately totaling hundreds of tables and billions of rows, and we ran into a variety of limitations which were not expected.”
· MySQL specific issues:
o Lacks alter table statement
o Write heavy workload can run heavy CPU spikes due to internal lock conflicts
o Lack of effective per-user governors means a single application can bring the system to its knees
· Interchangeable parts :” A common behavior of small teams on a tight budget is to tightly fit the building blocks of their system to the task at hand. It’s not uncommon to use different hardware configurations for the webservers, load balancers (more bandwidth), batch jobs (more memory), databases (more of everything), development machines (cheaper hardware), and so on. If more batch machines are suddenly needed, they’ll probably have to be purchased new, which takes time. Keeping lots of extra hardware on site for a large number of machine configurations becomes very expensive very quickly. This is fine for a small system with fixed needs, but the needs of a growing system will change unpredictably. When a system is changing, the more heavily interchangeable the parts are, the more quickly the team can respond to failures or new demands.”
· Instrument, propagate, isolate errors:
o It is important not to overlook transient, temporary errors in favor of large-scale failures; keeping good data about errors and dealing with them in an organized way is essential to managing system reliability.
o Second Life has a large number of highly asynchronous back-end systems, which are heavily interdependent. Unfortunately, it had the property that under the right load conditions, localized hotspots could develop, where individual nodes could fall behind and eventually begin silently dropping requests, leading to lost data.
· Batch jobs, the silent killer: Batch jobs bring two challenges: 1) sudden workload spikes and 2) inability to complete the job within the batch window.
· Keep alerts under control: “I can’t count the number of system operations people I’ve talked to (usually in job interviews as they sought a new position) who, at a growing firm, suffered from catastrophic over-paging.”
· Beware of the “grand rewrite”
If you are interested in reading more from Ian at Second Life: Interview with Ian Wilkes From Linden Lab.
More from the Scaling-X series:
· Scaling Second Life: http://perspectives.mvdirona.com/2010/02/07/ScalingSecondLife.aspx
· Scaling Google: http://perspectives.mvdirona.com/2009/10/17/JeffDeanDesignLessonsAndAdviceFromBuildingLargeScaleDistributedSystems.aspx
· Scaling LinkedIn: http://perspectives.mvdirona.com/2008/06/08/ScalingLinkedIn.aspx
· Scaling Amazon: http://glinden.blogspot.com/2006/02/early-amazon-splitting-website.html
· Scaling Second Life: http://radar.oreilly.com/archives/2006/04/web_20_and_databases_part_1_se.html
· Scaling Technorati: http://www.royans.net/arch/2007/10/25/scaling-technorati-100-million-blogs-indexed-everyday/
· Scaling Flickr: http://radar.oreilly.com/archives/2006/04/database_war_stories_3_flickr.html
· Scaling Craigslist: http://radar.oreilly.com/archives/2006/04/database_war_stories_5_craigsl.html
· Scaling Findory: http://radar.oreilly.com/archives/2006/05/database_war_stories_8_findory_1.html
· Scaling Myspace: http://perspectives.mvdirona.com/2008/12/27/MySpaceArchitectureAndNet.aspx
· Scaling Twitter, Flickr, Live Journal, Six Apart, Bloglines, Last.fm, SlideShare, and eBay: http://poorbuthappy.com/ease/archives/2007/04/29/3616/the-top-10-presentation-on-scaling-websites-twitter-flickr-bloglines-vox-and-more
A very comprehensive list from Royans: Scaling Web Architectures
Some time back for USENIX LISA, I brought together a set of high-scale services best practices:
· Designing and Deploying Internet-Scale Services
If you come across other scaling war stories, send them my way: jrh@mvdirona.com.
–jrh
b: http://blog.mvdirona.com / http://perspectives.mvdirona.com
Exactly Greg. The Friendster story has been one of favorites for (at least) three reasons: 1) SANs don’t belong in high scale services, 2) full rewrites seldom achieve their goals and, my favorite, 3) never promise anything you don’t know how to do with reasonable performance — some cool features can kill you.
Thanks for the comment Greg.
jrh@mvdirona.com
Thanks Royans. I added a reference to your list to the blog entry. Thanks,
jrh@mvdirona.com
I particularly like this nugget in the article talking about how Friendster succumbed to the common temptation to do a total rewrite, which ended up being a significant factor in the decline of the company:
"The VP of Ops drew up a bold plan for a complete re-imagining of the ops environment, that had us running on 64-bit Opterons (brand-new at the time), Force-10 networking, and a SAN for storage. On the other end, Engineering began a complete rewrite of the software. The changeover of the ops environment produced a number of unexpected surprises… Too much changed all at once. A process that was supposed to take 3 months took 8. And the whole time, the traffic kept growing! We were chasing a moving target."
In the meantime, Friendster became nearly unusable for large chunks of the day. Eventually, the company did finish its new system, and was able to return to functionality, but the months of load problems did enormous damage to the site’s reputation and surely contributed to the defection of users in many markets to MySpace."
A similar thing happened at Netscape too.
Hi James,
Here is my list of slides/talks on scalability [ http://www.royans.net/arch/library/ ]
Added the ones you mentioned to my list as well.
Royans
That’s true Indigo. When a company has a single product, that brand is often used to refer to the company.
jrh@mvdirona.com
James,
Second Life is a brand name. The name of the company where Ian Wilkes worked is Linden Lab.