As many of you know I collect high-scale scaling war stories. I’ve appended many of them below. Last week Ars Technica published a detailed article on Scaling Second Life: What Second Life can Teach your Datacenter About Scaling Web Apps. This article by Ian Wilkes who worked at Second Life from 2001 to 2009 where he was director of operations. My rough notes follow:
· Understand scale required:
o Billing system serving US and EU where each user interacts annually and the system has 10% penetration: 2 to 3 events/second
o Chat system serving UE and EU where each user sends 10 message/day during workday: 20k messages/second
· Does the system have to be available 24x7 and understand the impact of downtime (beware of over-investing in less important dimensions at the expense of those more important)
· Understand the resource impact of features. Be especially cautious around relational database systems and object relational mapping frameworks. If nobody knows the resource requirements, expect trouble in the near future.
· Database pain: “Almost all online systems use an SQL-based RDBMS, or something very much like one, to store some or all of their data, and this is often the first and biggest bottleneck. Depending on your choice of vendor, scaling a single database to higher capacity can range from very expensive to outright impossible. Linden's experience with uber-popular MySQL is illustrative: we used it for storing all manner of structured data, ultimately totaling hundreds of tables and billions of rows, and we ran into a variety of limitations which were not expected.”
· MySQL specific issues:
o Lacks alter table statement
o Write heavy workload can run heavy CPU spikes due to internal lock conflicts
o Lack of effective per-user governors means a single application can bring the system to its knees
· Interchangeable parts :” A common behavior of small teams on a tight budget is to tightly fit the building blocks of their system to the task at hand. It's not uncommon to use different hardware configurations for the webservers, load balancers (more bandwidth), batch jobs (more memory), databases (more of everything), development machines (cheaper hardware), and so on. If more batch machines are suddenly needed, they'll probably have to be purchased new, which takes time. Keeping lots of extra hardware on site for a large number of machine configurations becomes very expensive very quickly. This is fine for a small system with fixed needs, but the needs of a growing system will change unpredictably. When a system is changing, the more heavily interchangeable the parts are, the more quickly the team can respond to failures or new demands.”
· Instrument, propagate, isolate errors:
o It is important not to overlook transient, temporary errors in favor of large-scale failures; keeping good data about errors and dealing with them in an organized way is essential to managing system reliability.
o Second Life has a large number of highly asynchronous back-end systems, which are heavily interdependent. Unfortunately, it had the property that under the right load conditions, localized hotspots could develop, where individual nodes could fall behind and eventually begin silently dropping requests, leading to lost data.
· Batch jobs, the silent killer: Batch jobs bring two challenges: 1) sudden workload spikes and 2) inability to complete the job within the batch window.
· Keep alerts under control: “I can't count the number of system operations people I've talked to (usually in job interviews as they sought a new position) who, at a growing firm, suffered from catastrophic over-paging.”
· Beware of the “grand rewrite”
If you are interested in reading more from Ian at Second Life: Interview with Ian Wilkes From Linden Lab.
More from the Scaling-X series:
· Scaling Second Life: http://perspectives.mvdirona.com/2010/02/07/ScalingSecondLife.aspx
· Scaling Google: http://perspectives.mvdirona.com/2009/10/17/JeffDeanDesignLessonsAndAdviceFromBuildingLargeScaleDistributedSystems.aspx
· Scaling LinkedIn: http://perspectives.mvdirona.com/2008/06/08/ScalingLinkedIn.aspx
· Scaling Amazon: http://glinden.blogspot.com/2006/02/early-amazon-splitting-website.html
· Scaling Second Life: http://radar.oreilly.com/archives/2006/04/web_20_and_databases_part_1_se.html
· Scaling Technorati: http://www.royans.net/arch/2007/10/25/scaling-technorati-100-million-blogs-indexed-everyday/
· Scaling Flickr: http://radar.oreilly.com/archives/2006/04/database_war_stories_3_flickr.html
· Scaling Craigslist: http://radar.oreilly.com/archives/2006/04/database_war_stories_5_craigsl.html
· Scaling Findory: http://radar.oreilly.com/archives/2006/05/database_war_stories_8_findory_1.html
· Scaling Myspace: http://perspectives.mvdirona.com/2008/12/27/MySpaceArchitectureAndNet.aspx
· Scaling Twitter, Flickr, Live Journal, Six Apart, Bloglines, Last.fm, SlideShare, and eBay: http://poorbuthappy.com/ease/archives/2007/04/29/3616/the-top-10-presentation-on-scaling-websites-twitter-flickr-bloglines-vox-and-more
A very comprehensive list from Royans: Scaling Web Architectures
Some time back for USENIX LISA, I brought together a set of high-scale services best practices:
· Designing and Deploying Internet-Scale Services
If you come across other scaling war stories, send them my way: jrh@mvdirona.com.
--jrh
James Hamilton
e: jrh@mvdirona.com
w: http://www.mvdirona.com
b: http://blog.mvdirona.com / http://perspectives.mvdirona.com
Disclaimer: The opinions expressed here are my own and do not necessarily represent those of current or past employers.