James Hamilton's Blog RSS 2.0
 Tuesday, January 22, 2008

In Designing and Deploying Internet Scale Services (http://research.microsoft.com/~jamesrh/TalksAndPapers/JamesRH_Lisa.pdf) I’ve argued that all services should expect to be overloaded and all services should expect to have manage mass failures. 

 

Degraded operations mode is a means of dealing with excess load that will happen at some point in the life of your service. Sooner or later, you’ll get an unpredicted number of new customers, more concurrent users, or you’ll have part of the server fleet down and get hit with unexpected load.  Sooner or later you’ll have more customer requests than you have resources to satisfy. When this occurs, many services just run slower and slower and eventually start failing in timeouts.  Basically every user in the system gets a very bad experience.  A more serious example is a login storm. For most services, steady state service is much less resource intensive than user login.  So, in the event of global or broad service failure, millions of users will arrive back at once attempting to login.  The service will fail again under the load and the cycle repeats. It’s not a good place to be. A more drastic approach to avoid this problem is admission control. Only allow users into the service where you have resources left to be able to serve them. Essentially give a few customers a bad experience by not letting them onto the service in order to avoid giving all customers a bad experience.

 

There is much that can be done between the first options, service failure under high load, and the other end of the spectrum, admission control.  I call this middle ground, degraded operations mode.  In the limit all services need to have admission control to avoid complete and repeating service failure under extreme loads but you hope that admission control is never used.  Degraded operations mode allows a service to continue to take on new load after it reaches capacity by shedding unnecessary tasks.  Most services have batch jobs that run tasks that need to be done but there isn’t actually a customer waiting on them.  For example reporting, backup, index creation, system maintenance, copying data to warehouse servers, etc.  In most services a substantial amount of this work can be deferred without negatively impacting the service.  Clearly these operations need to be run eventually and how long each can be delayed is task and service specific.  Temporarily shedding these batch jobs allows more customers to be served.  The next level of degraded operations mode is to restrict the quality of service in some way. If some operations are far more expensive, you may only allow users to access a subset of the full service functionality.  For example, if you may allow transactions but not reporting if that makes sense for your service.  Finding these degraded modes of operation is difficult and very application specific but they are always there and its always worth finding them.  There WILL be a time when you have more users than resources.

 

15 years ago I worked on an Ada language compiler and one of the target hardware platforms for this compiler was a Navy fire control system.  This embedded system had a large red switch tagged as “Battle Ready Mode”.  This switch would disable all automatic shutdowns and put the server into a mode where it would continue to run when the room was on fire or water is beginning to rise up the base of the computer.  In this mode, it runs until it dies.  In the services world, this isn’t exactly what we’re after but it’s closely related.  We want all system to be able to drop back to a degraded operation mode that will allow them to continue to provide at least a subset of service even when under extreme load or suffering from cascading sub-system failures.  We need to design and, most important, we need to test these degraded modes of operation in at least limited production or they won’t work when we really need them.  Unfortunately, all services but the very least successful will need these degraded operations modes at least once.

 

Degraded operation modes are service specific and, for many services, the initial developer gut reaction is that everything is mission critical and there exist no meaningful degraded modes for their specific service.  But, they are always there if you take it seriously and look hard.  The first level is to stop all batch processing and periodic jobs.  That’s an easy one and almost all services have some batch jobs that are not time critical.   Run them later.  That one is fairly easy but most are hard to come up with.  It’s hard to produce a lower quality customer experience that is still useful but I’ve yet to find an example where none were available. As an example, consider Exchange Hosted Services (an email anti-malware and archiving service).  In that service, the mail must get delivered.  What is the degraded operation mode?  They actually can be found there as well.  Here’s some examples: turn up the aggressiveness of email edge blocks, defer processing of mail classified as Spam until later, process mail from users of the service ahead of non-known users, prioritize platinum customers ahead of others.  There actually are quite a few options.  The important point is to think what they are and ensure they are developed and tested prior to the operations team needing them in the middle of the night.

 

A few months back Skype had a problem recently where the entire service went down or mostly down for more than a day.  What they report happened was that Windows Update forced many reboots and it lead to a flood of Skype login requests “that when combined with lack of peer to peer resources had a critical impact” (http://heartbeat.skype.com/2007/08/what_happened_on_august_16.html).  There are at least two interesting factors here, one generic to all services and one Skype specific.  Generically, it’s very common for login operations to be MUCH more expensive than steady state operation so all services need to engineer for login storms after service interruption.  The WinLive Messenger team has given this considerable thought and has considerable experience with this issue.  They know there needs to be an easy way to throttle login requests such that you can control the rate with which they are accepted (a fine grained admission control for login).  All services need this or something like this but it’s surprising how few have actually implemented this protection and tested it to ensure it works in production.  The Skype specific situation is not widely documented put hinted at by the “lack of peer-to-peer” resources note in the above referenced quote.  In Skype’s implementation, the lack of an available supernode will cause client to report login failure (http://www1.cs.columbia.edu/~salman/publications/skype1_4.pdf  sent to me by Sharma Kunapalli).  This means that nodes can’t login unless they can find a supernode.  This has a nasty side effect in that the fewer clients that can successfully login, the more likely it is that other clients won’t successfully find a supernode.  If they can’t find a supernode, they won’t be able to login either.  Basically, the entire network can become unstable due to the dependence on finding a supernode to successfully log a client into the network.  For Skype, a great “degraded operation” mode would be to allow login even when a supernode can’t be found. Let the client get on and perhaps establish peer connectivity later.

 

Why wait for failure and the next post-mortem to design in and production test degraded operations for your services?  Make it part of your next release.

 

                                --jrh

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh

Tuesday, January 22, 2008 12:30:49 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services
Comments are closed.
Categories
Archive
<November 2008>
SunMonTueWedThuFriSat
2627282930311
2345678
9101112131415
16171819202122
23242526272829
30123456

This Blog
Member Login
All Content © 2008, James Hamilton
Theme created by Christoph De Baene / Modified 2007.10.28 by James Hamilton