In Designing and Deploying Internet Scale Services I’ve argued that all services should expect to be overloaded and all services should expect mass failures. Very few do and I see related down-time in the news every month or so.
The Windows Genuine Advantage failure (WGA Meltdown…) from a year ago is a good example in that the degraded operations modes possible for that service are unusually simple and the problem and causes were well documented. The obvious degraded operations model for WGA is allow users to continue as “WGA Authorized” when the service isn’t healthy enough to fully check their O/S authenticity. In the case of WGA, this actually is the intended operation and it is actually designed to do this. This should have worked but services rarely have the good sense to fail. They normally just run very, very slowly or otherwise misbehave.
The actual cause of the WGA issues are presented in detail here: So What Happened?. This excellent post even includes some of the degraded operation modes that the WGA team have implemented. This is close to the right answer. However, the problem with the implemented approach is: 1) it doesn’t detect unacceptable rises in latency or failure rate via deep monitoring and automatically fall back to degraded mode, and 2) it doesn’t allow the service to be repaired and retested in production selectively with different numbers of users (slow restart). It’s either on or off in this design. A better model is one where 100% of the load can be directed to a backup service that just says “yes”. And then real service that actually does the full check can be brought back live incrementally by switching more and more load from the “yes” service to the real, deep check service. Here again, deep real time monitoring is needed to measure whether the service is performing properly. Implementing and production testing a degraded operation mode is hard but I’ve never talked to a service who had invested in this work and later regretted it.
15 years ago I worked on a language compiler which, amongst others, targeted a Navy fire control system. This embedded system had a large red switch tagged as “Battle Ready”. This switch would disable all emergency shutdowns and put the server into a mode where it would continue to run when the room was on fire or water is beginning to rise up the base of the computer. In this state, the computer runs until it dies. In the services world, this isn’t exactly what we’re after but it’s closely related. We want all system to be able to drop back to a degraded operation mode that will allow it to continue to provide at least a subset of service even when under extreme load or suffering from cascading sub-system failures. We need to design and, most important, we need to test these degraded modes of operation in at least limited production or they won’t work when we really need them. Unfortunately, almost all services but the least successful will need these degraded operations modes at least once.
Degraded operation modes are service specific and, for many services, the initial gut reaction is that everything is mission critical and there exist no meaningful degraded modes. But, they are always there if you take it seriously and look hard. The first level is to stop all batch processing and periodic jobs. That’s an easy one and almost all services have some batch jobs that are not time critical. Run them later. That one is fairly easy but most are hard to come up with. It’s hard to produce a lower quality customer experience that is still useful but I’ve yet to find an example where none were available. As an example, consider Exchange Hosted Services. In that service, the mail must get through. What is the degraded operation mode? They actually can be found in mission critical applications such as EHS as well. Here’s some examples: turn up the aggressiveness of edge blocks, defer processing of mail classified as Spam until later, process mail from users of the service ahead of non-known users, prioritize premium customers ahead of others. There actually are quite a few options. The important point is to think what they should be ahead of time and ensure they are developed and tested prior to Operations needing them in the middle of the night.
Some time back Skype recently had a closely related problem where the entire service went down or mostly down for more than a day. What they report happened was that Windows Update forced many reboots and it lead to a flood of Skype login requests as the clients were coming back up and “that when combined with lack of peer to peer resources had a critical impact” (What Happened on August 16th?). There are at least two interesting factors here, one generic to all services and one Skype specific. Generically, it’s very common for login operations to be MUCH more expensive than steady state operation so all services need to engineer for login storms after service interruption. The WinLive Messenger team has given this considerable thought and has considerable experience with this issue. They know there needs to be an easy way to throttle login requests such that you can control the rate with which they are accepted (a fine grained admission control for login). All services need this or something like this but it’s surprising how few have actually implemented this protection and tested it to ensure it works in production. The Skype-specific situation is not widely documented put is hinted at by the “lack of peer-to-peer” resources note in the above referenced quote. In Skype’s implementation, the lack of an available supernode will cause client to report login failure (this is documented in An Analysis of the Skype Internet Peer-to-Peer Internet Telephony Protocol which was sent to me by Sharma Kunapalli of IW Services Marketing team). This means that nodes can’t login unless they can find a supernode. This has a nasty side effect in that the fewer clients that can successfully login, the more likely it is that other clients won’t successfully find a supernode since a super-node is a just a well connected client. If they can’t find a supernode, they won’t be able to login either. Basically, the entire network is unstable due to the dependence on finding a supernode to successfully log a client into the network. For Skype, a great “degraded operation” mode would be to allow login even when a supernode can’t be found. Let the client get on and perhaps establish peer connectivity later.
Why wait for failure and the next post-mortem to design in AND production test degraded operations for your services?