Monitoring at Scale

Service monitoring at scale is incredibly hard. I’ve long argued that you should never learn anything about a problem your service is experiencing from a customer. How could they possibly know first when there is a service outage or issue? And, yet it happens frequently. The reason it happens is most sites don’t have close to an adequate level of instrumentation. Without this instrumentation, you are flying blind.

Systems monitoring data can be used to drive alerts, to compute SLAs, to drive capacity planning, to find latencies, to understand customer access patterns, and some sites use it to drive billing although the later is probably a mistake.

In the rare cases where I’ve come across high quality monitoring systems that actually do fine-grained data collection, its often not looked at or underutilized. It turns out that fully using and exploiting very large amounts of monitoring data isn’t much easier than collecting it.

Returning the challenge of efficiently collecting fine grained monitoring data and events from thousands of servers, Facebook made a contribution yesterday in making Scribe available as an open source project: Facebook’s Scribe technology now open source. Scribe is used at Facebook to monitor their more than 10k servers across multiple data centers. Scribe is a Sourceforge project at: http://sourceforge.net/projects/scribeserver/.

Facebook continues to both develop interesting and broadly useful software and often contributes it to the community by making it open source. For example, Facebook Releases Cassandra as Open Source.

Some excerpts from On Designing and Deploying Internet-Scale Services on why I think auditing, monitoring, and alerting are important

Alerting is an art. There is a tendency to alert on any event that the developer expects they might find interesting and so version-one services often produce reams of useless alerts which never get looked at. To be effective, each alert has to represent a problem. Otherwise, the operations team will learn to ignore them. We don’t know of any magic to get alerting correct other than to interactively tune what conditions drive alerts to ensure that all critical events are alerted and there are not alerts when nothing needs to be done. To get alerting levels correct, two metrics can help and are worth tracking: 1) alerts-to-trouble ticket ratio (with a goal of near one), and 2) number of systems health issues without corresponding alerts (with a goal of near zero).

· Instrument everything. Measure every customer interaction or transaction that flows through the system and report anomalies. There is a place for “runners” (synthetic workloads that simulate user interactions with a service in production) but they aren’t close to sufficient. Using runners alone, we’ve seen it take days to even notice a serious problem, since the standard runner workload was continuing to be processed well, and then days more to know why.

· Data is the most valuable asset. If the normal operating behavior isn’t well-understood, it’s hard to respond to what isn’t. Lots of data on what is happening in the system needs to be gathered to know it really is working well. Many services have gone through catastrophic failures and only learned of the failure when the phones started ringing.

· Have a customer view of service. Perform end-to-end testing. Runners are not enough, but they are needed to ensure the service is fully working. Make sure complex and important paths such as logging in a new user are tested by the runners. Avoid false positives. If a runner failure isn’t considered important, change the test to one that is. Again, once people become accustomed to ignoring data, breakages won’t get immediate attention.

· Instrumentation required for production testing. In order to safely test in production, complete monitoring and alerting is needed. If a component is failing, it needs to be detected quickly.

· Latencies are the toughest problem. Examples are slow I/O and not quite failing but processing slowly. These are hard to find, so instrument carefully to ensure they are detected.

· Have sufficient production data. In order to find problems, data has to be available. Build fine grained monitoring in early or it becomes expensive to retrofit later. The most important data that we’ve relied upon includes:

Thanks to Sriram Krishnan for pointing me to the release of Scribe.

–jrh

James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh | blog:http://perspectives.mvdirona.com

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.