Last week I attended and presented at USENIX LISA (http://www.usenix.org/event/lisa07/) conference. I presented Designing and Deploying Internet-Scale Applications and the slides are at: PowerPoint slides.
I particularly enjoyed Andrew Hume’s (AT&T) talk where he talked about the storage sub-systems used at AT&T research and the data error rates he’s been seeing over the last several decades and what he does about it. His experience exactly parallels mine with more solid evidence and can be summarized by all layers in the storage hierarchy produce errors. The only way to store data for the long haul is with redundancy coupled with end-to-end error detection. I enjoyed the presentations of Shane Knapp and Avleen Vig of Google in that they provided a small window into how Google takes care of their ~10^6 servers with a team of 30 or 40 hardware engineers world-wide, the software tools they use to manage the world’s biggest fleet and the releases processes used to manage these tools. Guido Trotter also of Google talked about how Google IT (not the production systems) were using Xen and DRDB to build a highly reliable IT systems. He used DRDB (http://www.drbd.org/download.html) to do asynchronous, block level replication between a primary and a secondary. The workloads runs in a Xen virtual machine and, on failure, is restarted on the secondary. Ken Brill, Executive Director of the Uptime Institute made a presentation focused on power being the problem. Ignore floor space cost, system density is not the issue, it’s a power problem. He’s right and it’s becoming increasingly clear each year.
My rough notes from the sessions I attended are at: JamesRH_Notes_USENIXLISA2007x.docx (21.03 KB).