Google had an outage this week. Google Docs, which I use at home, was down for about an hour. They wrote a post about the outage on their blog.
I think this post is a great lesson in effective IT Operations. Our IT organization is working on improving all of these areas, but we have more work to do to get to Google’s level of kung fu:
- Effective, transparent communications: They are very transparent about the incident. They want their customers to know that such events are unacceptable; they take it seriously; and they are taking measures to improve service.
- Change Management: They understand that these problems are almost always caused by unsuccessful changes. By looking for the failed change, their troubleshooting is very quick. They resolved the problem by rolling back the change that caused the problem within 30 minutes.
- Monitoring: Their monitoring tools uncovered the problem within 30 minutes.
- Downtime Status: They talk about the Apps dashboard which is a tool for customers to see the status of their services.
- Root Cause Analysis: They quickly completed a Root Cause Analysis and are quickly moving to implement process based changes to minimize the likelihood of a repeat occurrence.
The fact of the matter is: outages happen. The most successful IT organizations don’t kid themselves about eliminating outages through redundancy or other means. They use the means above to minimize the customer impact.