Google had an outage this week. Google Docs, which I use at home, was down for about an hour. They wrote a post about the outage on their blog.
I think this post is a great lesson in effective IT Operations. Our IT organization is working on improving all of these areas, but we have more work to do to get to Google’s level of kung fu:
- Effective, transparent communications: They are very transparent about the incident. They want their customers to know that such events are unacceptable; they take it seriously; and they are taking measures to improve service.
- Change Management: They understand that these problems are almost always caused by unsuccessful changes. By looking for the failed change, their troubleshooting is very quick. They resolved the problem by rolling back the change that caused the problem within 30 minutes.
- Monitoring: Their monitoring tools uncovered the problem within 30 minutes.
- Downtime Status: They talk about the Apps dashboard which is a tool for customers to see the status of their services.
- Root Cause Analysis: They quickly completed a Root Cause Analysis and are quickly moving to implement process based changes to minimize the likelihood of a repeat occurrence.
The fact of the matter is: outages happen. The most successful IT organizations don’t kid themselves about eliminating outages through redundancy or other means. They use the means above to minimize the customer impact.
The excellent point of your piece was that customer impact is the important metric. Percentage uptime or downtime hours are less relevant (unless they are excessive) than the impact of those numbers on your customers.
Google is great at letting their customers know what’s going on when there’s a problem with their software. I think they set the standard for consistent customer-business interaction when it comes to communicating and resolving outages.
I agree wholeheartedly with this article. I think a lot of people working in IT (myself included) need to assess Google’s ability to not only deal with such occurrences, but also reassure their user-base on what happened.
If we were all to take a page out of Google’s book, then the healthcare business could become far more effective. We also need to implement people in fields where they can do such things, though. We can’t just pretend that the head of a department is going to come out and know what he’s talking about or how to fix it.
How to convince superior IT management that transparent communication and change management help deal with such a problem? I found a lot of IT Managers in Cambodia did not understand that. They usually hide the outage detail and have no idea how important change management is even their subordinate point out that it is important for business.