Google’s Outage Is A Lesson In Excellent IT Operations

Google had an outage this week. Google Docs, which I use at home, was down for about an hour. They wrote a post about the outage on their blog.

I think this post is a great lesson in effective IT Operations. Our IT organization is working on improving all of these areas, but we have more work to do to get to Google’s level of kung fu:

  • Effective, transparent communications: They are very transparent about the incident. They want their customers to know that such events are unacceptable; they take it seriously; and they are taking measures to improve service.
  • Change Management: They understand that these problems are almost always caused by unsuccessful changes. By looking for the failed change, their troubleshooting is very quick. They resolved the problem by rolling back the change that caused the problem within 30 minutes.
  • Monitoring: Their monitoring tools uncovered the problem within 30 minutes.
  • Downtime Status: They talk about the Apps dashboard which is a tool for customers to see the status of their services.
  • Root Cause Analysis: They quickly completed a Root Cause Analysis and are quickly moving to implement process based changes to minimize the likelihood of a repeat occurrence.

The fact of the matter is: outages happen. The most successful IT organizations don’t kid themselves about eliminating outages through redundancy or other means. They use the means above to minimize the customer impact.

4 thoughts on “Google’s Outage Is A Lesson In Excellent IT Operations

  1. Google is great at letting their customers know what’s going on when there’s a problem with their software. I think they set the standard for consistent customer-business interaction when it comes to communicating and resolving outages.

  2. I agree wholeheartedly with this article. I think a lot of people working in IT (myself included) need to assess Google’s ability to not only deal with such occurrences, but also reassure their user-base on what happened.
    If we were all to take a page out of Google’s book, then the healthcare business could become far more effective. We also need to implement people in fields where they can do such things, though. We can’t just pretend that the head of a department is going to come out and know what he’s talking about or how to fix it.

  3. How to convince superior IT management that transparent communication and change management help deal with such a problem? I found a lot of IT Managers in Cambodia did not understand that. They usually hide the outage detail and have no idea how important change management is even their subordinate point out that it is important for business.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s