Posts filed under ‘IT Operations’
We are users of a cool product from Aternity for application monitoring. The Aternity app runs in the background on our PCs and monitors the end user experience. When a customer says “the system is running slow” we have a wealth of data that can confirm that is the case, quantify how slow, and correlate the slowness with a myriad of other factors (PC model, network segment, OS patches installed, etc.). There are other solutions in this space, including one from Compuware. But, we have been happy with our Aternity investment (a 6 figure expense).
There is a fair amount of setup required to train Aternity how to monitor each application. So, it is not completely magic.
To date we had used this tool to respond to customer complaints and to review monthly performance of our more troublesome applications. But recently our use of application monitoring was taken to the next level by our Field Services group (they support the technologies that our customers touch).
That group is using Aternity as a proactive monitoring tool. They are monitoring three parameters to that serve as an indicator of PC health:
- Boot time
- Blue Screen of Death (BSOD)
- Application launch
When a device encounters a Blue Screen or exceeds a threshold for boot time or application launch it appears on a list. Then, the Field Services team from the appropriate region will replace the device and take the old one back to the device for troubleshooting and re-imaging or disposal.
By being proactive and showing up to replace a device before an incident is reported our Field Services team is creating a “wow” experience for our customers.
Google had an outage this week. Google Docs, which I use at home, was down for about an hour. They wrote a post about the outage on their blog.
I think this post is a great lesson in effective IT Operations. Our IT organization is working on improving all of these areas, but we have more work to do to get to Google’s level of kung fu:
- Effective, transparent communications: They are very transparent about the incident. They want their customers to know that such events are unacceptable; they take it seriously; and they are taking measures to improve service.
- Change Management: They understand that these problems are almost always caused by unsuccessful changes. By looking for the failed change, their troubleshooting is very quick. They resolved the problem by rolling back the change that caused the problem within 30 minutes.
- Monitoring: Their monitoring tools uncovered the problem within 30 minutes.
- Downtime Status: They talk about the Apps dashboard which is a tool for customers to see the status of their services.
- Root Cause Analysis: They quickly completed a Root Cause Analysis and are quickly moving to implement process based changes to minimize the likelihood of a repeat occurrence.
The fact of the matter is: outages happen. The most successful IT organizations don’t kid themselves about eliminating outages through redundancy or other means. They use the means above to minimize the customer impact.