This morning Gmail was down for about 100 minutes. The Google blog has a well written explanation of the downtime Gmail users experienced this morning: More on Today's Gmail Issue.
This is a good example of how to communicate with your user community about downtime issues. I don't think I have to tell any System Administrator that it is very important to communicate with users both during and after the fact of a downtime incident.
The fact is that no matter how well you set up fail-over and high availability, there may be time when systems fail. A good IT team needs to have monitoring set up to know when failure occurs, troubleshooting skill sets to determine the cause of the problem, the ability to think on their feet under high stress, and the ability to communicate status with their customers.
When I communicate downtime reports, I like to harken back to one of the first jobs I held after college. I used to process warranty claims for an Acura car dealership. On the back of the repair orders the technicians had to report "the three C's" or "C.C.C" which stood for "Complaint, Cause, and Correction." Like the 5 W questions of journalism (Who, What, Why, When, Where) these 3 C's serve as a good mnemonic device to help you make sure you cover the details.
An example of using this format on this morning's Gmail problem might look like this:
Incident Report:
Key detail: What systems/users/functions were affected. Note start time of first system down report, and end time when all systems were back to normal.
Complaint: Gmail down for 100 minutes.
Cause: Failover Request routers became overloaded during routing maintenance. Insufficient amount of request routers to handle failover.
Correction: Brought up additional request router servers.
Of course you will flesh this information out in different ways depending upon the audience, but the three Cs of repair technicians can be a good way to remember to report key details.
It is also good to remember there are two types of incident reports you may need to make: the first being a communication during the downtime, the second the full incident report after systems are back to normal. In the middle of the crisis there is a temptation to forgo communication and just work hard to fix the problem. Its good to have a member of management do the communication at this point, as the technical team is busy solving the problem. Always tell your boss(es) immediately when a problem occurs that is going to be on their radar. They hate being surprised.
During system downtime you may need to communicate via an alternative method. (Can't email them that email is down, right?) Voicemail, intranet, company blog, or even old fashioned overhead speakers can be used. Update the help desk teams, and task them with updating the help desk voicemail message for your callers.
After the problem is fixed, then you want to put together your "post mortem" incident report. Use the 5 W's of journalism and the C.C.Cs of Automotive service to help you remember to record all the key details. And remember, the most important thing after a downtime experience is to do what is needed to assure your customers and management what you are doing to ensure that this does not occur again.
Next time it will be something completely else.