Office 365 outages – the incomplete post-mortem
Microsoft doesn’t see the main problem as they apologizes & explain the Lync and Exchange outages.
The break in Exchange Server is down to “an intermittent failure in a directory role that caused a directory partition to stop responding to authentication requests” that happened in Microsoft’s North American datacenters only.
No numbers were given to indicate how many customers were affected. We’re assured that it was only a ‘small’ number of customers. However, ‘small’ is what Microsoft always says, no matter how big or small the number really is.
They also talk about a ‘brief’ outage but that doesn’t match with customer complaints of problems ranging between 4 and 10 hours.
There’s little acknowledgement that the communication to customers, at the time, was bad. But only to the extent of saying that the “Service Health Dashboard (SHD)” wasn’t updating properly.
We’re told that “we will learn from this experience and continue improving our proactive monitoring, prevention, recovery and defense in depth systems” which sounds great but misses the point. It seems that Microsoft hasn’t learnt the main lesson from this and other cloud outages. It’s not enough to apologize days later and trot out the usual platitudes.
Microsoft has to improve their communication at the time a problem arises. Fixing the Service Health Dashboard is one thing but the blog post makes no mention or apology for the single strange Twitter message about the problem.
Its basic customer relations, to say nothing of good manners, to keep customers informed. Often customers will be patient if they know the company is aware of a problem and working to fix it. The Lync/Exchange outage was made worse for customers because it appeared Microsoft wasn’t aware of the problem or was hiding the truth.
Maybe Microsoft will understand if we say it in public relations speak; “Stay ahead of the story”.