Microsoft 365 cloud outages, what happened and what you can do

Another bad week for Microsoft’s customers who rely on the Microsoft 365 hosting and other cloud technologies.  There were global outages for more than half a day in Teams, Azure, Xbox Live and many other Microsoft Cloud services.

All should be restored by the time you read this but there are still questions about what happened and what customers can do when these breakdowns occur.

What happened?

According to the usually well-informed Mary Jo Foley, the problem occurred for people trying to login to their Microsoft services.  Anyone already connected (and authenticated) could continue working.

That matches the many reports we’ve seen and heard.  For example, some people could continue to use Teams because they’d logged in before Microsoft’s servers went screwy.  But anyone trying to login to the same Teams group could not.

The problem was the encryption keys used by Azure. These keys are changed on a regular basis with older keys retired in favor of newly generated ones.  It’s a normal part of computer security which happens on everyone’s devices and browsers automatically.

An unusual need meant a key was retained longer than usual, that exposed a bug where the metadata for that key was removed.  The vital key wasn’t accepted by Microsoft’s systems.

You can read the technical summary on the Azure Status History.

These things happen to any large system.  They’ll happen to Microsoft again and other cloud-based companies too.

What can customers do?

The big question is – what can customers do when outages occur?

There are a few preventative measures available.

Have a backup service

Make plans for an alternative way to contact your customers, staff, class or other group.

“Place not all thine eggs in one basket”

Have a different email address, hosted by a different cloud service.  If your email is hosted by Microsoft 365/Outlook.com have a standby address on another system like Gmail.  Same goes the other way, if you use Google WorkSpaces/Gmail, have an alternate non-Google option.

Same for video calls.  The Microsoft outage brought many classes and meetings to a halt.  Organizations and schools should have an alternative option ready. If you rely on Teams, have all involved install Zoom as a backup.  Or vice-versa. 

Plan ahead

Think about what you’d do if your current system doesn’t work.  Now we’re more reliant on remote communication, it’s time to think about fallback options for when the main system isn’t behaving.

  • Do you have an alternative mailbox and contacts for everyone?
  • Is there a simple web page you can use to update everyone quickly?  A page that’s accessible without login?
  • Suggest all involved install the alternative software, create accounts etc. It’s likely many people have alternatives like Zoom, Skype etc already installed.
  • Run some live tests using the different calling system.  Think of it like a fire drill; a nuisance but necessary.

Individuals should think about what happens if their main wifi stops or even power to the house goes out (which happened to Peter Deegan this week).  Keep laptops charged up even if you think it’s not needed.  If there’s a power or internet outage, switch to a tethered/personal hotspot via a smartphone. That’s not as fast but should be enough for you to keep working.

Stay logged in

A pattern in many of the Microsoft cloud outages is the source of the problem – it’s the authentication or login process.

The core systems (Exchange Server, OneDrive, Azure etc) are working OK, but customers can’t get past the login process to reach them.

Anyone who is logged into Microsoft’s systems before the authentication breakdown can continue to use the service. That’s because they’ve already passed the ‘border’ checks before those checks stopped working.

It’s best, where possible, to stay logged into services.  If you’re using Teams or Microsoft 365, login in the morning and stay online for the day. Instead of dropping out then signing in again when needed.

That simple habit might help you avoid future cloud outages while others flounder.