Throughout this post, we will be dissecting the key events of the worldwide Microsoft 365 services outage that spanned over 9 hours on March 15th 2021.
On March 15th, 2021 at 8:40 pm UTC, Microsoft reported an issues that was preventing users from accessing Microsoft 365 services.Shortly after, they confirmed that the issue could be affecting users worldwide.
Many users took to Twitter to express their frustration with the major outage as well as their inability to check the Service Health Dashboard for updates.
As you can see below, many users are able to login to the Admin Center but, no information on the current outage is available.
Further details could be found on https://status.office.com. According to Microsoft, any service that leverages Azure AD may be affected, including but not limited to Microsoft Teams, Forms, Exchange, Intune and Yammer.
Microsoft reported that they had identified an issue with a recent change to an authentication system and they were rolling back the update to mitigate impact.
Shortly after, Microsoft reported that the process to roll back the change was taking longer than expected and that they would provide an ETA as soon as one became available.
Finally, at ~10:15 pm UTC, Microsoft reported that they were rolling out a mitigation worldwide and customers should begin to see recovery at this time. They anticipated full remediation within the hour.
At ~11:00 pm, Microsoft confirmed that the update had finished deployment to all impacted regions and that Microsoft 365 services were showing a decreasing error rate in telemetry.
Roughly two hours later, Microsoft reported that service health had improved across multiple Microsoft 365 services. However, they were still taking steps to resolve isolated residual impact for services that were still experiencing impact.
At ~2:30 am UTC, Microsoft reported that they had received confirmation that most services had recovered and that they would continue to monitor the remaining impacted services until fully mitigated and would continue to provide updates via status.office.com
Finally, at ~5:30 am, Microsoft confirmed impact has been largely mitigated and they would continue to provide service-specific updates.
Microsoft posted another update at 11:30 am confirming again that the majority of services impacted had recovered with the exception of Intune and Microsoft Managed Desktop.
Microsoft Root Cause Analysis (Tracking ID LN01-P8Z)
According to Microsoft, the root cause of the outage was as follows: "The preliminary analysis of this incident shows that an error occurred in the rotation of keys used to support Azure AD’s use of OpenID, and other, Identity standard protocols for cryptographic signing operations. As part of standard security hygiene, an automated system, on a time-based schedule,
removes keys that are no longer in use. Over the last few weeks, a particular key was marked as “retain” for longer than normal to support a complex cross-cloud migration. This exposed a bug where the automation incorrectly ignored that “retain” state, leading it to remove that particular key."
You can find more information on the Azure status history here.
Office 365 Monitoring: For less than a cup of coffee
Yesterday's outage was yet another reminder that organizations are at the mercy of cloud providers like Microsoft, however IT's reputation is still on the hook. The faster the IT team can determine if an outage is caused by Microsoft vs their infrastructure the greater chance IT Pros can protect workplace productivity.
ENow's OneLook dashboard monitors all of Office 365 from a single pane of glass. When an issue does occur, IT Pros are easily able to identify the services affected and drill down to the root cause. This enables IT to confidently inform upper management of issues and recommend alternative solutions until service is restored.
The Importance of Office 365 Monitoring
In a cloud-world, outages are bound to happen. While Microsoft is responsible for restoring service during outages, IT needs to take ownership of their environment and user experience. It is crucial to have greater visibility into business impacts during a service outage the moment it happens.
ENow’s Office 365 Monitoring and Reporting solution enables IT Pros to pinpoint the exact services effected and root cause of the issues an organization is experiencing during a service outage by providing:
- The ability to monitor entire environments in one place with ENow’s OneLook dashboard which makes identifying a problem fast and easy without having to scramble through Twitter and the Service Health Dashboard looking for answers.
- A full picture of all services and subset of services affected during an outage with ENow’s remote probes which covers several Office 365 apps and other cloud-based collaboration services.