On April 1, 2021 at ~10:43 pm UTC, Microsoft reported an issue that was preventing users from accessing the Microsofts' Azure cloud services and Microsoft 365 services. And no, unfortunately this is not an April Fools joke.
Shortly after Microsoft's initial report, Microsoft tweeted that they were investigating a potential DNS issue and were evaluating mitigation options. They soon after reported that they had rerouted traffic to their resilient DNS capabilities and were seeing improvement in service availability.
Many users took Twitter to share their frustration with the recent stream of outages including the major Microsoft outage on March 15th.
Microsoft reported that as services continued to recover, information would be available in the admin center. They also reported that they were managing multiple workstreams to validate to validate recovery and apply necessary mitigation steps to ensure full network recovery.
Finally at ~3:30 am UTC, they confirmed that they had resolved the issue fully and confirmed that all Microsoft 365 services had returned to a healthy state.
This outage took a large majority of Microsoft's internal services offline including Microsoft's Azure cloud services and Office 365, Teams, OneDrive, Bing, and Xbox live.
It appears that the outage stemmed from a Domain Name System error due to "a recent change to an authentication system." Although Microsoft handled the outage rather quickly, users were not happy due to a similar outage just two weeks ago that left users in the dark and offline for hours. DNS has been the root cause of multiple Microsoft outages of late which means it may be time to make some big changes to how Microsoft manages their network.
The Importance of Office 365 Monitoring
In a cloud-world, outages are bound to happen. While Microsoft is responsible for restoring service during outages, IT needs to take ownership of their environment and user experience. It is crucial to have greater visibility into business impacts during a service outage the moment it happens.
ENow’s Office 365 Monitoring and Reporting solution enables IT Pros to pinpoint the exact services effected and root cause of the issues an organization is experiencing during a service outage by providing:
- The ability to monitor entire environments in one place with ENow’s OneLook dashboard which makes identifying a problem fast and easy without having to scramble through Twitter and the Service Health Dashboard looking for answers.
- A full picture of all services and subset of services affected during an outage with ENow’s remote probes which covers several Office 365 apps and other cloud-based collaboration services.