Microsoft's latest outage reveals some attention points for Microsoft
This past Thursday, May 2nd 2019, Microsoft suffered another outage on (parts of) its cloud services. The outage follows a series of outages, earlier this year, affecting a variety of online services including Azure, SharePoint Online, OneDrive, Intune, Microsoft Teams, etc.
According to the recently published post-incident report, the root of the issue was a faulty DNS update, leaving thousands of users unable to connect to said services for a period of roughly two hours:
“A configuration issue occurred during planned maintenance activity related to a name server delegation change within Azure Domain Name Services (DNS). Specifically, an issue in the update to one of the name servers for DNS zones caused server records to point to a DNS server that contained blank zone data. As a result, the affected DNS infrastructure returned negative responses and users encountered connectivity issues when attempting to access Microsoft services."
If anything, the outage shows there are several areas of improvement for Microsoft. For example, the lack of (correct) communications left a lot of customers wondering what was wrong. This is an issue that keeps reappearing through various outages.
During the early stages of the outage, Microsoft’s various health dashboards showed no issues, forcing customers to turn to Twitter to find out more information about the issue itself:
“This issue was noticed and reported by customers before our monitoring detected the issue.”
All things considered; I don’t blame Microsoft for not immediately catching up onto the issue. After all, it was an external element causing disruption connecting to its service, there wasn’t something wrong with the service itself. It, however, does show that a holistic approach to monitoring is crucial.
That is why ENow's Office 365 Monitoring solution doesn’t rely on a single point of monitoring, but also leverages external probes to gain visibility in these types of external disturbances.
Luckily, not all customers seem to have suffered from this outage. I, myself, for instance did not notice anything of the outage. This was probably because of the various DNS entries that were cached (long enough) along the way. But in hindsight, the impact could have been much worse.
Despite Microsoft’s efforts to prevent failures like this from happening, there is no such thing as a 100% fail-safe strategy. Issues can (and will) happen, and this outage is the perfect example of it. And while issues of this size are typically far and few in between, detecting issues early-on really makes a difference to how you, as an organization, can deal with it. You won’t be able to solve the issues for Microsoft, but you can get ahead of the curve and communicate more clearly to your users about it. Depending on the type of outage, you might even be able to proactively provide a workaround before the issue spins out of control.
ENow Software's Office 365 monitoring provides visibility
ENow Software is the leading provider of Office365 Management solutions that helps you save money and increase end user productivity.
Let’s take a look at how ENow’s Office 365 monitoring solution quickly surfaces problems in real-time and allows our customers to successfully diagnose and troubleshoot tricky outages like the SharePoint and OneDrive Online problems.
Shortly after the DNS problem began taking effect, we received some visual indications on the OneLook Dashboard that pointed us in the direction of the problem. The screenshot below shows that there are critical issues for Office 365 Network connectivity as well as a problem with Teams and SharePoint Online. This helps us understand immediately that there is a problem with the Office 365 service.
During the May 2nd outage, ENow customers saw that there were failed status notifications for One Drive, SharePoint Online, and Teams.
Drilling down into the SharePoint Online indicator shows that we are not able to connect to the SharePoint Online service.
Additionally, ENow’s Office 365 monitoring solution performs synthetic transactions that test the functionality specific to your tenant. We can see from the image below that because we are not able to connect to the SharePoint Online service, our upload/download test fails.
Users who rely on the Microsoft Service Health dashboard didn’t get a concrete update for several hours. This frustration can be avoided by utilizing ENow’s OneLook Dashboard to save precious time when there is an outage.
ENow saves the day again!
ENow customers like Barclays, Facebook and VMware were able to quickly identify and drill down to the root cause of the problem as it was happening.
Watch the video below to see how this took place in real time!
Did you have the controls in place to visually spot the outage in real-time? Gain visibility today with ENow's End User Experience Monitoring!