July 15 Outage — More Visibility Needed! Office 365 Monitoring
As reported earlier, Office 365 was recently hit with a widespread issue. According to the case...
Welcome to part two of Addressing the Office 365 Monitoring Gaps. In part 1, Michael Van Horenbeeck discussed the differences in monitoring cloud-based systems vs traditional on-premises deployments.
In this post we will cover IT Pro's least favorite thing: Microsoft service outages. In particular, we will discuss Office 365 service outages as well as the components IT pros are responsible for.
As mentioned in part 1 of this blog series, given the scope of this topic and ENow's dedication to high quality posts, we decided to break this up into 4 parts.
While service outages don’t happen frequently, they do happen from time to time. Every outage teaches us something new, they highlight the areas of improvement for Microsoft, but also exposes Office 365 monitoring gaps and possible mitigation steps to prevent or reduce the impact of similar events in the future.
The graphic below depicts the largest and most widespread outages over the last 18 months [through February 4, 2021].
Let's take a look at some of the outages depicted above. What happened? What was the impact to end-users? What could admins do before/during/after the outage? What did the outage teach us?
The MFA outages in 2018 painfully demonstrated that MFA can be the Achilles’ heel of your deployment. It is by far the most effective (and necessary!) countermeasure against password spray attacks, but also renders it impossible for your users to access the Office 365 services in case of an issue. Many organizations that followed Microsoft’s recommendation to enable MFA for everyone, including their administrative accounts... They soon found themselves locked out of their tenant, preventing them from making changes, such as temporarily disabling MFA to allow users to access their services. Organizations affected by the outage quickly learned that having a break-glass account is a must.
Another example is the May 2019 outage Microsoft suffered after a faulty DNS update. According to the post- incident-report, Microsoft’s own monitoring solutions did not pick up on the issue until after customers started reporting it. While the service itself was likely performing just fine, a faulty DNS entry, led to customers being unable to connect.
If anything, the outage shows there are several areas of improvement for Microsoft. For example, the lack of (correct) communications left a lot of customers wondering what was wrong. This is an issue that keeps reappearing through various outages.
During the early stages of the outage, Microsoft’s various health dashboards showed no issues, forcing customers to turn to Twitter to find out more information about the issue itself.
On November 19, Office 365 experienced another outage. Some outages are more localized and only have a minor impact to end users, however this time it was widespread, across the globe and impacted a multitude of Office Services. This led many end users to flood help desks and many IT Pros were left with little information to report back to their end users.
During this outage, admins were unable to access the admin center, thus without an Office 365 monitoring solution in place many IT Pros had no visibility on the scope of the problem as they couldn’t access the IT Pro center. This led many to turn to Twitter or wait for their end user complaints. On Twitter, Microsoft265 status reported: "We've determined that users may experience intermittent access issues with the Microsoft 365 IT Pro Center, Exchange Online, SharePoint Online, Microsoft Teams, Skype for Business, and Yammer at this time. We're continuing our investigation into the root cause and we'll provide more information when we’ve isolated the root cause."
It was once reported that around 52% of support calls into Microsoft end up being issues on the client side. Hybrid configurations prove particularly problematic in terms of monitoring. Just as you don’t have any visibility into the infrastructure of a service provider, the same is true the other way around— Microsoft has no idea if or when there are problems affecting your on-premises servers. However, in some scenarios on-premises components are crucial to the operations of the cloud- based service.
There are many scenarios in which on-premises components become an important, if not critical, part in the end-to-end user experience of Office 365. This would be the case when you use components such as Active Directory Federation Services, Pass-Through Authentication, Exchange Hybrid, etc. When an organization decides to use these services, additional components are deployed on-premises and become the customer’s responsibility. In some cases, these components may be new to the environment and not managed to the same standard, or with the same skill, as the existing mature on-premises systems. If one of these services fail, the consequences for the service can be substantial. For example, when the AD FS infrastructure becomes unavailable, no users will be able to log on to any Office 365 service.
The same is true for certificates: they are crucial to many solutions like, Exchange or AD FS. Although trivial, expiring certificates have caused many issues in the past, even for cloud solution providers like Microsoft. Being able to effectively monitor the status of your certificates and when they expire – preferably all in one overview – will help you to avoid expiration and the issues resulting from that.
These are just a few examples of issues that fall solely within the responsibility of the customer to detect and solve. But it doesn’t stop there. Often times the responsibility stretches over many other components, some more obvious than others, like basic network connectivity, routing, DNS, and hybrid Exchange functionality or mailflow and free/busy, etc.
In this post we reported a handful of Office 365 outages that have happened over the past 18 months (through February 4, 2021). It's important to note that Microsoft does have native tools like the admin center. In part 3 of this blog series, we will discuss what's available to IT Pros, discuss the key features and highlight areas of improvement.
On-premises components, such as AD FS, PTA, and Exchange Hybrid are critical for Office 365 end user experience. In addition, something as trivial as expiring Exchange or AD FS certificates can certainly lead to unexpected outages. By proactively monitoring hybrid components, ENow gives you early warnings where hybrid components are reaching a critical state, or even for an upcoming expiring certificate. Knowing immediately when a problem happens, where the fault lies, and why the issue has occurred, ensures that any outages are detected and solved as quickly as possible.