Welcome to part three of Addressing the Office 365 monitoring gaps. In part 1, Michael Van Horenbeeck discussed the differences in monitoring cloud-based systems vs traditional on-premises deployments. In part 2, we discussed admins least favorite thing, outages.
In this post we will cover Office 365 native tools, their key features as well as limitations.
As mentioned in part 1, given the scope of this topic and ENow's dedication to high quality posts we decided to break this up into 4 parts.
- New Challenges in a Cloud World
- Office 365 Outages (Spoiler it's not all on Microsoft)
- Office 365 Native Tools
- Filling the Office 365 Monitoring Gaps
Microsoft’s Monitoring Native Tools
Microsoft’s internal monitoring team performs numerous Office 365 monitoring tests on the service. However, not all of the tools they use internally are made available to administrators. In this section we will provide an overview of the two main tools available to administrators: the Service Health Dashboard and AAD Connect Health.
Service Health Dashboard
Today, Microsoft exposes health information about Office 365 services through the Service Health Dashboard (SHD). The information that is provided through the dashboard is only of limited use, as it focuses primarily on the overall service health instead of tenant-specific or user-specific problems. One of the limitations of the SHD is that it only gives you part of the end-to-end service view; it provides information on components that Microsoft is responsible for, but fails to monitor and report on outages caused by other components, such as your local network, Internet connection or hybrid infrastructure such as Directory Synchronization, federation health, mail flow or AD FS.
Due to the massive scale of Office 365, the dashboard almost always reports some type of issue in one of its services as, logically, there is always a problem going on somewhere. A warning in the dashboard does not necessarily mean that your tenant is affected or that some of your users are experiencing problems. Often, service issues are accompanied with vague descriptions of who might be affected, leaving the customer wondering whether an issue impacts them or not. This creates a new challenge for the administrator, as they are left with the question of whether an issue is relevant and, if it is, to what extent.
The SHD also does not send automated alert notifications. One must purposefully log on to the SHD in order to view the latest health information. As a matter of fact, Microsoft uses the number of users reading an SHD alert to help determine the scope of impact for that issue. In the past, some issues in Office 365 were directly related to an outage in Azure Active Directory, which prevented access to the SHD. This also creates a Catch-22 situation for the Service Health Dashboard: the inability to authenticate to Azure AD prevents users from getting up-to-date Service Health Information. What good is a health dashboard if you risk being locked out of it?
Microsoft has become better and faster in term of posting outages to the SHD, unlike during a 2015 outage, when it took them nearly eight hours to acknowledge the issue in the SHD. However, it remains extremely hard for Microsoft to close the gap with external monitoring solutions, as they cannot simply post messages
to the SHD prior to assessing the issue and making sure that 1) customers are affected, and 2) the appropriate message is sent, as to not create unnecessary confusion. The time between the outage happening and Microsoft being able to confidently assess the issue so that they can craft an appropriate response to the SHD creates a void for many customers. During this time, they are left to wonder whether there was an outage and, if so, whether it was affecting them. One of the most common complaints we see on Twitter is, “All my users are affected. Why isn’t this in the Service Health Dashboard?”
Azure AD Connect Health
Another useful tool Microsoft provides to help you monitor your hybrid Office 365 deployment is Azure AD Connect Health. This tool is available if you have an Azure AD Premium subscription. It is an agent-based monitoring solution which helps you gain visibility in both Azure AD Connect synchronization, AD FS and on- premises Active Directory.
It supports AD FS 2.0 on Windows Server 2008 R2, Windows Server 2012, Windows Server 2012 R2 and Windows Server 2016. It also supports monitoring the AD FS proxy or web application proxy servers.
The main benefits are the usage reports, and that it will notify you if the directory sync engine stops working, or users are unable to authenticate to AD FS. The information is presented in the Azure AD Connect Health portal. The installation is simple and does not take much time.
Although Azure AD Health covers part of the components which are not included in the SHD, there are some limitations that will leave gaps in your monitoring strategy. First, it is not integrated with the SHD, requiring an admin to have permissions to access the Azure Portal to view the monitoring results. This also means you do not have a single location to look at all the areas that can a ect your users. It also creates a dependency on Azure AD which – in case of an outage –can render access to the portal impossible. Secondly, while it does provide the ability to alert on events, it does not o er any capabilities to integrate into 3rd party monitoring systems, which is a typical requirement for enterprise companies.
Azure AD Connect Health only provides limited insights into the components. While it does perform synthetic transactions, it doesn’t monitor the network, the Office 365 service, hybrid server health, connectivity and functionality from remote locations. This makes it very difficult for an administrator to figure out what caused an outage and if it is something they can x. Imagine if you think the issue is on Microsoft’s side and you wait for something that will never be fixed...
The lack of visibility hinders an organization’s ability to respond appropriately to reported issues. The ability to detect where an outage stems from is crucial, as it ultimately allows you to drive down the Mean-Time-To-Resolution (MTTR) of reported issues. For instance, if a network issue prevented a remote location from accessing the AD FS servers, you would likely hear from users complaining they could not access the Office 365 service, but what would you do next? The SHD would not show a service outage, and AAD Connect Health would not have visibility into the users at the remote location. The ultimate resolution would be to resolve the network issue, which would be your responsibility.
The way you handle an outage is obviously different when the issue occurs within Microsoft’s data centers. In these cases, the challenge shifts to first quickly understanding conclusively, that the issue is on Microsoft’s side. The ability to do this as soon as possible is crucial because you do not want your management or users asking you if there is an issue. Finally, you can open up a ticket with Microsoft and keep your users up-to-date as to when the issue will be resolved.
In this post we covered the Office 365 monitoring native tools, there benefits, how it works, and limitations. In the next and final post of the series we will showcase how you can close the Office 365 Monitoring Gaps.
Monitor Your Hybrid - Office 365 Environment with ENow
ENow’s solution is like your own personal outage detector that pertains solely to your environment. ENow’s solution monitors all crucial components including your hybrid servers, the network, and Office 365 from a single pane of glass. Knowing immediately when a problem happens, where the fault lies, and why the issue has occurred, ensures that any outages are detected and solved as quickly as possible.