If you are experiencing problems with an Office 365 service, the native option is to check Microsoft’s Service Health Dashboard (SHD) in your Microsoft 365 Admin Center to determine whether this is a known issue with a resolution in progress before you call support or spend valuable time troubleshooting. However, the information that is provided through the SHD is only of limited use, as it focuses primarily on the overall service health instead of tenant-specific or user-specific problems.
In this article, we explain SHD’s advisory and incident summaries, status definitions, message post types, and monitoring limitations to give you a better understanding of the extent of visibility it provides you when an outage or incident occurs.
Understanding Advisory and Incident Summaries
The advisory or incident summary provides the following information:
- Title - A summary of the problem.
- ID - A numeric identifier for the problem.
- Service - The name of the affected service.
- Last updated - The last time that the service health message was updated.
- Estimated Start time - The estimated time when the issue started.
- Status - How this problem affects the service.
- User Impact - A brief description of the impact this issue has on the end user.
- All Updates - We post frequent messages to let you know the progress that we're making in applying a solution.
Incidents and advisories
- If a service has an advisory shown, we are aware of a problem that is affecting some users, but the service is still available. In an advisory, there is often a workaround to the problem and the problem may be intermittent or is limited in scope and user impact.
- If a service has an active incident shown, it's a critical issue and the service or a major function of the service is unavailable. For example, users may be unable to send and receive email or unable to sign-in. Incidents will have noticeable impact to users. When there is an incident in progress, we will provide updates regarding the investigation, mitigation efforts, and confirmation of resolution in the Service health dashboard.
Understanding Status Definitions
- Investigating - We're aware of a potential issue and are gathering more information about what's going on and the scope of impact.
- Service degradation - We've confirmed that there is an issue that may affect use of a service or feature. You might see this status if a service is performing more slowly than usual, there are intermittent interruptions, or if a feature isn't working, for example.
- Service interruption - You'll see this status if we determine that an issue affects the ability for users to access the service. In this case, the issue is significant and can be reproduced consistently.
- Restoring service - The cause of the issue has been identified, we know what corrective action to take, and are in the process of bringing the service back to a healthy state.
- Extended recovery - This status indicates that corrective action is in progress to restore service to most users but will take some time to reach all the affected systems. You might also see this status if we've made a temporary fix to reduce impact while we wait to apply a permanent fix.
- Investigation suspended - If our detailed investigation of a potential issue results in a request for additional information from customers to allow us to investigate further, you'll see this status. If we need you to act, we'll let you know what data or logs we need.
- Service restored - We've confirmed that corrective action has resolved the underlying problem and the service has been restored to a healthy state. To find out what went wrong, view the issue details.
- False positive - After a detailed investigation, we've confirmed the service is healthy and operating as designed. No impact to the service was observed or the cause of the incident originated outside of the service. Incidents and advisories with this status appear in the history view until they expire (after the period of time stated in the final post for that event).
- Post-incident report published - We've published a Post Incident Report for a specific issue that includes root cause information and next steps to ensure a similar issue doesn't reoccur.
Understanding Message Post Types
- Quick Update - Short and frequent incremental updates for broadly impacting incidents, available to all customers.
- Additional Details - These additional posts will provide richer technical and resolution details to offer deeper visibility into the handling of incidents. This is available for tenants that meet the same requirements outlined for Exchange Online monitoring.
Understanding Monitoring Limitations
- Only People who are assigned the global admin or service support admin role can view service health.
- Most of the time, services will appear as healthy with no further information. When a service is having a problem, the issue is identified as either an advisory or an incident and shows a current status. However, root cause is not communicated in this view which limits you understanding full scope of impact across your org and end users.
- If there is an active incident or advisory for a service, they will be listed directly under the service name in a nested table. It is up to you to filter, open and collapse this data as there is no quick, simple dashboard view that pinpoints where there may be an outage or incident.
- It is a trailing indicator of problems, and only gives you part of the end-to-end service view.
- The History tab shows all incidents and advisories that have been resolved within the last seven or 30 days. Unfortunately, there is no way to see or report on historical data beyond this timeframe.
- It provides information on components that Microsoft is responsible for, but fails to monitor and report on outages caused by other components, such as your local network, Internet connection or hybrid infrastructure such as Directory Synchronization, federation health, mail flow or AD FS.
- The time between the outage happening and Microsoft being able to confidently assess the issue so that Microsoft can craft an appropriate response to the SHD creates a void for many customers.
Exchange Hybrid and Office 365 Monitoring and Reporting
On-premises components, such as AD FS, PTA, and Exchange Hybrid are critical for Office 365 end user experience. In addition, something as trivial as expiring Exchange or AD FS certificates can certainly lead to unexpected outages. By proactively monitoring hybrid components, ENow gives you early warnings where hybrid components are reaching a critical state, or even for an upcoming expiring certificate. Knowing immediately when a problem happens, where the fault lies, and why the issue has occurred, ensures that any outages are detected and solved as quickly as possible.