Office 365 Monitoring: ISP Issues in Chile, SA (July 2, 2021)
On July 2, 2021, at ~9:47 am UTC, Microsoft indicated that they were investigating an issue in...
On February 7, 2023, at ~6:38 PM EST, Microsoft announced on Twitter (@MSFT365status) that they were investigating an issue affecting access to Microsoft Teams services and service functionality for users in the Asia-Pacific region. Initially, Microsoft provided a single service incident number, TM512596, for system admins to refence in the Microsoft Admin Center.
This is the second Microsoft services incident in less than 24 hours. Earlier in the day, there was a significant outage event in which many Microsoft users in North America experiencing several hours of access and service issues for Outlook.
We’re investigating an issue affecting user access to Teams services and functionality for customers hosted within the Asia-Pacific region. Please refer to TM512596 in the Microsoft 365 admin center for further details.— Microsoft 365 Status (@MSFT365Status) February 7, 2023
A subsequent message from Microsoft indicated that rerouting work was underway and that some Microsoft Teams functionality was returning to those users impacted by the issue.
We've completed re-routing traffic and our monitoring indicates that Teams start up and sign in have mostly recovered. We'll continue working until the service has fully recovered. Additional details are available through TM512596 in the admin center.— Microsoft 365 Status (@MSFT365Status) February 8, 2023
However, at this time Microsoft Service Health Status site and Microsoft Azure Status sites were disclosing more detailed information. According to Microsoft, on the evening of February 7, there was a utility power surge in a datacenter in the Southeast Asia region which caused a shutdown of certain cooling units in the datacenter. Because they were unable to get the cooling units back online fast enough, Microsoft conducted a proactive power down of certain compute and storage units in the impacted data center. Consequently, this caused multiple downstream services to be impacted.
Microsoft's messaging on Twitter (@MSFT365status) continued, now with a second service incident number, MO512648. Microsoft stated that service disruptions to Microsoft Teams included messaging delays and call failures, and that new reports of impacts to certain Microsoft 365 services were likely due to the same root cause as the Teams issues.
We're continuing to address the failing Teams scenarios that include but are not limited to messaging delays, call failures, and chat presence. Refer to TM512596 in the admin center for more detailed information.— Microsoft 365 Status (@MSFT365Status) February 8, 2023
We've determined that various additional Microsoft 365 services are affected by the same root cause as TM512596. Updates for these services are being provided in the admin center under MO512648.— Microsoft 365 Status (@MSFT365Status) February 8, 2023
At approximately 3:41 AM EST, February 8, Microsoft had announced that the Microsoft Teams issue, service incident TM12596, was completely resolved. However, impacts to certain Microsoft 365 services continued.
TM512596 is now resolved. Please refer to the Service Health Dashboard in the admin center for more. We're continuing to monitor service health for M365 services impacted by this event. For more details on which services are affected, please refer to MO512648 in the admin center.— Microsoft 365 Status (@MSFT365Status) February 8, 2023
Since 3:41 AM EST, there have been no additional tweet updates from Microsoft @MSFT365status as to service incident MO512648.
However, Microsoft has provided additional updates and information elsewhere. An initial disclosure from Microsoft on their Azure Status page as to the datacenter cooling event indicated only that restoration efforts continued for the affected cooling units. Microsoft stated further that once datacenter temperatures have stabilized, they will then start on restoration of the compute and storage equipment that was shutdown. Microsoft also gave the following recommendation: "For workloads protected by Azure Site Recovery or Azure Backup, we recommend initiating a failover to the recovery region or to use Cross Region Restore. Newly deployed resources will by design bypass the impacted scale units."
At approximately 2:00 PM EST, Microsoft indicated, via their Azure Status page, that the datacenter cooling systems had been restored, and that datacenter temperatures remained at normal thresholds. Several storage resources were back online and that their structured power-up process was continuing.
And as of 4:00 PM EST, Microsoft was indicating that ALL storage resources had been restored and that post-recovery checks were being performed. Also at this time, 70% of compute resources were fully recovered. More importantly, Microsoft stated that it was seeing a significant recovery of its downstream services which were impacted earlier.
Some 13 hours after their previous tweet, Microsoft communicated via @MSFT365status, regarding service incident MO512648 and impacts to Microsoft 365 services, that they were still working on resolving a hardware issue in an APAC region data center.
We've identified and are working to resolve a data center hardware issue that impacted multiple Microsoft 365 services within the APAC region. For more details on which services are affected and their recovery status, please refer to MO512648 in the admin center.— Microsoft 365 Status (@MSFT365Status) February 8, 2023
Finally, after approximately 24 hours since first reporting the issue, Microsoft confirmed that the entire data center issue and its downstream impacts to various Microsoft services were completely resolved. Another check of the Azure Service Health status page confirmed that Microsoft was reporting no on going issues at this time.
We've resolved the data center issue, and most users in the APAC region will no longer see impact to Microsoft 365 services; though, full restoration may take several more hours. We'll provide details on this process via MO512648 in the admin center.— Microsoft 365 Status (@MSFT365Status) February 8, 2023
In a cloud-world, outages are bound to happen. While Microsoft is responsible for restoring service during outages, IT needs to take ownership of their environment and user experience. It is crucial to have greater visibility into business impacts during a service outage the moment it happens.
ENow’s Teams Monitoring and Reporting solution enables IT Pros to pinpoint the exact services effected and root cause of the issues an organization is experiencing during a service outage by providing: