Office 365 Monitoring - Service Outages Blog

Microsoft Datacenter Cooling Event Causes Service Disruptions in Asia

Written by ENow Software | Feb 8, 2023 6:44:49 PM

On February 7, 2023, at ~6:38 PM EST, Microsoft announced on Twitter (@MSFT365status) that they were investigating an issue affecting access to Microsoft Teams services and service functionality for users in the Asia-Pacific region. Initially, Microsoft provided a single service incident number, TM512596, for system admins to refence in the Microsoft Admin Center.

This is the second Microsoft services incident in less than 24 hours. Earlier in the day, there was a significant outage event in which many Microsoft users in North America experiencing several hours of access and service issues for Outlook.

 

 

A subsequent message from Microsoft indicated that rerouting work was underway and that some Microsoft Teams functionality was returning to those users impacted by the issue.

 

 

However, at this time Microsoft Service Health Status site and Microsoft Azure Status sites were disclosing more detailed information. According to Microsoft, on the evening of February 7, there was a utility power surge in a datacenter in the Southeast Asia region which caused a shutdown of certain cooling units in the datacenter. Because they were unable to get the cooling units back online fast enough, Microsoft conducted a proactive power down of certain compute and storage units in the impacted data center. Consequently, this caused multiple downstream services to be impacted.

Microsoft's messaging on Twitter (@MSFT365status) continued, now with a second service incident number, MO512648. Microsoft stated that service disruptions to Microsoft Teams included messaging delays and call failures, and that new reports of impacts to certain Microsoft 365 services were likely due to the same root cause as the Teams issues.

 

 

 

At approximately 3:41 AM EST, February 8, Microsoft had announced that the Microsoft Teams issue, service incident TM12596, was completely resolved. However, impacts to certain Microsoft 365 services continued.

 

 

Since 3:41 AM EST, there have been no additional tweet updates from Microsoft @MSFT365status as to service incident MO512648.

However, Microsoft has provided additional updates and information elsewhere.  An initial disclosure from Microsoft on their Azure Status page as to the datacenter cooling event indicated only that restoration efforts continued for the affected cooling units. Microsoft stated further that once datacenter temperatures have stabilized, they will then start on restoration of the compute and storage equipment that was shutdown.  Microsoft also gave the following recommendation: "For workloads protected by Azure Site Recovery or Azure Backup, we recommend initiating a failover to the recovery region or to use Cross Region Restore. Newly deployed resources will by design bypass the impacted scale units."

At approximately 2:00 PM EST, Microsoft indicated, via their Azure Status page, that the datacenter cooling systems had been restored, and that datacenter temperatures remained at normal thresholds. Several storage resources were back online and that their structured power-up process was continuing.

And as of 4:00 PM EST, Microsoft was indicating that ALL storage resources had been restored and that post-recovery checks were being performed. Also at this time, 70% of compute resources were fully recovered. More importantly, Microsoft stated that it was seeing a significant recovery of its downstream services which were impacted earlier.

Some 13 hours after their previous tweet, Microsoft communicated via @MSFT365status, regarding service incident MO512648 and impacts to Microsoft 365 services, that they were still working on resolving a hardware issue in an APAC region data center.

 

 

Finally, after approximately 24 hours since first reporting the issue, Microsoft confirmed that the entire data center issue and its downstream impacts to various Microsoft services were completely resolved.  Another check of the Azure Service Health status page confirmed that Microsoft was reporting no on going issues at this time.

 

 

The Importance of Teams Monitoring

In a cloud-world, outages are bound to happen. While Microsoft is responsible for restoring service during outages, IT needs to take ownership of their environment and user experience. It is crucial to have greater visibility into business impacts during a service outage the moment it happens.

ENow’s Teams Monitoring and Reporting solution enables IT Pros to pinpoint the exact services effected and root cause of the issues an organization is experiencing during a service outage by providing:

  • The ability to monitor networks and entire environments in one place with ENow’s OneLook dashboard which makes identifying a problem fast and easy without having to scramble through Twitter and the Service Health Dashboard looking for answers.
  • A full picture of all services and subset of services affected during an outage with ENow’s remote probes which covers several Office 365 apps and other cloud-based collaboration services.

Identify the scope of Teams outage impacts and restore workplace productivity with ENow’s Teams Monitoring and Reporting solution. Access your free 14-day trial today!