<img height="1" width="1" src="https://www.facebook.com/tr?id=1529264867168163&amp;ev=PageView &amp;noscript=1">
blog_listing_hero_img.jpg

The Annual Summer Office 365 Outage, 2015 Edition: What to Take Away From It

It seems as if every summer something seemingly innocuous  happens in a Microsoft datacenter halfway around the world and it spreads  around the service like wildfire, taking down access for vast numbers of  customers. It happened at the end of June last year in 2014, where Exchange  Online and Lync Online were down for hours, and it has just happened again this week.

hp-a-cloud-outages-fl-100340288-orig

The Outage

Here is how it started: at 5:28 PM Eastern time, Microsoft  officially establishes an ongoing incident report:

Current Status:  Engineers are investigating an issue in which some customers may be  experiencing problems accessing or using Exchange Online services or features.  This event is actively being investigated. More information will be provided  shortly.

29 minutes later, at 5:57 PM Eastern, the company  acknowledges that this outage is a reasonably widespread issue:

User Experience:  Affected users are unable to connect to the Exchange Online service when using  multiple protocols including Outlook, Outlook Web App (OWA), Exchange  ActiveSync (EAS), and Exchange Web Services (EWS). 

Customer Impact: A higher than average number of customers are reporting this  issue. Analysis indicates that customers will likely have some users  experiencing this issue.

33 minutes later, more explanation is given at 6:30 PM  Eastern (keep in mind access has been down or very intermittent for a lot of  users for over an hour at this point):

The investigation  determined that a portion of infrastructure which facilitates authentication to  the service is experiencing higher-than-normal resource usage. Engineers are  analyzing service telemetry to determine what is causing the high resource  usage.

30 minutes later, we get an admission that whoops! All of us  are beta testers for Microsoft, didn’t you know that, and that a programmed  update has knocked out e-mail for millions of people for an hour and a half. At  7 PM Eastern (92 minutes into this outage):

Engineers have  determined that this issue may be related to a recent update to the service and  are currently working to revert the update.

Another 37 minutes go by and at 7:37 Eastern time, two hours  and nine minutes into the outage, we finally get some progress on fixing the issue:

Engineers are making  progress in reverting the update which is believed to be causing this issue.  The update has been reverted in the Latin American region and users hosted from  that region should begin to see relief. Reversion of the update in the North  American region is underway; as it progresses users there should also  experience service restoration.

By 8:35 PM Eastern time, 58 minutes later and three hours  and seven minutes, the fix is complete and Microsoft expected everyone to see usable e-mail again.

Engineers have  reverted the update in all regions and affected customers should be  experiencing service restoration. Service teams are validating that the  configuration change has been applied correctly and that service health is  recovering as expected.

Of course, it takes another one hour and four minutes for  the mail queues that have built up during this outage to work their way through  the filter and get delivered so now four hours and 11 minutes into the outage, it  looks like things are turning up:

Engineers have  validated that the configuration change has been applied to the affected  infrastructure and are continuing to monitor service health. Affected customers  should experience service restoration as mail queues continue to drain.

And finally, at 10:17 PM, essentially five hours later,  Microsoft says the service is restored and healthy:

After validating that  the configuration change was successfully applied to the affected  infrastructure and mail queues have drained, engineers confirmed that service  is restored.

Five Hours of Radio Silence for You

Did your users appreciate not being able to send or receive  e-mail from within Outlook? Were they complaining to you? Did you have any idea  what was going on? Was your first instinct to check the Office 365 portal for  more information, or did you spin your wheels and waste time troubleshooting  non-existent issues on your network before finally concluding you were up, it  was Microsoft that was down?

During last year’s outage, one of the main complaints was  that Microsoft was slow in updating the Service Health Dashboard about the  issue and also providing regular updates about the progress to resolution and  service recovery. We can say that during this outage, updates were provided—it  looks like no more than an hour or so transpired between updates. But an hour  is a long time, especially in a multihour extended outage like this one, and  the Service Health Dashboard is not obvious in terms of discoverability and  findability.

Gaining Visibility into the Issue

When you have hundreds of users shouting that their e-mail  isn’t working, and you’re just the guy in the middle looking for information to  pass on to your users, then you need all the context and data you can get.  That’s where Mailscape365 from ENow Software comes in. The premise is simple but powerful: When you can  notify your users early on about an incident in progress, and when your own  reporting helps you keep those users up to date on progress toward restoration  of service, you have created an environment that is a cut above the rest.   Customers using Mailscape365 were instantly in the know through the product’s  easy to use dashboard with green, yellow, and red indicators. No guessing, no  trying to decipher Service Health Dashboard messages.

While an outage is never pleasant and never welcome, operating  in the dark is worse. Don’t let that happen to you—as I mentioned, this  certainly isn’t the first Exchange Online outage and you know it won’t be the  last.

enow-software-screen-365