Microsoft has had a bad run lately… First there was Solorigate, a major supply-chain attack (hack), through which attackers seem to have been able to access and copy (parts of) source code for various Microsoft (cloud) solutions, including Exchange and Intune. Next, at the beginning of March, there was ‘Hafnium’, a severe vulnerability in Exchange Server which left all on-premises installations of Exchange Server vulnerable to exploitation by a web shell, enabling attackers to fairly easily establish foothold into the on-premises environment. And then this week, Office 365 suffered from a severe outage which lasted many hours and left countless organizations without access to their data and applications… And then, today, another outage has left many organizations in EMEA without the ability to access specific features and services in Microsoft 365.
Despite this streak of bad luck, I’m still inclined to say that Microsoft is doing a hell of a job running the world’s largest productivity platform. Halas, with great power comes great responsibility, something which became very clear during recent outages. Looking back at the Azure AD outage, which eventually led to the downfall of all dependent services as well, it looks like a simple bug wreaked havoc through the environment. Without going into too much detail about the cause, according to Microsoft, a script responsible for cleaning up unused keys in the authentication process erroneously removed a key that was supposed to be retained. As a result, applications and services no longer trusted tokens signed by the key, resulting to the inability to logon to services. The result was a multi-hour outage, in which organizations world-wide couldn’t access their data or services in Microsoft’s cloud.
It isn’t the first time that Azure AD proves to be the proverbial Achilles heel of Microsoft 365. The previous outage, dated September 2020, was fairly similar in terms of impact. In the preliminary report of the outage, Microsoft takes blame and also mentions they are working to ensure this doesn’t happen again. A new Safe Deployment Process is being put in place and is designed to prevent these widespread outages from happening again. The decision to rollout this new safety feature was made after the previous outages but isn’t scheduled to be finished until mid-2021. We can only hope (and trust) that it’s the right call and that between now and then no more outages of this kind happen!
If anything, outages like this remind us that even the best teams aren’t infallible. I’m sure that, amidst the storm of the outage, many administrators were faced with tough (and many) questions about the outage, and what they were doing to solve the problem... Even though there wasn’t much you could do, it is IT’s responsibility to have a plan. It is true that a (large) part of ‘moving to the cloud’ is giving up control. In the case of Office 365 and – by extension Microsoft’s online services – you trust Microsoft to keep services up and running smoothly. However, what you do not give up is the responsibility to have a backup plan ready, nor are you giving up the responsibility to be in the know of what’s happening.
Over the years, Microsoft has become better at communicating about outages. Yet, it still takes time before an outage report is available and, in the case of an outage in Azure AD, you can’t really access the Service Health dashboard in your tenant. When these things happen, you need to have alternative ways to stay on top of what’s happening. One way of doing so, is to ensure you’re following the right profiles on Twitter, like @MSFT365Status, through which Microsoft regularly communicates about ongoing issues and outages.
Then there’s also the element of monitoring. For a long time, I have gone back and forth on the topic: should you monitor Microsoft’s services when you are paying them to do it? The answer lies somewhere in the middle. I wouldn’t (try to) monitor their servers and resource usage. You wouldn’t even be able to. But I would recommend monitoring your user experience so that you are aware of what’s happening in your environment. By proactively testing your user experience - for example, through synthetic transactions – you will quickly realize when issues arise and be able to proactively communicate with your users, raise a ticket with Microsoft and inform management. Keep in mind: it’s all about controlling the narrative…! It’s the oldest trick in the Marketing book, but an important one – also in IT.
What is also important, perhaps even more important, is having a plan ready for when things take a turn for the worse. Although outages like these are fairly uncommon, they do happen. It’s not the first time this has happened, and we know for sure it won’t be the last time. Perhaps it won’t be Azure AD causing trouble next time, but outages will continue to happen. Don’t get me wrong, I’m not betting on Microsoft making mistakes or delivering poor quality; it’s just the way it is. Whenever code is written (by humans), errors will be made, bugs will exist, and software will fail. Hardware will fail too. Calamities will happen. It’s Murphy’s law.
Going back to the point I was trying to make; you need to plan for outages like these so that you have a playbook ready that you can enact upon. Call it a Business Continuity plan of some sorts. After all: how many businesses can afford to run an entire day without their productivity platform? Amongst other things, elements that your plan must cover include:• How do you communicate with your users and management when e-mail is down?
• How do you stay up to date with the latest updates around the issue?
• When do you make the decision to (temporarily) switch to other services so that you can continue to run your business?
• If you make that decision: who will carry out what tasks? How can you keep the switchover period as short as possible?
• How and when will you switch back to ‘regular operations’?
Some might argue that an outage window of 9 hours is still too short to invoke a full-blown business continuity plan. Whilst there is some truth to that, not all outages require the same response. Let’s take an example: imagine you are using Teams with voice (PBX) integration. When Teams is unavailable, you wouldn’t be able to place or receive phone calls. If you’re in a line of business for which calling is really important, you can appreciate that a 9-hour outage window is not acceptable. In such case, you could include in your plan what alternative you could enable during an outage, and when you will make the decision to switch over to that alternative. This could be anything, from telling your users to use their mobile phones, asking your carrier to temporarily route incoming calls to cell phones rather than the SIP trunk, etc. You could envision something similar for email. For example, you could consider switching your MX records to an alternate system so that people can continue to send and receive email.
Whatever alternative solution you come up with, neither are a permanent replacement for Office 365, but they can alleviate the pressure when an outage cripples your ability to run your business.
Don't wait for the next outage. Contact us today to start monitoring Microsoft Office 365 to ensure you're prepared for the next one.