Microsoft customers often worry about the threat of a widespread and large outage. However, what they don’t realize is that they are getting beat up by an aggregate of smaller, less damaging but still annoying outages. There are a couple of deeper issues here that warrant a closer look in order to understand what the real risk is, and what you can do about it.
Back in April 2021, Microsoft proudly updated its financially-backed SLA for Office 365 to 99.99%. They have made the commitment to the resilience and availability of Azure AD, with more built-in redundancy and availability features than any on-premises customer can hope to provide to its users. The devil is in the details, though; that 99.99% SLA is an aggregate across all users of the service worldwide. With such a large subscriber base, an Office 365 outage that affects 3.5 million users for 12 hours would only reduce the quarterly SLA by a tiny amount. Of course, those 3.5 million users might not be comforted by this fact if they’re unable to work because of an outage. Using Microsoft’s own SLA as a yardstick just is not very helpful given that even a widespread or long-lasting outage might not make a real dent in their SLA number.
Back in the days of BPOS, Exchange Online was the only real usable cloud workload in Microsoft’s portfolio. Those days are long gone; Exchange has been joined by a trove of collaboration applications (SharePoint, OneDrive, and Teams). Adding these services is Microsoft’s way of trying to make Office 365 more sticky, and to help organizations justify the cost and effort of adopting Office 365 by giving them more service power per dollar.
The problem occurs when people start using these additional services. An Office 365 MVP makes this point well when he points out that getting a few dollars off your Power BI licenses due to an outage won’t make up for the cost of not being able to use Power BI to make business decisions (or, worse, the cost imposed by making a bad decision!)
It is almost always the case that newly added services lag behind the “big 2”, Exchange and SharePoint, in reliability and continuity, merely because the workloads are less mature (and also, possibly, because the operational and support processes for running them are less mature). The takeaway here is that when your business relies on one of these secondary workloads, you may not be able to depend on either the same service quality or SLA that the big 2 offer.
A sampling of recent outages includes these:
There might have been other outages or service incidents, too — it’s not critical to have an exact count, as that does not diminish the main point here: a small outage can be just as bad for you as a large one.
Mailscape 365 helps you monitor Office 365 and protect against and respond to small and large outages, disruptions, and interruptions. Because we do not depend on any single test to tell you whether your Office 365 service is healthy or not, outages that only affect one of the workloads (or part of a single workload) still trigger alerts to tell you what is going on. For example, our user experience monitoring probes independently test MAPI and Exchange Web Services access to Office 365 from your user locations—so if Outlook Anywhere breaks but EWS is up, or vice versa, you will know, and you can plan accordingly to help affected clients. Our synthetic monitoring tests do the same thing that users do, using the same operations that Microsoft’s own client applications do.
Because we allow you to monitor user experience from all your locations, we give you full visibility into outages no matter how large or small they are. Because we monitor multiple workloads (and parts of those workloads), you get rapid, actionable insight into exactly which services are broken, what the impact is to your users, and what you need to do to let them keep working.
Of course, many Office 365 problems are not because of anything on Microsoft’s end — such as problems with your own hybrid servers (including dirsync and AD FS) or network connectivity which can keep your users from being productive. We monitor and report on those components too, so whether an outage has its roots in Microsoft’s data center or yours, you can take the necessary action.
The major reason that organizations adopt Office 365 is to take advantage of Microsoft’s promise of better service quality at a lower cost than organizations can provide for themselves. Microsoft overall does an excellent job of delivering on this promise, but with such a complex infrastructure underlying their global Office 365 deployment, you need to have an independent view of whether the service is available and ready for your users.
In a cloud-world, outages are bound to happen. ENow’s Office 365 SLA Monitoring and Reporting solution enables you to pinpoint the exact services effected and root cause of the issues you are experiencing during a service outage. By having a comprehensive SLA monitoring consolidated view and precise reporting, you are able to track Office 365 performance against Microsoft’s financially-backed SLA.
Office 365 administrators will be able to quickly see what the uptime is for the month to ensure Microsoft met the published monthly SLA availability for each workload. This data will allow Office 365 administrator to submit for SLA credits if the SLA was violated due to a Microsoft problem.
Identify the scope of Office 365 service outage impacts and restore workplace productivity with ENow’s Office 365 Monitoring and Reporting solution. Access your free 14-day trial today or request a demo of our Office 365 SLA Reporting!