ENow Blog | M365 - Exchange Online Center

The Office 365 Domino Effect

Written by Jeff Guillet (MVP, MCSM) & Justin Harris (MVP, MCSM, MCM) | Sep 6, 2018 5:27:23 PM

Over the last several years Microsoft has made tremendous headway in showcasing the value proposition of the Office 365 platform and suite of collaboration tools. In fact, the argument can be made that the Office 365 suite of tools has helped fuel the team-based way of working.


A recent report shows that 80% of your time during work hour is spent collaborating like conference calls or meetings.1  McKinsey sheds additional light on this increase in collaboration by stating that 45% of workers are using social technologies to accomplish daily tasks2.

The culture of work has certainly changed and we are all dependent minute-by-minute on our mobile devices, email, document storage, and team-based chat services.

What happens when these cloud services we are so dependent on no longer work? Think about what a day of collaboration during meetings would look like for your end users if email, document storage, and team-based chat services were not available.

For many in North America this exact scenario played out starting on Tuesday September 4th.

What Happened?

A lighting strike that affected power to a Microsoft San Antonio datacenter resulted in a series of service outages that at first did not seem to be interconnected.

Office 365 users and admins around the globe experienced inability to sign-in to Office 365 cloud services on Tuesday, September 4th, 2018. The incident was reported on the Office 365 Service Health Dashboard as a Power BI access issue at 9:09 AM UTC, but other services started chiming in with access and functionality issues throughout the day.

Once the first domino fell on Tuesday, September 4th, the rest were right behind in a series of chain reactions.

It’s been quite an eventful 48 hours for customers consuming Azure and Office 365 services.

The problem stemmed from a cooling problem in the San Antonio, TX datacenter caused by a power voltage increase due to lightning strikes during severe weather. This led to a temperature spike that invoked automated datacenter procedures to power down equipment to protect data and hardware. At 5:41 PM UTC Microsoft reported it was working to re-route traffic from the affected environment to healthy infrastructure.

The infrastructure involved was the Azure AD service, which is the underpinning of all Office 365 services.

What was the impact?

Without access to Azure AD, no one can authenticate to use workloads like email, Teams, or even something like Power BI. Services reporting access issues included the Office 365 Portal, Microsoft Intune, Microsoft Teams, and Skype for Business.

The @Office365Status Twitter account posted:

“We’re investigating an issue where users may be unable to access Office 365 Online services. Further information can be found under MO147606 in the Admin center or on https://status.office.com.”

Affected users quickly found out that there was a problem with pointing everyone to the Admin Portal for updates on the issue.

Alex Simons, Corporate Vice President PM, Microsoft Identity Division, posted on Twitter at 5:35 PM UTC:

“Many of you are asking - here's a quick update. Azure AD Service has been experiencing load issues in North America with several high volume tenant doing massive auth retries. Availability in some NA tenants has dropped to 70%. Tenants in Europe and Asia are not effected.”

At this point in the outage roughly 30% of Azure AD traffic in North America was not being serviced.

A cause for concern here is that Azure AD underpins all the Office 365 collaboration services we rely on like email, document storage, and team-based chat services.

To say that tenants in Europe and Asia were not affected was not, in fact, correct. Quite a number of European customers were affected, mostly because all SMS messages used by MFA are serviced through affected North American data centers.

At 5:49 PM UTC he posted that, “Azure AD availability has now returned to 99.99% worldwide. Again, we are deeply sorry for any negative impacts.”

Even 24 hours later, Exchange Online and Skype are reporting throttling issues related to the fixes introduced for the original problem.

Logging into an Office 365 mailbox via Outlook on September 5th greeted many with a white box containing the word “Throttled” as shown →

The lack of transparency and ability to obtain a complete picture of what the impact is for your end users in real-time during an outage can lead to frustration. There is little doubt that Microsoft delivers better uptime and resiliency with Office 365 than typical on-premises IT organizations. No matter how well you plan, outages will still occur.

It is interesting to note that this week’s outage comes on the heels of the August 30th Office 365 Admin Portal outage... 

That outage again introduced an issue with the method of communication from Microsoft regarding outages. Without the ability to log into the Admin Portal to view the outage information the savvy administrators had to scour social media for updates.Remember that?

What is the path forward?

The path forward dictates that administrators must be able to determine what caused an outage or service slowdown so that they can respond appropriately to issues that come up and minimize the time required to resolve an issue.

Customer-centered monitoring that leverages end-user experience probes along with real-time synthetic tests are critical in determining where the problem lies. In the absence of modern monitoring capabilities, quickly understanding where problems are occurring and who’s affected may not be obtainable.

Did you have the controls in place to visually spot the Office 365 outage in real-time? Gain visibility today.

Mailscape 365 from ENow Software is the answer

ENow Software is the leading provider of Office365 Management solutions that helps you save money and increase end user productivity.

Let’s quickly walkthrough how Mailscape 365 is helping our customers navigate the outages over the past week and to achieve SLA transparency.

Once the September 4th outage started to affect the ability to authenticate via Azure AD to Office 365 systems, the Mailscape 365 OneLook dashboard turned red as a visual indicator for the NOC. You can see in the screenshot below that the Directory & Authentication, Exchange Online and Configuration services are showing red.

The visual queue of the red indicators quickly show there are issues with the Office 365 service.

The administrator or NOC agent can quickly follow the digital breadcrumb trail and select the Configuration section that is blinking red. Further selecting the blinking red Admin Portal section results in the screenshot shown below. Yes, there is a problem logging into the Admin Portal as the test results show as failed.

Mailscape 365 continues to provide further value to the administrator here in that the Service Heath Dashboard is available within the same platform. We can see in the screenshot below the 90-day history of advisories and most importantly – details about the current Office 365 portal outage.


Since the Azure AD outage affected many services the OneLook dashboard continues to drive down MTTD by showing all failed services. For instance, the Outlook functionality tests as shown below for MAPI/HTTP and Autodiscover provide additional color into the full breadth of the outage from an end user perspective.



In a cloud-based system like Office 365, visibility is limited, and the relationships between key system components are largely unknown to the customer.

The Mailscape 365 product also provides customer-centered monitoring that leverages end-user experience probes along with real-time synthetic tests to determine where the problems and outages lie. This approach injects probes into the locations you specify to carry out typical end-user tasks and report back on performance. These end-user experience probes provide the necessary data and resulting analytics to ensure complete visibility into performance and service quality at each individual location. Monitoring the experience that end-users have through synthetic tests when using the Office 365 service is critical to identifying and localizing problems. After all, the ultimate measure of any cloud-based service is whether the service is available for end-user consumption.

Companies like Barclays, Experian, & Vmware are using Mailscape 365 from ENow Software to obtain visibility, event correlation, and SLA transparency into Microsoft’s cloud platform. Using Mailscape 365 allowed you to successfully navigate the outages over the past week and to have real-time visibility into your collaboration suite.

Gain Visibility Today

 

References:

1 How much workplace collaboration is too much?

2 2017, McKinsey Global Institute Survey Advanced social technologies and the future of collaboration