Tha Annual Summer Outage, 2015 Edition: Office 365 Monitoring

Written by ENow Software | Jul 16, 2015 5:35:00 PM

It seems as if every summer something seemingly innocuous happens in a Microsoft datacenter halfway around the world and it spreads through the service like wildfire, taking down access for vast numbers of customers. It happened at the end of June last year in 2014, where Exchange Online and Lync Online were down for hours, and it has just happened again this week.

The Outage

Here is how it started: at 5:28 PM Eastern time, Microsoft officially establishes an ongoing incident report:

Current Status: Engineers are investigating an issue in which some customers may be experiencing problems accessing or using Exchange Online services or features. This event is actively being investigated. More information will be provided shortly.

29 minutes later, at 5:57 PM Eastern, the company acknowledges that this outage is a reasonably widespread issue:

User Experience: Affected users are unable to connect to the Exchange Online service when using multiple protocols including Outlook, Outlook Web App (OWA), Exchange ActiveSync (EAS), and Exchange Web Services (EWS).

Customer Impact: A higher than average number of customers are reporting this issue. Analysis indicates that customers will likely have some users experiencing this issue.

33 minutes later, more explanation is given at 6:30 PM Eastern (keep in mind access has been down or very intermittent for a lot of users for over an hour at this point):

The investigation determined that a portion of infrastructure which facilitates authentication to the service is experiencing higher-than-normal resource usage. Engineers are analyzing service telemetry to determine what is causing the high resource usage.

30 minutes later, we get an admission that whoops! All of us are beta testers for Microsoft, didn’t you know that, and that a programmed update has knocked out e-mail for millions of people for an hour and a half. At 7 PM Eastern (92 minutes into this outage):

Engineers have determined that this issue may be related to a recent update to the service and are currently working to revert the update.

Another 37 minutes go by and at 7:37 Eastern time, two hours and nine minutes into the outage, we finally get some progress on fixing the issue:

Engineers are making progress in reverting the update which is believed to be causing this issue. The update has been reverted in the Latin American region and users hosted from that region should begin to see relief. Reversion of the update in the North American region is underway; as it progresses users there should also experience service restoration.

By 8:35 PM Eastern time, 58 minutes later and three hours and seven minutes, the fix is complete and Microsoft expected everyone to see usable e-mail again.

Engineers have reverted the update in all regions and affected customers should be experiencing service restoration. Service teams are validating that the configuration change has been applied correctly and that service health is recovering as expected.

Of course, it takes another one hour and four minutes for the mail queues that have built up during this outage to work their way through the filter and get delivered so now four hours and 11 minutes into the outage, it looks like things are turning up:

Engineers have validated that the configuration change has been applied to the affected infrastructure and are continuing to monitor service health. Affected customers should experience service restoration as mail queues continue to drain.

And finally, at 10:17 PM, essentially five hours later, Microsoft says the service is restored and healthy:

After validating that the configuration change was successfully applied to the affected infrastructure and mail queues have drained, engineers confirmed that service is restored.

Five Hours of Radio Silence for You

Did your users appreciate not being able to send or receive e-mail from within Outlook? Were they complaining to you? Did you have any idea what was going on? Was your first instinct to check the Office 365 portal for more information, or did you spin your wheels and waste time troubleshooting non-existent issues on your network before finally concluding you were up, it was Microsoft that was down?

During last year’s outage, one of the main complaints was that Microsoft was slow in updating the Service Health Dashboard about the issue and also providing regular updates about the progress to resolution and service recovery. We can say that during this outage, updates were provided—it looks like no more than an hour or so transpired between updates. But an hour is a long time, especially in a multihour extended outage like this one, and the Service Health Dashboard is not obvious in terms of discoverability and findability.

Gaining Visibility into the Issue

When you have hundreds of users shouting that their e-mail isn’t working, and you’re just the guy in the middle looking for information to pass on to your users, then you need all the context and data you can get. That’s where Mailscape365 from ENow Software comes in. The premise is simple but powerful: When you can notify your users early on about an incident in progress, and when your own reporting helps you keep those users up to date on progress toward restoration of service, you have created an environment that is a cut above the rest. Customers using Mailscape365 were instantly in the know through the product’s easy to use dashboard with green, yellow, and red indicators. No guessing, no trying to decipher Service Health Dashboard messages.

While an outage is never pleasant and never welcome, operating in the dark is worse. Don’t let that happen to you—as I mentioned, this certainly isn’t the first Exchange Online outage and you know it won’t be the last.

View full post