Global Microsoft Outage caused by CrowdStrike Update Underscores the Benefits of Microsoft 365 Monitoring

Written by ENow Software | Jul 19, 2024 5:46:46 PM

A global IT outage significantly impacted Microsoft 365 and Microsoft Azure services and countless businesses, starting on July 18, 2024. The issue was traced back to an update to the Crowdstrike Falcon agent, which caused widespread disruptions.

Azure Virtual Machines & Microsoft 365 Services Impacted Across the World in Massive M365 Outage

According to Microsoft’s Azure Status Update, the update affected Windows Client and Windows Server virtual machines running on Azure, leading to major service interruptions and many end users experiencing the ‘Blue Screen of Death.’ The outage not only impacted general business operations but also led to severe delays at airports globally, disrupting flights and causing chaos for travelers.

While the specific details of the root cause and how it happened are unknown at this point, it appears that these updates were not tested before being applied to production servers, which is a recommended practice.

Figure 1: Updates from the Microsoft 365 Status X Account

Of course, the internet is having a field day, as many joke about having the day off and send well-wishes to the IT teams that are inevitably being inundated with support tickets and dealing with troubleshooting, workarounds, and recovery of their systems.

Figure 2: User replies to the Microsoft 365 Status X Updates

CrowdStrike acknowledged the problem in a statement, attributing it to a content update for their Falcon agent. They worked closely with Microsoft to identify the issue and implement a fix. Both companies have since been investigating the root cause to prevent future occurrences and enhance their systems' resilience.

Additionally, Central US region customers may have experienced Azure service issues due to another incident related to Azure Storage. The Azure Status History site provides more information on that outage.

This begs the question, what can companies do when these massive Microsoft 365 outages hit?

Well, if you’re an ENow Microsoft 365 Monitoring customer, you get the signals and early warning signs often hours before Microsoft even starts reporting the problem, as we take a multi-pronged approach to sharing relevant information and alerting our customers. Here's an overview of how a Microsoft 365 Monitoring Tool can help in an outage situation. We also outline the importance of communication to users in an outage.

Monitor Microsoft Outages with Remote Probes

Those with ENow M365 Monitoring & Remote Probes would see impacted geographies light up red with our OneLook Dashboard. When our customers see multiple geographies showing network and workload issues, it’s a clear indicator of a Global issue or outage rather than a more isolated event.

Figure 3: ENow’s M365 Monitoring OneLook Dashboard shows Remote Probes lighting up with service issues.

Figure 4: ENow Monitoring drill down into Remote Probe example.

The ENow Remote Probes can give you visibility into how Windows machines in your primary location, as well as all the remote locations, are performing. You can also get alerts wherever and whenever performance falls below desired conditions. In the case of the CrowdStrike outage, our customers reported Microsoft 365 outages across many of their site locations around the world about midday on July 18, 2024, much sooner than Microsoft was reporting it. The critical state of such locations is highlighted by the dashboard of a customer who deployed remote probes to all of them. The outage's timing is also captured by some of the sites being impacted later in the day. A quick view of the Service Health Dashboard did not reveal any related incidents but the replies from the Microsoft 365 X account did show worthy noise initially. A few hours later, Microsoft would post an incident to all Microsoft 365 Apps being impacted.

One customer we spoke with conducted additional research and determined that patches were applied the previous day and a restart of the server took place. The ENow Server OS Version and Patches report was instrumental in guiding the customer in determining part of the root cause.

ENow Microsoft 365 Monitoring Platform Integration with X

Our M365 Monitoring platform has an X (Twitter) integration, which significantly enhances IT teams’ ability to receive real-time information and updates, aiding in troubleshooting issues efficiently. The @MSFT365Status account posts updates once known, and users often share additional real-time issues and solutions. IT teams can leverage this collective knowledge to stay informed about widespread problems or emerging threats. In addition, X can provide geolocation data on where issues are being reported, helping IT teams further identify if a problem is localized or widespread. This is particularly useful for addressing regional outages or service disruptions.

ENow’s Microsoft 365 Monitoring Platform Integration with Microsoft’s Service Health Dashboard

With the Microsoft Service Health Dashboard Integration, our users can also see Microsoft’s service status updates alongside our monitoring analytics. Combining this information in one place can help organizations quickly determine where the issue is – whether it’s an ‘us’ or ‘them’ problem.

Figure 5: ENow’s Monitoring for Microsoft 365 integrates with Service Dashboard Status and X.

IT Outage User Communications Impact Perception

In IT outage situations, especially global ones of this scale that directly impact end users, tensions can run high between IT departments and users without good communication. Users expect transparency and timely updates, and any delay or lack of information can damage the trust and reputation of the IT department.

Identifying the Source of the Microsoft 365 Outage

Every minute of a head start you can get on identifying the source of an outage is crucial. The faster you identify the problem, the sooner you can start working on a solution. For example, if your monitoring systems alert you to a server failure within the first few minutes, you can immediately begin rerouting traffic or initiating failover protocols. This swift action can prevent the issue from escalating and affecting more users.

Alerting Your User Base

Early identification allows you to alert your user base promptly. Users appreciate transparency and timely updates, even if the news is bad. For example, if a Microsoft workload goes down, or some users experience the Blue Screen of Death like they did yesterday and today, sending out an immediate notification through email, social media, and in-app messages can inform users that you are aware of the problem and are working on it. This proactive communication can prevent a flood of support tickets and social media backlash.

Identifying the Right Course of Action for Recovery

Once the source of the outage is identified, quickly determining the right course of action for recovery is essential. This may involve deploying patches, rerouting traffic, or initiating backup systems. For instance, if a data center experiences a power failure, the IT team can quickly switch to a secondary data center, minimizing downtime and data loss. Rapid action not only shortens the outage duration but also shows users that the IT department is competent and prepared for emergencies.

The Importance of Quick Action During an M365 Outage

Acting quickly in IT outage situations is vital for maintaining a positive perception of the IT department. Swift identification of issues, prompt communication with users, and decisive action for recovery all contribute to demonstrating competence and reliability. Effective communication can turn a potential crisis into an opportunity to strengthen user trust and confidence in the IT department.

Effective communication and outage handling are only possible if you know what’s going on. ENow’s Proactive M365 Monitoring for enhanced visibility results in faster meantime to resolution, thereby reducing the business impact and building confidence in your IT team.

Learn more about ENow’s Microsoft 365 Monitoring and Reporting Platform or contact us for a Microsoft 365 Monitoring Demo.

The Importance of Microsoft 365 Monitoring

In a cloud-world, outages are bound to happen. While Microsoft is responsible for restoring service during outages, IT needs to take ownership of their environment and user experience. It is crucial to have greater visibility into business impacts during a service outage the moment it happens.

ENow’s Microsoft 365 Monitoring and Reporting solution enables IT Pros to pinpoint the exact services affected and root cause of the issues an organization is experiencing during a service outage by providing:

The ability to monitor entire environments in one place with ENow’s OneLook dashboard which makes identifying a problem fast and easy without having to scramble through Twitter and the Service Health Dashboard looking for answers.
A full picture of all services and subset of services affected during an outage with ENow’s remote probes which covers several Microsoft 365 apps and other cloud-based collaboration services.

Identify the scope of Microsoft 365 service outage impacts and restore workplace productivity with ENow’s Microsoft 365 Monitoring and Reporting solution. Access your free 14-day trial today!

View full post