A global IT outage significantly impacted Microsoft 365 and Microsoft Azure services and countless businesses, starting on July 18, 2024. The issue was traced back to an update to the Crowdstrike Falcon agent, which caused widespread disruptions.
According to Microsoft’s Azure Status Update, the update affected Windows Client and Windows Server virtual machines running on Azure, leading to major service interruptions and many end users experiencing the ‘Blue Screen of Death.’ The outage not only impacted general business operations but also led to severe delays at airports globally, disrupting flights and causing chaos for travelers.
While the specific details of the root cause and how it happened are unknown at this point, it appears that these updates were not tested before being applied to production servers, which is a recommended practice.
Figure 1: Updates from the Microsoft 365 Status X Account
Of course, the internet is having a field day, as many joke about having the day off and send well-wishes to the IT teams that are inevitably being inundated with support tickets and dealing with troubleshooting, workarounds, and recovery of their systems.
Figure 2: User replies to the Microsoft 365 Status X Updates
CrowdStrike acknowledged the problem in a statement, attributing it to a content update for their Falcon agent. They worked closely with Microsoft to identify the issue and implement a fix. Both companies have since been investigating the root cause to prevent future occurrences and enhance their systems' resilience.
Additionally, Central US region customers may have experienced Azure service issues due to another incident related to Azure Storage. The Azure Status History site provides more information on that outage.
Well, if you’re an ENow Microsoft 365 Monitoring customer, you get the signals and early warning signs often hours before Microsoft even starts reporting the problem, as we take a multi-pronged approach to sharing relevant information and alerting our customers. Here's an overview of how a Microsoft 365 Monitoring Tool can help in an outage situation. We also outline the importance of communication to users in an outage.
Those with ENow M365 Monitoring & Remote Probes would see impacted geographies light up red with our OneLook Dashboard. When our customers see multiple geographies showing network and workload issues, it’s a clear indicator of a Global issue or outage rather than a more isolated event.
Figure 3: ENow’s M365 Monitoring OneLook Dashboard shows Remote Probes lighting up with service issues.
Figure 4: ENow Monitoring drill down into Remote Probe example.
The ENow Remote Probes can give you visibility into how Windows machines in your primary location, as well as all the remote locations, are performing. You can also get alerts wherever and whenever performance falls below desired conditions. In the case of the CrowdStrike outage, our customers reported Microsoft 365 outages across many of their site locations around the world about midday on July 18, 2024, much sooner than Microsoft was reporting it. The critical state of such locations is highlighted by the dashboard of a customer who deployed remote probes to all of them. The outage's timing is also captured by some of the sites being impacted later in the day. A quick view of the Service Health Dashboard did not reveal any related incidents but the replies from the Microsoft 365 X account did show worthy noise initially. A few hours later, Microsoft would post an incident to all Microsoft 365 Apps being impacted.
One customer we spoke with conducted additional research and determined that patches were applied the previous day and a restart of the server took place. The ENow Server OS Version and Patches report was instrumental in guiding the customer in determining part of the root cause.
Our M365 Monitoring platform has an X (Twitter) integration, which significantly enhances IT teams’ ability to receive real-time information and updates, aiding in troubleshooting issues efficiently. The @MSFT365Status account posts updates once known, and users often share additional real-time issues and solutions. IT teams can leverage this collective knowledge to stay informed about widespread problems or emerging threats. In addition, X can provide geolocation data on where issues are being reported, helping IT teams further identify if a problem is localized or widespread. This is particularly useful for addressing regional outages or service disruptions.
With the Microsoft Service Health Dashboard Integration, our users can also see Microsoft’s service status updates alongside our monitoring analytics. Combining this information in one place can help organizations quickly determine where the issue is – whether it’s an ‘us’ or ‘them’ problem.
Figure 5: ENow’s Monitoring for Microsoft 365 integrates with Service Dashboard Status and X.
In IT outage situations, especially global ones of this scale that directly impact end users, tensions can run high between IT departments and users without good communication. Users expect transparency and timely updates, and any delay or lack of information can damage the trust and reputation of the IT department.
Every minute of a head start you can get on identifying the source of an outage is crucial. The faster you identify the problem, the sooner you can start working on a solution. For example, if your monitoring systems alert you to a server failure within the first few minutes, you can immediately begin rerouting traffic or initiating failover protocols. This swift action can prevent the issue from escalating and affecting more users.
Early identification allows you to alert your user base promptly. Users appreciate transparency and timely updates, even if the news is bad. For example, if a Microsoft workload goes down, or some users experience the Blue Screen of Death like they did yesterday and today, sending out an immediate notification through email, social media, and in-app messages can inform users that you are aware of the problem and are working on it. This proactive communication can prevent a flood of support tickets and social media backlash.
Once the source of the outage is identified, quickly determining the right course of action for recovery is essential. This may involve deploying patches, rerouting traffic, or initiating backup systems. For instance, if a data center experiences a power failure, the IT team can quickly switch to a secondary data center, minimizing downtime and data loss. Rapid action not only shortens the outage duration but also shows users that the IT department is competent and prepared for emergencies.
Acting quickly in IT outage situations is vital for maintaining a positive perception of the IT department. Swift identification of issues, prompt communication with users, and decisive action for recovery all contribute to demonstrating competence and reliability. Effective communication can turn a potential crisis into an opportunity to strengthen user trust and confidence in the IT department.
Effective communication and outage handling are only possible if you know what’s going on. ENow’s Proactive M365 Monitoring for enhanced visibility results in faster meantime to resolution, thereby reducing the business impact and building confidence in your IT team.
Learn more about ENow’s Microsoft 365 Monitoring and Reporting Platform or contact us for a Microsoft 365 Monitoring Demo.
In a cloud-world, outages are bound to happen. While Microsoft is responsible for restoring service during outages, IT needs to take ownership of their environment and user experience. It is crucial to have greater visibility into business impacts during a service outage the moment it happens.
ENow’s Microsoft 365 Monitoring and Reporting solution enables IT Pros to pinpoint the exact services affected and root cause of the issues an organization is experiencing during a service outage by providing:
Identify the scope of Microsoft 365 service outage impacts and restore workplace productivity with ENow’s Microsoft 365 Monitoring and Reporting solution. Access your free 14-day trial today!