Exchange Monitoring: An Introduction to Managed Availability (Part 1)

Written by Dominik Hoefling MVP | Dec 15, 2020 8:00:00 AM

An Exchange Administrator's Task?

Microsoft introduced a new built-in exchange monitoring system called Managed Availability in Exchange 2013, which automatically takes recovery actions for unhealthy services within the Exchange organization.

Microsoft has been operating a cloud version of Exchange since 2007 and has put all their knowledge into Managed Availability monitoring. Managed Availability is a cloud trained system based on an end user’s experience with recovery oriented computing.

Managed Availability doesn’t mean you don’t have to monitor your on-premises or hybrid Exchange environment in fact, it’s just the opposite. The long and complex exchange monitoring PowerShell cmdlet’s (which we will look at in more detail later) are not the best and most effective method to do so.

Exchange 2013, or even better, the Exchange Diagnostics Service (EDS), collects a lot of performance data by default. Over 3,000 performance counters are compiled over seven days. The folder %Exchange Install Path%\Logging\Diagnostics\PerformanceLogsToBeProcessed collects and merges data onto the daily performance log on a regular basis using the Microsoft Exchange Diagnostics service. You can find this folder under path %Exchange Install Path%\ Logging\Diagnostics\DailyPerformanceLogs which is a .blg file type from the PerfMon. Managed Availability uses these files, among others, to track the health of system components. The performance counters are saved for 7 days or until 5 GB of data is reached by default. You can change these settings in the file called Microsoft.Exchange.Diagnostics.Service.exe.config located in the bin directory of your Exchange installation path:

<add Name="DailyPerformanceLogs" LogDataLoss="True" MaxSize="5120" MaxSizeDatacenter="2048" MaxAge="7.00:00:00" CheckInterval="08:00:00" />

Managed Availability has multiple HealthSet models that are responsible for different services, such as:

Client Protocols: OWA/ECP, ActiveSync, IMAP/POP, UM, Outlook, Compliance
Storage: DataProtection, Clustering, PublicFolders, SiteMailbox, Store
Mail Flow: FrontEndTransport, HubTransport, MailboxTransport, Deployment
Migration: MigrationMonitor, MRS
Fabric: Diskspace, MailboxSpace, ActiveDirectory, UserThrottling

The main constituents of Managed Availability are Probes, Monitors, & Responders

Probes

Probes run every few minutes against different services, checks the health, and collects data from the server. These results flow in the exchange monitoring component of Managed Availability. An Exchange 2013 multi-role server is defined by hundreds of probes and in most cases, these Probes are not directly discoverable. This means that most of the Probes are defined within the Exchange program code and not changeable. For example, customers reported the AutoDiscoverSelfTestProbe failed when the ExternalUrl for the EWS virtual directory wasn’t set and there were no ways to change the probe settings. Therefore, Microsoft resolved this issue in Cumulative Update 6. The Probes write an informational event to the Microsoft.Exchange.ActiveMonitoring\ProbeResult crimson channel with the following result types:

1 = Timeout
2 = Poisoned
3 = Succeeded
4 = Failed
5 = Quarantined
6 = Rejected

Probes are divided into three categories:

Reoccurring Probes: system performed tests for the end-to-end user experience, such as OWA connectivity.
Notifications: performs their own monitoring without the health manager framework by directly writing probe results. For example, the MSExchangeDAGMgmt service logs a probe result without Managed Availability.
Checks: collects data from performance counters and logs events if the defined thresholds are exceeded or are unmet.

Monitors

Monitors are the central part of Managed Availability. All collected server data is examined to determine if action needs to be taken based on a predefined rule set within the Monitors. Nearly all Monitors collect three types of data:

Direct notifications: Monitors become Unhealthy if a direct notification, for example from a service, changed the Monitor state
Probe results: Monitors become Unhealthy if a Probe fails
Performance counters: Monitors become Unhealthy if a performance counter is higher or lower than the defined threshold

Depending on the issue, a monitor can either initiate a responder or escalate the issue via an event log entry. Monitors have the following various states:

Healthy: all collected Probe-data is within a normal state
Unhealthy: issue detected; either a recovery process started or escalated
Degraded: if a Monitor is in an unhealthy state < 60 seconds
Disabled: if a Monitor is manually disabled by an administrator
Unavailable: if the Microsoft Exchange Health service doesn’t get a query response from the Monitor
Repairing: to inform Managed Availability (or a monitoring software) that corrective actions are in progress

Many Monitors have high thresholds of multiple probe failures before becoming Unhealthy to avoid wrong recovery actions taken by Managed Availability and the Responders. For problems that require manual intervention, take a look at the Microsoft.Exchange.ManagedAvailability\Monitoring crimson channel.

Responders

Responders take actions generated by an Unhealthy Monitor and perform recovery actions, such as resetting an IIS application pool, initiating a database failover, or restarting a server. Managed Availability uses the following Responder types:

Restart Responder: Terminates and restarts a service
Reset AppPool Responder: Recycles an IIS AppPool
Failover Responder: performs a mailbox database or server failover
Bugcheck Responder: initiates a bug check of the server (forcing a reboot)
Offline Responder: takes a protocol on a server, such as mapi/http, out of service and thus reject client requests
Online Responder: takes a protocol on a server, such as mapi/http, back into production and thus accept client requests
Escalate Responder: writes an event log to inform an administrator
Specialized Component Responder: some specialized Responders that are unique to their component

If you would like to take a look at all recovery actions through the Managed Availability Responders, view the Microsoft.Exchange.ManagedAvailability\RecoveryActionResults crimson channel.

Conclusion

This concludes part one of this article. In the second part, we will take a more practical approach to Managed Availability. By using PowerShell we will show you how you can retrieve useful information from the massive amounts of data that Managed Availability collects about your environment.

Part 2 goes over how to check, protect, and maintain Exchange Server and then in Part 3 we dive into local monitoring and overrides.

Exchange Monitoring with ENow

Watch all aspects of your Exchange environment from a single pane of glass: client access, mailbox, and Edge servers; DAGs and databases; network, DNS, and Active Directory connectivity; Outlook, ActiveSync, and EWS client access.

View full post