Face it, most Exchange administrators look forward to their weekly patching projects about as much as you and I look forward to our next trip to the dentist. Throw in the extra complications of switching from a non-clustered environment to one that is clustered and the word root canal comes to mind. When working with non-clustered servers one can usually just use WSUS or other patching products which require simple install patches and then a restart or a reboot of the windows box. If this is done in an Exchange environment with clusters however, the same process could end in disaster.
What then can the Exchange admin do to make this routine process simple? The answer lies in how you use the nodes; move resources off the node you’re about to patch and then apply patches.
This doesn’t mean there are no drawbacks. This is a manual process and often takes time. The positive side is that administrators have complete visibility of the process and can see if anything doesn't work as expected. Plus one can take action on the failed node while other nodes of the cluster maintain service to users. This scenario also allows the administrator to work regular office hours instead of working late at night on patching-day or weekends.
Best yet- Exchange 2010 comes equipped with scripts that help administrators manage all this manually.
In the scripts directory, there are three scripts: StartDagServerMaintenance.ps1, StopDagServerMaintenance.ps1, and RedistributeActiveDatabases.ps1
Start on the first node by running StartDagServerMaintenance.ps1 –serverName <node name>
This will move databases from the first node to another node in the DAG while also moving the Cluster Group to another node if needed, to maintain quorum and have the Primary Active Manager online which is important. It will not just move things off the first node; it will also reconfigure some parameters to stop databases moving back if a failure occurs on another node during maintenance. This is the same for the “Cluster Group” resource.
Now you can patch the first node without coming online.
Once you are ready to go online use StopDagServerMaintenance –serverName < node name>
This will remove configuration to stop databases move to the server. So from here on the server is free to host active databases again or running the Primary Active Manager. The script will not move any databases back to the server, it will just configure server to be a possible owner of databases.
Now you just need to repeat these steps for any other node in your DAG. If you have a large DAG then it is possible to patch multiple servers at the same time. Just be careful to maintain quorum, otherwise things will break.
The last step when all servers are back online in normal state is to use the following script: RedistributeActiveDatabases
This script has several parameters, but the one you want to use now is –BalanceDbsByActivationPreference
When run it correctly, the databases will be moved to the mailbox database server with the lowest activation preference set on each individual database. Hopefully it is set already in a way that suits your environment.
All these steps take time. You should also verify between each step that replication is working and you don’t do a backup at the same time. Replication can be verified with Get-MailboxDatabaseCopyStatus cmdlet.
What though if you simply apply patches to a server that is a member of a DAG and reboot it afterwards?
In theory nothing should stop your DAG from serving users, as long as you don’t reboot multiple servers at the same time and loose quorum. Sometimes what happens is replication and Indexing breaks, leaving some tasks for admins to clean up later.
Smaller organizations often don’t have the time to babysit servers while patching them, they just want to configure patches in WSUS and they will automatically be applied during the night.
This is fine, but you must first configure windows update client on the server to not to apply patches at the same time. Next I have created a script that you must schedule to run a little bit ahead of time when Windows Update does its work. Let say you configure Windows Update to apply patches at 1 am in the morning. In this instance schedule the script to run 30 to 60 minutes before.
While working on the next node in your DAG, if you scheduled Windows Update to apply patches at 3am, you would schedule the script to run sometime between 2 and 2:30 am.
The script will do about the same thing as the start/stopdagservermaintenance script but in a slightly different way. It will not configure databases and servers to be blocked from activation. It will verify and not move any databases while a backup is running. It will also try to fix some errors such as content index or replication being in a failed state.
The script loops through looking for the database and replication health state. If something is wrong it will try to apply action to it and also try to move the databases to another node in your DAG. Between each loop there is pause for a minute to let replication and other things to catch up and come to a steady state.
There is a loop limit of 10 times. I figured that, if actions could not be performed with 10 tries it is not worth trying anymore. You probably have something worse to handle than just a script handling a little glitch in your DAG.
The script includes parameters so you can customize it to suit your environment. These include setting the amount of sleep time on line 120, the number of loops on line 124 and finally the copy and replayqueuelength on line 110.
There is a lot of IF’s between line 23 and 110 trying to handle different situations. Look through these and see if they are suitable in your environment.
The script doesn’t have any special error handling or output. If it scheduled at night there isn’t much need for output anyway. Logging can be added for admins to look at later.
You could schedule the start/stopdagservermaintence script? The only problem here is that it will prevent servers and databases from becoming active automatically and you want to set everything back to automatic once each server is patched. Remember, you don’t know how long each server takes to patch so picking the correct schedule could be tricky. If your environment has only a small DAG with 2 servers you want them to be available as much as possible. My script also tries to correct some issues before moving databases to another node.
While patching can be a source of pain for Exchange admins, hopefully this article has proven to be helpful by showing how you can operate in clustered environments more efficiently. Time, or a lack thereof, is usually the main pain point when dealing with the scenarios mentioned above. If you follow these steps you’ll find that you look at patching work differently from now on.