The Challenges of Reporting and Monitoring Exchange in the Cloud

Written by ENow Software | Mar 5, 2015 1:45:00 AM

This is a story about one company’s trials and tribulations of migrating to and maintaining a hybrid cloud environment; specifically Microsoft Exchange.

Once upon a time Acme Industries decided to move to the cloud. The decision was based on a sales pitch from a cloud vendor that described the pretty blue skies of the cloudy world. “You'll have fewer servers on premise so it’ll be less complex and overall you'll be very happy.” The company agreed that reducing on-premise services, simplifying infrastructure and gaining the ability to reprioritize their IT staff to higher value tasks made a great deal of sense. And while they all agreed this was a wise decision, they knew it was important that they recognized that like any deployment, there were going to be new processes to learn and expectations that need to be managed. However, not everything would be moving to the cloud, there would be some accounts that would remain on-premises. Therefore, theirs would be a hybrid deployment.

The company then moved their mailboxes and some data into Office 365. They started using Lync and SharePoint too. Slowly, the IT team noticed that the move to the cloud wasn’t always what they thought it would be. They uncovered a variety of challenges. A cloud solution, by itself, might be a very straightforward thing and Outlook is fairly simple. However, in a hybrid deployment, there’s a great deal going on behind the scenes. Acme quickly realized that their hybrid environment was no longer about Exchange, SharePoint or Lync individually. Rather, it is combining or federating these pieces that create a series of “unknowns”. For instance, DirSync now needed consideration. The team now required an entirely new set of factors (messaging, AD, networking, security, etc…) to review and plan for.

These factors translated into a new paradigm of complexity. Prior to the migration, when Acme’s users experienced an outage, it was easy to pinpoint causality because everything was contained on-premises. In their hybrid environment, they encountered more moving parts; including whether or not the source of the problem was the service provider or internal. Based on the symptoms, the problem could have stemmed from a variety of reasons: the internet connectivity, a problem with the hybrid Exchange server, scheduled maintenance, conflicting technologies, firewall patching, or a hundred other internal and external flash points. The challenge became how to efficiently and effectively determine the cause. Beyond understanding what it was, they needed answers to plenty of other questions such as why is it being caused? Where is the problem? Is it an internal system or a 3^rd party data system? Is it a single user affected who’s having internet connection issues from home or in a hotel? The answer to one question typically opened the door to several others.

The issue came to a head when Acme users couldn’t access their Exchange online mailboxes. All signs indicated a connectivity issue. But the question remained on whether it was an issue on Acme’s end, a general connectivity issue or an issue with Office 365? Maybe connectivity was a misdiagnosis, and the ADFS infrastructure failed. But did it fail because of a connectivity issue? Is it the firewall? Is it the ADFS servers? Or possibly a third party data center was experiencing issues. To further complicate the matter, because some applications and services could be managed through Azure; maybe it was an Azure-only issue? As you can see, many questions had to be asked, and a lot of factors required closer inspection before a root cause could be properly identified. This doesn’t cast blame on the method of distribution (cloud or hybrid); it simply illustrates the complexity of the new model and new processes needed for its support. The Acme IT team now understood that maintaining visibility was a central standard in this hybrid evolution. To achieve this, they needed to improve monitoring and reporting.

With all the moving parts like troubleshooting labyrinths and related uncertainty, Acme recognized what was lacking in their deployment: measurement. There’s a well regarded business axiom: measuring means knowing. At first Acme thought there was no need to measure elements in the cloud because they didn’t run them. After a few such incidents and chases down the troubleshooting rabbit hole, they realized having enhanced visibility, especially in the cloud, increased their agility in diagnosing and preventing operational problems. The ability to make operational decisions based on measurement against a baseline is the cornerstone to optimized performance. Monitoring and reporting are the keys to that competence.

The first level they monitored was connectivity. They measured whether or not there was a successful connection, then quantified its continuous performance. By first establishing a baseline of acceptable normality, they could now tell if an account was downloading or exporting an unusual amount of data. This had the potential to uncover latency issues or possible security data leaks. Ideally, the monitoring process allowed Acme to collect data from across their enterprise which allowed them to suspect certain root causes and rule out other much more quickly. They noticed, in such instances, that latency would be high on one server, but low on another. This ruled out Office 365 as the culprit and allowed the team to quickly investigate other layers and levels of control.

They conducted various tests. Performing synthetic transactions verified the various functionalities of the hybrid deployment including whether Outlook was properly authenticating. But the main reason Acme stepped up their monitoring and reporting capabilities was not necessarily to improve troubleshooting, but to significantly speed-up resolution. By the time users complained about a problem it was too late. Much like when you hear a car knocking and pinging, the problem has already began to cause damage. Now time, resources and money are needed to address and repair the issue. Obviously there are costs to the monitoring and reporting processes too, but the overarching benefits including the centralized visibility and expansion of the overall infrastructure life-cycle mitigates most expenses.

Acme quickly found that the enhanced visibility gained from better monitoring across the enterprise provided the necessary insight to get ahead of some issues. However, they recognized another challenge. In terms of reporting and speed-related issues, Acme could only see half of the equation. In a hybrid deployment, the Microsoft infrastructure side is completely out of sight and control. As they were a large tenant, Acme experienced throttling issues. Throttling is a commonplace Microsoft integrity defense that allows a specific user to do specific amount of tasks in a specific amount of time. This was affecting Acme’s end user experience. Simply running a report (like mailbox statistics or an active sync report) once a day for 10,000 licenses (translating to possibly 50,000 devices) meant a tremendous amount of bi-directional querying. In a shared infrastructure like Office 365, this causes problems for Microsoft which is why they impose throttling defaults. Results that should take minutes now took hours. This slowdown impacted Acme’s proactive visibility. Beyond asking Microsoft for a higher default limits, they found an alternate working solution. Because throttling is on a per tenant/per account basis, they created multiple accounts. They used one to get mailbox statistics and another for Active Sync statistics. However, because of their size, they still ran the risk of throttling.

To properly control their hybrid Office 365 environment, Acme adopted a new reporting best practice. This would directly address the volume issue as well as also focus their workflow on impactful KPIs. They identified which reports were key and for whom. Instead of a daily report for some KPIs, they understood that they could get the same level of statistics running a report on a weekly basis (“spreading the load”). They also began targeting subsets of users. They carefully evaluated which groups of users and data brought value, met compliance and maintained optimized operational performance. Additionally, they also further enabled data by using built-in reports provided in Office 365 and their monitoring/reporting solution, relied on PowerShell scripts and found a few helpful APIs.

The Acme story has a happy ending. Despite the bumps in the road adjusting to a hybrid environment, by incorporating a third party monitoring and reporting solution like Mailscape365 and the adoption of recognized best practices, they found a healthy balance of services offered from the cloud and on-premises. Their learning curve took them from blind guesswork to continuous performance integrity across a diverse enterprise. Not only was speed, access and uptime improved—which makes end users happier, but the proactive collection of intelligence allowed the Acme IT team to anticipate and resolve issues faster and more accurately. This in turn, continues to extend the lifecycle of devices, servers and infrastructure assets.

View full post