Tuesday 27 November 2012

Exchange 2010 2 member DAG quorum problem after FSW reboot

Most DAG discussions circle around DAGs with umpteen servers spanning multiple datacentres etc, but I am sure that many DAG’s are much simpler, and are subject to different issues like the one discussed below.
Setup is Exchange 2010 - 2 member DAG with a domain controller (DC) as file share witness (FSW). Whilst the DC was rebooted the DAG would show typical expected error state:
image
image
When both Exchange DAG members are online there is no apparent problem and there appears to be nothing amiss, but if you then innocently restart the other DAG member in this state you will hit major problems as the databases on the remaining ACTIVE member will dismount as there is no longer quorum and Active Manager will stop the databases being remounted manually until the other DAG member is back up and everything is back in sync. This behaviour, while logical and ultimately to be expected, is disconcerting to say the least! If you are using one server as a primary and the other as a failover (ie one server has all the active copies of every database) then you have lost all service even though the primary server is 100% up. This scenario could give you more downtime than you might have been led to believe!!!, especially during patching or if you have a design whereby one of the DAG members and the FSW are running as VMs on the same host and the host is rebooted…
But after restart of the DC the cluster still showed an error on file share witness, despite the file share being available.
image
This was due to the cluster only checking the file share resource once per hour, and I reduced this to 15 minutes. Looking around the internet this article appears to summarise the problem using Exchange 2007 with clusters.
http://blogs.technet.com/b/timmcmic/archive/2009/01/22/exchange-2007-sp1-ccr-windows-2008-clusters-file-share-witness-fsw-failures.aspx
So now the exposure to a loss of quorum was down from 1 hour to 15 minutes after the DC/FSW reboot.
To reduce this further the answer was to manually bring the file share witness resource on line after a reboot. I scripted a ps1 file and put it in scheduled tasks on one of the DAG members to start the cluster resource if was both offline AND the file share witness file share could be accessed, which was the norm after a reboot.
ie IF clusterFSW resource offline AND can access FSW share then try and start cluster resource.
image
Why the cluster cannot check the FSW resource more frequently is a mystery to me, but I presume there are good reasons.
The motto of the story is to make sure 2 out of the three quorum members are available at all times…

No comments:

Post a Comment