Tuesday 27 November 2012

Exchange 2010 2 member DAG quorum problem after FSW reboot

Most DAG discussions circle around DAGs with umpteen servers spanning multiple datacentres etc, but I am sure that many DAG’s are much simpler, and are subject to different issues like the one discussed below.
Setup is Exchange 2010 - 2 member DAG with a domain controller (DC) as file share witness (FSW). Whilst the DC was rebooted the DAG would show typical expected error state:
image
image
When both Exchange DAG members are online there is no apparent problem and there appears to be nothing amiss, but if you then innocently restart the other DAG member in this state you will hit major problems as the databases on the remaining ACTIVE member will dismount as there is no longer quorum and Active Manager will stop the databases being remounted manually until the other DAG member is back up and everything is back in sync. This behaviour, while logical and ultimately to be expected, is disconcerting to say the least! If you are using one server as a primary and the other as a failover (ie one server has all the active copies of every database) then you have lost all service even though the primary server is 100% up. This scenario could give you more downtime than you might have been led to believe!!!, especially during patching or if you have a design whereby one of the DAG members and the FSW are running as VMs on the same host and the host is rebooted…
But after restart of the DC the cluster still showed an error on file share witness, despite the file share being available.
image
This was due to the cluster only checking the file share resource once per hour, and I reduced this to 15 minutes. Looking around the internet this article appears to summarise the problem using Exchange 2007 with clusters.
http://blogs.technet.com/b/timmcmic/archive/2009/01/22/exchange-2007-sp1-ccr-windows-2008-clusters-file-share-witness-fsw-failures.aspx
So now the exposure to a loss of quorum was down from 1 hour to 15 minutes after the DC/FSW reboot.
To reduce this further the answer was to manually bring the file share witness resource on line after a reboot. I scripted a ps1 file and put it in scheduled tasks on one of the DAG members to start the cluster resource if was both offline AND the file share witness file share could be accessed, which was the norm after a reboot.
ie IF clusterFSW resource offline AND can access FSW share then try and start cluster resource.
image
Why the cluster cannot check the FSW resource more frequently is a mystery to me, but I presume there are good reasons.
The motto of the story is to make sure 2 out of the three quorum members are available at all times…

Saturday 24 November 2012

Windows 8 File History bandwidth stealer!

I recently turned on Windows 8 file history on my laptop and pointed the ‘backup’ to my main server using a UNC share. I wandered at the time if the system had any logic about running on low bandwidth/different subnet connections. I didn't need to wonder for long as next day at home with only an ADSL connection to the same server instead of the gigabit LAN all remote connections were running very slow. The router showed uplink bandwidth maxed out. I immediately suspected File History and turned the service off and the bandwidth was released. Despite this problem I did like the feature so I knocked up a quick and dirty solution so it only worked on the same subnet as the LAN.
First of all I removed the triggers from the File History Service so it couldn't start on any triggers. The triggers are Group Policy Triggers but I just removed them anyway and set the service to manual. In an elevated command window type
sc qtriggerinfo “fhsvc” delete
image
You should not remove these triggers without discussing with your syasdmins first!
So now the service was set to manual and would never autostart (or so I thought...).
I then created a very simple and crude powershell ps1 file that gets triggered at computer startup and sleeps for 5 minutes to allow everything else to startup. My Laptop has a DHCP reservation so it always gets the same address but you could do something much more sophisticated than below. I always prefer to pipe powershell to strings as you can work with strings more easily. My wired connection has been renamed to Ethernet so don’t copy this code directly.
The powershell looks something like this:
Start-Sleep -s 300
$IP_Info = Get-NetIPAddress | Where-Object {$_.InterfaceAlias -eq "ethernet" -and $_.AddressFamily -eq "ipv4" } | out-string
if ($IP_Info -match "200.1.1.250")
{
start-service "file history service"
}
else
{
stop-service  "file history service"
}
Set-Service "fhsvc" -startuptype "manual"

The stop service is not required as it theoretically never starts but does not do any harm. Setting the startuptype to manual is required as I noticed the service always sets itself back to autostart whenever it is started!
This is saved in a .ps1 file and run as a high permissions scheduled task on computer startup, After 5 minutes it starts the service if on the same subnet as the server. I do not need to think any more about manual switching on and off.
I think there should be an option to limit the bandwidth/stop the service/multiple backup targets depending on subnet in certain connection scenarios. Maybe windows 8.5…