Tuesday, 27 November 2012

Exchange 2010 2 member DAG quorum problem after FSW reboot

Most DAG discussions circle around DAGs with umpteen servers spanning multiple datacentres etc, but I am sure that many DAG’s are much simpler, and are subject to different issues like the one discussed below.
Setup is Exchange 2010 - 2 member DAG with a domain controller (DC) as file share witness (FSW). Whilst the DC was rebooted the DAG would show typical expected error state:

When both Exchange DAG members are online there is no apparent problem and there appears to be nothing amiss, but if you then innocently restart the other DAG member in this state you will hit major problems as the databases on the remaining ACTIVE member will dismount as there is no longer quorum and Active Manager will stop the databases being remounted manually until the other DAG member is back up and everything is back in sync. This behaviour, while logical and ultimately to be expected, is disconcerting to say the least! If you are using one server as a primary and the other as a failover (ie one server has all the active copies of every database) then you have lost all service even though the primary server is 100% up. This scenario could give you more downtime than you might have been led to believe!!!, especially during patching or if you have a design whereby one of the DAG members and the FSW are running as VMs on the same host and the host is rebooted…
But after restart of the DC the cluster still showed an error on file share witness, despite the file share being available.

This was due to the cluster only checking the file share resource once per hour, and I reduced this to 15 minutes. Looking around the internet this article appears to summarise the problem using Exchange 2007 with clusters.
http://blogs.technet.com/b/timmcmic/archive/2009/01/22/exchange-2007-sp1-ccr-windows-2008-clusters-file-share-witness-fsw-failures.aspx
So now the exposure to a loss of quorum was down from 1 hour to 15 minutes after the DC/FSW reboot.
To reduce this further the answer was to manually bring the file share witness resource on line after a reboot. I scripted a ps1 file and put it in scheduled tasks on one of the DAG members to start the cluster resource if was both offline AND the file share witness file share could be accessed, which was the norm after a reboot.
ie IF clusterFSW resource offline AND can access FSW share then try and start cluster resource.

Why the cluster cannot check the FSW resource more frequently is a mystery to me, but I presume there are good reasons.
The motto of the story is to make sure 2 out of the three quorum members are available at all times…

Saturday, 24 November 2012

Windows 8 File History bandwidth stealer!

I recently turned on Windows 8 file history on my laptop and pointed the ‘backup’ to my main server using a UNC share. I wandered at the time if the system had any logic about running on low bandwidth/different subnet connections. I didn't need to wonder for long as next day at home with only an ADSL connection to the same server instead of the gigabit LAN all remote connections were running very slow. The router showed uplink bandwidth maxed out. I immediately suspected File History and turned the service off and the bandwidth was released. Despite this problem I did like the feature so I knocked up a quick and dirty solution so it only worked on the same subnet as the LAN.
First of all I removed the triggers from the File History Service so it couldn't start on any triggers. The triggers are Group Policy Triggers but I just removed them anyway and set the service to manual. In an elevated command window type
sc qtriggerinfo “fhsvc” delete

You should not remove these triggers without discussing with your syasdmins first!
So now the service was set to manual and would never autostart (or so I thought...).
I then created a very simple and crude powershell ps1 file that gets triggered at computer startup and sleeps for 5 minutes to allow everything else to startup. My Laptop has a DHCP reservation so it always gets the same address but you could do something much more sophisticated than below. I always prefer to pipe powershell to strings as you can work with strings more easily. My wired connection has been renamed to Ethernet so don’t copy this code directly.
The powershell looks something like this:
Start-Sleep -s 300
$IP_Info = Get-NetIPAddress | Where-Object {$_.InterfaceAlias -eq "ethernet" -and $_.AddressFamily -eq "ipv4" } | out-string
if ($IP_Info -match "200.1.1.250")
{
start-service "file history service"
}
else
{
stop-service "file history service"
}
Set-Service "fhsvc" -startuptype "manual"

The stop service is not required as it theoretically never starts but does not do any harm. Setting the startuptype to manual is required as I noticed the service always sets itself back to autostart whenever it is started!
This is saved in a .ps1 file and run as a high permissions scheduled task on computer startup, After 5 minutes it starts the service if on the same subnet as the server. I do not need to think any more about manual switching on and off.
I think there should be an option to limit the bandwidth/stop the service/multiple backup targets depending on subnet in certain connection scenarios. Maybe windows 8.5…

Monday, 8 October 2012

Exchange 2010 DAG’s about to be cheaper under Windows Server 2012!

Warning: Exchange 2010 does not run on Windows Server 2012 when this article was first written, but will be engineered to do so.

Running Exchange 2010 DAGs always meant installing Windows Server 2008/R2 ENTERPRISE Edition, as DAGs require the clustering components that are not found in Windows Server 2008/R2 Standard Edition. Enterprise Edition is considerably more expensive than Standard Edition. It also allows much more RAM, as Windows Server 2008R2 Standard Edition is limited to 32 gb.

However, Windows Server 2012 Standard Edition includes unlimited memory and all the components previously reserved for Enterprise Edition. That means that a DAG running on Windows Server 2012 will be significantly cheaper than a DAG running on Server 2008/R2.

Friday, 17 August 2012

An end to the scalable networking fiasco?

If you run a lot of Windows Servers chances are you will have come across the disgraceful debacle of 'Scalable networking', where the network card ~~takes~~ corrupts some of the load from the OS, thereby ~~improving~~ destroying performance.

This has been a shambles since it first appeared in Server 2003 (service pack 2 as I remember). Instead of realising performance increases you will infrequently but invariably come across some unfathomable glitch that makes a mockery of the whole technology.

When you first come across this 'feature' many hours will be spent trying to track down bizarre problems that just don't add up.
I have come across this on several occasions when I was naive to the problem, most notably when Exchange 2007 OWA clients (running on W2K3 sp2) would not load properly. MS support eventually suggested I turned off scalable networking and everything worked perfectly in an instant.
Other problems include SQL servers with a fraction of the network throughput and sporadic delays. Whenever we replace servers nowadays one of the items to check is the scalable network settings, since you can never tell what will happen with network card A on OS B until you have tried changing all the settings in your own environment. Performance can vary dramatically, and sometimes some settings make your system practically unusable. If in doubt turn it all off, both in Windows and the network card itself. This problem remains in Server 2008R2 running on the latest Dell servers, whereby you need to check which settings give best performance.
There is no logical reason for the sometimes massive performance variations. As I said disgraceful. This story from MS itself perfectly sums up this shameful time wasting technology...
http://blogs.msdn.com/psssql/archive/2010/02/21/tcp-offloading-again.aspx

Here are my own notes on this area when I install 2008R2 on our latest Dell Servers found by trial and error for our environment - mainly SQL and file traffic (we standardise on a model/network hardware as much as possible so we we are not having to do too much retesting in this area)
install broadcom package
netsh int tcp set global chimney=disabled
netsh int tcp set global rss=disabled

****Broadcom - defaults (ie scalable network settings on the NIC are ON - how bizarre is that)
****Check file copy speed etc
For reasons unknown I have always seen more problems in this area with Broadcom - whether this is statistically valid or not I don't know, but I now specify Intel network cards as they have always been my favourite since NT4 days, and disable the Broadcom network cards.
If you run into problems I recommend you change the settings on both the NIC and Windows itself as there is no correct setting - trial and error only I'm afraid.
I know the next time we buy another batch of servers (the NIC etc will be different model by then) we will need to try the settings again to see what works best.

I personally would put this shambles as the most disastrous Windows Server technology in the last decade. It should either be sorted out once and for all or simply be dropped. I have never worked out if it is a bug in Windows itself or the network card manufacturers at fault.
I have seen the potential performance gains that this technology can give but reliability has to take preference over potential performance gains.

Addendum
I am hoping Server 2012 will solve this problem once and for all as it has NIC teaming baked into the OS and it is likely a lot of work has been done fixing any glitches in this area. We will see...

Saturday, 28 July 2012

Reduce SQL Server 2012 install time on Server 2008R2 by pre-installing .Net4 / patches

If you are running SQL Server without clustering or other failover technologies you will want your SQL Server upgrade to run as quickly as possible so the users can get back the applications as quickly as possible. I always close all client connections, run an SQL back up of both client and system databases, stop SQL Services and set to manual, 'manually' copy the SQL data directories/log etc to a safe place before starting the upgrade. For me the downtime starts as soon as all the users have to come off so I can carry out a clean SQL backup.
Upgrading SQL2008R2 SP1 on Windows 2008R2 SP1 to SQL2012 a large amount of my downtime was SQL Server installing .Net4 and some subsequent .Net4 fixes before the ‘real’ SQL2012 upgrade started.
On subsequent installations I was able to reduce my downtime significantly by pre-installing .Net4 and the appropriate patches ahead of time.
Look in the redist\dotnetframeworks directory of your SQL Server installation media, it should look like the image below.

Run the dotNetFx40_Full_x86_x64 file first (or SC for server core install).
Once complete run NDP40-KB2468871-v2, and then NDP40-KB2544514, then switch to the appropriate x64 or x86 parent directory and run the file in there, eg NDP40-KB2533523. Now run SQL Server install and it should be much quicker. If you have many servers in this position it would be better to script it.
On Windows Server 2012 .Net4 should already be ‘pre-installed’ and this should not be necessary.
Happy upgrading!
Bizarre occurrence!
During one early upgrade I had copied the SQL installation media to the local server and then run the SQL installer from that server itself and SQL setup crashed installing .Net4. On investigation of the (many!) set up logs, the actual dotNetFx40 file was not copied correctly from the master installation media and was corrupted during the copy. Fortunately the crash had no serious consequences and I installed .Net 4 manually as per above after copying new files over, but there was some concern seeing the SQL install crash as a test install on the same server in the lab had installed perfectly! This highlighted the speed advantage of preinstalling .Net 4. When you get a crash in the middle of an install it can be a major problem as various components are at different levels, and fixing the mess can be very time consuming.

Tuesday, 10 July 2012

SQL Server 2012 Standard Lock pages in memory behaviour change from SQL 2008

If you have ever seen this message in the SQL log you know you are in big trouble…
A significant part of sql server process memory has been paged out. This may result in a performance degradation.
This is well documented and can be overcome in more recent builds of SQL Server Standard using trace flag 845.
http://support.microsoft.com/kb/970070
I (and many others) had complained about the lack of this feature
http://www.sqlservercentral.com/articles/SQL+Server+2008/64421/
and MS listened. If you set the service account to lock pages in memory and set the trace flag you will be up and running. Make sure you configure Max Memory appropriately! However, having an ‘extra’ switch was something I had become used to and looking on my laptop using SQL Server 2012 RTM which does not have traceflag 845 but does have locked pages in memory configured I saw this message.

The behaviour in SQL 2012 had changed and always locks pages in memory once the account is configured.
I looked around and indeed this is now documented in a KB article, although I don’t remember reading the behaviour change in any what’s new articles.
http://support.microsoft.com/kb/2659143/en-us?sd=rss&spid=16139

Here is an extract from the KB article with highlighting of the change:
Starting with SQL Server 2012, the memory manager simplifies the usage of "locked pages" across supported editions and processor architectures. In SQL Server 2012, you can enable "locked pages" by granting the "lock pages in memory" user right for the startup account for the instance of SQL Server in Windows. This is applicable for Standard, Business Intelligence, and Enterprise editions of SQL Server 2012 running on both 32-bit and 64-bit processor architectures.

I think this is a step forward, but I also think there should be a configuration option on the memory page to select/deselect the option, and SQL Server should carry out a check that the service account has the correct privileges and maximum memory could be checked for a ‘sensible’ value (a whole story in itself) so the memory configuration screen would be modified and would look something like this…

This would remove any ambiguity and make it easier for new users to use this feature.You could argue misuse of this feature could do more damage but I think users should be made more aware of this feature and know how and when to set it.

Thursday, 24 May 2012

New features in Windows Server 2012 for medium size enterprises

Windows Server 2012 online backup

(Note: this commentary is based on beta software and is subject to change)

I have been looking round for some time for a new backup system on a simple file server/print server type server that meets the following criteria:

Reliability – something that just works every day without intervention, and lets you know when there is a problem. Many expensive backups systems are overly complex and require intervention.
Offsite storage – preferably pre-compressed and encrypted before departure to reduce bandwidth and provide guaranteed data safety once the data has left premises.
Multiple versions of backup - so you can restore files from specific days.
Integrity check of backed up data to guarantee it can be restored.
Incremental backups –not in the true sense of the terminology but a system that only backs up changes at block level.
Cheap to buy – dedicated backup software is far more expensive than it should be.
Scriptable – for automation .

I have tested various lower cost software solutions that backup to the cloud, frequently using Amazon as a storage medium. None of them did exactly what I was looking for or were robust enough to employ as an enterprise solution in my opinion. I was particularly concerned of the quality of the software that deals with ‘below the waterline’ activities such as the VSS module.

After a while I wished that Microsoft would modify the excellent Windows Server Backup wbadmin Server Backup Utility (that I have used very successfully since Server 2008 on tens of thousands of backups) to use Azure or Skydrive storage, or something similar as the backup target. My prayers were answered when I read this article that Microsoft were going to do just that. Maybe this has been a common request and I wasn’t alone! http://blogs.technet.com/b/windowsserver/archive/2012/03/28/microsoft-online-backup-service.aspx
I put my name down on Microsoft Connect for the free beta trial and subsequently got a key. I created a W8 Beta VM and installed the software/set up the Microsoft storage account in about 10 minutes. I am not going to tell you how to setup the software as the link above tells you everything you need to know to install (check the official documentation after RTM) and it is very simple. You will also find this beta documentation very helpful, especially for scripting information (again, may change after RTM). http://www.microsoft.com/en-us/download/details.aspx?id=29005

The only problem was the 10gb cloud storage limit, which is hardly enough to trial a real fileserver, but this was soon lifted to 100gb, which makes the test adequately realistic.

I loaded some test data and…it just worked. I left it for a few days and recovered some data and …it just worked. I had chosen some highly compressible data to backup and it was being compressed to about ¼ of its original size.

As a version 1 piece of software this is a great effort. Obvious omissions are bare metal recovery and support for Exchange / SQL Server. I can live without direct bare metal recovery on a simple fileserver as if you are following good practice and separate the OS and Data onto different volumes, you can effectively perform a bare metal recovery – it just takes longer to recover.

My suggestion would be to back up the C = OS drive separately using normal backup (wbadmin) and store this OS backup on the D (Data drive) then backup the whole D drive using online backup Backup 1 – using wbadmin– backup OS volume only to say D = Data drive
C:\Users\Administrator>wbadmin start backup -backuptarget:d: -include:c: -allcritical -vssfull

Backup2 – Online backup of D = Data drive (also includes OS backup) to Microsoft cloud

Bare metal recovery should be simple.

1) From the online backup console start recovering the WindowsImageBackup directory of C (= OS) to a file share, VHD, USB – whatever is your preference

2) Put a Windows Server 2012 DVD in a server and take the recovery option using the downloaded WindowsImageBackup folder as the recovery media target

3) Restore the OS

4) From the newly restored machine restore the D=Data drives etc from cloud storage

It is not clear whether this new backup feature will ever support ‘transaction log based applications’ such as SQL Server and Exchange. The restore module of Exchange is for example very specific in what it will restore. This new online backup feature looks excellent, and hopefully the storage pricing will be reasonable.