Friday 17 August 2012

An end to the scalable networking fiasco?

If you run a lot of Windows Servers chances are you will have come across the disgraceful debacle of 'Scalable networking', where the network card takes corrupts some of the load from the OS, thereby improving destroying performance.

This has been a shambles since it first appeared in Server 2003 (service pack 2 as I remember). Instead of realising performance increases you will infrequently but invariably come across some unfathomable glitch that makes a mockery of the whole technology.

When you first come across this 'feature' many hours will be spent trying to track down bizarre problems that just don't add up.
I have come across this on several occasions when I was naive to the problem, most notably when Exchange 2007 OWA clients (running on W2K3 sp2) would not load properly. MS support eventually suggested I turned off scalable networking and everything worked perfectly in an instant.
Other problems include SQL servers with a fraction of the network throughput and sporadic delays. Whenever we replace servers nowadays one of the items to check is the scalable network settings, since you can never tell what will happen with network card A on OS B until you have tried changing all the settings in your own environment. Performance can vary dramatically, and sometimes some settings make your system practically unusable. If in doubt turn it all off, both in Windows and the network card itself. This problem remains in Server 2008R2 running on the latest Dell servers, whereby you need to check which settings give best performance.
There is no logical reason for the sometimes massive performance variations. As I said disgraceful. This story from MS itself perfectly sums up this shameful time wasting technology...
http://blogs.msdn.com/psssql/archive/2010/02/21/tcp-offloading-again.aspx

Here are my own notes on this area when I install 2008R2 on our latest Dell Servers found by trial and error for our environment - mainly SQL and file traffic (we standardise on a model/network hardware as much as possible so we we are not having to do too much retesting in this area)
install broadcom package
netsh int tcp set global chimney=disabled
netsh int tcp set global rss=disabled

****Broadcom - defaults (ie scalable network settings on the NIC are ON - how bizarre is that)
****Check file copy speed etc
For reasons unknown I have always seen more problems in this area with Broadcom - whether this is statistically valid or not I don't know, but I now specify Intel network cards as they have always been my favourite since NT4 days, and disable the Broadcom network cards.
If you run into problems I recommend you change the settings on both the NIC and Windows itself as there is no correct setting - trial and error only I'm afraid.
I know the next time we buy another batch of servers (the NIC etc will be different model by then) we will need to try the settings again to see what works best.

I personally would put this shambles as the most disastrous Windows Server technology in the last decade. It should either be sorted out once and for all or simply be dropped. I have never worked out if it is a bug in Windows itself or the network card manufacturers at fault.
I have seen the potential performance gains that this technology can give but reliability has to take preference over potential performance gains.

Addendum
I am hoping Server 2012 will solve this problem once and for all as it has NIC teaming baked into the OS and it is likely a lot of work has been done fixing any glitches in this area. We will see...