Multi instance cluster running on 2 VMware guests/ ESX 5.1 Guest memory does not exceed memory per NUMA node on hypervisor. The Hypervisors are not over allocated and not all resources are in use. Server 2008 R2 sp1 2VCPUs 48 gb of RAM SQL server 2012 Standard edition SP1 max memory set at 38 GB 2 Queries running on node2 clus2 take significantly longer then the same queries running on node 1 clus1 with the same databases and the same data. Application is much slower executing against node 2 clus2 vs node 1 clus 1. Same data and databases. Same 2 queries with the same databases and the same data on a stand alone box run ~ same time as node 1 clus1 Execution plans for queries are the same across instances. Reboot of node2 last week “seemed” to solve the issue for 2 days. Customer would not allow a failover of clus2 to node 1 and clus1 to node 2. When I used sp_whoisactive While performance was degraded on node 2 clus2 wait types would be sp_server_diagnostics sleep and ones of type ..crypto.. they are using column level encryption and I would see about 20 connections active. Queries are not missing indexes. When I queried the cache for high CPU consumers it returned queries we were all ready aware of… When the 2 primary applications were forced to access node 1 clus1 performance returned to normal The odd thing With almost no load on node2 clus2 % of CPU still remained at ~ 50% used by sql server process per the big liar taskmgr. When I used sp_whoisactive there would only be 1 or 2 processes; 1 of them being sp_server_diagnostics. This went on for 30 to 45 minutes after load was forced off. To me it appears that some process? configuration setting? driver? Patch? that may be causing the node2 clus2 the SQL Server poor response time. Top 2 waits were SOS_SCHEDULER_YIELD DIRTY_PAGE_POLL There was no IO over 25 MS on node2 clus2 no evidence of memory pressure. Resource waits were greater than signal waits 65% vs 35% No obvious errors in the errorlog. Besides the failover test any other pointers to help figure out what may be causing this?
With you stating resource to signal ratio is 65/35 have you investigated the host layer? This ratio suggests that you have processor contention. I believe the metric is anything above 10% signal time is a warning sign. Certainly at 35% this warrants investigation. Given the virtual environment this gets more complex as you must look at metrics at the host layer. Look at the blog by Jonathan Kehayias on CPU ready time in VMWare as this will assist you in your investigation. Also look to see if you have CPU limits set on a guest as this can cause issues as well. Some questions that need to be answered by you are below: - What sort of CPU load is occurring here (guest/host problem)? - How many VMWare servers are involved? - Are the guests on different hosts? - Is the cluster node that is experiencing issues CPU bound on the host? - If live migration is an option does live migrating the problem guest to a new lower CPU load host resolve the issue?