performance degredation at remote node but isolated
I am trying to troubleshoot a challenging performance issue with SQL Server. Last week we failed over to a remote location (seperate node on a multi-cluster environment) where the server has identical setup as the other nodes but the database is slowing down to a halt over the period of 12 hours or so. - The users are accessing the database through applications located at this remote site as well. - The issue we are experiencing is isolated to just this one server out of several others which are working perfectly fine. Which eliminated my original theory that it had to do with the remote location. - The SSMS error log is completely silent at this time so it tells me nothing. The windows event log is of no help as well (no related SQL Server errors). I compared server performamce and the utlization rate is significantly higher when we aren't experiencing performance issues and the node we are on is located nearby. (90% utilization vs 40% utilization). can anyone shed light on why this might be happening? Perhaps this is a configuration issue at the remote site? Thanks.
If the servers are the same (models, specs with cpu/ram etc) then check your sys.configurations settings and ensure that they are lined up between machines. In particular pay attention to your max degree of parallelism and confirm that's the same. Confirm with your storage administrator that you have similar i/o bandwidth available on both nodes. Also looks to see what external processes may be running on the server and taking up your CPU. There could be AV running without filters for example.
I would certainly check configurations of the "good" and "bad" nodes to see where there is a difference. Also consider any extra workloads on the "bad" node; additional services, tools, programs that may be running that are not on the "good" node. What hardware differences are there (if any)? (RAM/CPU differences) Is there a second/third instance of SQL Server on the "bad" node that you don't know about? Are the power settings on the "bad" node optimally set? ([Link to Brent Ozar's article on how to find and fix this issue]) <-- This may be affecting other boxes in your environment and could apply here. Is the "bad" node fully patched (O/S, SQL Server, Hardware ROMs, BIOS, Drivers)? :
Can you give us more detail on the point that you make when you say "but the database is slowing down to a halt over the period of 12 hours or so." please? What happens after the 12 hours, do you restart the server or does the problem go away by itself? Is your SQL server configured the same way with regard to RAM/CPU? If it has a lower RAM setting then you may be seeing certain queries start to cause paging that then causes problems of its own. Have you checked CPU, HDD and RAM usage over the 12 hours to see if any of them change significantly?
Check the ANSI connection settings too. They can cause different execution plans to be generated. I suspect thought it's nothing that subtle. It's probably a difference in CPU, memory, disk alignment, something like that as the others have suggested. Have you looked at the wait stats on the server? What are the queries waiting on? Is that different than what queries are waiting on in your other servers?
Well about an hour before the database quit responding to requests from the users and it seemed to lock up completely. The only was to fix was to restart the services (Agent). The logs are silent and of course no errors. An hour before the forced restart of services I was able to connect to the database through Management Studio and open up Activity Monitor. There are many active connections with long wait times but I could not see nor access any drop down features under the instance name/tree in the Object Explorer pane. The instance was still green however.... just no options under it (database, security, server objects, replication, management, sql server agent). The server is mirror image in terms of RAM/CPU. The CPU/RAM usage is actually higher when we are on the local node and not experiencing any issues.