Our client has hardware NUMA on their SQL Server box. There are 2 NUMA nodes and each contains 8 CPUs. At the beginning, things will run fine. But after a few days, we start to notice the CPUs belongine to one NUMA node (parent_id 1) are under more pressure (higher runnable_tasks_count). And the performance of our system will start to suffer.
What may cause this? What type of data / stats should we collect to better understand the situation?
I've just spent the day in the SQLOS session @ SQLBits, and I can pretty much tell you why this is now.
Internally, SQL Server has it's own schedulers, which work out when worker threads can be context switched by the OS. It does this by setting the threads it doesn't want scheduled into a suspended state, thereby removing the need for the O/S to do pre-emptive context switches.
Now, these schedulers exist per CPU - but can only schedule worker threads on the NUMA node that they are associated with - so you will have 16 + 3 schedulers in an environment with two nodes, and you can validate the number of NUMA nodes by querying the
Now, when connections are started, they are assigned to a scheduler on a round-robin basis. So, the scheduler will then own that connection, and the worker threads that the scheduler uses will all run on that NUMA node.
So, until connections are dropped, they stay on that NUMA node. Connection pooling will keep connections open and re-use them, which means that they will still stay on that NUMA node even if the application has closed it's connection. Once one node, for whatever reason, is busier, then you can see that the connections which are running on that node will take longer to run the tasks that they need to run, while the other NUMA node will clear out operations faster.
What you may well find is that your client load happens to do something like open up two connections to the SQL Server - and keep one open, and close the other. This, in combination with connection pooling, could cause the scenario you are seeing.
Things to look for:
Quick possible fix:
Hope that helps!
answered Sep 30 '10 at 03:53 PM
Matt Whitfield ♦♦
First of all, thanks for the answers.
It is basically an OLTP system. It has very high traffic during certain hours in a day. What we noticed was that things never really went wrong in the middle of those busy hours. Instead, the system would all of a sudden go into problems during quiet hours. It then occurred to us we had some jobs scheduled during the quiet hours to move data around or delete obsolete data.
So our guess was that this problem could be caused by those jobs filling up the buffer pool or cache of one of the NUMA nodes (we have MAXDOP set o 1). That CPU then became slower when handling normal tasks because it has to do extra memory processing.
To confirm our guess, we have been flushing buffer and cache after all the jobs are complete in the past 2 weeks. So far the system seems to function properly.
The way we flush the buffer and cache is like:
We are still not very sure what exactly is happening here. We ran DBCC MEMORYSTATUS on the system and noticed a few things.
Memory node Id = 0 KB
MEMORYCLERK_SQLBUFFERPOOL (node 0) KB
We don't have much experience in NUMA. Can someone explain whether this looks normal?
answered Oct 13 '10 at 04:09 PM
Can you reproduce this problem on a similarly configured non-production server? If so, there's a simple solution you can test there: apply startup trace flags 8015 and 8048 through the configuration manager, stop sql server and restart.
Where to begin with the explanation :) Matt Whitfield is mostly right. I'll try to add a few details and I hope I'm mostly right, too. :)
SQL Server by default recognizes the memory boundaries of NUMA nodes as well as the CPU boundaries. There isn't one bpool by default on a NUMA server, there is one per NUMA node and each is managed independently. Task and connection scheduling pay attention to the CPU boundaries of the NUMA nodes.
Incoming connections are round-robin distributed among NUMA nodes unless a port has been affinitized. On the NUMA node, the connections are divvied up among the schedulers. Tasks for the connection will be slightly biased to the scheduler hosting the connection, the preferred scheduler. When the preferred scheduler has a count of tasks 20% or more greater than other schedulers in the NUMA node, that defeats the bias and the task will go to a scheduler other than the preferred scheduler. But the task will likely go to a scheduler in the same NUMA node - unless its part of a parallel plan.
The resource utilization of the NUMA nodes is tracked for the purposes of parallel query task allocation; its updated every two seconds. If the "home" node for the connection seems able to take on all tasks for the parallel query it will. Otherwise, if another NUMA node can satisfy the needs of the parallel query, it will. If no single NUMA node can meet the needs, tasks will be distributed across NUMA node boundaries.
Imagine that the connection distribution happens to be wonky - the first and third connections are for massive queries and the 2nd and 4th connections have very little work to do. It could very easily be the case that the first NUMA node's schedulers are sweating hard long after the second NUMA node's schedulers are all done. SQLCAT saw this very thing during a parallel data load scenario. They put together a connection wrapper to terminate if the connection was on a busy scheduler while other schedulers were idle. The client app had to respond to the termination by reconnecting, and pretty soon the incoming connection for the task would land on a less busy NUMA node to balance the number of tasks. By doing this they achieved 25% greater data load throughput.
I kinda like that solution because its clever, but trace flag 8015 + 8048 might be even more beneficial and it doesn't reauire a wrapper and changing client behavior.
Now to the fun part - the easy fix (provided you can verify on a nonproduction system): trace flag 8015 and 8048. Startup trace flag 8015 tells sql server to ignore NUMA. One bpool. One cpu group. More even task distribution. More even memory management, although there won't be any attention paid to making memory references NUMA-node local. (Linchi Shea did some great testing to show that might not matter, anyway.) SQL Server won't have to care about "away buffer lists" which could lead to inflated query memory grant requests which could further limit the size of bpool database cache and maybe even cause pended query memory grants. (Inflated query memory grants are also bad because they increase exposure to a wait and spinlock issue for synchronous query starts within a NUMA node.) There are other possible benefits but I may have already talked your ear off. So let me then say this: only put trace flag 8015 in place with trace flag 8048 also, unless you've verified that there won't be trouble. By default, query memory grant requests and some other memory allocations are serialized through a NUMA node level resource. Add more schedulers to the NUMA node (with trace flag 8015 the server is basically as one huge NUMA node) and if there are enough simultaneous query starts or other simultaneous memory allocations at the NUMA node level, spinlock contention and CMEMTHREAD waits will skyrocket. But, put trace flag 8048 in place and the memory allocations which would otherwise serialize at NUMA node level will serialize at the core level. Voila! One last thing to consider if trace flag 8015 + 8048 looks promising: trace flag 8015 results in decreasing the count of lazy writers from one per hardware NUMA node to one for the SQL Server instance. Make sure that'll be ok for your system.
**** OK, here's a roundup of the best resources I know regarding NUMA and SQL Server, at least until the SQL Server 2012 Internals book comes out. But I have a strong feeling that the issue I described won't be touched in that book.
*First, make sure you don't get in trouble with the Enterprise CAL license limit of 20 CPUs, which can also result in NUMA node imbalance due to not using all CPUs. Jonathan does a great job of explaining that here. http://www.sqlperformance.com/2012/11/system-configuration/2012-cal-problems
*SQLCAT works around NUMA node imbalance to increase data load throughput http://sqlcat.com/sqlcat/b/technicalnotes/archive/2008/04/09/resolving-scheduler-contention-for-concurrent-bulk-insert.aspx
*NUMA away lists and query memory inflation. Exposure to memory allocation spinlocks and waits increases as the allocations increase in size, so query memory inflation hurts CPU as well as memory resource utilization. http://blogs.msdn.com/b/psssql/archive/2012/12/13/how-it-works-sql-server-numa-local-foreign-and-away-memory-blocks.aspx
*Maybe it doesn't make sense for your workloads chase to the lower latency of local memory access anyway? http://sqlblog.com/blogs/linchi_shea/archive/2012/01/30/performance-impact-the-cost-of-numa-remote-memory-access.aspx
*Consider the consequences of trace flag 8015
*Use trace flag 8048 to eliminate contention around the per-NUMA node memory allocation resource by promoting it to a per-core resource. Especially if you use trace flag 8015 to ignore NUMA, adding to the number of concurrent queries which might experience CMEMTHREAD waits and spinlock contention.
*Rohit Nayak's great post about CMEMTHREAD waits, some of which can be eliminated with trace flag 8048 http://blogs.msdn.com/b/psssql/archive/2012/12/20/how-it-works-cmemthread-and-debugging-them.aspx
*Additional information about SQL Server treatment of NUMA
http://blogs.msdn.com/b/psssql/archive/2008/01/24/how-it-works-sql-server-2005-numa-basics.aspx http://blogs.msdn.com/b/psssql/archive/2010/02/23/how-it-works-sql-server-2008-numa-and-foreign-pages.aspx http://blogs.msdn.com/b/psssql/archive/2011/11/11/sql-server-clarifying-the-numa-configuration-information.aspx
*Some very nice diagrams
*And some older stuff that is valuable
answered Feb 13 at 10:10 PM