x

CPU pressure and uneven NUMA load

Our client has hardware NUMA on their SQL Server box. There are 2 NUMA nodes and each contains 8 CPUs. At the beginning, things will run fine. But after a few days, we start to notice the CPUs belongine to one NUMA node (parent_id 1) are under more pressure (higher runnable_tasks_count). And the performance of our system will start to suffer.

What may cause this? What type of data / stats should we collect to better understand the situation?

Thanks
more ▼

asked Sep 19 '10 at 09:38 PM in Default

baoqinye gravatar image

baoqinye
51 1 1 1

This is a SQL Server 2005 64 bit environment.
Sep 19 '10 at 09:41 PM baoqinye
Can you tell us what edition of sql server this is (enterprise, standard) and what the CPU settings are for the instance that is seeing the problems? Are you using affinity masking, or using all available CPUs for the instance? Is anything else running on this box?
Sep 20 '10 at 12:24 AM WilliamD

It is our client's environment. They have not been very co-operative with providing details. I will try my best to share what I know.

1) This is SQL Server 2005 Enterprise edition 64 bit. I think it's running on SP2.

2) From the SQL side, we can see there are 16 CPUs. SQL server has detected 2 NUMA nodes upon startup. According to our client, they use hardware NUMA.

3) No affinity mask has been set. That means it is on dynamic setting.

4) According to our client, this is a dedicated SQL server box.

Thanks
Sep 20 '10 at 12:48 PM baoqinye

I would suggest taking a look at the wait-stats on the machine to find out if there is anything showing up as a bottleneck.

The major problem you will face though is troubleshooting whilst half-blind. If your client is experiencing problems, but not letting you take a closer look, then you are not able to help properly. That is like saying you are not feeling well, but not letting the doctor examine you!

Fix the communication problem, then fix the system. This may be out of your reach, but maybe management can do something about that.
Sep 21 '10 at 12:07 AM WilliamD
(comments are locked)
10|1200 characters needed characters left

4 answers: sort voted first

Ok

I've just spent the day in the SQLOS session @ SQLBits, and I can pretty much tell you why this is now.

Internally, SQL Server has it's own schedulers, which work out when worker threads can be context switched by the OS. It does this by setting the threads it doesn't want scheduled into a suspended state, thereby removing the need for the O/S to do pre-emptive context switches.

Now, these schedulers exist per CPU - but can only schedule worker threads on the NUMA node that they are associated with - so you will have 16 + 3 schedulers in an environment with two nodes, and you can validate the number of NUMA nodes by querying the [sys].[dm_os_nodes] DMV, and validate the number of schedulers using [sys].[dm_os_schedulers]. The +3 schedulers are the DAC scheduler, the Resource Manager scheduler and the Backup / Restore scheduler.

Now, when connections are started, they are assigned to a scheduler on a round-robin basis. So, the scheduler will then own that connection, and the worker threads that the scheduler uses will all run on that NUMA node.

So, until connections are dropped, they stay on that NUMA node. Connection pooling will keep connections open and re-use them, which means that they will still stay on that NUMA node even if the application has closed it's connection. Once one node, for whatever reason, is busier, then you can see that the connections which are running on that node will take longer to run the tasks that they need to run, while the other NUMA node will clear out operations faster.

What you may well find is that your client load happens to do something like open up two connections to the SQL Server - and keep one open, and close the other. This, in combination with connection pooling, could cause the scenario you are seeing.

Things to look for:

  • Worker threads with a lot of waits ([sys].[dm_os_workers])
  • Connection pooling being used on the client, with connections not sharing an equal workload

Quick possible fix:

  • Disable connection pooling on the client
Hope that helps!
more ▼

answered Sep 30 '10 at 03:53 PM

Matt Whitfield gravatar image

Matt Whitfield ♦♦
29.4k 61 65 87

+1 - brilliant explanation Matt. SQLBits already paying for itself!
Sep 30 '10 at 11:53 PM WilliamD
@WilliamD - thanks - it's been hugely beneficial so far. I can't begin to explain how much I've learned...
Oct 02 '10 at 12:53 AM Matt Whitfield ♦♦
Damn, I wish I could have been at SQLBits for the whole thing, not just the free day!
Oct 14 '10 at 03:19 AM ThomasRushton ♦
@ThomasRushton - I wish you and Kev had been there for longer - I didn't feel I got the chance to meet either of you very much! At least Kev got to come out on the Saturday night... Next time, eh?
Oct 14 '10 at 03:27 AM Matt Whitfield ♦♦
(comments are locked)
10|1200 characters needed characters left

First of all, thanks for the answers.

It is basically an OLTP system. It has very high traffic during certain hours in a day. What we noticed was that things never really went wrong in the middle of those busy hours. Instead, the system would all of a sudden go into problems during quiet hours. It then occurred to us we had some jobs scheduled during the quiet hours to move data around or delete obsolete data.

So our guess was that this problem could be caused by those jobs filling up the buffer pool or cache of one of the NUMA nodes (we have MAXDOP set o 1). That CPU then became slower when handling normal tasks because it has to do extra memory processing.

To confirm our guess, we have been flushing buffer and cache after all the jobs are complete in the past 2 weeks. So far the system seems to function properly.

The way we flush the buffer and cache is like:

DBCC FREEPROCCACHE;
CHECKPOINT;
DBCC DROPCLEANBUFFERS;

We are still not very sure what exactly is happening here. We ran DBCC MEMORYSTATUS on the system and noticed a few things.

Memory node Id = 0 KB
VM Reserved 43072
VM Committed 42824
AWE Allocated 22004936
MultiPage Allocator 9864
SinglePage Allocator 5136984

Memory node Id = 1 KB
VM Reserved 67385712
VM Committed 390976
AWE Allocated 37306424
MultiPage Allocator 25856
SinglePage Allocator 5136984

MEMORYCLERK_SQLBUFFERPOOL (node 0) KB
VM Reserved 67059712
VM Committed 65536
AWE Allocated 37306424
SM Reserved 0
SM Commited 0
SinglePage Allocator 0
MultiPage Allocator 1544

MEMORYCLERK_SQLBUFFERPOOL (node 1) KB
VM Reserved 0
VM Committed 0
AWE Allocated 22004936
SM Reserved 0
SM Commited 0
SinglePage Allocator 0
MultiPage Allocator 0

We don't have much experience in NUMA. Can someone explain whether this looks normal?

Thanks
more ▼

answered Oct 13 '10 at 04:09 PM

baoqinye gravatar image

baoqinye
51 1 1 1

It sounds to me like your SQL Agent jobs are all running under one connection, because they are picking up the same connection they've just used via connection pooling - and yes, that would likely lead to the imbalance you're describing. In either case, however, the amount of memory committed (i.e. memory actually being used) isn't huge, so those stats don't look awful to me. An imbalance in reserved memory doesn't matter too much, AFAIK, because reserved memory is just memory that has been ear-marked, rather than necessarily used.
Oct 14 '10 at 01:25 AM Matt Whitfield ♦♦
(comments are locked)
10|1200 characters needed characters left

Can you reproduce this problem on a similarly configured non-production server? If so, there's a simple solution you can test there: apply startup trace flags 8015 and 8048 through the configuration manager, stop sql server and restart.

Where to begin with the explanation :) Matt Whitfield is mostly right. I'll try to add a few details and I hope I'm mostly right, too. :)

SQL Server by default recognizes the memory boundaries of NUMA nodes as well as the CPU boundaries. There isn't one bpool by default on a NUMA server, there is one per NUMA node and each is managed independently. Task and connection scheduling pay attention to the CPU boundaries of the NUMA nodes.

Incoming connections are round-robin distributed among NUMA nodes unless a port has been affinitized. On the NUMA node, the connections are divvied up among the schedulers. Tasks for the connection will be slightly biased to the scheduler hosting the connection, the preferred scheduler. When the preferred scheduler has a count of tasks 20% or more greater than other schedulers in the NUMA node, that defeats the bias and the task will go to a scheduler other than the preferred scheduler. But the task will likely go to a scheduler in the same NUMA node - unless its part of a parallel plan.

The resource utilization of the NUMA nodes is tracked for the purposes of parallel query task allocation; its updated every two seconds. If the "home" node for the connection seems able to take on all tasks for the parallel query it will. Otherwise, if another NUMA node can satisfy the needs of the parallel query, it will. If no single NUMA node can meet the needs, tasks will be distributed across NUMA node boundaries.

Imagine that the connection distribution happens to be wonky - the first and third connections are for massive queries and the 2nd and 4th connections have very little work to do. It could very easily be the case that the first NUMA node's schedulers are sweating hard long after the second NUMA node's schedulers are all done. SQLCAT saw this very thing during a parallel data load scenario. They put together a connection wrapper to terminate if the connection was on a busy scheduler while other schedulers were idle. The client app had to respond to the termination by reconnecting, and pretty soon the incoming connection for the task would land on a less busy NUMA node to balance the number of tasks. By doing this they achieved 25% greater data load throughput.

I kinda like that solution because its clever, but trace flag 8015 + 8048 might be even more beneficial and it doesn't reauire a wrapper and changing client behavior.
on a NUMA server, a database block inserted into the bpool goes into the bpool managed by its NUMA node. So if one NUMA node has busier CPUs than another because of SQL Server activity, its bpool will probably be more active, too. You've seen one way to check that - dbcc memorystatus. The perfmon counters for the buffer nodes/memory nodes can help, too. In such a case you'd expect to see a big difference in page life expectancy among the NUMA nodes. Or, if things are really bad all NUMA node PLEs will be extremely low.

Now to the fun part - the easy fix (provided you can verify on a nonproduction system): trace flag 8015 and 8048. Startup trace flag 8015 tells sql server to ignore NUMA. One bpool. One cpu group. More even task distribution. More even memory management, although there won't be any attention paid to making memory references NUMA-node local. (Linchi Shea did some great testing to show that might not matter, anyway.) SQL Server won't have to care about "away buffer lists" which could lead to inflated query memory grant requests which could further limit the size of bpool database cache and maybe even cause pended query memory grants. (Inflated query memory grants are also bad because they increase exposure to a wait and spinlock issue for synchronous query starts within a NUMA node.) There are other possible benefits but I may have already talked your ear off. So let me then say this: only put trace flag 8015 in place with trace flag 8048 also, unless you've verified that there won't be trouble. By default, query memory grant requests and some other memory allocations are serialized through a NUMA node level resource. Add more schedulers to the NUMA node (with trace flag 8015 the server is basically as one huge NUMA node) and if there are enough simultaneous query starts or other simultaneous memory allocations at the NUMA node level, spinlock contention and CMEMTHREAD waits will skyrocket. But, put trace flag 8048 in place and the memory allocations which would otherwise serialize at NUMA node level will serialize at the core level. Voila! One last thing to consider if trace flag 8015 + 8048 looks promising: trace flag 8015 results in decreasing the count of lazy writers from one per hardware NUMA node to one for the SQL Server instance. Make sure that'll be ok for your system.

**** OK, here's a roundup of the best resources I know regarding NUMA and SQL Server, at least until the SQL Server 2012 Internals book comes out. But I have a strong feeling that the issue I described won't be touched in that book.

*First, make sure you don't get in trouble with the Enterprise CAL license limit of 20 CPUs, which can also result in NUMA node imbalance due to not using all CPUs. Jonathan does a great job of explaining that here. http://www.sqlperformance.com/2012/11/system-configuration/2012-cal-problems

*SQLCAT works around NUMA node imbalance to increase data load throughput http://sqlcat.com/sqlcat/b/technicalnotes/archive/2008/04/09/resolving-scheduler-contention-for-concurrent-bulk-insert.aspx

*NUMA away lists and query memory inflation. Exposure to memory allocation spinlocks and waits increases as the allocations increase in size, so query memory inflation hurts CPU as well as memory resource utilization. http://blogs.msdn.com/b/psssql/archive/2012/12/13/how-it-works-sql-server-numa-local-foreign-and-away-memory-blocks.aspx

*Maybe it doesn't make sense for your workloads chase to the lower latency of local memory access anyway? http://sqlblog.com/blogs/linchi_shea/archive/2012/01/30/performance-impact-the-cost-of-numa-remote-memory-access.aspx

*Consider the consequences of trace flag 8015

http://blogs.msdn.com/b/psssql/archive/2010/04/02/how-it-works-soft-numa-i-o-completion-thread-lazy-writer-workers-and-memory-nodes.aspx

*Use trace flag 8048 to eliminate contention around the per-NUMA node memory allocation resource by promoting it to a per-core resource. Especially if you use trace flag 8015 to ignore NUMA, adding to the number of concurrent queries which might experience CMEMTHREAD waits and spinlock contention.

http://blogs.msdn.com/b/psssql/archive/2011/09/01/sql-server-2008-2008-r2-on-newer-machines-with-more-than-8-cpus-presented-per-numa-node-may-need-trace-flag-8048.aspx

*Rohit Nayak's great post about CMEMTHREAD waits, some of which can be eliminated with trace flag 8048 http://blogs.msdn.com/b/psssql/archive/2012/12/20/how-it-works-cmemthread-and-debugging-them.aspx

*Additional information about SQL Server treatment of NUMA

http://blogs.msdn.com/b/psssql/archive/2008/01/24/how-it-works-sql-server-2005-numa-basics.aspx http://blogs.msdn.com/b/psssql/archive/2010/02/23/how-it-works-sql-server-2008-numa-and-foreign-pages.aspx http://blogs.msdn.com/b/psssql/archive/2011/11/11/sql-server-clarifying-the-numa-configuration-information.aspx

*Some very nice diagrams

http://gavinpayneuk.files.wordpress.com/2012/09/numa-internals-of-sql-server-2012.pdf

*And some older stuff that is valuable

http://blogs.msdn.com/b/slavao/archive/2005/08/02/446648.aspx

http://blogs.msdn.com/b/slavao/archive/2005/11/20/495093.aspx

http://blogs.msdn.com/b/slavao/archive/2007/01/02/q-and-a-ratio-between-awe-allocated-and-vm-committed-reserved-in-dbcc-memorystatus-output.aspx
more ▼

answered Feb 13 '13 at 10:10 PM

sql_handle gravatar image

sql_handle
20 1

(comments are locked)
10|1200 characters needed characters left
Your answer
toggle preview:

Up to 2 attachments (including images) can be used with a maximum of 524.3 kB each and 1.0 MB total.

New code box

There's a new way to format code on the site - the red speech bubble logo will automatically format T-SQL for you. The original code box is still there for XML, etc. More details here.

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

SQL Server Central

Need long-form SQL discussion? SQLserverCentral.com is the place.

Topics:

x1933
x12
x1

asked: Sep 19 '10 at 09:38 PM

Seen: 3280 times

Last Updated: May 17 '13 at 02:04 AM