question

specterunseen avatar image
specterunseen asked

Highest priority areas to evaluate for extreme disk latency

I've been evaluating a server that is reporting massive latency issues. **Symptoms/Problems** - The write latency on both tempdb files is 2500-2800 query the DMV io stats (see glenn berry). - The read latency is impacted on query a few databases with the highest being 159, and another 20 between 20-100ms. - SQL Error log shows thousands of events for the IO requests taking longer than 15 seconds. The vast majority of these are specifically on tempdb (no surprise considering the above!) **Symptom Details** - The write latency isn't significant for many other databases (a few 30's here and there, but most are good). - Machine has 128GB of ram, 24 CPUS **Problem's I've noted in reviewing** - Trace Flag 272 running. - VLF count is very high for around 10 databases. The primary database of 500 GB has 7865 VLF count, with the remaining following from 4276 down to 1057. - Disk configuration shows all databases and logs are on the same drives. However, in looking at the underlying configuration, it appears that it is on NetApp, so I'm having trouble getting a clear picture on if this is a SAN disk configuration issue (the magic black box that gets hard to diagnose), or if something else is causing this. For me to go down the SAN route I need a solid indication its disk io subsystem, as there will be pushback on this unfortunately. - Instant file initialization off - 2 TEMPDB files for entire instance, with 24 logical cpus. - Log grow amounts are significantly under what it should be, showing around 100MB for many databases, and the 500GB database has 450MB growth amount. Evaluating the autogrowth events in the last 22 days didn't show a significant pattern of growth events though. This was surprising. Maybe the fragmentation happened before the restart 22 days ago, as I'm not showing hardly any growth events since the reboot. - Waits Evaluated. Max dop set to 4 (24 logical cpus hyperthreaded i believe), Degree of Parallelism set to 5 (I've advised to consider 8 maxdop and 50 DOP with caution to test impact - 25% of waits CXPacket - 16% writelog - 12% LCK_M_U - 11% PAGEIOLATCH_SH ---------- **My current recommendations** 1. Add 6 more tempdb files to bring total to 8 to help eliminate any contention. 2. Evaluate the SAN configuration for confirmation that the setup is optimal for SQL server, ie if using a large aggregate, ensure the number of disk shelves is plenty to prevent contention between log, tempdb, and data files. Identify the "performance pool" and try to separate tempdb into a separate LUN with it's own dedicated disks,ideally SSD, to optimize tempdb performance and eliminate contention with any other SAN traffic. 3. Adjust autogrowth settings of log and database to better prevent autogrowth events, as well as prevent excessive VLF counts from affecting performance. 4. Implement Ola Hallengren (respect!) scripts for daily maintenance, as I'm also noticing over 100 indexes with >10% row change since last statistics update (with some anomalies show a few tables with extremely out of date stats). 5. Trace Flag 272 currently running. Make sure they understand it could impact identity column insert performance and consider removing unless they need it. I've spent hours and will continue to spend hours trying to advise on this particular area, but I'm hoping I can get clarification/confirmation/correction on any specific areas I'm off on, as there will be pushback on any direction I provide at this point. Please let me know of any more helpful detail I could provide. Add 2 tempdb files
sql-server-2012dmvdisk-io
4 comments
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

DenisT avatar image DenisT commented ·
#3 -- http://www.sqlskills.com/blogs/kimberly/8-steps-to-better-transaction-log-throughput/ -- unless you've already knew about this blog post.
1 Like 1 ·
JohnM avatar image JohnM commented ·
I'd also recommend turning IFI ON. You could also examine the block size of the disk (I believe 64K is the recommended values these days) and if the disk is coming any particular RAID pool. Is there anything else that's hitting the NetApp that might be causing performance issue?
0 Likes 0 ·
specterunseen avatar image specterunseen commented ·
thanks! I've definitely read that article. The information I have right now is what I have to work with. I know IFI should be on and will recommended it for sure.
0 Likes 0 ·
Grant Fritchey avatar image Grant Fritchey ♦♦ commented ·
ASK SQL Server Central runs by you voting. For all helpful answers below, please indicate this by clicking on the thumbs up next to those answers. If any one answer lead to a solution, please indicate this by clicking on the check mark next that answer.
0 Likes 0 ·
Tim avatar image
Tim answered
Reading over this, with tempdb being your highest latency database and you only having two files, increasing that to 8 is a good move, however what is the current initial size of the tempdb files and how big are they now after a workload? You will want to make sure when you add the additional files that they are all equal sized, trace flag 1117 enabled for equal growth, and make sure the bloody things are large enough that they don't have to grow. If normal tempdb size is 8 GB, then at a minimum you want 8 1GB data files and a proper size auto growth value. With Instant File Initialization on this could be any ole value as long as you have space to support it but 256mb to 1GB would be adequate. It really all depends on the workload. The biggest thing to know here, if tempdb is starting out small and having to grow a lot, that is impacting the entire instance and causing a lot of pain. More files can help but that is typically a move to help with latch contention on PFS, GAM, and SGAM pages, not necessarily read latency. For your SAN, if your Netapp is configured as one big array, well it doesn't really matter that LDF and MDF files are shared on the same lun since the underlying disk would be the same regardless of drive letter. If the Netapp is configured with different spindles for different arrays, well then, you could take the SAN admin out for some beers and talk shop to see if there is a hot spot and split some of your database files to the other array. It too many shops, the SAN is a black box. While having those beers with your SAN admin, have them check the network between your box and the storage, is multipathing in place, is it working, are you seeing peeks in utilization while you are seeing performance degradation. Since your write latency doesn't appear to be an issue, and your reads are high, this could lead to an indexing strategy. Have you ran other Glenn Berry scripts to see where your high cost queries and stored procedures are. Maybe you need a few more indexes or adding a few more columns to existing indexes to cover those queries. Do you have Optimize for Adhoc Workloads enabled to make sure you aren't bloating the plan cache with single use queries, even though you have 128GB of ram, it is a valuable resource. What is your Page Life Expectancy. This is more of symptom of reading to much data, but if single use plans are consuming a lot of memory, dumping those out couldn't hurt. Biggest thing, find what is reading so much data in those high latency databases, tune the code or index accordingly, and fix tempdb. Hope this helps.
3 comments
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Tim avatar image Tim commented ·
Taking indexing out of the equation, are your statistics up to date so the Query Optimizer is creating good plans, is index fragmentation being cared for. If latency is still high, then I would look heavily at the storage infrastructure. The disk, network, drivers, firmware. Is the memory fully being optimized as well.
0 Likes 0 ·
specterunseen avatar image specterunseen commented ·
There is some serious fragmentation on some tables. I've recommended them to implement regular maintenance with Ola Hallengren scripts. Am I correct in walking through this after these issues the storage infrastructure would be the next area of focus?
0 Likes 0 ·
Tim avatar image Tim commented ·
It wouldn't hurt to have the storage infrastructure folks to go ahead and do a health assessment as well. Be up front with them on what all you are doing to troubleshoot and just ask them if they can review as well. Also a heads up that we are about to reorganize and rebuild a lot of indexes so they will be seeing a spike. That would actually help them see if you are hitting a bottleneck on bandwidth.
0 Likes 0 ·
mikebaldwin avatar image
mikebaldwin answered
With multiple tempdb datafiles consider trace flag 1117 so that all the files grow at the same time. @JohnM also recommends instant file initialization which is a must to prevent waits while the files are growing.
2 comments
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

specterunseen avatar image specterunseen commented ·
thank you. I'll look at that trace flag. Not marking as the answer though, because the overall performance issue I'm working on is not likely to be just the IFI or tempdb growth. Any insights on the SAN, disk IO situation?
0 Likes 0 ·
raadee avatar image raadee commented ·
But do not forget that 1117 makes all databases in the instance behave that way, not just tempdb. Might or it might not be an issue for you.
0 Likes 0 ·
raadee avatar image
raadee answered
I've had customers with massive disk latency going through all possible areas of the server/storage and so on. In the end the problem was solved with indexing. So if you feel that you have above mentioned problems covered, try sp_blitzindex from Brent Ozar. [SP_blitzindex][1] It will give you a report on your indexing status in your databases. If you have massive heaps and a lot of high value missing indexes (now be careful before adding - watch the tutorial) your disks will suffer a slow death.. [1]: http://www.brentozar.com/blitzindex/
1 comment
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

specterunseen avatar image specterunseen commented ·
Taking indexing out of the equation, what would be your next step? I'll say that the database in question is vendor provided, and almost exclusively it is correctly utilizing index seeks. Yes, it can be optimized, but there is a huge discrepancy between the performance of this one vs the normal impact to a server this vendor app provides.
0 Likes 0 ·

Write an Answer

Hint: Notify or tag a user in this post by typing @username.

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.