Highest priority areas to evaluate for extreme disk latency
I've been evaluating a server that is reporting massive latency issues. **Symptoms/Problems** - The write latency on both tempdb files is 2500-2800 query the DMV io stats (see glenn berry). - The read latency is impacted on query a few databases with the highest being 159, and another 20 between 20-100ms. - SQL Error log shows thousands of events for the IO requests taking longer than 15 seconds. The vast majority of these are specifically on tempdb (no surprise considering the above!) **Symptom Details** - The write latency isn't significant for many other databases (a few 30's here and there, but most are good). - Machine has 128GB of ram, 24 CPUS **Problem's I've noted in reviewing** - Trace Flag 272 running. - VLF count is very high for around 10 databases. The primary database of 500 GB has 7865 VLF count, with the remaining following from 4276 down to 1057. - Disk configuration shows all databases and logs are on the same drives. However, in looking at the underlying configuration, it appears that it is on NetApp, so I'm having trouble getting a clear picture on if this is a SAN disk configuration issue (the magic black box that gets hard to diagnose), or if something else is causing this. For me to go down the SAN route I need a solid indication its disk io subsystem, as there will be pushback on this unfortunately. - Instant file initialization off - 2 TEMPDB files for entire instance, with 24 logical cpus. - Log grow amounts are significantly under what it should be, showing around 100MB for many databases, and the 500GB database has 450MB growth amount. Evaluating the autogrowth events in the last 22 days didn't show a significant pattern of growth events though. This was surprising. Maybe the fragmentation happened before the restart 22 days ago, as I'm not showing hardly any growth events since the reboot. - Waits Evaluated. Max dop set to 4 (24 logical cpus hyperthreaded i believe), Degree of Parallelism set to 5 (I've advised to consider 8 maxdop and 50 DOP with caution to test impact - 25% of waits CXPacket - 16% writelog - 12% LCK_M_U - 11% PAGEIOLATCH_SH ---------- **My current recommendations** 1. Add 6 more tempdb files to bring total to 8 to help eliminate any contention. 2. Evaluate the SAN configuration for confirmation that the setup is optimal for SQL server, ie if using a large aggregate, ensure the number of disk shelves is plenty to prevent contention between log, tempdb, and data files. Identify the "performance pool" and try to separate tempdb into a separate LUN with it's own dedicated disks,ideally SSD, to optimize tempdb performance and eliminate contention with any other SAN traffic. 3. Adjust autogrowth settings of log and database to better prevent autogrowth events, as well as prevent excessive VLF counts from affecting performance. 4. Implement Ola Hallengren (respect!) scripts for daily maintenance, as I'm also noticing over 100 indexes with >10% row change since last statistics update (with some anomalies show a few tables with extremely out of date stats). 5. Trace Flag 272 currently running. Make sure they understand it could impact identity column insert performance and consider removing unless they need it. I've spent hours and will continue to spend hours trying to advise on this particular area, but I'm hoping I can get clarification/confirmation/correction on any specific areas I'm off on, as there will be pushback on any direction I provide at this point. Please let me know of any more helpful detail I could provide. Add 2 tempdb files
Reading over this, with tempdb being your highest latency database and you only having two files, increasing that to 8 is a good move, however what is the current initial size of the tempdb files and how big are they now after a workload? You will want to make sure when you add the additional files that they are all equal sized, trace flag 1117 enabled for equal growth, and make sure the bloody things are large enough that they don't have to grow. If normal tempdb size is 8 GB, then at a minimum you want 8 1GB data files and a proper size auto growth value. With Instant File Initialization on this could be any ole value as long as you have space to support it but 256mb to 1GB would be adequate. It really all depends on the workload. The biggest thing to know here, if tempdb is starting out small and having to grow a lot, that is impacting the entire instance and causing a lot of pain. More files can help but that is typically a move to help with latch contention on PFS, GAM, and SGAM pages, not necessarily read latency. For your SAN, if your Netapp is configured as one big array, well it doesn't really matter that LDF and MDF files are shared on the same lun since the underlying disk would be the same regardless of drive letter. If the Netapp is configured with different spindles for different arrays, well then, you could take the SAN admin out for some beers and talk shop to see if there is a hot spot and split some of your database files to the other array. It too many shops, the SAN is a black box. While having those beers with your SAN admin, have them check the network between your box and the storage, is multipathing in place, is it working, are you seeing peeks in utilization while you are seeing performance degradation. Since your write latency doesn't appear to be an issue, and your reads are high, this could lead to an indexing strategy. Have you ran other Glenn Berry scripts to see where your high cost queries and stored procedures are. Maybe you need a few more indexes or adding a few more columns to existing indexes to cover those queries. Do you have Optimize for Adhoc Workloads enabled to make sure you aren't bloating the plan cache with single use queries, even though you have 128GB of ram, it is a valuable resource. What is your Page Life Expectancy. This is more of symptom of reading to much data, but if single use plans are consuming a lot of memory, dumping those out couldn't hurt. Biggest thing, find what is reading so much data in those high latency databases, tune the code or index accordingly, and fix tempdb. Hope this helps.
With multiple tempdb datafiles consider trace flag 1117 so that all the files grow at the same time. @JohnM also recommends instant file initialization which is a must to prevent waits while the files are growing.
I've had customers with massive disk latency going through all possible areas of the server/storage and so on. In the end the problem was solved with indexing. So if you feel that you have above mentioned problems covered, try sp_blitzindex from Brent Ozar. [SP_blitzindex] It will give you a report on your indexing status in your databases. If you have massive heaps and a lot of high value missing indexes (now be careful before adding - watch the tutorial) your disks will suffer a slow death.. :