question

rgilbert avatar image
rgilbert asked

2000ms disk response time suddenly when reorganizing indexes

We have a vendor database with vendor configured maintenance plans but they use the built in rebuild index, check integrity, reorganize etc. It is configured to rebuild indexes weekly, reorganize on weekdays. This has been running fine. I have had my eye on the server because the response times on the SAN disks that the databases are on have consistently over 40ms response times. Well suddenly yesterday the response times spiked to over 2000ms, the user databases were pretty much unusable, and the only messages in the SQL server log were regarding IO taking longer than 15 seconds to complete. The reorganize index task was running. When the reorganize indexes task finished the issue was resolved... but it had taken 12 hours when it usually takes less than 2. The next night it happened again. The Avg Disk Read/Write Sec spiked between 50 and 100 in perfmon. Any Ideas how a reorganize index operation could suddenly cause this behavior? An integrity check hasnt been done since the first time the issue occured. I was thinking possibly a corrupt index? SQL 2008 R2
sql server 2008 r2disk-ioiosan
8 comments
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

rgilbert avatar image rgilbert commented ·
Thanks so much for the quick response. All of the wait statistics point to IO issues - I captured these stats at the time. The SAN guys said nothing had changed. I know the volume size hadn't changed. I was convinced it was SAN related but the only pattern for the major issue was when the reorganize job runs. wait_type wait_time_s pct running_pct BROKER_EVENTHANDLER 4560841.31 36.98 36.98 BACKUPIO 2163639.22 17.54 54.53 ASYNC_IO_COMPLETION 1825311.81 14.80 69.33 BACKUPBUFFER 1797675.60 14.58 83.90 PAGEIOLATCH_SH 641799.13 5.20 89.11 PAGEIOLATCH_EX 366157.80 2.97 92.08 WRITELOG 269198.87 2.18 94.26 CXPACKET 237224.14 1.92 96.18 LOGBUFFER 147928.99 1.20 97.38 LCK_M_U 85228.92 0.69 98.07 LCK_M_SCH_M 66052.84 0.54 98.61 SLEEP_BPOOL_FLUSH 59402.97 0.48 99.09
0 Likes 0 ·
rgilbert avatar image rgilbert commented ·
Pauls script since the reboot this morning... BACKUPIO 5711.06 5709.85 1.22 874304 24.51 0.0065 0.0065 0.0000 BACKUPBUFFER 5310.83 5296.64 14.19 2573834 22.79 0.0021 0.0021 0.0000 ASYNC_IO_COMPLETION 5309.90 5309.90 0.00 23 22.79 230.8653 230.8653 0.0000 PAGEIOLATCH_SH 3489.23 3488.79 0.44 122405 14.97 0.0285 0.0285 0.0000 PAGEIOLATCH_EX 1640.45 1640.24 0.22 74475 7.04 0.0220 0.0220 0.0000 WRITELOG 940.19 937.93 2.26 169910 4.03 0.0055 0.0055 0.000
0 Likes 0 ·
JohnM avatar image JohnM commented ·
Has the issue occurred since the reboot? If not, then the waitstats won't reflect the issue as the DMV's are cleared upon a restart.
0 Likes 0 ·
rgilbert avatar image rgilbert commented ·
Yeah, I know :/ i have the wait stats from before the reboot though. The storage guys insisted we reboot. That stopped the reorganize job too. They swear it isnt something with the SAN. They arent seeing that level of latency on their end. I sent pefmon counters, wait stats, SQL error messages, but because there is a coorelation with the reorganize, i am getting push back. Maybe i will kick off the job outside of the normal window and see what happens. I didnt know if there was something else i could look at
0 Likes 0 ·
Tim avatar image Tim commented ·
Nothing with the SAN, ok then, well how about the network between your SQL Server and their storage, who supports that. If it isn't the actually spinning disk then it is the highway that transports all those 1's and 0's between the two devices. SQL clearly stated it took 2000ms to read and write. So you rebooted, that was today, any issues since? Any new filter drivers installed? Antivirus running on the database server?
0 Likes 0 ·
Show more comments
JohnM avatar image
JohnM answered
Has the data volume suddenly increased by a large magnitude? Was the SAN doing something at the same time as the reorg was? Have you check on the wait stats to see what SQL Server might be waiting on specifically? I recommend Paul Randal's "Tell Me where It Hurts" blog as a reference. http://www.sqlskills.com/blogs/paul/wait-statistics-or-please-tell-me-where-it-hurts/ If you want to post back what the results of Paul's script is, that might give us a better clue as to where the problem is.
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Tim avatar image
Tim answered
I am thinking this is more of a disk issue than anything SQL would be doing. 2000ms to read/write to the disk is the issue. Much longer latency from the disk would cause the index job to take longer as well. I would be getting your storage folks involved. Are they doing any storage migrations, recently migrated you to another storage array or pool of disk, things like that. Run something like Crystal Benchmark or SQLIO to stress the disk off hours to show it is having latency issues.
1 comment
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

JohnM avatar image JohnM commented ·
I would agree with @Tim. My first reaction is that it's a SAN issue with the level of latency that you are seeing.
0 Likes 0 ·
Wilfred van Dijk avatar image
Wilfred van Dijk answered
is your write cache on the SAN disabled? faulty controller? dead battery?
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Write an Answer

Hint: Notify or tag a user in this post by typing @username.

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.