question

buckeyeSQL avatar image
buckeyeSQL asked

Distributing TempDB to address slow I/O performance

We are having difficulty redistributing our 8 TempDB files currently on 1 solid state device to 4 solid state devices. We decided to distribute the load due to our I/O writes taking 500 msecs on average and reads in the 50 msec range. Upon destributing the 8 TempDB files across 4 solid state devices (2 TempDB's per device) we experienced I/O writes of 10 seconds and greater with no activity other than some searches and queries we were running. We received the following errors in the log: [sqsrvres] CheckQueryProcessorAlive: sqlexecdirect failed [sqsrvres] printODBCError: sqlstate = 08S01; native error = 2746; message = [Microsoft][SQL Native Client]TCP Provider: An existing connection was forcibly closed by the remote host. [sqsrvres] printODBCError: sqlstate = 08S01; native error = 2746; message = [Microsoft][SQL Native Client]Communication link failure [sqsrvres] OnlineThread: QP is not online. Also in the event log we found these messages at around 2:44 am SQL Server has encountered 51 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [M:\tempdb07.ndf] in database [tempdb] (2). The OS file handle is 0x0000098C. The offset of the latest long I/O is: 0x00000004a10000 SQL Server has encountered 37 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [L:\tempdb05.ndf] in database [tempdb] (2). The OS file handle is 0x00000980. The offset of the latest long I/O is: 0x00000004a50000 SQL Server has encountered 52 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [N:\tempdb.mdf] in database [tempdb] (2). The OS file handle is 0x000008D0. The offset of the latest long I/O is: 0x00000009490000 Then again at 2:46am [sqsrvres] CheckQueryProcessorAlive: sqlexecdirect failed [sqsrvres] printODBCError: sqlstate = 08S01; native error = 40; message = [Microsoft][SQL Native Client]TCP Provider: The specified network name is no longer available. [sqsrvres] printODBCError: sqlstate = 08S01; native error = 40; message = [Microsoft][SQL Native Client]Communication link failure [sqsrvres] OnlineThread: QP is not online. I would appreciate any insight into this issue. We've attempted redistributing our TempDB files twice and have had to back out each time.
tempdb
3 comments
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

buckeyeSQL avatar image buckeyeSQL commented ·
We have 8 CPU's and we have considered using IOSIM to do a benchmark. Checking on RAID level...
0 Likes 0 ·
buckeyeSQL avatar image buckeyeSQL commented ·
Also, we had completed all maintenance prior to making the change.
0 Likes 0 ·
buckeyeSQL avatar image buckeyeSQL buckeyeSQL commented ·
8 CPU's in play. We have considered benchmarking using IOSIM. I'll get back to you on the RAID level.
0 Likes 0 ·
Tim avatar image
Tim answered
Have you benchmarked the drives using a tool such as IOSIM to ensure the drives can handle the load? I have heard of several folks having issues with the drives under serious load. During the times you mentioned above was there I/O intensive maintenance tasks going on?
1 comment
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

buckeyeSQL avatar image buckeyeSQL commented ·
This was being considered for an upcoming weekend. The problem is that this is production and finding a window large enought to allow for a good benchmark is difficult.
0 Likes 0 ·
David Wimbush avatar image
David Wimbush answered
How many CPUs do you have. Microsoft recommends one file per core, and warns that too many files will actually harm performance. Also, what configuration are the disks (ie. what RAID level etc.)?
2 comments
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Tim avatar image Tim commented ·
The ratio is 1 file per CPU to a certain extent. Once you get into 8,16, and 32 cores the ration drops to 1 file per 2 or 1 per 4. On my 16 core boxes I typically go with 4 to 8 files depending on the I/O.
0 Likes 0 ·
buckeyeSQL avatar image buckeyeSQL commented ·
This is a clustered node. The server has 8 CPU's. We are using RAID 10 for the solid state devices on the SAN. We actually reduced the number of tempdb files down from 8 to 4 resulting in minimal improvement.
0 Likes 0 ·
WilliamD avatar image
WilliamD answered
What OS are you using? Have you made sure that the OS and the SSDs support the TRIM command to speed up writes? Are you using the latest patch levels for your OS, SSD Drivers (if applicable), SSD Controller (some manufacturers update these quite often)? Doing throughput tests is also highly advisable as David and Tim suggested. I have not used IOSim before, but SQLIO will give you a good idea of your IO capabilities. Berent Ozar produced a brilliant [tutorial for SQLIO at SqlServerPedia][1], it talks about SANs but applies to any storage really. [1]: http://sqlserverpedia.com/wiki/SAN_Performance_Tuning_with_SQLIO
1 comment
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

buckeyeSQL avatar image buckeyeSQL commented ·
We are using Windows 2003. I will check for Trim support. I do believe our patch levels are up to date. We have looked at SQLIO as well. Thanks for the link.
0 Likes 0 ·
Håkan Winther avatar image
Håkan Winther answered
What type of SSD disks do you use? And are they all equal? Look at this link [ http://ssd-reviews.com/][1], as you can see there is a big difference in performance. You can also have a look at [RAM SAN][2] or [Violin][3] The fastest solution for TempDB is to use a large RAM disc (in the server), but it is hard to implement in a clustered environment. [1]: http://ssd-reviews.com/ [2]: http://www.ramsan.com/products/products.asp [3]: http://www.violin-memory.com/products/
2 comments
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

sp_lock avatar image sp_lock commented ·
+1 for the Violin... Been looking at one of these for our business.
0 Likes 0 ·
Oleg avatar image Oleg commented ·
@Jonlee Lockwood You did not actually click on the upvote button :)
0 Likes 0 ·
buckeyeSQL avatar image
buckeyeSQL answered
We have isolated the issue to SQL Response or the server it resides on (W2K3). By shutting down the server our I/O performace for TempDB went from an average write of 500 msecs to less than 10 msecs. Does anyone have any experience with monitoring tools like SQL Response and why it would cause a huge I/O performance hit?
1 comment
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

WilliamD avatar image WilliamD commented ·
That really suprises me. I have known SQL Response to be lightweight, are you monitoring lots of servers with this? It may be an idea to talk with red-gate directly. They are very quick to respond and are friendly too.
0 Likes 0 ·

Write an Answer

Hint: Notify or tag a user in this post by typing @username.

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.