Mirroring/Recovery Problem - Need Help Please

I have a pair of identical, fairly active SQL 2005 Servers (MAIN and ALT) that collectively host 10 different production databases. 3 of the 10 databases are fairly active, reasonably large OLTP databases, but not overwhelmingly so compared to what many of you talk about here in terms of performance or load. The most critical databases are hosted normally on MAIN and mirrored (high safety with witness) to ALT. Less critical databases are hosted normally on ALT and mirrored (high safety with witness) to MAIN. We use mirroring primarily as a DR solution as we need to be up (in theory) 24/7/365.

Although I really DO NOT believe my issue here is hardware related, I will say that the server hardware is pretty good but not stellar, dual 2.x XEONs with 16GB RAM. Each machine has 24 SATA-II disks that run various RAID-10 configurations for different databases, each of the 3 OLTP databases has its own array for data, all less critical databases share an array for data, and all databases share a dedicated array for transaction logs. Under normal operation, the entire thing operates very nicely, reliable, excellent, and predictable performance, well within our expectations.

So here's the issue: For regular maintenance (Windows updates, etc.), we do a manual PARTNER FAILOVER on each database and allow one server to carry the load for all databases until the other is ready to go again. Recently, one of the OLTP databases began taking an extremely long time to recover during this process. The other 2 databases fail over and recover typically within about 1 minute, give or take a few seconds, and all is well. But the problem database typically requires 6-7 minutes to failover, and then (incidentally) requires nearly 15 additional minutes to catch up and become fully synchronized.

This extremely long recovery happens regardless of which direction we fail this database. Both servers have their recovery interval set to 0 (I know because I always personally check this just to make sure while the phone rings with upset clients and I am screaming at the new principal server to finish recovering this database following a failover). Issuing a manual CHECKPOINT just prior to starting the manual failover does not seem to make any important impact.

The affected database is not particularly busy by SQL Server standards, it has a lot of users but transactions are VERY short. The affected database has no more users or activity than either of the other two OLTP databases; in fact it's probably the least busy of the three. It can’t really be data size either. The MDF for the problem database is just over 18GB (and growing), while the two OLTP databases that fail over and recover as expected are 40GB and 24GB in size. The affected database normally performs admirably as long as it’s left alone on its current principal server.

I normally revel in solving these problems myself, typically with the help of wise posts from other SSC pros. However, I'm at my wit’s end here and am open to any ideas anyone might offer.

more ▼

asked Jan 02, 2012 at 01:00 PM in Default

avatar image

Morgan Leppink
135 2 3 6

(comments are locked)
10|1200 characters needed characters left

1 answer: sort voted first

Have you checked how many VLFs that database has? Run DBCC LOGINFO for the problem database to find out. Having too many or having some that are too large can cause recovery to take a long time.

Kimberly Tripp's article has some good information for deciding if you have too many or too few VLFs.

more ▼

answered Jan 02, 2012 at 01:12 PM

avatar image

329 1 4 8

Gee, I wonder - is 46,657 VLFs too many for one database? The other two OLTP databases I mentioned only have 18K and 8K VLFs respectively. And here I thought I knew what I was doing. LOL!

OK I knew somebody could point me in the right direction. I'll post back and let the group know how it goes after I've fixed this issue. I'm sure this is it.

Thanks so much!

Jan 02, 2012 at 01:38 PM Morgan Leppink

No problem. (If it is the right answer can you accept it please?)

This http://blogs.msdn.com/b/psssql/archive/2009/05/21/how-a-log-file-structure-can-affect-database-recovery-time.aspx MS article, which is aimed more at transactional replication, recommends less than 1000 VLFs, so I think 46000 is probably the problem.

Jan 02, 2012 at 02:07 PM pipthegeek

Followed the advice in Kim's article, reduced the VLFs to less than 300 for all the OLTP databases (64GB log files). All is well, the problem database now fails over in 7 seconds and is fully synchronized in 30 seconds. Hooray, thanks PipTheGeek!

Jan 02, 2012 at 04:17 PM Morgan Leppink
(comments are locked)
10|1200 characters needed characters left
Your answer
toggle preview:

Up to 2 attachments (including images) can be used with a maximum of 524.3 kB each and 1.0 MB total.

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here



Answers and Comments

SQL Server Central

Need long-form SQL discussion? SQLserverCentral.com is the place.



asked: Jan 02, 2012 at 01:00 PM

Seen: 4304 times

Last Updated: Jan 02, 2012 at 01:00 PM

Copyright 2018 Redgate Software. Privacy Policy