I have a pair of identical, fairly active SQL 2005 Servers (MAIN and ALT) that collectively host 10 different production databases. 3 of the 10 databases are fairly active, reasonably large OLTP databases, but not overwhelmingly so compared to what many of you talk about here in terms of performance or load. The most critical databases are hosted normally on MAIN and mirrored (high safety with witness) to ALT. Less critical databases are hosted normally on ALT and mirrored (high safety with witness) to MAIN. We use mirroring primarily as a DR solution as we need to be up (in theory) 24/7/365.
Although I really DO NOT believe my issue here is hardware related, I will say that the server hardware is pretty good but not stellar, dual 2.x XEONs with 16GB RAM. Each machine has 24 SATA-II disks that run various RAID-10 configurations for different databases, each of the 3 OLTP databases has its own array for data, all less critical databases share an array for data, and all databases share a dedicated array for transaction logs. Under normal operation, the entire thing operates very nicely, reliable, excellent, and predictable performance, well within our expectations.
So here's the issue: For regular maintenance (Windows updates, etc.), we do a manual PARTNER FAILOVER on each database and allow one server to carry the load for all databases until the other is ready to go again. Recently, one of the OLTP databases began taking an extremely long time to recover during this process. The other 2 databases fail over and recover typically within about 1 minute, give or take a few seconds, and all is well. But the problem database typically requires 6-7 minutes to failover, and then (incidentally) requires nearly 15 additional minutes to catch up and become fully synchronized.
This extremely long recovery happens regardless of which direction we fail this database. Both servers have their recovery interval set to 0 (I know because I always personally check this just to make sure while the phone rings with upset clients and I am screaming at the new principal server to finish recovering this database following a failover). Issuing a manual CHECKPOINT just prior to starting the manual failover does not seem to make any important impact.
The affected database is not particularly busy by SQL Server standards, it has a lot of users but transactions are VERY short. The affected database has no more users or activity than either of the other two OLTP databases; in fact it's probably the least busy of the three. It can’t really be data size either. The MDF for the problem database is just over 18GB (and growing), while the two OLTP databases that fail over and recover as expected are 40GB and 24GB in size. The affected database normally performs admirably as long as it’s left alone on its current principal server.
I normally revel in solving these problems myself, typically with the help of wise posts from other SSC pros. However, I'm at my wit’s end here and am open to any ideas anyone might offer.
asked Jan 02, 2012 at 01:00 PM in Default
Have you checked how many VLFs that database has? Run DBCC LOGINFO for the problem database to find out. Having too many or having some that are too large can cause recovery to take a long time.
Kimberly Tripp's article has some good information for deciding if you have too many or too few VLFs.
answered Jan 02, 2012 at 01:12 PM