question

Morgan Leppink avatar image
Morgan Leppink asked

Need to prevent our mirroring DISASTER from happening again. Need ideas!

Hi all -

We are trying to set up database mirroring for all of our databases in SQL 2005 SP3 Enterprise, fully patched. We have 6 small databases, one pretty gigantic archive database (168GB), and then 3 mission critical production databases. The production databases are quite large; 68GB, 21GB, and 5GB. These databases are growing daily as we run a very busy eCommerce enterprise that simply CANNOT ever be down.

Our main DB machine handles our load almost effortlessly, but we are looking for a disaster recovery model that will keep us up if it ever flat-out fails, AND so that our data center guys can apply updates etc. during waking hours instead of in the middle of the night. Besides, even being offline in the middle of the night is unacceptable to our enterprise.

We have chosen to use SQL 2005 database mirroring, in high safety (synchronous) mode with a witness server for automatic failover. In order to facilitate the setup, we bought a machine identical to our primary server in every way; CPU, memory, disk configurations, motherboard, everything. the witness is a far less capable machine, but it is still very solid (it was our main DB machine at one time). These three machines sit right in the same rack and are connected using a Gigabit LAN.

Anyway, we have been setting up the first 7 (less critical) databases for mirroring, and testing manual failover. Everything has worked perfectly, without a hitch, even for the 168GB database. So, with great confidence, we started mirroring our first mission critical database. This has led to a disasterous situation (we were totally offline for almost a full hour) that we have now recovered from using SET PARTNER OFF and RESTORE DATABASE WITH RECOVERY, but I am very interested in hearing expert speculation on how this sitaution could have happened, and epscially how to prevent it from happening again.

The 68GB database was set up for mirroring and mirrored successfully. This process was slightly different than the first 7 DBs, in that:

1.) The initial data and log restore on the mirror machine took almost 2 hours, so when the mirror was set up the databases were in "Synchronizing" state for about 15 mins while the mirror caught up with the principal. This database was very busy on the principal while it was being restored on the mirror. However, this went fine, eventually both DBs said "Synchronized."

2.) We were unable to add the witness server to the quorum until after the mirror caught up. This seemed strange, but when we added it everything was fine.

3.) The principal was undergoing its nightly full backups and log backups every 15 mins for almost 24 hours, during which I'm sure that the transaction log was truncated on the principal. But it stayed synchronized with the mirror, and I'm pretty sure that's how mirring is supposed to work (correct me if I'm wrong).

This morning, during a fairly slow period transaction-wise, we tried to test a manual failover of that 68GB mission critical database with SET PARTNER FAILOVER, exactly like we had done effortlessly with the first 7 DBs (including the huge 168GB one) and all hell broke loose. Both DBs said (at various times) "In Recovery" and "Restoring" and then SELECT * FROM sys.database_mirroring (on the principal) showed the partner as DISCONNECTED while the same query (on the mirror) showed all NULLs for that DB. Then they BOTH showed all NULLS, it was aconstant state of insanity.

Attempts to FORCE-SERVICE-ALLOW-DATA-LOSS on the principal were answered with a message stating that the "mirror is busy" and that we should ty the command again later. All this while our tech support guys were answering calls as quickly as they could. It was, as you can imagine, extremely frustrating.

Nearly an hour later, we were successful in issuing RESTORE WITH RECOVERY on the original principal database, and we were back in business at last.

This must never happen to us again. We NEED database mirroring, but we have to figure out what happened here so we can prevent it. I would greatly appreciate any speculation or ideas on things to check from you experts.

I am happy to answer any questions or requests for additional information.

Thanks in advance for your input!

Sincerely,

Morgan Leppink

sql-server-2005database-mirroring
10 |1200 characters needed characters left characters exceeded

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Blackhawk-17 avatar image
Blackhawk-17 answered

Size is not the determining factor in this. Even the 168 GB dB is a relatively inconsequential switch if it is not active. It is the level of synch and the active transactions that cause issues. Put simply, mirroring cannot start a transaction on the Principal and have it complete on the Mirror.

What went wrong is hard to say.

  • Was there network saturation or an outage?
  • Can the drives handle the throughput?
  • What separation is there on physical drives of the data, log, O/S, temp dB, page file etc?
  • Are there any O/S errors at the same time?
  • What were the memory, paging file and CPU levels?
  • Are the log or data files on the dB, or temp dB, capped?
  • Was a backup (data or log) occurring at the same time?

There are many factors that could be involved so, despite your very complete description, there is more to be known.

I would suggest that, in the future, you need to go down to single-user mode prior to failover, i.e. no in-flight transactions, at a minimum.

1 comment
10 |1200 characters needed characters left characters exceeded

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Arrgh, you can't place a mirrored database into SINGLE_USER mode! Msg 1468, Level 16, State 5, Line 1 The operation cannot be performed on database "TestMirror" because it is involved in a database mirroring session. So this is not really the answer. Any other suggestions?
0 Likes 0 ·
Morgan Leppink avatar image
Morgan Leppink answered

Excellent information BlackHawk-17, thanks for that. We had really hoped that we could make these failovers happen without any noticable loss of service at all, so resorting to single user mode for every failover, although possible, would reduce the glamour of the solution quite a bit.

IMPORTANT question though: IF we took that approach (single user), would the mirror database then ALSO end up in single user mode after the manual failover, so that we'd have to go back to multi user mode on the mirror machine manually to get the system back up and running? Or would the application of single user mode affect ONLY the principal, in the moments just prior to the failover occurring, and would the mirror would come online in multi-user mode, ready to go?

And here's another question: What will happen if the principal simply FAILS, say in a massive hardware failure scenario? Will the mirroring solution handle that better because the mirror is involved and the principal machine is NOT part of the failover decision (thus it would have nothing to say about it)?

Are you of the opinion that what likely happened here is that the PRINCIPAL was holding things up, waiting for any in-flight transactions to complete (we use a lot of transactions so there most likely WERE some in-flight), and that while it waited, perhaps new transactions kept starting, and so the result was that the failover simply NEVER happened successfully?

Question about single-user mode: Does this mode disconnect everybody instantly, or are in-flight transactions allowed to complete, but then NEW transactions (and connections) are disallowed?

Answers to YOUR questions, if it helps:

1.) This was a reasonably slow time period, so NO there was no network saturation, I/O ("disk throughput") issues, or CPU or memory over-usage. These machines loaf along at 2% CPU most of the time.

2.) The O/S and page file are on the same disk array, but the logs have their own disk array and data has its own disk array (on both machines - they are identical).

3.) There are no O/S errors at all.

4.) There is no limit placed on growth of the log or data files - each has a disk array all to itself that is currently using less than 5% of available space.

5.) I do not know if a log backup was occurring at the same time, that IS possible. Can you comment on what affect that would have had on the failover?

Again, thanks for your help, I am certainly open to any further input.

10 |1200 characters needed characters left characters exceeded

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Morgan Leppink avatar image
Morgan Leppink answered
10 |1200 characters needed characters left characters exceeded

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Write an Answer

Hint: Notify or tag a user in this post by typing @username.

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.