Missing data on Subscriber. No latency or errors. How can I troubleshoot this?
This morning after performing a routine database deployment of two new tables I noticed that transactional replication had stopped for more than 3 hours. When replication resumed it picked up 3+ hours later, as if nothing happened, leaving a 3+ hour gap in the data in the subscriber. For example, replication stopped at 3:30am and then resumed at 6:30am, but no data was replicated between 3:30am and 6:30am. No replication errors could be found, no database/server errors, and replication monitor showed zero latency. The transaction log was quite large (61% capacity), but otherwise there were no errors to be found. During the 3-hour period where nothing was being replicated sp_replmonitorsubscriptionpendingcmds showed there were no pending transactions. I was able to determine that data stopped flowing because we have a table that continually logs application events. The only job running during this period was a index rebuild of a large table ~1 billion rows. The job kicked off 30 minutes before replication stalled, and lasted 40 minutes in total duration. Our environment consists of two large databases configured for peer-to-peer transactional replication. We do not utilize the environment as one would expect. Instead of reading and writing to both DBs, one database is our primary customer facing DB and the secondary is for real-time reports. We originally considered peer-to-peer as a cheap DR solution if the primary DB became corrupt. To date we have not used it as such, and in hindsight should have used a simple transactional replication instead. Once replication began to flow again between the publisher and subscriber databases Replication monitor showed the some latency, and sp_replmonitorsubscriptionpendingcmds showed pending transactions. After about 90 minutes there was zero latency and everything was back to normal, but there was a 3+ hour gap in the subscriber’s data. I don’t understand how this happen without there being any errors? I would imagine if replication stopped suddenly it would happened because of an exception. I did some googling and found some information on Replication Agent Stalls, but it was inconclusive. I have no idea what happened and not much to go on. Does anyone know the best way to troubleshoot this? Has someone experienced this in the past? Can someone help point me in the right direction? Any help would be greatly appreciated.
Ensure that there is no flush between these period. If somebody (or an automated process)did a cleanup, replication will never complains about it because they are not pending, neither they are failed but that they were applied and later cleaned so Replication will thing here as subscriber might not require this data and resume to the data post last processing. Perhaps this is why "sp_replmonitorsubscriptionpendingcmds" is blank. Guess that you did an LSN comparasion pre and post this time frame (when you lost the data at subscriber end). for a quicker solution you can try removing the data till 3:30 am (having said that since this time data is missing) at subscriber and then re-initialize the subscription. Or try to pause the replication and apply a fresh backup of the publishers to the subscribers. (but do it carefully as you might not be replication everything but some data/tables. Also if you have multiple subscribers this might be a complex job for you) .