I have come across an issue that doesn't appear to be well documented, well for my particular case anyway. We have our production AG on SAN storage with 512 byte physical sectors. When restoring to our Azure VMs (4k physical sectors) we are experiencing really bad delays and I've found messages in the log saying:
"There have been X misaligned log IOs which required falling back to synchronous IO. The current IO is on file xxx."
I had a database restore to 100% after 50 minutes and then take a further 90 minutes to complete and it seems to be down to the rewriting of the logs to the new sector size. All research points to this occurring between primary and secondary AG servers where the sector size differs but my thinking is that the logic would still apply when you restore a database to another server?
Another thing I have noticed is that the delay in the restore is worse if 2 databases are restoring at the same time. It can literally go from an ETA of 2 hours to 48 hours! (Note that we only recently moved our production to the SAN so the order and scheduling of the restores worked fine prior to this). When it says its going into synchronous mode, does that mean its doing it for that restore.... or for all restores occurring at that point in time?
Another test I done that seems to prove my theory is that I restored a backup to a 4096 server and as expected the misaligned log IO errors were shown. I then took a backup of it and restored again and this time no misaligned log IO errors.
Re-configuring the drives on either end is not an option so enabling Trace Flag 1800 seems to be the solution but I have no test environment of 512 and I am nervous enabling it on our production cluster. Has anyone had any experience with enabling this? My fear is that it will start rewriting all the logs to the new sector size and take ages to come back up. These are 50 24/7 critical databases in an AG some of which are over 1TB, so I would really love to hear from anyone that has had experience with enabling it or even better of someone can confirm that my theory is actually correct.
All servers on SQL Server 2017. Production on CU 18 and Azure servers vary from RTM to CU20.
Thanks in advance