We are getting a very strange issue (strange as in I would not expect that based on the configuration).
Our setup is a distributed always-on replication between two data centers.
DC1 and DC2 have two SQL nodes with synchronous always-on each.
Between the two data centers we have asynchronous always on replication.
DC1 <==> DC2 Distributed Async
SQL1 <==> SQ2 Sync (readonly secondary)
SQL3 <==> SQL4 sync (readonly secondary)
The issue happens when we apply patches/ restart secondary SQL server in the secondary data center SQL4. When the Server comes back from restart, it causes very high I/O on all 4 servers, which causes latency and timeouts in our application.
The question is, why would a restart of secondary server on an async replication cause high I/O on not only the primary node but all nodes in the cluster?
All I see in our monitoring application that all servers had high I/O. Primary reporting high check point writes.
PARALLEL_REDO_TRAN_TURN goes up from baseline and Parallel_redo_worker_wait_work goes down from baseline but I don't see any other significant wait types.
This has happened multiple times so we know it's not coincidental. Any thoughts?