We have a problem with Always on high availability groups in Microsoft SQL Server 2016 (SP2) , When we want to failover to the secondary node manually, it fails because of this error:
Failed to bring availability group 'per-ag1' online. The operation timed out. Verify that the local Windows Server Failover Clustering (WSFC) node is online. Then verify that the availability group resource exists in the WSFC cluster. If the problem persists, you might need to drop the availability group and create it again. (.Net SqlClient Data Provider)
And databases go to not synchronizing situation and Availability group goes to resolving mode so we have to reset the secondary node until the Availability group return back to primary node.
We checked the failover cluster manager events we found these errors:
Error1: Network Name resource 'per-ag1_per-lis3' (with associated network name 'PER-LIS3') has Kerberos Authentication support enabled. Failed to add required credentials to the LSA - the associated error code is '-2146893802'. Cluster resource 'per-ag1_per-lis3' of type 'Network Name' in clustered role 'per-ag1' failed.
Error2: Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it. Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet. Error3: The Cluster service failed to bring clustered role 'per-ag1' completely online or offline. One or more resources may be in a failed state. This may impact the availability of the clustered role. And the last one is time out: Error4: Clustered role 'per-ag1' has exceeded its failover threshold. It has exhausted the configured number of failover attempts within the failover period of time allotted to it and will be left in a failed state. No additional attempts will be made to bring the role online or fail it over to another node in the cluster. Please check the events associated with the failure. After the issues causing the failure are resolved the role can be brought online manually or the cluster may attempt to bring it online again after the restart delay period.
I find some a command in error2 so I try it in windows powershell the result is:
Then I try to check logs of sql server with hoping to find any thing to see more in detail:
Name State OwnerGroup ResourceType
Cluster IP Address Online Cluster Group IP Address Cluster Name Online Cluster Group Network Name File Share Witness Online Cluster Group File Share Witness per-ag1 offline per-ag1 SQL Server Availability Group per-ag1_[ my ip address] Online per-ag1 IP Address per-ag1_FSShare offline per-ag1 SQL Server FILESTREAM Share per-ag1_per-lis3 failed per-ag1 Network Name
but in the normal situation of high availability (when I reset the secondary node and the high availability return back to primary ) everything return to online: Name State OwnerGroup ResourceType
Cluster IP Address Online Cluster Group IP Address Cluster Name Online Cluster Group Network Name File Share Witness Online Cluster Group File Share Witness per-ag1 Online per-ag1 SQL Server Availability Group per-ag1_172.16.0.230 Online per-ag1 IP Address per-ag1_FSShare Online per-ag1 SQL Server FILESTREAM Share per-ag1_per-lis3 Online per-ag1 Network Name
I try to check “show dashboard” report too and it has a critical error that I wrote below: The availability group is offline, and is unavailable. This issue can be caused by a failure in the server instance that hosts the primary replica or by the WSFC availability group resource going offline.
I have Error in Error log
The Kerberos client received a KRB_AP_ERR_MODIFIED error from the server myserverName$. The target name used was HTTP/prs-clsrv1.domain. This indicates that the target server failed to decrypt the ticket provided by the client. This can occur when the target server principal name (SPN) is registered on an account other than the account the target service is using. Ensure that the target SPN is only registered on the account used by the server. This error can also happen if the target service account password is different than what is configured on the Kerberos Key Distribution Center for that target service. Ensure that the service on the server and the KDC are both configured to use the same password. If the server name is not fully qualified, and the target domain (domain name ) is different from the client domain (domain name), check if there are identically named server accounts in these two domains, or use the fully-qualified name to identify the server.
Do you have any suggestion for me about this error? It will be appreciated. I’m look forward to hearing suggestion from DBAs