Why is Secondary Replica Showing in Resolving state after AOAG Automatic failover?

Sometimes AOAG configuration stuck in resolving state and its secondary Replica does not come online. When you check the Alwayson Availability Group configuration, you find it showing in resolving state after Automatic failover of AOAG Configuration. Let us get the reason behind secondary replica stuck in resolving state and solution to overcome such issues in future.

SQL Server AlwaysOn Availability Group

SQL Server AlwaysOn Availability Group is an advanced version of database mirroring and introduced in SQL Server 2012. AOAG is HA & DR solution for SQL Server databases. We can group a bunch of databases as one Availability Group and failover/failback as one entity whereas it was not possible in database mirroring. Database Mirroring can be configured only for single database at a time. You can visit the attached link to learn more about AlwaysOn Basics.

I have also an AOAG configuration in my environment which is set for Automatic failover between both replicas. Last week, we faced an issue during automatic failover. Secondary replica was not transitioning as primary replica during an automatic failover and it was stuck in “Resolving” state. It was full down time for application because primary replica was lost and availability databases were not coming online on secondary replica.

I will explain step by step method to make you understand the issue and the fix.

Automatic Failover Not Working and Secondary Replica Stuck in Resolving State

As secondary replica is not transitioning as primary replica after failover because secondary replica hung in resolving state so here, we will check the main root cause and its fix so that secondary replica comes online after every failover. Let’s check the current configuration of this AOAG. Run below command to get the AOAG details. We can see the configuration is running into automatic failover mode.

SELECT replica_server_name, availability_mode, availability_mode_desc, failover_mode, failover_mode_desc
FROM sys.availability_replicas

Check AoAG details

Now connect to secondary replica in SQL Server Management Studio. Expand the AlwaysOn High Availability folder followed by Availability Group folder. Here, you can see the current state of this availability group which is showing in Resolving state.
secondary replica status in resolving state post failover

I checked the SQL Server error log file but i did not get enough information about this issue to proceed. I searched about this issue and got few suggestions on web that i should check the cluster log. I checked failover cluster manager for critical events. I got below event in failover cluster manager.

Clustered role 'AG_***' has exceeded its failover threshold. It has exhausted the configured number of failover attempts within the failover period of time allotted to it and will be left in a failed state.
No additional attempts will be made to bring the role online or fail it over to another node in the cluster.
Please check the events associated with the failure. After the issues causing the failure are resolved the role can be brought online manually or the cluster may attempt to bring it online again after the restart delay period.

check cluster events

Fix Resolving state and Make Secondary Replica Primary

Above issue clearly indicates about some cluster configuration related to role created for AOAG. As we can see, there is one role is showing under roles tab of this cluster group in below image. We can see the current owner of this role is still showing as node 1.
failover cluster manager status

Now right click on this role and choose properties as per below image.
Role properties in FCM

You will get a properties window as shown in below screenshot. Now click on “Failover” tab of this screen to check the failover threshold details.
FCM Role Properties

We can see the failover threshold is set to only 1 in the 6 hours. It means if any failover happen more than once in the 6 hrs time period it will remain in failed state and will not try to come online. Now increase this value as per your need. I have changed it to 10. Click on apply and Ok button.
FCM Role Properties

Now, secondary replica of availability databases will not come online automatically. We need to bring them online. We have multiple options to bring them online from resolving state. You can directly run the failover by right click on availability group in SSMS and then proceed with failover.

Another option is to bring them online from failover cluster manager. You can either right click on the AOAG role and click on “Bring Online…” option or you can failover it to best possible node. As primary replica is down and not accessible so it will be failed over to the secondary replica only because this node is only available this time. Let’s right click on AOAG role in failover cluster manager, choose move and then click on “Best Possible Node”
FCM failover post increasing failover threshold

It will process and come online within few seconds and this time you can see this role is online from node 2 as current owner is showing as node 2 now.
Bring role online

We can also validate this change while making a database connection to the secondary replica in SSMS. Connect to secondary replica and expand the AlwaysOn Availability Group folders. You can see now availability group is showing as primary. That means your database is online from secondary replica to accept the user connections.
resolving state changed to primary

You can also test this exercise to make sure automatic failover is working fine or not to prevent this kind of future outages. Once your earlier primary replica comes online, you shutdown SQL Server services on secondary replica/current primary replica. Once it will go down, your current secondary replica becomes primary automatically. Before shutdown SQL Server services, make sure databases should be fully synchronized.

Hee, we have fixed Alwayson issue where secondary replica is not coming online after failover and hung in resolving state. I hope you like this article. Please follow us on our facebook page and on Twitter handle to get latest updates.

Read More:

Follow me:

Manvendra Deo Singh

I am working as a Technical Architect in one of the top IT consulting firm. I have expertise on all versions of SQL Server since SQL Server 2000. I have lead multiple SQL Server projects like consolidation, upgrades, migrations, HA & DR. I love to share my knowledge. You can contact me on my social accounts for any consulting work.
Manvendra Deo Singh
Follow me:
Summary
Why is AOAG stuck in Resolving state after Automatic failover?
Article Name
Why is AOAG stuck in Resolving state after Automatic failover?
Description
Sometimes AOAG configuration stuck in resolving state and its secondary Replica does not come online. When you check the Alwayson Availability Group configuration, you find it showing in resolving state after Automatic failover of AOAG Configuration? Let us get the reason behind this issue and solution to overcome such issues in future.
Author
Publisher Name
www.techyaz.com

You may also like...

Leave a Reply

Your email address will not be published. Required fields are marked *