My Availability Group Database Isn’t Synchronizing

I currently manage 5 AlwaysOn Availability Groups. Two are on SQL Server 2014 and overdue for an upgrade, while three of them are on SQL Server 2017.  From time to time I have run into a couple of different situations that I needed to troubleshoot and I want to tell you where to look and what to check on in these scenarios. I can’t possibly tell you about everything that could go wrong, but I can tell you about my experience with AlwaysOn Availability Groups and let you decide if that experience helps you or not.

A Single Database Is Not Synchronizing

If a single Availability Group database is in the “Not Synchronizing” state, what would you do?

There are several things to check. First you could look in SQL Server Management Studio to see if data movement has been suspended. If it has then you will see an icon similar to two pipes || over the database icon.  You will also see messages in the Error Log about data movement being suspended. If this is the case, try to resume data movement. This can be done in SSMS by right clicking the database in the Availability Databases folder and choose “Resume Data Movement”. Alternatively, you could execute the below T-SQL.

ALTER DATABASE [YourDatabase] SET HADR RESUME

Beyond that you can also use PowerShell to resume data movement for the database.

<#
gcm is the alias for Get-Command. Below I'm using wildcards to find any commands with the wor "resume" in it
and specifying that I want PowerShell to look in the DBATools module.
#>
gcm *resume* -Module DBATools

<#
Once I have found a command I think might help me, then I run the alias Help for Get-Help
to find out how to use the command

#>

help Resume-DbaAgDbDataMovement -Detailed

#Resumes data movement on db1 and db2 to ag1 on sql2017a. Prompts for confirmation.

Resume-DbaAgDbDataMovement -SqlInstance sql2017a -AvailabilityGroup ag1 -Database db1, db2
    

 

Second, check the SQL Server Error Log for errors related to the AG and check the  AlwaysOn_health xEvent that is running by default for AGs. To check the default health session for AlwaysOn Availability Groups, open SSMS and navigate to Management > Extended Events > Sessions > AlwaysOn_health. Then, click the plus sign next to it in order to drill down to see the file. Right click the file and choose “View Target Data”. After doing that, you can filter and search the file for relevant entries.

Extended Events AlwaysOn Health

If none of these things gives you enough information to resolve the problem, you could also restart the SQL Server service on the secondary that is showing the database that isn’t synchronizing. Again, there are many ways to do this, and I’m going to encourage you to look at the DBATools module in PowerShell to find an automated way to handle this. As a last resort, you could remove the database from the AG and then re-add it to the AG.

None of the Availability Group Databases Are Synchronizing

What if all of the databases are in the “not synchronizing” state, what would you do?

This is a little different because in this case every database on at least one secondary is in the “Not Synchronizing” state. The prior scenario was only about a single database with a synchronization issue.

The primary thing I check in this scenario is whether the cluster service is running on the secondary. If it isn’t, then none of the databases will synchronize. I have seen this happen several times, particularly after secondary servers have been rebooted. You could also have a network disconnect between this replica and the primary. Again though, the SQL Server Error log or the AlwaysOn_health extended events session will show you this. Additionally, you can look in the Cluster Events section of the Failover Cluster Manager for one of the nodes in the AG. In there you may find that you have lost quorum for some reason.

If for some reason you need to get down into the fine details of what is happening with the failover cluster, then check out the following path in the Windows Event Viewer.

Event Viewer > Applications and Services Log > Microsoft > Windows > FailoverClustering. In that location are three logs called Diagnostic, DiagnosticVerbose and Operational.

Next Steps To Take

  1. Go to YouTube and find videos on Availability Groups.
  2. Consider taking a class from the fine folks at SQLSkills about Failover Clustering and Availability Groups.
  3. Reach out to somebody if you have questions. If you would like help with anything in this post, or with something else related to SQL Server, reach out to me here, or on Twitter, and I’ll be glad to offer assistance.

 

Azure SQL Boot Camp Day 3 Takeaways

Hello reader!  Here we are with the much anticipated post for day 3 of the recent Azure SQL Boot Camp. Here are the previous posts for days one and two .

If you want to get started with Azure  SQL, the boot camp is a great way to do it.  You’ll get Azure experts like Anna Hoffman and Bob Ward, who is also a long time SQL Server expert, teaching you how to leverage the cloud for SQL Server. On day three, Anna does a couple of demos on the subject of HA that demonstrate the capabilities of the Azure platform.  Other days have demos of other concepts as well.

Not only will you learn specifically about SQL Server in Azure, but this series is going to teach you other foundational concepts about Azure that are applicable no matter what you’re trying to do in Microsoft’s cloud. Things like cloud networking and the security aspect that needs to go into that networking, as well as how to give people access to resources in a secure way, are relevant for every part of Azure. With that, let’s dig into what day 3 had to offer.

Day 3 of the Azure SQL Boot Camp was all about availability.  When done right, availability solutions keep your environment up and running when things don’t go right. Here are the key takeaways for Day 3 regarding high availability:

  • Azure has you covered in terms of backup/restore.
  1. A full backup is taken weekly.
  2. A differential backup is taken ever 12 hours.
  3. Log Backups are taken every 5 to 10 minutes.
  4. Backups are stored on Read Access Geo-Redundant Storage, which means there are multiple copies of backups in differing locations.  This protects you from losing the backups.
  5. Point in time restore is available.
  6. Backup integrity checks are performed and there is even automated, random testing of those backups via automatically restoring some of them from time to time.
  7. Backup retention is by default at 7 days.  You can go up to 35 days or even use the Long Term Retention service and keep backups for as long as 10 years.
  8. With Managed Instance a copy only backup option is available.
  9. On premises backups can be used to restore to a Managed Instance as part of a migration strategy.

Let’s be honest, how many of you have a backup strategy that is this solid? Some of you have databases that are in full recovery model with no log backups happening. Yikes! Don’t believe me? Check into your backup history information in msdb with a little T-SQL and you can verify for yourself whether your environment is up to snuff in this regard.

How many of you are writing backups to the same location as the data? And, the big question is this: How many of you are testing whether you can use your backups to perform a restore? Remember, just because you have backups doesn’t mean you can restore.

If you don’t have these sorts of things in place, you might be doing a lot of mopping up after a disastrous situation. “Clean up on aisle 4!”

  • Azure has you covered when it comes to High Availability
  1. In the general purpose tier, Azure customers get a service that behaves like a SQL Server Failover Cluster Instance.
  2. Auto failover is possible and is handled by SQL Server and the Azure Service fabric.
  3. There is local SSD and local Tempdb with data files in premium storage.
  4. Backup files are geo-redundant.
  5. In the Business critical tier, Azure customers get Availability Groups behind the scenes – set up and configured for you automatically. With this service, you get 3 secondaries, only one of which has to harden transactions in order for the primary to commit transactions, and one replica for read only routing.
  6. In the hyperscale tier, you get snapshot backups taken by Azure, up to 4 secondaries and they can all be readable, as well as redundant data and log files through the log service and long term storage on Azure Standard Storage.
  7. Services like Zone Redundant Configuration, or Availability Zones, provide additional HA services for Azure SQL DB, but not in all regions.  This method of HA is currently not available for Azure Managed Instance.
  8. Geo-replication and auto-failover groups are a central piece to HA in Azure. Geo-replication has multiple replicas and supports read scale out. With this service offering, you do have to update connection strings after a manual failover is initiated.  It is asynchronous but you can use a system stored procedure to force synchronous commit prior to a failover. Geo-replication does not allow for failing over multiple databases together, and is available only for Azure SQL DB. Auto-failover groups can failover multiple databases at once with no changes to the application connection string.  It supports read scale out and is available only in Managed Instance.

Again, let’s do a gut check. How many of you are running important or business critical SQL Servers on stand alone servers and your only real option for recovery if the machine has an issue is to create a new SQL Server and restore all your backups? In this scenario, that important or business critical SQL Server doesn’t have a second copy via log shipping or some sort of replication, mirroring or an Availability Group.

Go on…. think about it.  List your important and critical business servers in a Word doc. Now examine the list and ask yourself what would happen if any of those servers died and couldn’t be recovered.  Would some part of your business stop while you tried to build a new SQL Server and restore? If so, I think you just found an urgent action item to put at the top of your to do list.  As you determine your solution, think about whether some form of SQL Server in Azure can give you what you need.

This is an important feature that is on by default in Azure.  It is available in SQL Server 2019 on-premises, but it is not on by default. It uses a persisted version store to house old data rows.  Those old data rows, and the most recent checkpoint, are used for a faster rollback.  This addresses the problem of a long rollback from a long running transaction that was killed.

Well, that is the summary for Day 3 of the Azure SQL Boot Camp. There is more that was discussed, but remember, I am only trying to give you a summary of key takeaways. Enjoy as you dig into SQL Server in the Azure cloud!