Microsoft SQL Availability Groups

WHAT IS IT

Microsoft SQL Server AlwaysOn Availability Groups (AG) is a high-availability and disaster-recovery feature that provides an enterprise-level alternative to database mirroring. It was introduced in SQL Server 2012 to allow for a failover environment for a discrete set of user databases, known as availability databases.

An availability group supports a set of read-write primary databases and one to eight sets of corresponding secondary databases. These sets of databases are referred to as availability replicas. Secondary databases are not directly accessible, but they are kept up-to-date automatically from the primary database. In the case of a failure or downtime, one of these secondary databases can be made primary, minimizing downtime and data loss.

Here are some features of SQL AG:

1. Multiple Secondaries: AlwaysOn supports multiple secondary replicas, providing more options for failover and offloading read operations or backup operations to secondary replicas.

2. Readable Secondary Replicas: Secondaries can be used to offload reporting. This can be done with live data, reducing the load on the primary database and making better use of hardware resources.

3. Automatic and Manual Failover: The automatic failover feature minimizes downtime in the event of a failure. Manual failover can also be performed for planned maintenance or upgrades.

4. Data Compression and Encryption: AlwaysOn supports data compression and encryption, improving security and reducing storage needs.

5. Integrated Management Experience: AlwaysOn groups can be configured and managed using familiar SQL Server tools such as SQL Server Management Studio (SSMS).

6. Automatic Page Repair: This feature helps to maintain data integrity by automatically repairing corrupted database pages.

In a nutshell, AlwaysOn Availability Groups provide high availability, disaster recovery, and load-balancing solutions, helping to ensure that your SQL Server databases are always accessible, secure, and performing optimally.

 

DR SCENARIO

In a disaster recovery (DR) scenario, your primary goal is to ensure data continuity and minimal downtime. An example of a typical design for AlwaysOn Availability Groups (AG) might include multiple replicas of your data: a primary replica at your main data center, a synchronous secondary replica at a nearby data center (for high availability), and an asynchronous secondary replica at a remote data center (for disaster recovery).

Here is how different components play their role:

1. Windows Active Directory (AD): AD is used for authentication and authorization of the SQL Server instances in an AlwaysOn AG. Each of the SQL Server instances participating in the AG must be part of the same Windows Server Failover Clustering (WSFC) cluster, and the WSFC cluster itself must be a part of the same AD domain.

2. Windows AD DNS: In the context of SQL Server AlwaysOn AG, DNS plays a critical role in routing to the currently active (primary) replica. When a failover event happens, the DNS needs to be updated to reflect the IP address of the new primary replica. The client applications connect to the databases using a Virtual Network Name (VNN), which is registered in DNS. The VNN abstracts the failover process so that clients can maintain connections through failovers without any changes in connection strings.

To speed up failover and ensure automatic failover, consider the following:

1. Synchronous-Commit Mode: Using synchronous-commit mode between your primary replica and your high-availability secondary replica can help speed up failover. In this mode, transactions are not considered committed until they are written to both the primary and the secondary replica.

2. Automatic Failover: Up to three replicas (including the primary) can be configured for automatic failover. When the primary goes down, automatic failover will switch to the secondary, which is synchronized with the primary, without data loss.

3. Proper Hardware and Network Configuration: A good hardware and network configuration is crucial. You should ensure you have a stable and fast network connection between your primary and secondary sites, and your server hardware should be up to the task of handling your workloads.

4. Quorum Configuration: Proper configuration of the WSFC quorum mode is critical. This can improve the speed and reliability of failovers.

In theory, you can set up your environment to automatically fail back to the original primary once it comes online, but this is not typically recommended. Automatic failback can lead to unnecessary failovers if your primary is not stable. It’s usually better to monitor your primary to ensure it’s stable, then manually fail back when you’re ready.

 

QUORUM

The quorum in a Windows Server Failover Clustering (WSFC) is a concept that helps prevent split-brain scenarios, which is when network partitioning causes two or more parts of the cluster to operate independently and unaware of each other, possibly leading to conflicting actions.

The quorum model ensures that only a majority (quorum) of the cluster nodes can be operational at any given time. It determines the number of failures that a cluster can sustain while still remaining online. If more failures occur than the quorum configuration can handle, the cluster will cease to function.

Here are the quorum configuration models:

1. Node Majority: Each node that is available and in communication can vote. This is best suited to clusters with an odd number of nodes.

2. Node and Disk Majority: Each node plus a designated disk in the cluster (known as the “disk witness“) can vote. This is typically used in clusters with an even number of nodes, where the disk witness can provide the “tiebreaker” vote.

3. Node and File Share Majority: This model works similarly to Node and Disk Majority but uses a file share as the witness instead of a disk. This is particularly useful for multi-site clusters where shared storage is not available.

4. No Majority: Disk Only: This legacy model is available but not recommended for most clusters. It relies solely on a disk witness and doesn’t count node votes.

In the context of SQL Server AlwaysOn Availability Groups, the quorum plays a significant role in automatic failover. Automatic failover will only occur if the failover can maintain quorum; if quorum is lost, the cluster and any associated AGs will go offline.

By understanding and appropriately configuring the quorum in a WSFC, you can make the cluster and the AGs more resilient and available, which is crucial in high-availability and disaster recovery scenarios.

 

BEST PRACTICES AND RECOMMENDATION FOR QUORUM IN DR A/P

In a typical disaster recovery (DR) scenario where you have an active-passive setup, you would have your primary, active data center and a secondary, passive data center. The primary center hosts the production workload, while the secondary data center is used in case of a disaster at the primary center.

Here are some best practices and recommendations for setting up the quorum in such a scenario:

1. Witness Configuration: For an even number of nodes (two-node active-passive setup), it’s recommended to add a witness to avoid a “split-brain” scenario and ensure there is always a majority to establish quorum. A witness can be a disk or file share (cloud witness from Windows Server 2016 and onwards) hosted on a separate, third location. This witness does not participate in the voting process until it needs to break a tie.

2. Quorum Voting Adjustment: In a multi-site cluster where not all nodes are intended for automatic failover, consider adjusting the quorum vote assignment. For example, you might remove the quorum vote from the DR site nodes so they won’t affect the quorum majority in the production site in case of network partitioning.

3. Node Weighting: In some scenarios, you may want to adjust the node votes. For instance, you can assign more votes to the nodes at the primary site than the secondary site to maintain availability during a network outage between sites.

4. Dynamic Quorum: Enable dynamic quorum, which allows the cluster to dynamically adjust the vote count based on the state of the currently active nodes. This feature, available in Windows Server 2012 and later, can increase the overall availability of the cluster.

5. Regular Testing: Regularly test your failover process to ensure that everything works as expected. This includes confirming that your applications handle failover correctly and that you meet your recovery time objective (RTO) and recovery point objective (RPO).

6. Quorum Monitoring: Monitor the state of the quorum. This can help you understand if you’re at risk of losing quorum and the cluster going offline. Tools like the Failover Cluster Manager can provide useful insights.

Remember, the quorum configuration that works best for you will depend on your specific infrastructure and business requirements. It’s essential to plan and test your configuration to ensure it meets your needs.

 

MULTISUBNET FAILOVER PARAMETERS

These three parameters are vital when dealing with multi-subnet failover in SQL Server AlwaysOn Availability Groups, especially when you have an Active/Passive setup for disaster recovery across geographically separated locations.

1. RegisterAllProvidersIP: This is a cluster resource parameter that controls how client applications connect to the availability group listener. When this parameter is set to 1 (true), the Domain Name System (DNS) will register IP addresses of all replicas, and it’s the client application’s responsibility to sort out which one to use based on the `MultiSubnetFailover` parameter in their connection string. When it’s set to 0 (false), only the IP address of the current primary replica is registered with DNS.

For rapid failover in a multi-subnet environment, it is recommended to set this parameter to 0 (false). This is because DNS propagation delays can prevent applications from quickly connecting to the new primary replica after a failover if all IP addresses are registered.

2. HostRecordTTL: This stands for Time-To-Live, and it’s a parameter that dictates how long the DNS record for the availability group listener is cached by clients and DNS servers. It’s measured in seconds. The default value is 1200 seconds (20 minutes), but for faster failover in a multi-subnet environment, it’s recommended to lower this value. A common recommendation is to set it to 60 seconds or even lower. This allows client applications to detect the new primary replica quicker after a failover.

3. MultiSubnetFailover: This is a parameter that you set in the connection string of client applications connecting to the SQL Server. When set to true, it enables faster detection of and connection to the new primary replica after a failover in a multi-subnet environment. It does this by trying to connect to all IPs registered to the AG listener simultaneously, rather than one at a time. This parameter should be set to true in a multi-subnet environment.

In an Active/Passive disaster recovery setup spanning multiple subnets, these parameters are crucial. Set `RegisterAllProvidersIP` to 0, `HostRecordTTL` to a lower value like 60, and ensure `MultiSubnetFailover=true` in your client applications’ connection strings. This setup will help facilitate a quick failover and reconnection of clients in case of a disaster.

Please remember that any change to these settings should be tested thoroughly in a non-production environment before implementing them in a production environment.

 

WINDOWS AD

Active Directory (AD) replication is a critical process within a Windows Server infrastructure. Replication ensures that changes made to objects in Active Directory, such as creating new users, assigning rights, or changing passwords, are automatically synchronized across all Domain Controllers (DCs) in the network.

In the context of a multi-subnet environment for AlwaysOn Availability Groups, AD replication becomes particularly important, as it ensures the consistency of DNS records across all sites. When a failover occurs, the new IP address of the AlwaysOn Availability Group Listener (which is responsible for directing client applications to the appropriate database replica) is registered in DNS. If the DNS changes aren’t replicated swiftly and accurately across all DCs, clients may fail to connect to the correct database replica.

The challenge arises with AD Intersite Replication, which is replication between Domain Controllers in different physical locations or subnets. By default, Intersite Replication occurs every 180 minutes (3 hours), which may not be adequate in a high-availability scenario.

There are two strategies to address this:

1. Change the Intersite Replication Interval: In environments where changes are frequent, you might want to reduce this interval. This can ensure DNS changes propagate faster to all sites. However, reducing the replication interval increases the bandwidth consumption between sites, which could be a concern if you have limited bandwidth.

2. Force Replication: In urgent scenarios, such as after a manual failover, you could force AD replication to occur immediately. This would ensure that all Domain Controllers are aware of the DNS changes related to the AlwaysOn Availability Group Listener.

In an Active-Passive DR setup with AlwaysOn Availability Groups, properly configured AD replication ensures that all changes to the availability group, especially DNS changes, are propagated swiftly across all sites. This enables clients to connect to the correct SQL Server instance after a failover, ensuring high availability and business continuity.

 

DNS REPLICATION

DNS (Domain Name System) is a critical service that resolves friendly hostnames to IP addresses, making it easier to access resources within a network. When you have Active Directory (AD) in place, DNS plays a vital role in enabling domain controller location, replication, user authentication, and other crucial operations.

In an AD-integrated DNS scenario, DNS data is stored in Active Directory and replicates according to the AD replication topology and schedule. This setup provides inherent fault tolerance and reduced management overhead.

However, with multiple AD sites, you have what is called “intersite replication.” Intersite replication is the replication that occurs between different physical locations or sites within an AD forest.

Here’s how it works:

1. Replication within a site (intrasite): Within a single site, the replication is typically near-instantaneous. Changes made to an AD-integrated DNS zone on one domain controller are quickly replicated to all other domain controllers within the same site.

2. Replication between sites (intersite): For AD replication between sites, a different approach is used. By default, replication occurs at intervals of 180 minutes (3 hours). The idea behind this delay is to conserve bandwidth over potentially slower intersite network connections.

In the context of SQL Server AlwaysOn Availability Groups in a multi-subnet scenario, you must consider how this replication delay could impact operations. If a failover event occurs, the new primary replica’s IP address needs to be updated in DNS so that client connections are correctly routed. With the default intersite replication interval, this update could take up to 3 hours to propagate to other sites.

If this delay is unacceptable for your disaster recovery objectives, you might want to consider the following options:

1. Decrease the Replication Interval: You could decrease the default replication interval to speed up the propagation of DNS changes. However, be aware that this can increase bandwidth usage between sites.

2. Force Replication: If you need an immediate update after a failover or other significant changes, you can manually force replication between sites. This can be done using the Active Directory Sites and Services tool.

3. Change DNS TTL: The DNS TTL (Time-To-Live) value determines how long DNS resolvers cache a DNS response. Lowering the TTL can help ensure clients check back with the DNS server more frequently for updated information.

As always, any changes to your AD and DNS configuration should be carefully planned and tested to ensure they meet your high-availability and disaster recovery requirements without negatively impacting your network.

 

DNS NOT UPDATING

If the DNS record is not getting updated from the SQL Server side, and you are encountering the “DNS operation refused” error, it usually points to a permission issue where the computer account (SQL Server) or Cluster Name Object (CNO) does not have the necessary rights to update the DNS record. This is often due to Active Directory and DNS security settings.

Here are a few options to resolve this issue:

1. Check Permissions: Verify that the SQL Server service account (or the computer account if the SQL Server service is running under a built-in account such as Network Service) and the CNO have the necessary permissions to update DNS records. You can use tools like the DNS Management console and Active Directory Users and Computers to review and modify these permissions.

2. Enable Secure Dynamic Updates: In your DNS zone properties, ensure that “Secure only” dynamic updates are enabled. This allows only authenticated users and computers to make changes to DNS records.

3. Delegate Permissions: In large environments, sometimes it’s better to delegate the DNS update permissions to a specific security group. You can then add the SQL Server service accounts and the CNO to this security group. This way, you have better control and visibility over who can update DNS records.

4. Update DNS Records Manually: As a last resort, and assuming your environment permits it, you could update the DNS records manually. However, this is not recommended as a long-term solution, since it is prone to human error and does not scale well.

5. Review DNS Aging and Scavenging Settings: Ensure that your DNS aging and scavenging settings are configured correctly. Incorrect settings might lead to premature deletion of records.

6. Repair AD Replication: If you’re dealing with a multi-subnet environment, it could be that AD replication is not working properly, causing DNS changes not to propagate to all sites. Use tools like “repadmin” to check the replication status and diagnose any issues.

Remember, any changes to Active Directory and DNS settings can have a significant impact on your network, so always plan and test changes carefully before implementing them in a production environment. It might also be beneficial to involve your network and security teams when investigating this type of issue.

 

PERMISSIONS

When dealing with DNS updates in relation to SQL Server AlwaysOn Availability Groups, the relevant accounts (the service accounts for SQL Server and the Cluster Name Object, or CNO) need specific permissions to update the DNS entries associated with the AlwaysOn Availability Group Listener.

For the SQL Server service accounts and the CNO to be able to update DNS, they need the following permissions on their corresponding DNS records:

1. Read: This permission allows the account to read the properties and contents of the DNS record.

2. Write: This permission allows the account to modify the DNS record.

3. Delete: This permission allows the account to delete the DNS record.

These permissions are necessary for the operation of an AlwaysOn Availability Group, as when a failover event occurs, the DNS entries for the Availability Group Listener need to be updated to point to the new primary replica.

To check and modify these permissions, you can use the DNS Manager console:

1. Open DNS Manager.

2. Expand the DNS Server object and navigate to the Forward Lookup Zone that contains the DNS record.

3. Right-click on the DNS record and select Properties.

4. In the Properties dialog box, go to the Security tab.

5. If the account (SQL Server service account or CNO) is not listed, you can add it by clicking on the “Add” button. If the account is already listed, select it and ensure it has Read, Write, and Delete permissions.

Always ensure that you thoroughly test any changes to permissions in a non-production environment before implementing them in a production environment. If you’re unsure, it’s always a good idea to consult with your network and security teams or a qualified consultant.