Disaster recovery overview

This topic provides an overview of the best practices for preparing a disaster recovery strategy for your Consul cluster.

Introduction

Disaster recovery is an important part of business continuity planning. Your strategy depends on the following considerations:

Recovery point objective (RPO) - The maximum amount of data loss that can be incurred from a disaster, failure, or comparable event. RPO is measured as a unit of time and there is usually a 1-to-1 correlation between RPO and backup frequency.
Recovery time objective (RTO) - The amount of time that passes between application failure and full availability which includes how much time it takes to recover from the disaster or failure. RTO could be relatively short for a customer that already has another datacenter location available for disaster recovery purposes and replication of services and data occurs on a regular basis. You can also leverage leverage automation technologies to recover more quickly from a disaster.

When a Consul cluster loses quorum, or if you lose the server agents completely, restoration requires you to build a new Consul cluster from the latest snapshot. You may also need to reinstall and configure Consul client agents. Automation technologies can greatly reduce the amount of time that it will take to perform these steps. Keep these facts in in mind when deciding between RPO and RTO.

Snapshot recommendations

Our recommended method for backing up Consul state uses the built-in Consul snapshot feature, which is available through the HTTP API or CLI. The tutorial Backup Consul Data and State covers this in further detail.

Warning

Consul snapshots contain extremely sensitive data, such as credentials in recoverable form. Store snapshots on an encrypted medium with sufficiently strict access controls in place.

You should take snapshots of Consul clusters on a regular basis and store them on mounted or external storage. We suggest the use of object storage versus block or file based storage, such as Azure blobs, Google Cloud storage, or AWS S3 storage. Avoid local or ephemeral storage.

You can automate this process using the Consul Enterprise Snapshot agent.

We recommend that you regularly test and validate the restore process for critical systems to ensure that everything works as expected. This testing process is typically defined in a Disaster recovery plan (DRP), which is a formal document created by an organization that contains the processes used to recover access to systems and data after a catastrophic event. DRPs typically also include a set of processes for testing and validating disaster recovery procedures and establish a regular cadence for these events.

ACL considerations

When you restore a snapshot to a new Consul cluster, note the following behavior regarding tokens and the ACL system.

Token persistence enabled	ACL token provided in Consul client config	Consul client requires a re-configuration

Yes	No	No
No	Yes	No
Yes	Yes	No
No	No	Yes

If token persistence was enabled on client agents when the snapshot was captured, then the client agents will resume function after you restore the cluster's server agents.
If the ACL token is specified directly in the client agent configuration when when the snapshot was captured, then the client agents resume function after you restore the cluster's server agents.
If client agents were not configured in a way that persists access to a token, then client agents will not resume function after the restore because they not have permissions to register with the new Consul cluster. This situation applies when the ACL token was set using the API or CLI, or if the ACL token was set in an environment variable. Use the consul acl set-agent-token command or the CONSUL_HTTP_TOKEN variable to update the token on client agents before you restore a cluster with a snapshot.

Service failure recommendations

To architect against outages caused by disasters that impact services registered with Consul, use cluster peering failover with sameness groups. With this setup, Consul can transparently failover requests to an unhealthy service to the same service in a different region and datacenter.

Region failure recommendations

In the event of a total region failure, Consul and your services are likely down. To architect against this situation, deploy Consul and your services in multiple regions with a global failover policy so that Consul reroutes network traffic to the alternate region during a disaster. Deploying identical Consul servers and services across multiple cloud regions satisfies datacenter latency requirements and limits the blast radius during large-scale disasters.

Multi-cluster disaster recovery considerations

The disaster recovery considerations for single Consul cluster deployments apply to Consul multi-cluster deployments as well. However, there a few additional considerations that are specific to Consul multi-cluster deployments using WAN Federation.

When you design and architect your WAN-federated Consul environment, it is important to consider the critical role of the primary datacenter in the multi-cluster deployment. The primary Consul datacenter serves as the source of truth for the following data.

Certificate Authority management, if you use the built-in Consul CA. The root CA resides in the primary Consul datacenter and must sign the certificates for the additional Consul datacenters.
ACLs
Intentions

It is important to consider both placement of the primary Consul datacenter as well as the steps required to recover from a disaster. The recommended approach is reviewed in detail below.

Clientless primary Consul datacenter

Once you establish and federate a primary Consul datacenter; you cannot migrate, change, or move it. An effective pattern for large Consul multi-cluster deployments is to have a dedicated primary Consul datacenter with the sole purpose of serving as a primary. You would only include Consul servers in this primary datacenter and not connect any client nodes or services. This primary Consul datacenter can then be federated normally with other Consul datacenters, which will each contain both servers and clients.

This approach provides two distinct advantages.

It becomes easier to move the primary Consul datacenter. For example, you may want to migrate it from an on premises datacenter to a cloud environment. Typically, this would entail performing a backup and restore of the primary Consul datacenter to the alternate location. Review the Disaster Recovery for the Primary Datacenter tutorial for guidance on restoring a Consul cluster.
In the event of a disaster, the additional Consul datacenters can still continue to function independently of the primary Consul datacenter although functionality will be reduced until the primary Consul datacenter is brought back online. See the table below for more details.

Primary Consul datacenter outage behaviors

The table below assumes that the primary Consul datacenter is offline. It is implied that when referencing 'any Consul datacenter' that the primary Consul datacenter is not included.

Consul Cluster Functionality	Within local Consul datacenter	Within any Consul datacenter	Comments
Read ACLs	✔	✔	Assumes that the default setting of ‘extend cache’ is used for the ACL down policy
Create/Update/Delete ACLs	✖	✖
Read Intentions	✔	✔	Assumes that Intentions were created when primary datacenter was online
Create/Update/Delete Intentions	✖	✖
Create/Read/Update/Delete KV Store items	✔	✔
Create/Read/Update/Delete Services	✔	✔
Certificate Generation & Renewal	✖	✖	Certificates must be signed by the primary Consul datacenter

Additional guidance

For more information on disaster recovery, including detailed instructions on how to backup and restore Consul datacenters, refer to the following resources: