Documentation
Get started
What is Consul?
Consul operations
Service networking
Enterprise solutions
Runtimes and platforms
HCP Consul Dedicated
Plugins, integrations, & extensions
Reference docs
Glossary

Disaster recovery overview

This page provides an overview of the resources available for preparing a disaster recovery strategy for your Consul cluster and to recover a Consul datacenter from an outage.

Overview

Disaster recovery is an important part of business continuity planning. When operating a Consul environment, you need to prepare a strategy that includes many outage possibilities. You should also take into account the unique outage possibilities that can occur when your network is composed of multiple Consul datacenters.

You should prepare a disaster recovery process for the most severe cases, like a complete outage of one of your physical datacenters, or a cloud provider outage that might make one of the components of your environment temporarily or permanently unavailable. Each outage possibility depends on your configurations, but there are strategies that can help you mitigate the impact when disaster occurs.

To prepare for an outage in your Consul deployment, you should learn how to do the following:

Backup and restore a Consul datacenter
Plan a disaster preparation strategy specific to your network and application requirements
Restore a primary datacenter
Restore a federated datacenter, if you connect multiple Consul datacenters in a single environment

Backup and restore

We recommend backing up Consul's state using the built-in Consul snapshot feature, which is available through the HTTP API's /v1/snapshot endpoint, or the CLI's consul snapshot command.

For step-by-step instructions, refer to Backup and restore a Consul datacenter.

We also recommend that you take snapshots of Consul clusters on a regular basis and store them on mounted or external storage. We recommend that you avoid local or ephemeral storage. We also suggest object storage, instead of block or file based storage.

Enterprise customers can automate the backup process by using the Automated Backups functionality.

Disaster preparation strategy

We recommend that you regularly test and validate the restore process for critical systems to ensure that everything works as expected.

This testing process is typically defined in a Disaster recovery plan (DRP), which is a formal document created by an organization that contains the processes used to recover access to systems and data after a catastrophic event. DRPs typically also include a set of processes for testing and validating disaster recovery procedures and establish a defined process to tackle these events.

DRPs should also include specific steps that are necessary for the environment at hand. For example, when performing disaster recovery for a Kubernetes environment with a large amount of CustomResourceDefinitions (CRDs). For those environments, saving and restoring Consul snapshots may take longer because of the great number of CRDs being synced by the snapshot. When testing the backup procedure, if the backup and restore process is taking too long and you want to have an alternative solution to reduce the procedure's time, you can choose to first restore basic Consul functionalities without CRDs. To do so, add a section in your DPR to include the following steps:

manually backup the CRDs
remove CRDs from the Consul cluster
backup Consul
restore Consul
manually restore the CRDs

For more information about best practices and considerations for your own deployments, refer to Disaster preparation strategy.

Restore a primary datacenter

When an outage happens in the primary datacenter, many of the Consul cluster's functions become unavailable. To learn how to restore functionality to a single Consul datacenter, refer to Restore primary datacenter. You should adapt these instructions to your environment as you build your internal operations manual for disaster recovery.

We also provide tutorials on disaster recovery to help you test the commands in a sandbox environment:

Restore a federated (secondary) datacenter

When a secondary datacenter experiences an outage, it can still impact your environment even though the primary datacenter never stopped operations. To learn how to restore a federated Consul datacenter's functionality, refer to Restore federated datacenter. You should adapt these instructions to your environment as you build your internal operations manual for disaster recovery.