Consul
Disaster recovery overview
This page provides an overview of the resources available for preparing a disaster recovery strategy for your Consul cluster and to recover a Consul datacenter from an outage.
Overview
Disaster recovery is an important part of business continuity planning. When operating a Consul environment, you need to prepare a strategy that includes many outage possibilities. You should also take into account the unique outage possibilities that can occur when your network is composed of multiple Consul datacenters.
You should prepare a disaster recovery process for the most severe cases, like a complete outage of one of your physical datacenters, or a cloud provider outage that might make one of the components of your environment temporarily or permanently unavailable. Each outage possibility depends on your configurations, but there are strategies that can help you mitigate the impact when disaster occurs.
To prepare for an outage in your Consul deployment, you should learn how to do the following:
- Backup and restore a Consul datacenter
- Plan a disaster preparation strategy specific to your network and application requirements
- Restore a primary datacenter
- Restore a federated datacenter, if you connect multiple Consul datacenters in a single environment
Backup and restore
We recommend backing up Consul's state using the built-in Consul snapshot feature, which is available through the HTTP API's /v1/snapshot endpoint, or the CLI's consul snapshot command.
For step-by-step instructions, refer to Backup and restore a Consul datacenter.
We also recommend that you take snapshots of Consul clusters on a regular basis and store them on mounted or external storage. We recommend that you avoid local or ephemeral storage. We also suggest object storage, instead of block or file based storage.
Enterprise customers can automate the backup process by using the Automated Backups functionality.
Disaster preparation strategy
We recommend that you regularly test and validate the restore process for critical systems to ensure that everything works as expected.
This testing process is typically defined in a Disaster recovery plan (DRP), which is a formal document created by an organization that contains the processes used to recover access to systems and data after a catastrophic event. DRPs typically also include a set of processes for testing and validating disaster recovery procedures and establish a defined process to tackle these events.
For more information about best practices and considerations for your own deployments, refer to Disaster preparation strategy.
Restore a primary datacenter
When an outage happens in the primary datacenter, many of the Consul cluster's functions become unavailable. To learn how to restore functionality to a single Consul datacenter, refer to Restore primary datacenter. You should adapt these instructions to your environment as you build your internal operations manual for disaster recovery.
We also provide tutorials on disaster recovery to help you test the commands in a sandbox environment:
Restore a federated (secondary) datacenter
When a secondary datacenter experiences an outage, it can still impact your environment even though the primary datacenter never stopped operations. To learn how to restore a federated Consul datacenter's functionality, refer to Restore federated datacenter. You should adapt these instructions to your environment as you build your internal operations manual for disaster recovery.