Disaster recovery
This page assumes you are familiar with an existing Terraform Enterprise deployment and architecture.
It provides general guidance on preparing Terraform Enterprise for disaster recovery (DR). It does not prescribe specific steps for any single cloud or private datacenter, but the recommendations apply to any officially supported platform.
Terminology
- Recovery Point Objective (RPO) – The maximum acceptable amount of data loss measured in time. For example, an RPO of 10 minutes means that, in the worst case, you could lose up to 10 minutes of data.
- Recovery Time Objective (RTO) – The maximum acceptable amount of time to restore normal operations after a failure. For example, an RTO of 15 minutes means the system must be fully operational within 15 minutes of a failure.
- Standard Operating Procedure (SOP) - The step by step instructions that describe how to perform a disaster recovery procedure.
- Regions – A geographically separated location such as a cloud region or datacenter. For example, you might run Terraform Enterprise in
us-east4(Virginia) with a failover inus-west1(Oregon). - Environments – Refers to a Terraform Enterprise deployment in a given region. You can use this interchangeably with 'region' (for example, primary or failover).
Best practices
When building disaster recovery processes for Terraform Enterprise, keep the following in mind:
- Plan with RPO and RTO in mind – Follow your organization’s policies or industry-specific requirements. Some workloads may require RPO/RTO as low as 15 minutes.
- Mirror environments – Production and failover deployments must be nearly identical, including network, IAM, and supporting services.
- Automate provisioning – Use HashiCorp Validated Designs (HVD) or Terraform modules to provision both production and DR environments consistently.
- Keep DR state files separate from production Terraform Enterprise so you can apply them during a failover.
- Independent environments – Each environment must have its own DNS, storage, and supporting services. Avoid configuring failover environments to point back to production DNS or services.
- Fault injection testing – Use cloud-provider features or third-party tools to simulate availability zone or regional outages for realistic failover testing.
- Replicate critical data – Enable replication for Terraform Enterprise dependencies such as Postgres, object storage, and configuration. Monitor replication health and lag.
- Test on a recurring basis – Run DR exercises to validate processes and train staff.
- Document and automate standard operating procedures (SOPs) – Maintain clear, up-to-date failover documentation and automate recovery steps with tools like Ansible or AWS Step Functions.
- Readily available infrastructure – Pre-provision critical resources in the failover region (for example, Redis). Even if replication isn’t needed, having instances ready ensures faster recovery.
Recommended: Public cloud services often provide the simplest and fastest way to replicate storage and databases across regions for DR.
Operational modes and disaster recovery
Terraform Enterprise supports three operational modes. This guide focuses on Active/Active mode, as it covers all necessary components for DR.
Component-specific guidance
S3-compatible storage
- Create buckets in both primary and failover regions.
- Enable replication from primary to failover buckets.
- Optionally, configure bidirectional replication to preserve writes made during failover.
- Monitor replication status, latency, and errors.
Postgres
- Replicate Postgres data from primary to failover region.
- For self-hosted Postgres, configure streaming replication to minimize RPO.
- For managed Postgres (Aurora, Cloud SQL, and so on), use built-in replication options.
- Monitor replication lag to ensure fast failover.
Redis
- Redis does not require replication between regions.
- Ensure you deploy Redis and make it ready in the failover region before starting Terraform Enterprise.
Secrets, keys, licensing, and configuration
Ensure the following are securely available in both environments, these may vary depending on your configuration:
- Secrets to connect to Postgres, Redis, and object storage.
- Registry credentials for pulling container images.
- Encryption keys for Terraform Enterprise data.
- Terraform Enterprise license key.
- TLS keys and certificates.
Compute
- Ensure VM images and container images are version-controlled and available in the failover region.
- Terraform Enterprise must not be running in failover region.
- For container orchestration (Nomad, Kubernetes, OpenShift), keep cluster infrastructure deployed but scaled down until failover.
- Docker and Podman deployments can also have a single host ready to run Terraform Enterprise on.
Note: All external services must be healthy before starting Terraform Enterprise. Begin with a single Terraform Enterprise instance, confirm stability, then scale out.
Standard operating procedure (SOP) for failover
Automate failover as much as possible to ensure you execute steps in the correct order.
Recommended failover steps
- Declare outage – Notify end users and stakeholders.
- Start failover – Depending on outage scope:
- If primary compute is accessible, drain Terraform Enterprise workloads and scale down to zero.
- If compute is inaccessible (for example, regional outage), skip this step.
- Failover actions (can run in parallel):
- DNS – Update global DNS (for example, Route 53) to point to the failover load balancer.
- Postgres – Promote the failover Postgres from standby to primary.
- Terraform Enterprise startup – Launch a single Terraform Enterprise instance in the failover region (Docker, Podman, Nomad, Kubernetes, OpenShift).
- Validate health – Ensure Terraform Enterprise is operational and accessible.
- Scale out – Once stable, add additional Terraform Enterprise instances as needed.
Standard operating procedure (SOP) after failover
When the primary region recovers from its outage:
- S3-Compatible Storage – Ensure failover region’s bucket is replicating data back to the primary region.
- Postgres – If self-hosted, you must reinitialize the primary region's Postgres instance so it receives replicated data from the failover region.
- This may not be applicable, if leveraging a cloud service.
Standard operating procedure (SOP) for failback
When the primary region is ready to return to normalized operations.
- Schedule Failback – Notify teams of a maintenance window for failback.
- Drain Workloads – Scale down Terraform Enterprise workloads to zero in the failover region.
- Failback Infrastructure:
- S3-Compatible Storage - Ensure replication is complete from failover to primary, and no additional objects need replicated.
- Postgres - Ensure replication is complete from failover to primary. Promote primary Postgres from standby to primary.
- DNS - Update DNS to point back to the primary region.
- Launch Terraform Enterprise in Primary Region – Start a single instance and verify health.
- Scale Out – Once stable, add additional Terraform Enterprise instances as needed.
- Post-Failback Verification –
- If applicable, ensure you reinitialize failover Postgres and ready for future replication.
- Confirm S3-compatible storage replication from primary to failover is operational.