Restore Terraform Enterprise

All business verticals require business continuity management (BCM) for production services, and the rise in cyber crime has intensified these requirements. To ensure business continuity, you must implement and test a recovery and restoration plan for your Terraform Enterprise deployment prior to go-live. This plan must include all aspects of data held and processed by Terraform Enterprise's components so that operators can restore it within the organization's recovery time objective (RTO) and to their recovery point objective (RPO).

This guide discusses the best practices, options, and considerations to recover and restore Terraform Enterprise. It also recommends redundant, self-healing configurations using public and private cloud infrastructure, which add resilience to your deployment and reduce the chances of requiring backups.

This guide assumes you are familiar with Terraform Enterprise backup processes and procedures. If not, first refer to the Terraform Enterprise Backup page to create backups for your Terraform Enterprise instance.

Most of this guide is only relevant to single-region, multi-availability zone, external or active-active operational mode deployments except where otherwise stated. Having a multi-region deployment does not preclude your potential need to restore data in one of the regions, because all automated data replication services you might use for active/passive architectural models automatically replicate corruption events.

We recommend you automate the recommendations listed below to reduce the recovery time.

Note

If you are experiencing an outage or need to recover your Terraform Enterprise instance, contact HashiCorp Support(opens in new tab) for direct assistance.

Definitions and best practices

Business continuity (BC) is a corporate capability. This capability exists whenever organizations can continue to deliver their products and services at acceptable, predefined levels whenever disruptive incidents occur.

Note

The ISO 22301 document(opens in new tab) uses business continuity rather than disaster recovery (DR). As a result, this guide refers to business continuity instead of disaster recovery.

Two factors heavily determine your organization's ability to achieve BC:

Recovery Time Objective (RTO)(opens in new tab) is the target time set for the resumption of product, service, or activity delivery after an incident. For example, if an organization has an RTO of one hour, they aim to have their services running within one hour of the service disruption.
Recovery Point Objective (RPO)(opens in new tab) is the maximum tolerable period of data loss after an incident. For example, if an organization has an RPO of one hour, they can tolerate the loss of a maximum of one hour's data before the service disruption.

Based on these definitions, we recommend establishing the valid RTO/RPO for your Terraform Enterprise instance and approach BC accordingly. These factors determine your backup frequency and other considerations discussed below.

In this guide:

See here(opens in new tab) for a definition of point-in-time-recovery (PITR).
We express outage time as T0. The number after T represents the time relative to the outage time in minutes. For example, T-240 is equivalent to four hours before the established outage event start time.
Source refers to the Terraform Enterprise instance that you need to recover. Destination refers to its replacement.
A public cloud availability zone (AZ) is equivalent to a single VMware-based data center.
A public cloud multi-availability zone is equivalent to a multi-data center VMware deployment.
The main AZ is the primary. Any other AZs in the same region are the secondary.

Note

Depending on the chosen Terraform Enterprise architecture and site-specific configuration aspects, planning, deployment, and confirmation must be done as part of the overall platform management process. While this can be a time-consuming process, we cannot understate the importance of reliable understanding of what to do in an outage situation.

Data management

Select the tab below for high level best practices for your scenario.

Use the Terraform: Solution Design Guide when designing your Terraform Enterprise deployment. This allows you to integrate CSP native tooling, where recommended, to support automated single-region recovery capabilities.

For single-region, multi-availability zone deployments:

Your public cloud database deployment automatically replicates the database to at least one other availability zone.
Your object storage service remains online during an availability zone failure.

Select the tab below for high level best practices for your deployment, either external or disk operational modes.

If you have an external mode multi-data center deployment, your PostgreSQL database and S3-compatible object storage automatically replicate to another data center.

Confirm the replication configuration to ensure these components are actually replicating.
Ensure the integrity of the database and object storage copies in the secondary data center by suspending the replication to mimic the loss of the primary and then deploying a Terraform Enterprise application server configured to connect to the external services in the secondary and ensure the platform comes up.
Flip the DNS record for the primary instance so it resolves to the secondary instance to ensure the process works, fully making use of a maintenance window if applicable.

If you have a disk mode multi-data center deployment, your application server has PostgreSQL database and object storage mounts.

Ensure the VM replication process to the secondary is working.
Confirm that service starts using the replicated data layer in the secondary. Take the primary instance down before bringing up the secondary.

Note

Since Terraform Enterprise encrypts the data in both the object storage and database, you must ensure you can recover by deploying the Terraform Enterprise instance in the secondary data center. Unless your work results in a working Terraform Enterprise instance in your secondary data center, you cannot know if the restoration process you are working towards works when you need it.

Record event data

In an outage situation, for all scenarios, we recommend that you record event data. This enables you to:

Perform root cause analysis.
Work with HashiCorp Support to reduce losses incurred.
Identify tasks to prevent similar outages from occurring in the future.

For incident management, record the following information as soon as possible.

The date and time when the change that led to the outage occurred.
The date and time of the most recent, available runs and state files.
The date and time of the most recent, available PostgreSQL database snapshot or backup.
Whether the Terraform Enterprise application configuration values(opens in new tab) are safe.

Relationship between database and object store

Consider the database and object storage as one conceptual data layer for Terraform Enterprise even though they are technically separate. The database stores links to objects in the object storage and the application uses these links to manipulate the objects stored. For example, this means that restoring the database to an earlier version would then mean lost links to objects created since, even though those objects would still be present in the object store.

It is important to establish when the incident began. For example, if a workspace run had created a VM in a workspace at 11AM, if the state file is then corrupted, rewinding to a 10AM state version would mean that the state file would not reflect the running VM.

The steps to remediate any one issue depend on the issue details, so this document provides the main concepts and recommendations. If you have specific questions which are not addressed in this document, contact HashiCorp Support if you are troubleshooting an event.

Preparation

The steps to recover a down service depends on what the specific problem is, and which cloud scenario.

Before you attempt to recover the service, use the node-drain command on the application servers one at a time. This command attempts to stop processing of background jobs on the Terraform Enterprise node in a graceful manner.

Application server

As part of BCM, expect, and plan, for any outages. Well-known reasons for Terraform Enterprise application server failing include: underlying network or physical circuitry failures, operating system data issues, or kernel panics. Human error could also cause the application server to fail.

Application server recovery and restoration

For both external and active-active operational mode deployments, design your application servers to automatically replace failed worker nodes and to handle availability zone failures. This deploys redundancy at both the server and availability zone level.

Warning

Do not run two Terraform Enterprise external mode application servers which interact with the same external services at the same time. This can cause database or other corruption.

The configuration must be the same in both the node and its replacement. Terraform Enterprise uses the encryption password (TFE_ENCRYPTION_PASSWORD) to decrypt the internal Vault unseal key and root token. If you do not have the encryption password, you cannot recover your application data.

Refer to the tab(s) below for specific recommendations relevant to your implementation.

If you encounter a VM scaling group issue, redeploy using an automated deployment capability.
You do not need to restore failed application servers - replace them using automated means.
For both external and active-active operational mode deployments, when the VM automatically restarts, expect to find 502 and 503 errors from the browser while the service also restarts.

For both external mode and disk mode deployments on VMware, consider the following:

VMware customers using a stretched cluster across the sites must rely on VMware vSphere High Availability(opens in new tab) to recover a failed VM.
Customers that are not using stretched clusters are to rely on vSAN replication with software orchestration such as Site Recovery Manager(opens in new tab).
vMotion is not considered an implicit backup and recovery service.
If you have an external mode deployment on premise, redeploy the application server via vRA/vRO, OpenStack Nova/Ironic, or equivalent so that it reconnects to your on-premise object store and database.
If automated, IaC-based replacement is impossible, restore it from a backup unless manual redeployment is quicker. To ensure that reconnection is automatic, it is critical to that your vRA template/Glance template/post-deploy configuration management includes the same connection details to the object store and database as the failed VM.

We recommend that you use the option which results in the shortest time-to-resolution.

Object store

Terraform Enterprise object storage contains a historical set of state files for all of your workspaces, logs of workspace runs, plan records and slugs, and work lists that persist tasks from plans to applies.

Terraform Enterprise deployments compliant with the reference architectures and Backup Recommended Pattern use either public cloud storage with eleven nines of availability, or an equivalent S3-compatible private cloud storage facility. This means that multiple copies of object store data exist in multiple locations. As a result, during expected outages, like disk or network failure, single-region object storage recovery is automatic. If you need region redundancy, you use regional object storage replication. Refer to the Multiple Regions section for more information.

This section covers best practices for object storage restoration from unexpected outages, like human error or corruption events.

Object store content loss and corruption

This section describes the effects of each object storage file type's loss and corruption on Terraform Enterprise. Refer to the next section for object storage file recovery recommendations. The recovery method depends on whether there is loss or corruption and also the cloud provider you are using.

Workspace current state loss and corruption

If the object store does not have a current workspace state file, this presents two problems:

The missing state file has represented either addition or deletion of running infrastructure.
You experience the following error on the next workspace run since this current state file is not readable.

Failed to save state: Error uploading state: Precondition Failed

The serial provided in the state file is not greater than the serial
currently known remotely. This means that there may be an update performed
since your version of that state. Update to the latest state to resolve
conflicts and try again

Error: Failed to persist state to backend.

The error shown above has prevented Terraform from writing the updated state
to the configured backend. To allow for recovery, the state has been written
to the file "errored.tfstate" in the current working directory.

Running "terraform apply" again at this point will create a forked state,
making it harder to recover.

To retry writing this state, use the following command:
    terraform state push errored.tfstate

Found errored.tfstate file, dumping contents...

If corruption occurs to the current state file of a workspace, the next run hangs and eventually fail with the following output.

Configuring remote state backend...
Initializing Terraform configuration...

Setup failed: Failed terraform init (exit 1): <nil>

Output:

Initializing the backend...

Successfully configured the backend "remote"! Terraform will automatically
use this backend unless the backend configuration changes.
There was an error connecting to the remote backend. Please do not exit
Terraform to prevent data loss! Trying to restore the connection...

Still trying to restore the connection... (2s elapsed)
Still trying to restore the connection... (5s elapsed)
...
Still trying to restore the connection... (5m21s elapsed)
Still trying to restore the connection... (5m42s elapsed)
Error refreshing state: Error downloading state: 500 Internal Server Error

Workspace non-current state loss and corruption

A workspace may contain non-current states (previous versions of the state file). If the object store does not have a workspace's non-current state, this does not impact workspace runs. However, when you try to access the missing state version in the UI, the following error results.

Failed State View

If corruption occurs in a workspace's non-current state, this does not impact workspace runs. However, when you try to access the corrupted state version in the UI, the following error results.

Failed State View

Logs loss and corruption

Find Terraform Enterprise logs in the following location based on your deployment.

Operational Mode	Log Path
`external`	`/archivistterraform/logs`
`disk`	`/data/aux/archivist/terraform/logs`

If log files for historical plans and applies are missing or corrupted, this does not impact workspace runs. However, when you try to access the missing state version in the UI, the following error occurs since the Terraform Enterprise is unable to access the log files.

undefined

Note

Since the Archivist service uses Redis to cache certain data for several hours, the UI may still be able to access run log data until the cache expires.

JSON plans and provider schemas loss and corruption

Terraform Enterprise writes JSON plans and provider schemas during normal operations.

For Terraform Enterprise version v202108-1, only the Sentinel policy evaluation uses these objects.
From Terraform Enterprise v202109-1 and later, the UI uses both JSON plan and provider schema objects to render structured plan output if configured in the workspace.

If the JSON plans and provider schemas are missing or corrupted, you are unable to view the structured plan rendering for historical runs.

Slugs loss and corruption

Some slugs contain caches of the most recent workspace run configurations. If the object store does not have those slugs, the following error occurs when you try to start a new plan from an existing workspace.

Setup failed: failed unpacking terraform config: failed to uncompress slug: EOF

However, since Terraform Enterprise generates new slugs for new workspace runs, new workspace runs are not affected.

The loss of slugs that do not contain workspace run configurations do not affect system operation and workspace runs. There is no way to differentiate slug types from object storage content listings.

Bucket loss and corruption

If you lose the object store, the following message results, similar to lost slugs containing workspace run configurations.

Setup failed: failed unpacking terraform config: failed to uncompress slug: EOF

If corruption occurs to all of the objects in the object store, the UI can still operate with cached objects. When the cached objects expire, Terraform Enterprise tries to access the data from the object store. When this happens, the same preceding error occurs.

Recover lost object store content

The primary way to recover lost object store content is to enable object store versioning. Using this, you can recover deleted files from the versioned object storage by un-deleting the files. Refer to the tab(s) below for specific recommendations relevant to your implementation.

A versioned S3 bucket (or S3-compatible equivalent) creates a delete marker for the removed object. To recover this object, follow the steps below.

Create a list of workspaces or individual objects impacted by the missing object and its respective severity.
For each affected object, remove the delete marker(s)(opens in new tab) to restore each item.
Refresh the UI after you recover all impacted objects to find the recovered objects.
Optionally, start a new workspace plan to verify that you recovered the objects.

With point-in-time restore enabled(opens in new tab) in Azure Blob storage, you can restore deleted objects by following the steps below.

Determine the outage time (T0). Use this value for the --time-to-restore flag in the restore command.
Create a list of workspaces or individual objects impacted by the missing object and its respective severity.
Use the az storage blob restore Azure command line(opens in new tab) command to recover the missing object. If you need to recover multiple files, use the --blob-range flag.
Refresh the UI after you recover all impacted objects to find the recovered objects.
Optionally, start a new workspace plan to verify that you recovered the objects.

With object versioning(opens in new tab) enabled for a bucket, you can restore deleted objects. Use the versioning document(opens in new tab) which guides you through enabling GCS bucket versioning via the command line. We strongly recommend enabling bucket versioning with the Terraform google_storage_bucket(opens in new tab) resource.

Follow the tasks in the GCS restore documentation(opens in new tab) or the steps below to recover deleted Terraform Enterprise objects.

Create a list of workspaces or individual objects impacted by the missing object and its respective severity.
For each affected object, get the first non-current object version path, replacing x with the affected object's path.

gsutil ls -a gs://my-bucket/archivistterraform/states/xxxxxxxx/sv-xxxxxxxxxxxxxxxx | \
       sort -nr | head -2 | tail -1

Then, restore each path to make a new current object version.

$ gsutil cp gs://my-bucket/archivistterraform/states/xxxxxxxx/sv-xxxxxxxxxxxxxxxx\#xxxxxxxxxxxxxxxx \
          gs://my-bucket/archivistterraform/states/xxxxxxxx/sv-xxxxxxxxxxxxxxxx

Tip

GCS timestamps object versions with a generation number(opens in new tab), equivalent to the number of microseconds since the epoch(opens in new tab).

Refresh the UI after you recover all impacted objects to find the recovered objects.
Optionally, start a new workspace plan to verify that you recovered the objects.

If you delete a workspace from Terraform Enterprise, you cannot recover it by manipulating the object storage alone. You also need to restore the database since links to the object store also need restoring. If the workspace deletion occurs after the last database backup, you cannot recover it.

In addition, when you delete a workspace, Terraform Enterprise deletes the corresponding objects from the object store, which creates delete markers in the S3 bucket. If you remove all delete markers to recover from an accidental object storage deletion, this also recovers objects from deleted workspaces. If you do not restore the database in this situation, the database would still not have links to those restored objects. This may be an acceptable option as opposed to identifying and undeleting only those deleted objects. Consult HashiCorp for specifics which may impact decision-making.

If you are unable to use these recommendations to recover the lost objects, use point-in-time-recovery (PITR) scripting to restore all the objects in the object store to just before the outage time (T0).

Tip

Take Terraform Enterprise down prior to using PITR scripting.

Since using PITR scripting effectively takes Terraform Enterprise back through time, restore the database to the same time. Follow the guidance in the Database section below. Ensure that the object store is more recent (younger) than the database, otherwise, the database may contain broken object storage links which has further adverse impact on the platform.

Recover corrupt object store content

You can recover corrupt files from versioned object storage by going back to the last-known "good" version.

In this section, current refers to the latest available state files which represent the existing deployed infrastructure. Last good refers to the most recent previous state file that Terraform Enterprise can process without error.

Terraform Enterprise writes a new state file in the object store for every applied workspace run that requires a change in resources. Terraform Enterprise stores each state file as separate objects. If corruption occurs in a workspace's current state file, all subsequent runs fail. Use the API to recover corrupted state data since you know the failed workspace's name.

The following example demonstrates this process. The table below represents an example workspace with three applies, each adding one VM to the cloud. The current state file represents three VMs running on the cloud. However, it is corrupt.

     ╭─────────────────────┬────────┬───────────────────────────────────╮
     │ ID                  │ SERIAL │ CREATED                           │
     ├─────────────────────┼────────┼───────────────────────────────────┤
+VM3 │ sv-2K1mF7GUimf12bEd │      2 │ 2021-09-08 15:26:09.228 +0000 UTC │ <- corrupt current
+VM2 │ sv-KxRnWxmpFsYqsNzp │      1 │ 2021-09-08 15:24:15.584 +0000 UTC │ <- last good
+VM1 │ sv-gdBm2KUqDuQdzSvQ │      0 │ 2021-09-08 15:23:00.23 +0000 UTC  │
     ╰─────────────────────┴────────┴───────────────────────────────────╯

In order to recover the corrupted state, you need to:

download the state file for the last good run (serial 1),
change the serial to 3, and
upload it as the new current.

This process assumes that Terraform Enterprise is running and the API is available.

The Version Remote State with the HCP Terraform API tutorial covers how to:

download the state file
modify and create the state payload (only update the serial to 3)
upload the state file

These instructions are similar for Terraform Enterprise. Update the hostname, organization, and workspaces to reflect your workspace you want to recover.

Note

Due to the number of steps involved, we recommend you automate the process where possible. You can automate the preceding API integration steps using the unofficial tfx(opens in new tab) Go binary and this State Push(opens in new tab) script may also be of use.

Once complete, the following occurs.

     ╭─────────────────────┬────────┬───────────────────────────────────╮
     │ ID                  │ SERIAL │ CREATED                           │
     ├─────────────────────┼────────┼───────────────────────────────────┤
+VM2 │ sv-xceRBCuExqFdvLEB │      3 │ 2021-09-08 16:01:39.372 +0000 UTC │ <- new current <--+
+VM3 │ sv-2K1mF7GUimf12bEd │      2 │ 2021-09-08 15:26:09.228 +0000 UTC │ <- corrupted      |
+VM2 │ sv-KxRnWxmpFsYqsNzp │      1 │ 2021-09-08 15:24:15.584 +0000 UTC │ <- last good -----+
+VM1 │ sv-gdBm2KUqDuQdzSvQ │      0 │ 2021-09-08 15:23:00.23 +0000 UTC  │
     ╰─────────────────────┴────────┴───────────────────────────────────╯

Since the latest run ID (serial 3) is the same as the last good one, the listed states in the UI shows two entries with the same run ID.

Set the apply method of the workspace to Manual apply.

Warning

Do not run terraform plan on a workspace in this situation if it has auto apply set. If Terraform Enterprise detects missing infrastructure due to changes to the state, it automatically proceeds to deploy duplicate infrastructure. This is undesirable in an outage.

Then, start a plan so Terraform Enterprise can identify which infrastructure to add into the state file so it reflects the current infrastructure. From the preceding table, the most recent added virtual machine is not represented in the latest state.

Plan: 1 to add, 0 to change, 0 to destroy.

Note

For VCS-backed workspaces only, you need to temporarily alter the workflow to use a remote backend by removing the VCS connection in the UI.

In your local working copy of the repository, add, or modify, the remote backend in the terraform block to update the remote state on Terraform Enterprise. Replace hostname, organization and workspaces.name with your values.

Note

The Terraform remote backend is only supported on v0.11.12 and later.

terraform {
  required_version = "~> 1.14.1"
  required_providers {
    aws = {
      source = "hashicorp/aws"
    }
  }

  backend "remote" {
    hostname = "tfe.example.com"
    organization = "my-org"

    workspaces {
      name = "workspace1"
    }
  }
}

Then, configure the provider credentials in your local terminal.

Initialize your configuration when you are ready to import the existing infrastructure.

$ terraform init

Then, run terraform import for each listed object from the plan output to update the recovered state file. Obtain the necessary object ID(s) by referring to the relevant cloud resource. The following example command imports an EC2 instance with an ID of i-03a474677481b3380.

terraform import aws_instance.web i-03a474677481b3380

Import successful!

The resources that were imported are shown above. These resources are now in
your Terraform state and will henceforth be managed by Terraform.

When you are recovering corrupted state files, remember that infrastructure deletion is also possible. Consider the following example:

At T-480 (8 hours ago), you have a workspace that deployed a virtual machine.
At T-240 (4 hours ago), you backup your database.
At T-120 (2 hours ago), you apply your configuration, which deletes the VM.
At T-0, you experience a corruption event.

To recover the last good state, you restore your database to T-240 and rewind your object store to the same time. At this point, even though your state contains a VM, your configuration and the cloud API reflects no VM. If you generate a plan, it returns no changes.

Reconcile your state file with your configuration by running terraform apply -refresh-only.

$ terraform apply -refresh-only

You have restored and reconciled a corrupted or missing state file. If the workspace is VCS-backed, when all imports have completed, re-establish your VCS connection to HCP Terraform and re-run a final terraform plan. This returns no changes.

Database

This section discusses how to restore and recover your Terraform Enterprise PostgreSQL database. This assumes you followed the Backup page of this HVD. It also assumes point-in-time recovery on public clouds providing granular roll-back capability.

Database recovery

If there is an outage in multi-availability zone, single-region database instances and multi-DC private cloud, the database automatically switches the service to the secondary AZ/DC and the service resumes. If the database is in the process of failing over to a different availability zone when a workspace run triggers, the run may initially hang, resulting in possible 504 or 502 errors.

Error queueing plan
504 Gateway Time-out

If this happens, restart the failed run when the database reconnects.

Database restoration

You need to restore your database when unexpected database failures or corruption occurs (due to human errors, or hardware/software failures). The speed it takes you to restore your database after an outage depends on your RPO, and the frequency of your database backups and snapshots.

Consider the following:

If a database worker node goes down, all database connections drop. The database may report open connections or pool usage until it hits its configured timeout.
Terraform Enterprise uses write locks to protect the database writes. However, if a connection terminates mid-write, database corruption may occur. This is a general concern related to PostgreSQL database management rather than specific to Terraform Enterprise. This is why it is important to ensure the database backup is sound.
The restoration method depends on the severity of the database corruption. In some cases, you could just re-index the database; in others, you need to do full restore.
The database stores the Vault unseal token. It is important because it allows Terraform Enterprise to decrypt the data in the object store. Since this is a small entity by comparison to the size of the database and not updated often, there is a small chance of corrupting the unseal token.

If you need to restore the database, the recommended strategy is:

Notify your user base of the outage.
Determine the outage time (T0), so you have an anchor point in time to focus restoration efforts.
Take the application down.
Create a new database instance by restoring from a backup and applying relevant snapshots as applicable.
Create a new destination object store and copy the objects from the source object store attached to the broken instance.
Deploy a fresh Terraform Enterprise instance, configured to use the new, restored database and destination object store.

Reference the tab(s) below for specific recommendations relevant to your implementation. These recommendations focus on single-region restores. Refer to the multi-region section for multi-region considerations.

Note

The HVD deployment modules deploy running, whole instances. As a result, you must specify an existing object storage and database paths for deploying Terraform Enterprise instances in restoration scenarios which thus requires editing of the module code. We cannot predict what the requirements might be in a given scenario, thus cannot provide code for every eventuality.

Migrate copies of the source S3 bucket objects to the destination bucket. The copies of the source S3 bucket objects must match as close to the age of the database as possible. If you cannot restore their ages to the same time as the database restore time, then the state must be more recent (nearer to T0) than the database.
Script a solution to manage this transfer. The open source tool s3-pit-restore(opens in new tab) is also useful although unofficial. You cannot use s3-pit-restore to restore all buckets in a single command, but it allows you to specify a precise time.

If you are using Amazon RDS continuous backup and PITR(opens in new tab), set your recovery point to just before T0, so the time-to-recover (TTR) is short.
In AWS console > AWS Backup, select your desired backup then click the Restore button in the upper right. This creates a new database.

Note

Migrate copies of the source storage blob objects to the destination blob which match as close to the age of the database as possible. If you cannot restore their ages to the same time as the database restore time, then the state must be more recent (nearer to T0) than the database.
Use the az storage blob copy command(opens in new tab) to asynchronously duplicate the required objects to the destination instance.
Azure Database for PostgreSQL automatically configures a backup retention of seven days(opens in new tab) so that your data is restorable to a point-in-time.
Follow these instructions(opens in new tab) to restore your database to just before the problem started.

Note

Migrate copies of the source GCS bucket objects to the destination bucket. The copies of the source GCS bucket objects must match as close to the age of the database as possible. If you cannot restore their ages to the same time as the database restore time, then the state must be more recent (nearer to T0) than the database. For example, if your T0 is 22:00, and you backed up your database at 19:40 (T-140), the following bash script generates a list of paths that fall before the specified date.

  for i in $(gcloud storage ls --recursive gs://source_bucket | grep '\#')
  do
    STAMP=$(echo ${i} | cut -d'\#' -f2)
    GETMETOBEFORE=$(
      printf "%d000000" \
      $(date -d '2025/09/10 19:40:00' + "%s")
    )

    if
      [ "${STAMP}" -lt "${GETMETOBEFORE}" ]
    then
      echo ${i}
    fi
  done

Use the gcloud storage cp command(opens in new tab) to copy the required objects to the destination instance.

Migrate copies of the source S3-compatible bucket objects to the destination bucket. The copies of the source S3-compatible bucket objects must match as close to the age of the database as possible. If you cannot restore their ages to the same time as the database restore time, then the state must be more recent (nearer to T0) than the database.
If you have a database backup facility which ships WALs across data center, corruption is likely replicate to the secondary.

Create a new database from the backup using the same settings. Note the path and connection details.
Copy the deployment configuration file (Kubernetes Helm overrides.yml file or compose.yml for Docker Compose) from the source server(s) to the destination instance's deployment configuration. When the autoscaling group boots a new server(s), each one must have the correct configuration and be able to connect to the restored destination database instance and destination object store.
Create the destination Terraform Enterprise instance using the destination database and object storage.
Confirm platform health.
Amend DNS to point the service address to the new load balancer.
Since the object store contains a previous state in time, each workspace applied between the database restore point and T0 requires a terraform import of any running infrastructure not represented in the state. Follow the example in the Recover corrupt object store content section in this document to do this.
Notify your user base when you are back online.

Redis cache

This section is only relevant if you are running the active-active operational mode. There are no explicit backup requirements because Terraform Enterprise uses Redis as a cache. However, ensure your Redis instance has regional availability to protect against zone failure.

Restore a `disk` mode Terraform Enterprise instance

This section discusses the recovery and restoration process for disk operational mode Terraform Enterprise instances. Refer to the Operational Mode Decision documentation for information on Terraform Enterprise operating modes.

`disk` mode recovery

Because the same mounted volume stores the Terraform Enterprise object store and database, recovery of disk mode deployments involve recovery of the machine or associated disk using appropriate recovery technologies, depending on the outage.

If you are using tooling such as Dell EMC RecoverPoint for Virtual Machines or HPE Zerto, you already have systems policy set up to recover the Terraform Enterprise workload automatically in the case of a failure. Online continuous data protection platforms are strongly recommended for automated recovery when using single-machine Terraform Enterprise deployments, and we recommend using these to recovery the system in an outage situation. The process differs depending on the recovery platform in place. The alternative to this is to expect to restore the platform in the event of an outage.

Mounted disk mode restoration

Restoration of disk operational mode instances involve deploying a replacement VM from a backup. As the establishment of business RTO and RPO are the same for disk mode and external mode deployments, the amount of remediation required after restoration completes depends on how far back you have to go to restore your data. Refer to the Object Store section in this document for recommendations regarding import of deployed infrastructure objects when updating workspace state.

When restoring Terraform Enterprise in disk mode, consider the following steps.

Notify your user base of the outage.
Determine the outage time (T0).
Based on the Reference Architecture for VMware, you have data intact in one or both centers. You could have done this by replicating the data layer using a technology such as lsyncd in the case of isolated data disks, or using a shared device from a SAN or NAS.

If you are using isolated data disks, the secondary Terraform Enterprise host must be up to facilitate data replication. However, to avoid corruption, shut down the application on the primary.

Bring up Terraform Enterprise on the secondary host, and confirm it can read the data disk.

If there is no data corruption, the load balancer uses the secondary data center. If you are not using a load balancer, amend DNS instead.

When the restore is complete, start Docker. If you establish data integrity Terraform Enterprise starts.
Availability governs the data center you decide to rerun the preceding processes in. If you have brought up Terraform Enterprise in the secondary data center, there is either a natural business process regarding safe return to the primary data center, or, if the data centers are equivalent, no further technical work is necessary.
Notify your user base when you are back online.

Backup Terraform Enterprise

Failover Terraform Enterprise

Restore Terraform Enterprise

Definitions and best practices

Data management

Record event data

Relationship between database and object store

Preparation

Application server

Application server recovery and restoration

Object store

Object store content loss and corruption

Workspace current state loss and corruption

Workspace non-current state loss and corruption

Logs loss and corruption

JSON plans and provider schemas loss and corruption

Slugs loss and corruption

Bucket loss and corruption

Recover lost object store content

Recover corrupt object store content

Database

Database recovery

Database restoration

Redis cache

Restore a disk mode Terraform Enterprise instance

disk mode recovery

Mounted disk mode restoration

Restore a `disk` mode Terraform Enterprise instance

`disk` mode recovery