Restore Terraform Enterprise
All business verticals require business continuity management (BCM) for production services, and the rise in cyber crime has intensified these requirements. To ensure business continuity, you must implement and test a recovery and restoration plan for your Terraform Enterprise deployment prior to go-live. This plan must include all aspects of data held and processed by Terraform Enterprise's components so that operators can restore it within the organization's recovery time objective (RTO) and to their recovery point objective (RPO).
This guide discusses the best practices, options, and considerations to recover and restore Terraform Enterprise. It also recommends redundant, self-healing configurations using public and private cloud infrastructure, which add resilience to your deployment and reduce the chances of requiring backups.
This guide assumes you are familiar with Terraform Enterprise backup processes and procedures. If not, first refer to the Terraform Enterprise Backup page to create backups for your Terraform Enterprise instance.
Most of this guide is only relevant to single-region, multi-availability zone, external or active-active operational mode deployments except where otherwise stated. Deployments made using the Replicated Native Scheduler are not covered. Having a multi-region deployment does not preclude your potential need to restore data in one of the regions, because all automated data replication services you might use for active/passive architectural models automatically replicate corruption events.
We recommend you automate the recommendations listed below to reduce the recovery time.
Definitions and best practices
Business continuity (BC) is a corporate capability. This capability exists whenever organizations can continue to deliver their products and services at acceptable, predefined levels whenever disruptive incidents occur.
Two factors heavily determine your organization's ability to achieve BC:
- Recovery Time Objective (RTO)(opens in new tab) is the target time set for the resumption of product, service, or activity delivery after an incident. For example, if an organization has an RTO of one hour, they aim to have their services running within one hour of the service disruption.
- Recovery Point Objective (RPO)(opens in new tab) is the maximum tolerable period of data loss after an incident. For example, if an organization has an RPO of one hour, they can tolerate the loss of a maximum of one hour's data before the service disruption.
Based on these definitions, we recommend establishing the valid RTO/RPO for your Terraform Enterprise instance and approach BC accordingly. These factors determine your backup frequency and other considerations discussed below.
In this guide:
- See here(opens in new tab) for a definition of point-in-time-recovery (PITR).
- We express outage time as
T0. The number afterTrepresents the time relative to the outage time in minutes. For example,T-240is equivalent to four hours before the established outage event start time. - Source refers to the Terraform Enterprise instance that you need to recover. Destination refers to its replacement.
- A public cloud availability zone (AZ) is equivalent to a single VMware-based data center.
- A public cloud multi-availability zone is equivalent to a multi-data center VMware deployment.
- The main AZ is the primary. Any other AZs in the same region are the secondary.
Data management
Select the tab below for high level best practices for your scenario.
Use the Terraform: Solution Design Guide when designing your Terraform Enterprise deployment. This allows you to integrate CSP native tooling, where recommended, to support automated single-region recovery capabilities.
For single-region, multi-availability zone deployments:
- Your public cloud database deployment automatically replicates the database to at least one other availability zone.
- Your object storage service remains online during an availability zone failure.
Record event data
In an outage situation, for all scenarios, we recommend that you record event data. This enables you to:
- Perform root cause analysis.
- Work with HashiCorp Support to reduce losses incurred.
- Identify tasks to prevent similar outages from occurring in the future.
For incident management, record the following information as soon as possible.
- The date and time when the change that led to the outage occurred.
- The date and time of the most recent, available runs and state files.
- The date and time of the most recent, available PostgreSQL database snapshot or backup.
- Whether the Terraform Enterprise application configuration values(opens in new tab) are safe.
Relationship between database and object store
Consider the database and object storage as one conceptual data layer for Terraform Enterprise even though they are technically separate. The database stores links to objects in the object storage and the application uses these links to manipulate the objects stored. For example, this means that restoring the database to an earlier version would then mean lost links to objects created since, even though those objects would still be present in the object store.
It is important to establish when the incident began. For example, if a workspace run had created a VM in a workspace at 11AM, if the state file is then corrupted, rewinding to a 10AM state version would mean that the state file would not reflect the running VM.
The steps to remediate any one issue depend on the issue details, so this document provides the main concepts and recommendations. If you have specific questions which are not addressed in this document, contact HashiCorp Support if you are troubleshooting an event.
Preparation
The steps to recover a down service depends on what the specific problem is, and which cloud scenario.
Before you attempt to recover the service, use the node-drain command on the application servers one at a time. This command attempts to stop processing of background jobs on the Terraform Enterprise node in a graceful manner.
Application server
As part of BCM, expect, and plan, for any outages. Well-known reasons for Terraform Enterprise application server failing include: underlying network or physical circuitry failures, operating system data issues, or kernel panics. Human error could also cause the application server to fail.
Application server recovery and restoration
For both external and active-active operational mode deployments, design your application servers to automatically replace failed worker nodes and to handle availability zone failures. This deploys redundancy at both the server and availability zone level.
The configuration must be the same in both the node and its replacement. Terraform Enterprise uses the encryption password (TFE_ENCRYPTION_PASSWORD) to decrypt the internal Vault unseal key and root token. If you do not have the encryption password, you cannot recover your application data.
Refer to the tab(s) below for specific recommendations relevant to your implementation.
- If you encounter a VM scaling group issue, redeploy using an automated deployment capability.
- You do not need to restore failed application servers - replace them using automated means.
- For both
externalandactive-activeoperational mode deployments, when the VM automatically restarts, expect to find502and503errors from the browser while the service also restarts.
Object store
Terraform Enterprise object storage contains a historical set of state files for all of your workspaces, logs of workspace runs, plan records and slugs, and work lists that persist tasks from plans to applies.
Terraform Enterprise deployments compliant with the reference architectures and Backup Recommended Pattern use either public cloud storage with eleven nines of availability, or an equivalent S3-compatible private cloud storage facility. This means that multiple copies of object store data exist in multiple locations. As a result, during expected outages, like disk or network failure, single-region object storage recovery is automatic. If you need region redundancy, you use regional object storage replication. Refer to the Multiple Regions section for more information.
This section covers best practices for object storage restoration from unexpected outages, like human error or corruption events.
Object store content loss and corruption
This section describes the effects of each object storage file type's loss and corruption on Terraform Enterprise. Refer to the next section for object storage file recovery recommendations. The recovery method depends on whether there is loss or corruption and also the cloud provider you are using.
Workspace current state loss and corruption
If the object store does not have a current workspace state file, this presents two problems:
- The missing state file has represented either addition or deletion of running infrastructure.
- You experience the following error on the next workspace run since this current state file is not readable.
Failed to save state: Error uploading state: Precondition Failed
The serial provided in the state file is not greater than the serial
currently known remotely. This means that there may be an update performed
since your version of that state. Update to the latest state to resolve
conflicts and try again
Error: Failed to persist state to backend.
The error shown above has prevented Terraform from writing the updated state
to the configured backend. To allow for recovery, the state has been written
to the file "errored.tfstate" in the current working directory.
Running "terraform apply" again at this point will create a forked state,
making it harder to recover.
To retry writing this state, use the following command:
terraform state push errored.tfstate
Found errored.tfstate file, dumping contents...
If corruption occurs to the current state file of a workspace, the next run hangs and eventually fail with the following output.
Configuring remote state backend...
Initializing Terraform configuration...
Setup failed: Failed terraform init (exit 1): <nil>
Output:
Initializing the backend...
Successfully configured the backend "remote"! Terraform will automatically
use this backend unless the backend configuration changes.
There was an error connecting to the remote backend. Please do not exit
Terraform to prevent data loss! Trying to restore the connection...
Still trying to restore the connection... (2s elapsed)
Still trying to restore the connection... (5s elapsed)
...
Still trying to restore the connection... (5m21s elapsed)
Still trying to restore the connection... (5m42s elapsed)
Error refreshing state: Error downloading state: 500 Internal Server Error
Workspace non-current state loss and corruption
A workspace may contain non-current states (previous versions of the state file). If the object store does not have a workspace's non-current state, this does not impact workspace runs. However, when you try to access the missing state version in the UI, the following error results.

If corruption occurs in a workspace's non-current state, this does not impact workspace runs. However, when you try to access the corrupted state version in the UI, the following error results.

Logs loss and corruption
Find Terraform Enterprise logs in the following location based on your deployment.
| Operational Mode | Log Path |
|---|---|
external | /archivistterraform/logs |
disk | /data/aux/archivist/terraform/logs |
If log files for historical plans and applies are missing or corrupted, this does not impact workspace runs. However, when you try to access the missing state version in the UI, the following error occurs since the Terraform Enterprise is unable to access the log files.
undefined
JSON plans and provider schemas loss and corruption
Terraform Enterprise writes JSON plans and provider schemas during normal operations.
- For Terraform Enterprise version
v202108-1, only the Sentinel policy evaluation uses these objects. - From Terraform Enterprise
v202109-1and later, the UI uses both JSON plan and provider schema objects to render structured plan output if configured in the workspace.
If the JSON plans and provider schemas are missing or corrupted, you are unable to view the structured plan rendering for historical runs.
Slugs loss and corruption
Some slugs contain caches of the most recent workspace run configurations. If the object store does not have those slugs, the following error occurs when you try to start a new plan from an existing workspace.
Setup failed: failed unpacking terraform config: failed to uncompress slug: EOF
However, since Terraform Enterprise generates new slugs for new workspace runs, new workspace runs are not affected.
The loss of slugs that do not contain workspace run configurations do not affect system operation and workspace runs. There is no way to differentiate slug types from object storage content listings.
Bucket loss and corruption
If you lose the object store, the following message results, similar to lost slugs containing workspace run configurations.
Setup failed: failed unpacking terraform config: failed to uncompress slug: EOF
If corruption occurs to all of the objects in the object store, the UI can still operate with cached objects. When the cached objects expire, Terraform Enterprise tries to access the data from the object store. When this happens, the same preceding error occurs.
Recover lost object store content
The primary way to recover lost object store content is to enable object store versioning. Using this, you can recover deleted files from the versioned object storage by un-deleting the files. Refer to the tab(s) below for specific recommendations relevant to your implementation.
A versioned S3 bucket (or S3-compatible equivalent) creates a delete marker for the removed object. To recover this object, follow the steps below.
- Create a list of workspaces or individual objects impacted by the missing object and its respective severity.
- For each affected object, remove the delete marker(s)(opens in new tab) to restore each item.
- Refresh the UI after you recover all impacted objects to find the recovered objects.
- Optionally, start a new workspace plan to verify that you recovered the objects.
If you delete a workspace from Terraform Enterprise, you cannot recover it by manipulating the object storage alone. You also need to restore the database since links to the object store also need restoring. If the workspace deletion occurs after the last database backup, you cannot recover it.
In addition, when you delete a workspace, Terraform Enterprise deletes the corresponding objects from the object store, which creates delete markers in the S3 bucket. If you remove all delete markers to recover from an accidental object storage deletion, this also recovers objects from deleted workspaces. If you do not restore the database in this situation, the database would still not have links to those restored objects. This may be an acceptable option as opposed to identifying and undeleting only those deleted objects. Consult HashiCorp for specifics which may impact decision-making.
If you are unable to use these recommendations to recover the lost objects, use point-in-time-recovery (PITR) scripting to restore all the objects in the object store to just before the outage time (T0).
Since using PITR scripting effectively takes Terraform Enterprise back through time, restore the database to the same time. Follow the guidance in the Database section below. Ensure that the object store is more recent (younger) than the database, otherwise, the database may contain broken object storage links which has further adverse impact on the platform.
Recover corrupt object store content
You can recover corrupt files from versioned object storage by going back to the last-known "good" version.
In this section, current refers to the latest available state files which represent the existing deployed infrastructure. Last good refers to the most recent previous state file that Terraform Enterprise can process without error.
Terraform Enterprise writes a new state file in the object store for every applied workspace run that requires a change in resources. Terraform Enterprise stores each state file as separate objects. If corruption occurs in a workspace's current state file, all subsequent runs fail. Use the API to recover corrupted state data since you know the failed workspace's name.
The following example demonstrates this process. The table below represents an example workspace with three applies, each adding one VM to the cloud. The current state file represents three VMs running on the cloud. However, it is corrupt.
╭─────────────────────┬────────┬───────────────────────────────────╮
│ ID │ SERIAL │ CREATED │
├─────────────────────┼────────┼───────────────────────────────────┤
+VM3 │ sv-2K1mF7GUimf12bEd │ 2 │ 2021-09-08 15:26:09.228 +0000 UTC │ <- corrupt current
+VM2 │ sv-KxRnWxmpFsYqsNzp │ 1 │ 2021-09-08 15:24:15.584 +0000 UTC │ <- last good
+VM1 │ sv-gdBm2KUqDuQdzSvQ │ 0 │ 2021-09-08 15:23:00.23 +0000 UTC │
╰─────────────────────┴────────┴───────────────────────────────────╯
In order to recover the corrupted state, you need to:
- download the state file for the last good run (serial 1),
- change the serial to
3, and - upload it as the new current.
This process assumes that Terraform Enterprise is running and the API is available.
The Version Remote State with the HCP Terraform API tutorial covers how to:
- download the state file
- modify and create the state payload (only update the serial to
3) - upload the state file
These instructions are similar for Terraform Enterprise. Update the hostname, organization, and workspaces to reflect your workspace you want to recover.
Once complete, the following occurs.
╭─────────────────────┬────────┬───────────────────────────────────╮
│ ID │ SERIAL │ CREATED │
├─────────────────────┼────────┼───────────────────────────────────┤
+VM2 │ sv-xceRBCuExqFdvLEB │ 3 │ 2021-09-08 16:01:39.372 +0000 UTC │ <- new current <--+
+VM3 │ sv-2K1mF7GUimf12bEd │ 2 │ 2021-09-08 15:26:09.228 +0000 UTC │ <- corrupted |
+VM2 │ sv-KxRnWxmpFsYqsNzp │ 1 │ 2021-09-08 15:24:15.584 +0000 UTC │ <- last good -----+
+VM1 │ sv-gdBm2KUqDuQdzSvQ │ 0 │ 2021-09-08 15:23:00.23 +0000 UTC │
╰─────────────────────┴────────┴───────────────────────────────────╯
Since the latest run ID (serial 3) is the same as the last good one, the listed states in the UI shows two entries with the same run ID.
Set the apply method of the workspace to Manual apply.
Then, start a plan so Terraform Enterprise can identify which infrastructure to add into the state file so it reflects the current infrastructure. From the preceding table, the most recent added virtual machine is not represented in the latest state.
Plan: 1 to add, 0 to change, 0 to destroy.
In your local working copy of the repository, add, or modify, the remote backend in the terraform block to update the remote state on Terraform Enterprise. Replace hostname, organization and workspaces.name with your values.
terraform {
required_version = "~> 1.14.1"
required_providers {
aws = {
source = "hashicorp/aws"
}
}
backend "remote" {
hostname = "tfe.example.com"
organization = "my-org"
workspaces {
name = "workspace1"
}
}
}
Then, configure the provider credentials in your local terminal.
Initialize your configuration when you are ready to import the existing infrastructure.
$ terraform init
Then, run terraform import for each listed object from the plan output to update the recovered state file. Obtain the necessary object ID(s) by referring to the relevant cloud resource. The following example command imports an EC2 instance with an ID of i-03a474677481b3380.
terraform import aws_instance.web i-03a474677481b3380
Import successful!
The resources that were imported are shown above. These resources are now in
your Terraform state and will henceforth be managed by Terraform.
When you are recovering corrupted state files, remember that infrastructure deletion is also possible. Consider the following example:
- At
T-480(8 hours ago), you have a workspace that deployed a virtual machine. - At
T-240(4 hours ago), you backup your database. - At
T-120(2 hours ago), you apply your configuration, which deletes the VM. - At
T-0, you experience a corruption event.
To recover the last good state, you restore your database to T-240 and rewind your object store to the same time. At this point, even though your state contains a VM, your configuration and the cloud API reflects no VM. If you generate a plan, it returns no changes.
Reconcile your state file with your configuration by running terraform apply -refresh-only.
$ terraform apply -refresh-only
You have restored and reconciled a corrupted or missing state file.
If the workspace is VCS-backed, when all imports have completed, re-establish your VCS connection to HCP Terraform and re-run a final terraform plan. This returns no changes.
Database
This section discusses how to restore and recover your Terraform Enterprise PostgreSQL database. This assumes you followed the Backup page of this HVD. It also assumes point-in-time recovery on public clouds providing granular roll-back capability.
Database recovery
If there is an outage in multi-availability zone, single-region database instances and multi-DC private cloud, the database automatically switches the service to the secondary AZ/DC and the service resumes. If the database is in the process of failing over to a different availability zone when a workspace run triggers, the run may initially hang, resulting in possible 504 or 502 errors.
Error queueing plan
504 Gateway Time-out
If this happens, restart the failed run when the database reconnects.
Database restoration
You need to restore your database when unexpected database failures or corruption occurs (due to human errors, or hardware/software failures). The speed it takes you to restore your database after an outage depends on your RPO, and the frequency of your database backups and snapshots.
Consider the following:
- If a database worker node goes down, all database connections drop. The database may report open connections or pool usage until it hits its configured timeout.
- Terraform Enterprise uses write locks to protect the database writes. However, if a connection terminates mid-write, database corruption may occur. This is a general concern related to PostgreSQL database management rather than specific to Terraform Enterprise. This is why it is important to ensure the database backup is sound.
- The restoration method depends on the severity of the database corruption. In some cases, you could just re-index the database; in others, you need to do full restore.
- The database stores the Vault unseal token. It is important because it allows Terraform Enterprise to decrypt the data in the object store. Since this is a small entity by comparison to the size of the database and not updated often, there is a small chance of corrupting the unseal token.
If you need to restore the database, the recommended strategy is:
- Notify your user base of the outage.
- Determine the outage time (
T0), so you have an anchor point in time to focus restoration efforts. - Take the application down.
- Create a new database instance by restoring from a backup and applying relevant snapshots as applicable.
- Create a new destination object store and copy the objects from the source object store attached to the broken instance.
- Deploy a fresh Terraform Enterprise instance, configured to use the new, restored database and destination object store.
Reference the tab(s) below for specific recommendations relevant to your implementation. These recommendations focus on single-region restores. Refer to the multi-region section for multi-region considerations.
- Migrate copies of the source S3 bucket objects to the destination bucket. The copies of the source S3 bucket objects must match as close to the age of the database as possible. If you cannot restore their ages to the same time as the database restore time, then the state must be more recent (nearer to
T0) than the database. - Script a solution to manage this transfer. The open source tool
s3-pit-restore(opens in new tab) is also useful although unofficial. You cannot uses3-pit-restoreto restore all buckets in a single command, but it allows you to specify a precise time.
- If you are using Amazon RDS continuous backup and PITR(opens in new tab), set your recovery point to just before
T0, so the time-to-recover (TTR) is short. - In
AWS console > AWS Backup, select your desired backup then click theRestorebutton in the upper right. This creates a new database.
- Create a new database from the backup using the same settings. Note the path and connection details.
- Copy the deployment configuration file (Kubernetes Helm overrides.yml file or compose.yml for Docker Compose) from the source server(s) to the destination instance's deployment configuration. When the autoscaling group boots a new server(s), each one must have the correct configuration and be able to connect to the restored destination database instance and destination object store.
- Create the destination Terraform Enterprise instance using the destination database and object storage.
- Confirm platform health.
- Amend DNS to point the service address to the new load balancer.
- Since the object store contains a previous state in time, each workspace applied between the database restore point and
T0requires aterraform importof any running infrastructure not represented in the state. Follow the example in the Recover corrupt object store content section in this document to do this. - Notify your user base when you are back online.
Redis cache
This section is only relevant if you are running the active-active operational mode.
There are no explicit backup requirements because Terraform Enterprise uses Redis as a cache. However, ensure your Redis instance has regional availability to protect against zone failure.
Restore a disk mode Terraform Enterprise instance
This section discusses the recovery and restoration process for disk operational mode Terraform Enterprise instances. Refer to the Operational Mode Decision documentation for information on Terraform Enterprise operating modes.
disk mode recovery
Because the same mounted volume stores the Terraform Enterprise object store and database, recovery of disk mode deployments involve recovery of the machine or associated disk using appropriate recovery technologies, depending on the outage.
If you are using tooling such as Dell EMC RecoverPoint for Virtual Machines or HPE Zerto, you already have systems policy set up to recover the Terraform Enterprise workload automatically in the case of a failure. Online continuous data protection platforms are strongly recommended for automated recovery when using single-machine Terraform Enterprise deployments, and we recommend using these to recovery the system in an outage situation. The process differs depending on the recovery platform in place. The alternative to this is to expect to restore the platform in the event of an outage.
Mounted disk mode restoration
Restoration of disk operational mode instances involve deploying a replacement VM from a backup. As the establishment of business RTO and RPO are the same for disk mode and external mode deployments, the amount of remediation required after restoration completes depends on how far back you have to go to restore your data. Refer to the Object Store section in this document for recommendations regarding import of deployed infrastructure objects when updating workspace state.
When restoring Terraform Enterprise in disk mode, consider the following steps.
- Notify your user base of the outage.
- Determine the outage time (
T0). - Based on the Reference Architecture for VMware, you have data intact in one or both centers. You could have done this by replicating the data layer using a technology such as
lsyncdin the case of isolated data disks, or using a shared device from a SAN or NAS.
If you are using isolated data disks, the secondary Terraform Enterprise host must be up to facilitate data replication. However, to avoid corruption, shut down the application on the primary.
Bring up Terraform Enterprise on the secondary host, and confirm it can read the data disk.
If there is no data corruption, the load balancer uses the secondary data center. If you are not using a load balancer, amend DNS instead.
- When the restore is complete, start Docker. If you establish data integrity Terraform Enterprise starts.
- Availability governs the data center you decide to rerun the preceding processes in. If you have brought up Terraform Enterprise in the secondary data center, there is either a natural business process regarding safe return to the primary data center, or, if the data centers are equivalent, no further technical work is necessary.
- Notify your user base when you are back online.