Backup Terraform Enterprise

Introduction

Many business verticals require business continuity management (BCM) for production services. A reliable backup of your Terraform Enterprise deployment is crucial to ensuring business continuity. Your backup must include data held and processed by Terraform Enterprise's components so that operators can restore it within the organization's Recovery Time Objective (RTO) and to their Recovery Point Objective (RPO).

Below are the best practices, options, and considerations for backing up Terraform Enterprise. It also recommends standard operating procedures for redundant, self-healing configurations using public and private cloud infrastructure, which add resilience to your deployment and reduce the chances of requiring restoration.

This guide makes reference to the official Terraform Enterprise operating modes(opens in new tab) and is only relevant to Terraform Enterprise and not HCP Terraform. Also, most of this guide is only relevant to external mode and active-active deployments except where otherwise stated. Refer to Backup a Disk Deployment section below for specific details if you are running a disk deployment. If you have a disk deployment in production, we strongly recommend you liaise with your HashiCorp account team regarding urgent migration to one of the other operational modes because, while supported, it is not recommended for production use.

For region redundancy, read this guide in concert with the Terraform: Solution Design Guide Multiple Regions page.

For recovery and restoration of Terraform Enterprise, refer to the next page of this guide.

Definitions

Business continuity (BC) is a corporate capability. This capability exists whenever organizations can continue to deliver their products and services at acceptable, predefined levels whenever disruptive incidents occur. When we refer to environments, these map to engineering/pre-development, development, various test (system integration testing, user acceptance testing and so on) and production. Reserve dedicated engineering instances for the platform team only, for replication, backup/restore and upgrade testing.

Note

The ISO 22301 document(opens in new tab) uses business continuity rather than disaster recovery (DR). As a result, this tutorial refers to business continuity instead of disaster recovery.

Two factors heavily determine your organization's ability to achieve BC:

Recovery Time Objective (RTO)(opens in new tab) is the target time set for the resumption of product, service, or activity delivery after an incident. For example, if an organization has an RTO of one hour, they aim to have their services running within one hour of the service disruption.
Recovery Point Objective (RPO)(opens in new tab) is the maximum tolerable period of data loss during an incident. For example, if an organization has an RPO of one hour, they can tolerate the loss of a maximum of one hour's data before the service disruption.

Based on these definitions, we recommend establishing the valid RTO/RPO for your Terraform Enterprise instance and approach BC accordingly. These factors determine your backup frequency and other considerations discussed below.

In this guide:

Reference to a public cloud availability zone (AZ) is equivalent to a single VMware-based data center.
Reference to a public cloud multi-availability zone is equivalent to a multi-data center VMware deployment.
The main AZ is the primary. Any other AZs in the same region are the secondary. The secondary region is a business continuity/failover region only and is not an active secondary location. Consider all availability zones equal for the purposes of illustration.

Best practices

Maintain the backup and restore process

When you deploy Terraform Enterprise:

Test the backup and restoration process and measure the recovery time to ensure it satisfies your organization's RTO/RPO.
Document the backup and restoration process.
Arrange for staff who did not write this document to run a test restore using it. This measure increases confidence in the backup and restore process.
Use a regular schedule to test the backup and restoration process to ensure the documentation is reliable. This increases familiarity with the backup and restoration process and covers situations where staff leave.

Manage sensitive values

For automated deployments, you must manage several sensitive values. On Kubernetes, refer to this support page for base information(opens in new tab). The HVD deployment modules also provide further information specific to the deployment method and target platform - see below for further details. The methods below do not back up these data. We recommend HashiCorp Vault(opens in new tab) for storage of all pipeline secrets. Do not store any of these sensitive values in version control or allow them to leak into shell histories. Do not store any of these sensitive values in version control or allow them to leak into shell histories.

We recommend HashiCorp Vault(opens in new tab) for storage of all pipeline secrets.

Process audit logs

Audit log processing helps you identify the root cause during a data recovery incident.

Follow the guidance on this Terraform Enterprise logs resource page(opens in new tab) to aggregate and index logs from the Terraform Enterprise node(s) using a central logging platform such as Splunk, ELK, or a cloud-native solution. Use these as a diagnostic tool in the event of outage, scanning them case insensitively for ERROR and FATAL messages as part of root cause analysis.

Terraform Enterprise backup API

Terraform Enterprise has a Backup API(opens in new tab) but this primarily facilitates migrations from one operational mode to another. Only use the backup API for such migrations. Use cloud-native tooling instead for day-to-day backup and recovery on public cloud, and standard approaches for on-premise deployments as detailed below.

Initial considerations

The following recommendations improve your security posture, reduce the effort required to maintain an optimal Terraform Enterprise instance, and speed up deployment time during a restoration. Apply all relevant points in the list below for your target platform (VM-based or Kubernetes-based deployments).

Harden server images using CIS benchmarking(opens in new tab).
Secure Terraform Enterprise deployments on both single-tenant and shared Kubernetes clusters. If using Kubernetes clusters, lock down all APIs properly using automated configuration capabilities such as Ansible.
If using VM-based deployments of Terraform Enterprise, use single-tenant, immutable instances using automation by repaving instances with patched images rather than patching them in place. This process requires you to maintain the setup configuration in the code used to deploy the system and is out of the scope of this document.
Remove all unnecessary packages from the operating system.
Ensure to store deployment configuration in a version control system and use git best practices on the repositories used to version the code.

Application server

We recommend you automatically replace application server nodes when a node or availability zone fails. Replacing the node provides redundancy at the server and availability zone level. Public clouds and VMware have specific services for this.

Click on the tab(s) below relevant to your cloud deployment for additional cloud-specific recommendations. The recommendation for all Kubernetes-based deployments is to deploy a cluster in line with the Terraform: Solution Design Guide recommendations.

Select your deployment method below for more details. If you are deploying to EKS, your operational mode is active-active. You can deploy VM Docker-based instances in external mode or active-active mode.

For external deployments, set the min_size and max_size of the ASG to 1. When a node or availability zone fails, the ASG automatically replaces the node(opens in new tab). The time it takes for the ASG to replace the node depends on the time it takes the node to be ready. For example, the system comes up only after the node downloads the installation media from the network.

For VM-based active-active deployments, set the min_size and max_size parameters to the desired number of nodes. These two values must be the same. If a node fails, the service remains up while the ASG replaces it.
For Kubernetes-based active-active deployments using the HVD Module, in the Helm overrides file, set the replicaCount parameter to the desired number of nodes to influence how many server pods run on cluster. This defaults to 3. The module's eks_nodegroup_scaling_config variable sets the number of Kubernetes nodes in the node group. This also defaults to 3.

For both external and VM-based active-active deployments, populate the ASG vpc_zone_identifier list with at least two subnets. If the region supports additional subnets, we recommend a minimum of three subnets since it provides n-2 AZ redundancy.

Select your deployment method below for more details. If you are deploying to AKS, your operational mode is active-active. You can deploy VM Docker-based instances in external mode or active-active mode. Use a zone-balanced Linux virtual machine scale set(opens in new tab) (VMSS) to automatically replace nodes on Azure. Select your operational mode below for more details.

For AKS-based deployments, ensure your region(opens in new tab) supports availability zones prior to deployment and confirm the number in the target zone prior to continuing.

For external deployments, set the zone-balanced Linux VMSS instances parameter to 1. When a node or availability zone fails, the VMSS automatically replaces the node(opens in new tab) in the same region. The time it takes for the ASG to replace the node depends on the time it takes the node to be ready. For example, the system comes up only after the node downloads the installation media from the network.

For VM-based active-active deployments, set the zone-balanced Linux VMSS instances parameter to the desired number of nodes. Two or more instances meet the Azure 99.95% service-level agreement for VM availability(opens in new tab). If a node fails, the service remains up while the VMSS replaces it. active-active deployments require a fully automated deployment.
For Kubernetes-based active-active deployments using the HVD Module, in the Helm overrides file, set the replicaCount parameter to the desired number of nodes to influence how many server pods run on cluster. The module's aks_tfe_node_pool_node_count variable sets the number of Kubernetes nodes in the node group. This defaults to 2 because of the number of regions which have fewer than three zones. We recommend using a minimum of three AZs and recommend testing deployment to ensure the node group size fits with your requirements.

For both external and VM-based active-active deployments, if using Terraform to deploy, use the azurerm_linux_virtual_machine_scale_set(opens in new tab) resource. The HVD Module does this. Set the zones to a minimum of two (preferably three) zones in the region, and set zone_balance to true, which provides zone redundancy.

Use a regional managed instance group(opens in new tab) (MIG) to automatically replace nodes on GCP. Select your operational mode below for more details.

For external deployments, set the target_size parameter of the MIG to 1. When a node or availability zone fails, the MIG automatically replaces the node(opens in new tab) in the same region. Enable the Google compute health check and configure the auto-healing policy of the instance group manager. The time it takes for the MIG to replace the node depends on the time it takes the node to be ready. For example, the system comes up only after the node downloads the installation media from the network.

For both external and active-active deployments, if using Terraform to deploy, use the google_compute_region_instance_group_manager(opens in new tab) resource as this deploys a regional MIG, and thus ensures that the application server layer can automatically recover from a zone failure. The HVD Module does this.

In an external mode scenario, the application server is running as a stateless node.

If you deployed the instance from code (where the preceding sensitive data listed is available to the repaving of the node), you do not need to back up the instance. Redeploying the application server is acceptable.
If you deployed the instance through a vRA/vRO experience and need to manually configure it after it comes up, we recommend backing up the instance. Available VMware operational practices are likely to already include a strategic backup solution, which we would recommend using in this case. There are many VMware backup solutions, and the relative merits are outside the scope of this guide. HashiCorp customers back up Terraform Enterprise virtual machines using Dell EMC RP4VM(opens in new tab) and Veritas NetBackup(opens in new tab), however, solutions vary widely.

Object store

We recommend the following to support the object store's business continuity:

Choose fast storage optimized for use that scales well and automatically replicates to another zone in the same region. Each public cloud has a well-known option in this space. For private cloud external mode deployments, you must use S3-compatible storage.
Configure accidental/MFA deletion protection to prevent accidental deletion.
Be aware that automated backup facilities provided by cloud service providers provide eventual consistency within a time window of around 10-15 minutes depending on network congestion. This means that in a failover situation, not all object changes may have succeeded - see below. It also means, however, that object corruption is also automatically replicated.

Click on the tab(s) below relevant to your cloud deployment for additional cloud-specific recommendations.

The most likely problem with the object store is service inaccessibility or corruption through human error rather than loss of durability, due to AWS's claim of eleven nines of durability.

As a result, S3 Same-Region Replication(opens in new tab) is not explicitly required for the Terraform Enterprise object store because it does not add sufficient value. Corruption on the primary S3 bucket replicates to the secondary automatically.

We recommend the following to ensure you back up your application data appropriately.

Refer to AWS's S3 FAQs(opens in new tab) for information about S3's durability.
Implement the security best practices(opens in new tab) for Amazon S3.
Enable versioning(opens in new tab) on the bucket used as the object store.

The most likely problem with the object store is service inaccessibility or corruption through human error rather than loss of durability, due to Azure's claim of eleven nines of durability.

We recommend the following to ensure you back up your application data appropriately.

Use a zone-redundant(opens in new tab) storage account blob (ZRS) for a single region deployment. If you require a multi-region solution, then use Geo-zone-redundant storage (GZRS) and refer to the Terraform: Solution Design Guide for more information.
Implement the security recommendations(opens in new tab) for Azure blob storage.
Enable point-in-time restores(opens in new tab).
- In order to do this, you need to also enable its prerequisites on the storage account:
  - Enable soft delete for blobs(opens in new tab) and set retention to the maximum 365 days.
  - Enable the change feed(opens in new tab)
  - Enable blob versioning(opens in new tab)
- While Microsoft recommends maintaining fewer than one thousand versions per blob, Terraform Enterprise does not overwrite existing objects, so additional blob versions only exist as a result of workspace or accidental deletion operations.
- Note the Azure limitations on storage container deletion(opens in new tab).
Use a hot access tier, General-Purpose V2(opens in new tab) Azure blob storage account.
The bucket must be in the same region as the worker node for maximum efficiency.
Use the Microsoft.Storage(opens in new tab) endpoint.

The most likely problem with the object store is service inaccessibility or corruption through human error rather than loss of durability, due to GCP's claim of eleven nines of durability. We recommend the following to ensure you back up your application data appropriately.

Refer to Google's Storage(opens in new tab) documentation for information about Google Cloud Storage's (GCS) durability.
Implement the security recommendations(opens in new tab) for GCS.
Enable object versioning(opens in new tab).
Use a regional Standard Storage cloud storage bucket as an object store. If managing a multi-region platform, use a replicated bucket and refer to the Terraform: Solution Design Guide Multiple Regions page for more information.
For maximum efficiency, the bucket must be in the same region as the managed instance group which uses it.

For on-premise external deployments, as the architectural requirements include an S3-compatible storage facility, such as minIO(opens in new tab) or Dell ECS(opens in new tab):

If using minIO, use an active/passive configuration to ensure that objects replicate efficiently. Enable object versioning on the buckets. Refer to MinIO setup guide(opens in new tab) for more information.
If using Dell ECS, we recommend using the default erasure coding scheme (12+4) and, therefore, a minimum of four discrete nodes in the storage cluster intra-region. Because Terraform providers comprise a significant percentage of stored slug objects (created for each workspace run), the slug size stored could be larger or smaller than 128Mb. Refer to the Terraform: Solution Design Guide Multiple Regions page for more information.

Database

Configure the database to be in line with Terraform Enterprise's PostgreSQL requirements(opens in new tab).

For high availability in a single public cloud region, we recommend deploying the database in a multi-availability zone configuration to add resilience against recoverable outages. For coverage against non-recoverable issues (such as data corruption), take regular snapshots of the database.

Click on the tab(s) below relevant to your cloud deployment for additional cloud-specific recommendations.

In addition to the preceding general recommendations, consider the following AWS-specific recommendations.

Implement the security best practices for Amazon RDS(opens in new tab).
Use a multi-AZ deployment(opens in new tab). AWS creates the primary DB instance in the primary AZ and synchronously replicates the contents to the standby instance in the secondary AZ. Refer to AWS's high availability for RDS documentation(opens in new tab) for more information.
Configure AWS Backup for RDS(opens in new tab), using continuous backup and point-in-time-recovery (PITR).
If using Aurora as the Terraform Enterprise RDS database, you automatically benefit(opens in new tab) from PITR, continuous backup to Amazon S3, and replication across three availability zones. How many days of retention you require is a business decision, but HashiCorp recommends the maximum 35-day retention for maximum flexibility (the HVD Module uses this as a default), but we recommend performing a cost calculation relating to data storage.
Configure database snapshots(opens in new tab).
- AWS creates the default standby instance DB backup once a day. To achieve an RPO of less than a day, use snapshots taken at shorter time frames, and facilitate region-redundancy by using region-replicated buckets. They also persist beyond the 35-day PITR window.
- Trigger DB snapshots automatically at required points during the day. The organization's needs and RPO determines when to trigger DB snapshots.
- Continuously monitor the length of time it takes to do the backup and compare this to the RPO to avoid overlapping backups.
In both cases, ensure that backups and snapshots are secure and restrict access to only required staff.
Keep up-to-date with the AWS RDS documentation(opens in new tab).
Actively and continuously monitor(opens in new tab) operational health, and configure automatic event notifications.
Store your snapshots in Amazon S3 in the same region as the platform to reduce recovery time.
The HVD Module used to deploy Terraform Enterprise to both VM- and EKS-based targets uses a variable rds_preferred_backup_window set to a default of 04:00-04:30 local time.
- Override this in the terraform.tfvars file as needed. Check the value of rds_preferred_maintenance_window to ensure no overlap with rds_preferred_backup_window.
- Set a minimum of 30 minutes.
The HVD Module also allows for the setting of is_secondary_region to deploy to two regions. Refer to the Multiple regions section of the Terraform Enterprise solution design guide for more information on multi-region Terraform Enterprise deployments.
If your Terraform Enterprise uses Aurora, in the aws_rds_cluster(opens in new tab) resource:
- Set availability_zones to a list of at least three EC2 availability zones. AWS requires a minimum of three availability zones, which is also what HashiCorp recommends to maximize the database layer's recovery capability. The HVD Modules allow for multiple AZs.
Test that the backups work by documenting the restore experience in your run book making edits as necessary to the backup procedure.

In addition to the preceding general recommendations, consider the following Azure-specific recommendations.

Follow the performance best practices for Azure Database for PostgreSQL(opens in new tab).
Enable high availability(opens in new tab) on the Azure Database for PostgreSQL service (default configuration).
- This provides automatic single-region redundancy for recoverable errors in seconds.
- For unrecoverable errors such as data corruption or user error, configure a point-in-time recovery(opens in new tab) which would require database snapshots.
- Consider the overview of BC with Azure Database for PostgreSQL - Single Server.(opens in new tab)
Configure the PostgreSQL database in line with the Azure backup and restore guide(opens in new tab).
Keep up-to-date with the Azure Database for PostgreSQL documentation(opens in new tab).
Always ensure that event notifications and operational health are being actively monitored(opens in new tab).
Ensure that backups and snapshots are secure and restrict access to only required staff. Azure encrypted automatic database snapshots by default.
Use servers that can support up to 16 TB storage - bear in mind the supported regions(opens in new tab) for these. This option provides eight days of coverage.
If using Azure Database for PostgreSQL, set the azurerm_postgresql_server resource's backup_retention_days to a suitable period inline with company policy and regulations. The recommended retention is the current maximum of 35 days and to perform a cost calculation of the cost of data retention. Use snapshots to retain DB copies for longer than this when necessary.
Test that the backups work by documenting the restore experience in your run book making edits as necessary to the backup procedure.

In addition to the preceding general recommendations, consider the following GCP-specific recommendations.

Adopt Google CloudSQL best practices detailed here(opens in new tab). In particular, note that exports take longer to create because GCP creates an external file in Cloud Storage used to recreate your data. Exports are not affected if you delete the instance.
Use a multi-AZ HA regional deployment instance(opens in new tab) so GCP automatically creates the primary DB instance in the primary AZ and synchronously replicates the contents to the standby instance in the secondary AZ. Failover is automatic. Refer to GCP's Replication in Cloud SQL documentation(opens in new tab) for more information.
Keep up-to-date with the GCP CloudSQL documentation(opens in new tab).
Always ensure that the database is being actively monitored(opens in new tab).
Ensure that backups and snapshots are secure and restrict access to only required staff.
GCP encrypts data in a Cloud SQL database instances automatically(opens in new tab). The configuration of Google-managed encryption versus customer-managed encryption is down to site-specific policy.
If you use Terraform to deploy CloudSQL, in the google_sql_database_instance(opens in new tab) resource:
- In the settings stanza, set availability_type to REGIONAL to enable high availability.
- Under the settings stanza, in the backup_configuration sub block, set enabled and point_in_time_recovery_enabled to true, set an appropriate start_time for backups to run.
Test that the backups work by documenting the restore experience in your run book making edits as necessary to the backup procedure.

In addition to the preceding general recommendations, consider the following VMware-specific recommendations.

We understand that customers with private clouds are likely to have an established backup policy for databases already, possibly including a software partnership with a recognized backup vendor. In this case, for external mode deployments, we recommend you use these existing practices and tooling.
Refer to the PostgreSQL continuous archiving and PITR(opens in new tab) document to be able to replay write-ahead logs (WALs).
Establish automated snapshots in line with the established RPO.
Test that the backups work by documenting the restore experience in your run book making edits as necessary to the backup procedure.

Redis cache

Because the Redis instance serves as an active cache for Terraform Enterprise, you do not need to maintain backups of the caching layer. The core recommendation is to maintain intra-region redundancy in line with other recommendations in the HVDs. Assuming you are running identical service resources in the secondary region, failing over the cache involves starting use of the cache cluster in the secondary region which commences unpopulated.

Backup a `disk` deployment

The backup approach for a disk operational mode deployment is simpler than for active-active mode because it involves a single machine and possibly its business continuity instance. Also, a disk deployment backup ensures the integrity of the machine and its attached data disk.

Tip

Read the Introduction, Definitions and Best Practices sections of this guide before continuing this section.

We only recommend using disk mode when provisioning on private cloud where the added complexity of managing an on-premise database, S3-compatible storage and a Redis instance are not already supported in your environment.

We do not recommend using disk deployments on public cloud since external and active-active modes provides better scalability. For Twelve Factor compliance(opens in new tab), use the same operational mode for both production and non-production.

Follow the advice in the preceding VMware tabs for information which is applicable to disk mode deployments as well.
Ensure to quiesce the database on disk instances before taking a database snapshot. Your backup software may or may not do this automatically.
disk mode uses a separate mountable volume (data disk) that can come in many flavors. To ensure data integrity, ensure the mountable volume has the following capabilities (in this order):
- Continuous volume replication
- Use of the same volume mounted on the original instance
- A backup restored to another volume
Make copies available in multiple data centers to confer DC-redundancy.

If your primary disk node and a backup machine have their isolated data disks and maintain a mirroring strategy such as lsyncd(opens in new tab), corruption on the primary volume replicates to the disk attached to the passive node. Maintain regular additional snapshots/backups of the data disk.