Backup Terraform Enterprise
Introduction
Many business verticals require business continuity management (BCM) for production services. A reliable backup of your Terraform Enterprise deployment is crucial to ensuring business continuity. Your backup must include data held and processed by Terraform Enterprise's components so that operators can restore it within the organization's Recovery Time Objective (RTO) and to their Recovery Point Objective (RPO).
Below are the best practices, options, and considerations for backing up Terraform Enterprise. It also recommends standard operating procedures for redundant, self-healing configurations using public and private cloud infrastructure, which add resilience to your deployment and reduce the chances of requiring restoration.
This guide makes reference to the official Terraform Enterprise operating modes(opens in new tab) and is only relevant to Terraform Enterprise and not HCP Terraform. Also, most of this guide is only relevant to external mode and active-active deployments except where otherwise stated. Refer to Backup a Disk Deployment section below for specific details if you are running a disk deployment. If you have a disk deployment in production, we strongly recommend you liaise with your HashiCorp account team regarding urgent migration to one of the other operational modes because, while supported, it is not recommended for production use.
For region redundancy, read this guide in concert with the Terraform: Solution Design Guide Multiple Regions page.
For recovery and restoration of Terraform Enterprise, refer to the next page of this guide.
Definitions
Business continuity (BC) is a corporate capability. This capability exists whenever organizations can continue to deliver their products and services at acceptable, predefined levels whenever disruptive incidents occur. When we refer to environments, these map to engineering/pre-development, development, various test (system integration testing, user acceptance testing and so on) and production. Reserve dedicated engineering instances for the platform team only, for replication, backup/restore and upgrade testing.
Two factors heavily determine your organization's ability to achieve BC:
- Recovery Time Objective (RTO)(opens in new tab) is the target time set for the resumption of product, service, or activity delivery after an incident. For example, if an organization has an RTO of one hour, they aim to have their services running within one hour of the service disruption.
- Recovery Point Objective (RPO)(opens in new tab) is the maximum tolerable period of data loss during an incident. For example, if an organization has an RPO of one hour, they can tolerate the loss of a maximum of one hour's data before the service disruption.
Based on these definitions, we recommend establishing the valid RTO/RPO for your Terraform Enterprise instance and approach BC accordingly. These factors determine your backup frequency and other considerations discussed below.
In this guide:
- Reference to a public cloud availability zone (AZ) is equivalent to a single VMware-based data center.
- Reference to a public cloud multi-availability zone is equivalent to a multi-data center VMware deployment.
- The main AZ is the primary. Any other AZs in the same region are the secondary. The secondary region is a business continuity/failover region only and is not an active secondary location. Consider all availability zones equal for the purposes of illustration.
Best practices
Maintain the backup and restore process
When you deploy Terraform Enterprise:
- Test the backup and restoration process and measure the recovery time to ensure it satisfies your organization's RTO/RPO.
- Document the backup and restoration process.
- Arrange for staff who did not write this document to run a test restore using it. This measure increases confidence in the backup and restore process.
- Use a regular schedule to test the backup and restoration process to ensure the documentation is reliable. This increases familiarity with the backup and restoration process and covers situations where staff leave.
Manage sensitive values
For automated deployments, you must manage several sensitive values. On Kubernetes, refer to this support page for base information(opens in new tab). The HVD deployment modules also provide further information specific to the deployment method and target platform - see below for further details. The methods below do not back up these data. We recommend HashiCorp Vault(opens in new tab) for storage of all pipeline secrets. Do not store any of these sensitive values in version control or allow them to leak into shell histories. Do not store any of these sensitive values in version control or allow them to leak into shell histories.
We recommend HashiCorp Vault(opens in new tab) for storage of all pipeline secrets.
Process audit logs
Audit log processing helps you identify the root cause during a data recovery incident.
Follow the guidance on this Terraform Enterprise logs resource page(opens in new tab) to aggregate and index logs from the Terraform Enterprise node(s) using a central logging platform such as Splunk, ELK, or a cloud-native solution. Use these as a diagnostic tool in the event of outage, scanning them case insensitively for ERROR and FATAL messages as part of root cause analysis.
Terraform Enterprise backup API
Terraform Enterprise has a Backup API(opens in new tab) but this primarily facilitates migrations from one operational mode to another. Only use the backup API for such migrations. Use cloud-native tooling instead for day-to-day backup and recovery on public cloud, and standard approaches for on-premise deployments as detailed below.
Initial considerations
The following recommendations improve your security posture, reduce the effort required to maintain an optimal Terraform Enterprise instance, and speed up deployment time during a restoration. Apply all relevant points in the list below for your target platform (VM-based or Kubernetes-based deployments).
- Harden server images using CIS benchmarking(opens in new tab).
- Secure Terraform Enterprise deployments on both single-tenant and shared Kubernetes clusters. If using Kubernetes clusters, lock down all APIs properly using automated configuration capabilities such as Ansible.
- If using VM-based deployments of Terraform Enterprise, use single-tenant, immutable instances using automation by repaving instances with patched images rather than patching them in place. This process requires you to maintain the setup configuration in the code used to deploy the system and is out of the scope of this document.
- Remove all unnecessary packages from the operating system.
- Ensure to store deployment configuration in a version control system and use git best practices on the repositories used to version the code.
Application server
We recommend you automatically replace application server nodes when a node or availability zone fails. Replacing the node provides redundancy at the server and availability zone level. Public clouds and VMware have specific services for this.
Click on the tab(s) below relevant to your cloud deployment for additional cloud-specific recommendations. The recommendation for all Kubernetes-based deployments is to deploy a cluster in line with the Terraform: Solution Design Guide recommendations.
Select your deployment method below for more details. If you are deploying to EKS, your operational mode is active-active. You can deploy VM Docker-based instances in external mode or active-active mode.
- For
externaldeployments, set themin_sizeandmax_sizeof the ASG to 1. When a node or availability zone fails, the ASG automatically replaces the node(opens in new tab). The time it takes for the ASG to replace the node depends on the time it takes the node to be ready. For example, the system comes up only after the node downloads the installation media from the network.
- For both
externaland VM-basedactive-activedeployments, populate the ASGvpc_zone_identifierlist with at least two subnets. If the region supports additional subnets, we recommend a minimum of three subnets since it providesn-2AZ redundancy.
Object store
We recommend the following to support the object store's business continuity:
- Choose fast storage optimized for use that scales well and automatically replicates to another zone in the same region. Each public cloud has a well-known option in this space. For private cloud
externalmode deployments, you must use S3-compatible storage. - Configure accidental/MFA deletion protection to prevent accidental deletion.
- Be aware that automated backup facilities provided by cloud service providers provide eventual consistency within a time window of around 10-15 minutes depending on network congestion. This means that in a failover situation, not all object changes may have succeeded - see below. It also means, however, that object corruption is also automatically replicated.
Click on the tab(s) below relevant to your cloud deployment for additional cloud-specific recommendations.
The most likely problem with the object store is service inaccessibility or corruption through human error rather than loss of durability, due to AWS's claim of eleven nines of durability.
As a result, S3 Same-Region Replication(opens in new tab) is not explicitly required for the Terraform Enterprise object store because it does not add sufficient value. Corruption on the primary S3 bucket replicates to the secondary automatically.
We recommend the following to ensure you back up your application data appropriately.
- Refer to AWS's S3 FAQs(opens in new tab) for information about S3's durability.
- Implement the security best practices(opens in new tab) for Amazon S3.
- Enable versioning(opens in new tab) on the bucket used as the object store.
Database
Configure the database to be in line with Terraform Enterprise's PostgreSQL requirements(opens in new tab).
For high availability in a single public cloud region, we recommend deploying the database in a multi-availability zone configuration to add resilience against recoverable outages. For coverage against non-recoverable issues (such as data corruption), take regular snapshots of the database.
Click on the tab(s) below relevant to your cloud deployment for additional cloud-specific recommendations.
In addition to the preceding general recommendations, consider the following AWS-specific recommendations.
- Implement the security best practices for Amazon RDS(opens in new tab).
- Use a multi-AZ deployment(opens in new tab). AWS creates the primary DB instance in the primary AZ and synchronously replicates the contents to the standby instance in the secondary AZ. Refer to AWS's high availability for RDS documentation(opens in new tab) for more information.
- Configure AWS Backup for RDS(opens in new tab), using continuous backup and point-in-time-recovery (PITR).
- If using Aurora as the Terraform Enterprise RDS database, you automatically benefit(opens in new tab) from PITR, continuous backup to Amazon S3, and replication across three availability zones. How many days of retention you require is a business decision, but HashiCorp recommends the maximum 35-day retention for maximum flexibility (the HVD Module uses this as a default), but we recommend performing a cost calculation relating to data storage.
- Configure database snapshots(opens in new tab).
- AWS creates the default standby instance DB backup once a day. To achieve an RPO of less than a day, use snapshots taken at shorter time frames, and facilitate region-redundancy by using region-replicated buckets. They also persist beyond the 35-day PITR window.
- Trigger DB snapshots automatically at required points during the day. The organization's needs and RPO determines when to trigger DB snapshots.
- Continuously monitor the length of time it takes to do the backup and compare this to the RPO to avoid overlapping backups.
- In both cases, ensure that backups and snapshots are secure and restrict access to only required staff.
- Keep up-to-date with the AWS RDS documentation(opens in new tab).
- Actively and continuously monitor(opens in new tab) operational health, and configure automatic event notifications.
- Store your snapshots in Amazon S3 in the same region as the platform to reduce recovery time.
- The HVD Module used to deploy Terraform Enterprise to both VM- and EKS-based targets uses a variable
rds_preferred_backup_windowset to a default of04:00-04:30local time.- Override this in the terraform.tfvars file as needed. Check the value of
rds_preferred_maintenance_windowto ensure no overlap withrds_preferred_backup_window. - Set a minimum of 30 minutes.
- Override this in the terraform.tfvars file as needed. Check the value of
- The HVD Module also allows for the setting of
is_secondary_regionto deploy to two regions. Refer to the Multiple regions section of the Terraform Enterprise solution design guide for more information on multi-region Terraform Enterprise deployments. - If your Terraform Enterprise uses Aurora, in the
aws_rds_cluster(opens in new tab) resource:- Set
availability_zonesto a list of at least three EC2 availability zones. AWS requires a minimum of three availability zones, which is also what HashiCorp recommends to maximize the database layer's recovery capability. The HVD Modules allow for multiple AZs.
- Set
- Test that the backups work by documenting the restore experience in your run book making edits as necessary to the backup procedure.
Redis cache
Because the Redis instance serves as an active cache for Terraform Enterprise, you do not need to maintain backups of the caching layer. The core recommendation is to maintain intra-region redundancy in line with other recommendations in the HVDs. Assuming you are running identical service resources in the secondary region, failing over the cache involves starting use of the cache cluster in the secondary region which commences unpopulated.
Backup a disk deployment
The backup approach for a disk operational mode deployment is simpler than for active-active mode because it involves a single machine and possibly its business continuity instance. Also, a disk deployment backup ensures the integrity of the machine and its attached data disk.
We only recommend using disk mode when provisioning on private cloud where the added complexity of managing an on-premise database, S3-compatible storage and a Redis instance are not already supported in your environment.
We do not recommend using disk deployments on public cloud since external and active-active modes provides better scalability. For Twelve Factor compliance(opens in new tab), use the same operational mode for both production and non-production.
- Follow the advice in the preceding VMware tabs for information which is applicable to
diskmode deployments as well. - Ensure to quiesce the database on
diskinstances before taking a database snapshot. Your backup software may or may not do this automatically. diskmode uses a separate mountable volume (data disk) that can come in many flavors. To ensure data integrity, ensure the mountable volume has the following capabilities (in this order):- Continuous volume replication
- Use of the same volume mounted on the original instance
- A backup restored to another volume
- Make copies available in multiple data centers to confer DC-redundancy.
- If your primary
disknode and a backup machine have their isolated data disks and maintain a mirroring strategy such aslsyncd(opens in new tab), corruption on the primary volume replicates to the disk attached to the passive node. Maintain regular additional snapshots/backups of the data disk.