Multiple regions
This page assumes readers are familiar with an existing Terraform Enterprise deployment and architecture(opens in new tab).
Terraform Enterprise is a single region application, even when running the active-active operational mode. This is due primarily to PostgreSQL not operating Active/Active cross-region. However, scaled customers require the use of multiple public or private cloud regions to support redundancy in the event of the loss of the active region and it is solely for this purpose that we provide the recommendations below. As such, if your business is in this category, the primary recommendation is to operate two concurrent instances of Terraform Enterprise in two different cloud regions (public or private).
This page therefore provides detailed guidance regarding considerations for deploying Terraform Enterprise using two public or private cloud regions. For reference below, this means using a primary region as the main operating region, and a secondary region as a disaster recovery region for business continuity if the primary fails.
Although there are three operational modes(opens in new tab), this guide focuses on the active-active operational mode, as it covers all necessary components for DR.
Primary considerations
Cost versus risk
Terraform Enterprise provides mission critical capabilities to the business, so the benefit of duplicating the systems' tiers in a secondary region outweighs the cost.
Nonetheless, we recommend calculating the total cost of ownership (TCO) of two Terraform Enterprise instances (one in each of two regions) and ensure your project management staff are aware of this as part of annual project finance. For comparison, we also recommend working to calculate the cost to your business were developers unable to deploy applications. This is notably pertinent if you have committed spend under an enterprise discount program. We anticipate it being easier to balance cost versus risk with these parameters known.
When calculating TCO, include the cost of geo-redundant copies of the data layer.
Cloud support
We recommend checking public cloud provider documentation to ensure there is support for cross-region replication for each Terraform Enterprise component between the regions you are targeting. Cloud service provisioning for this is still developing as cloud service providers add new regions to their platforms.
Automation
Only use automated means to deploy all infrastructure to the primary and secondary regions used for Terraform Enterprise. This provides logical reproducibility, code versioning and speed of redeployment. We recommend using cloud automation tools to replicate the data layers of the platform to the secondary region. We refer to these below.
We strongly recommend using the HVD Modules to deploy Terraform Enterprise as this ensures you mirror the resources in each region. See below for more details specific to your chosen cloud partner. Deploy each region separately and maintain state for each separately.
We do not recommend using automation for detecting and failing over Terraform Enterprise from one region to the other. We do recommend use of automated health checking, and also (optionally) the actual fail over process, however, putting these two together bodes less well in situations where transient network outages occur. In such scenarios, Terraform Enterprise may fail over unnecessarily which may have undesirable consequences, as there are sometimes delays in cross-region data layer replication.
It is normal for senior management or dedicated business continuity staff to declare an outage, so we recommend hooking into this process. In a full-region failover scenario, there are likely to be a large number of applications requiring movement and this might represent noticeable network congestion between regions. This is of particular note on premise where VMware vMotion may result in delays as massive numbers of applications migrate at the same time. We recommend liaising with staff in position to manage such circumstances to ensure you are aware of where in the priority list of applications your Terraform Enterprise instances are so you can better manage failover.
Testing
It is of primary importance to test your region failover capability on a regular basis using a cadence in line with business policy. We recommend at least twice annually.
Document both the region failover and failback processes step-by-step in run books. Ensure that team members who did not write the document use it to perform the fail over and fail back testing as this ensures that the document works and is clear. This also helps train staff.
Deploy a pair of engineering Terraform Enterprise instances, one in each of the same regions that your production instances use. Mirror your engineering instances from production in terms of resources for dev/prod parity(opens in new tab) and populate them with test data - this makes for a meaningful failover test. Document the experience in the run book.
Maintain independent instances. Each instance in each environment must have its own DNS, storage, and supporting services. Avoid configuring failover environments to point back to production DNS or services.
Perform fault injection testing by using cloud service provider features or third-party tooling to simulate availability zone or regional outages for realistic failover testing. Document the experience.
Configuration security
Ensure the following are securely available to both environments. These may vary depending on your configuration. Include details about the locations of these data, their storage and access in your run book and review and maintain these at regular intervals as normal business-as-usual operations.
- Terraform Enterprise license key.
- Registry credentials for pulling container images.
- TLS keys and certificates.
- The Terraform Enterprise encryption password used to access the VAULT_TOKEN at application start-up.
- Secrets to connect to PostgreSQL, Redis, and object storage.
Component-specific guidance
The Terraform Enterprise active-active operational mode application architecture comprises compute, object storage, database and caching layers. We detail each of these below in a multi-region context.
Compute
The Terraform Enterprise active-active operational mode operates a stateless compute layer irrespective of whether it is running on VMs, Nomad or Kubernetes.
- Ensure VM and container images are version-controlled and available in the failover region. We recommend using Packer(opens in new tab) as the industry standard for machine image creation.
- Do not run the Terraform Enterprise application container(s) in the secondary region while the primary region is online. Specifically, the main risk of doing this is in case of the premature promotion of the database read replica (see below) to read-write, which risks database corruption.
- Keep the compute cluster infrastructure deployed but scaled down until failover.
- Keep the Terraform Enterprise application containers installed on the cluster at the same version in both regions and upgrade them during the same change window. Docker- and Podman-based deployments can also have a single host ready to run Terraform Enterprise on.
- For performance reasons, co-locate your primary and secondary Terraform Enterprise compute layers in the same regions as the corresponding object store and database components respectively.
S3-compatible storage
- Use AWS S3 cross-region replication(opens in new tab) (CRR) on the object store; this means creating one S3 bucket in each region, configured to replicate from the primary to the secondary.
- Use live replication.
- Also use S3 CRR on buckets that store Terraform Enterprise database snapshots and on the
bootstrapbuckets that store the air-gapped installation media if applicable. Doing this locates critical data local to the ASG in the respective region. - Use bidirectional replication(opens in new tab) to synchronize data between the two regions during failover.
- AWS refers only to S3 Same-Region Replication(opens in new tab) to confer control over data sovereignty. When using CRR, select regions in line with your regulatory requirements.
- We recommend enabling Amazon S3 server access logging(opens in new tab) in both regions as this means you can refer to the logs to understand what the last object updated in the Terraform Enterprise object store was and compare this to the outage time to better ascertain the extent of lost data during the outage. Note that the HVD Modules do not enable this.
PostgreSQL
- It is possible to offer greater than n-1 region redundancy if required due to both RDS and Aurora being able to offer read replicas in multiple secondary regions on AWS.
- Use Aurora as the RDS DBaaS solution and enable cross-region read replicas(opens in new tab).
- Ensure you are using a version of PostgreSQL which is both supported by Terraform Enterprise(opens in new tab) and also AWS Aurora read replication(opens in new tab).
- The HVD Modules for AWS use the
aws_rds_global_clusterresource, so we recommend using versions of PostgreSQL from the preceding link which Aurora global database supports (>= version 15.4). - Aurora global databases replicate cross-region in about 1 second so there is negligible expectation of database writes to the secondary not completing and if these are to lost state file objects, we advise use of
terraform importcommands or the HCL equivalent to restore lost configuration.
- The HVD Modules for AWS use the
- We recommend monitoring Aurora replicas for writer disconnects(opens in new tab) and also consider monitoring the write-through cache and logical slots for Aurora PostgreSQL logical replication(opens in new tab) if scaling your instance (monitor for
AuroraReplicaLagandAuroraGlobalDBReplicationLagmetrics). - Ensure you can access database logs and add documentation to the effect to your run book with regards accessing replica logs in the secondary region. We recommend following the details in the official AWS guide(opens in new tab) on this.
- If your business operates in a regulated market and you require database audit logs, we recommend use of the PostgreSQL Audit extension (pgAudit) with your Aurora instance and, as such, recommend following the provisions on this page(opens in new tab). Note that the HVD does not enable pgAudit.
We also recommend the following:
- Embed the recommendations in HashiCorp's public document(opens in new tab) regarding PostgreSQL database failover for Terraform Enterprise into your multi-region failover run book.
- The HVD Modules for Terraform Enterprise support single-region deployments as this is the primary application architecture from the HashiCorp Product team. As such, from a database perspective, focus on ensuring you have point-in-time recovery enabled on the database, maximize retention, choose the SKU conferring fastest cross-region update times, and the enablement of read replicas in the secondary region.
- Work to restore PITR-based recovery of the database to the secondary region going back to an earlier date that latest. Doing this proves that your run book can direct recovery from both a region failover and a replicated database corruption scenario.
Redis
- Redis does not require replication between regions.
- Ensure you deploy Redis in both regions and make it ready in the failover region before starting Terraform Enterprise. The HVD Module ensures this if used iteratively to deploy instances into both regions.