Multiple regions

This page assumes readers are familiar with an existing Terraform Enterprise deployment and architecture(opens in new tab).

Terraform Enterprise is a single region application, even when running the active-active operational mode. This is due primarily to PostgreSQL not operating Active/Active cross-region. However, scaled customers require the use of multiple public or private cloud regions to support redundancy in the event of the loss of the active region and it is solely for this purpose that we provide the recommendations below. As such, if your business is in this category, the primary recommendation is to operate two concurrent instances of Terraform Enterprise in two different cloud regions (public or private).

This page therefore provides detailed guidance regarding considerations for deploying Terraform Enterprise using two public or private cloud regions. For reference below, this means using a primary region as the main operating region, and a secondary region as a disaster recovery region for business continuity if the primary fails.

Although there are three operational modes(opens in new tab), this guide focuses on the active-active operational mode, as it covers all necessary components for DR.

Note

Patroni is a template for high availability (HA) PostgreSQL deployments using Python. As the documentation(opens in new tab) suggests, Patroni is a template, and has caveats. While Patroni can run natively on top of Kubernetes, use of this pattern to support Terraform Enterprise is in beta(opens in new tab), so is not yet considered a validated design.

Primary considerations

Cost versus risk

Terraform Enterprise provides mission critical capabilities to the business, so the benefit of duplicating the systems' tiers in a secondary region outweighs the cost.

Nonetheless, we recommend calculating the total cost of ownership (TCO) of two Terraform Enterprise instances (one in each of two regions) and ensure your project management staff are aware of this as part of annual project finance. For comparison, we also recommend working to calculate the cost to your business were developers unable to deploy applications. This is notably pertinent if you have committed spend under an enterprise discount program. We anticipate it being easier to balance cost versus risk with these parameters known.

When calculating TCO, include the cost of geo-redundant copies of the data layer.

Cloud support

We recommend checking public cloud provider documentation to ensure there is support for cross-region replication for each Terraform Enterprise component between the regions you are targeting. Cloud service provisioning for this is still developing as cloud service providers add new regions to their platforms.

Automation

Only use automated means to deploy all infrastructure to the primary and secondary regions used for Terraform Enterprise. This provides logical reproducibility, code versioning and speed of redeployment. We recommend using cloud automation tools to replicate the data layers of the platform to the secondary region. We refer to these below.

We strongly recommend using the HVD Modules to deploy Terraform Enterprise as this ensures you mirror the resources in each region. See below for more details specific to your chosen cloud partner. Deploy each region separately and maintain state for each separately.

We do not recommend using automation for detecting and failing over Terraform Enterprise from one region to the other. We do recommend use of automated health checking, and also (optionally) the actual fail over process, however, putting these two together bodes less well in situations where transient network outages occur. In such scenarios, Terraform Enterprise may fail over unnecessarily which may have undesirable consequences, as there are sometimes delays in cross-region data layer replication.

It is normal for senior management or dedicated business continuity staff to declare an outage, so we recommend hooking into this process. In a full-region failover scenario, there are likely to be a large number of applications requiring movement and this might represent noticeable network congestion between regions. This is of particular note on premise where VMware vMotion may result in delays as massive numbers of applications migrate at the same time. We recommend liaising with staff in position to manage such circumstances to ensure you are aware of where in the priority list of applications your Terraform Enterprise instances are so you can better manage failover.

Testing

It is of primary importance to test your region failover capability on a regular basis using a cadence in line with business policy. We recommend at least twice annually.

Document both the region failover and failback processes step-by-step in run books. Ensure that team members who did not write the document use it to perform the fail over and fail back testing as this ensures that the document works and is clear. This also helps train staff.

Deploy a pair of engineering Terraform Enterprise instances, one in each of the same regions that your production instances use. Mirror your engineering instances from production in terms of resources for dev/prod parity(opens in new tab) and populate them with test data - this makes for a meaningful failover test. Document the experience in the run book.

Maintain independent instances. Each instance in each environment must have its own DNS, storage, and supporting services. Avoid configuring failover environments to point back to production DNS or services.

Perform fault injection testing by using cloud service provider features or third-party tooling to simulate availability zone or regional outages for realistic failover testing. Document the experience.

Configuration security

Ensure the following are securely available to both environments. These may vary depending on your configuration. Include details about the locations of these data, their storage and access in your run book and review and maintain these at regular intervals as normal business-as-usual operations.

Terraform Enterprise license key.
Registry credentials for pulling container images.
TLS keys and certificates.
The Terraform Enterprise encryption password used to access the VAULT_TOKEN at application start-up.
Secrets to connect to PostgreSQL, Redis, and object storage.

Component-specific guidance

The Terraform Enterprise active-active operational mode application architecture comprises compute, object storage, database and caching layers. We detail each of these below in a multi-region context.

Note

This section provides guidance on the architectural details of a multi-region Terraform Enterprise deployment. For specifics of how to fail over and fail back, refer to the Backup(opens in new tab) and Restore(opens in new tab) pages of the Terraform: Operating Guide for Adoption.

Compute

The Terraform Enterprise active-active operational mode operates a stateless compute layer irrespective of whether it is running on VMs, Nomad or Kubernetes.

Ensure VM and container images are version-controlled and available in the failover region. We recommend using Packer(opens in new tab) as the industry standard for machine image creation.
Do not run the Terraform Enterprise application container(s) in the secondary region while the primary region is online. Specifically, the main risk of doing this is in case of the premature promotion of the database read replica (see below) to read-write, which risks database corruption.
- Keep the compute cluster infrastructure deployed but scaled down until failover.
- Keep the Terraform Enterprise application containers installed on the cluster at the same version in both regions and upgrade them during the same change window. Docker- and Podman-based deployments can also have a single host ready to run Terraform Enterprise on.
For performance reasons, co-locate your primary and secondary Terraform Enterprise compute layers in the same regions as the corresponding object store and database components respectively.

S3-compatible storage

Use AWS S3 cross-region replication(opens in new tab) (CRR) on the object store; this means creating one S3 bucket in each region, configured to replicate from the primary to the secondary.
Use live replication.
Also use S3 CRR on buckets that store Terraform Enterprise database snapshots and on the bootstrap buckets that store the air-gapped installation media if applicable. Doing this locates critical data local to the ASG in the respective region.
Use bidirectional replication(opens in new tab) to synchronize data between the two regions during failover.
AWS refers only to S3 Same-Region Replication(opens in new tab) to confer control over data sovereignty. When using CRR, select regions in line with your regulatory requirements.
We recommend enabling Amazon S3 server access logging(opens in new tab) in both regions as this means you can refer to the logs to understand what the last object updated in the Terraform Enterprise object store was and compare this to the outage time to better ascertain the extent of lost data during the outage. Note that the HVD Modules do not enable this.

Note

AWS confirms that 99.99% of S3-stored objects replicate within 15 minutes and, as a result, this impacts your RPO. It also means that if you need to fail over, we strongly recommend including a task in your run book to check for missing objects.

If RDS/Aurora has replicated database pointers to S3 bucket objects which are not themselves replicated at the point of failover, dangling database pointers may result. This manifests as UI artifacts such as, but not limited to, missing run or state history items. Refer to the Terraform: Operating Guide for Adoption for more information.

We also recommend using AWS S3 Replication Time Control(opens in new tab) to monitor S3 CRR progress, and include maintenance tasks in your run book. Although such metrics are visible in the Metrics tab in the AWS console, use CloudWatch(opens in new tab) as it writes replication metrics to the destination region.

Note

At this time, the HVD Modules do not deploy

Bidirectional cross-region replication. Refer to this page(opens in new tab) for information on adding this.
S3 Replication Time Control. This is a separate consideration depending on the criticality of Terraform Enterprise to your business.

Tip

S3 cross-region replication (CRR) can provide replicas in multiple secondary regions simultaneously, so it is possible to operate greater than n-1 region redundancy if required.

Use Azure geo-zone-redundant storage(opens in new tab) (GZRS) on Terraform Enterprise object store storage accounts. GZRS provides sixteen nines of durability, maintains the recommended zone-redundant storage (ZRS) replication on objects in the primary region, and replicates data to the availability zones in the secondary region.

Warning

Not all Azure regions support GZRS. In fact, not all Azure regions support availability zones.

For GZRS to work, and thus to deploy Terraform Enterprise in multiple Azure regions in a reliable manner, deploy instances only in paired regions(opens in new tab) which also have availability zone support.

We recommend specifically checking data integrity if deploying onto Azure using GZRS.

Geo-redundant data transfer happens asynchronously so there is a potential for data loss(opens in new tab).
When planning a Terraform Enterprise deployment across multiple regions, consider that after failover, local data replication is geo-redundant storage (GRS) not GZRS. Read this document(opens in new tab) for more information.
Use Standard general-purpose v2 storage account types only with GRS and GZRS.
Bear in mind Azure's region pairing model(opens in new tab) with respect to geo-redundancy of the data layer. This also impacts the database layer - see below.
If you have data sovereignty considerations, select two paired regions(opens in new tab) which also work for your company or regulator policy.
We recommend enabling Azure Storage Analytics logs (classic)(opens in new tab) in both regions as this means you can refer to the logs to understand what the last object updated in the Terraform Enterprise object store was and compare this to the outage time to better ascertain the extent of lost data during the outage. Note that the HVD Modules do not enable this.

Note

Azure confirms that use of Azure Storage Geo Priority Replication(opens in new tab) guarantees replication to the secondary region for 99.0% of blob-stored objects within 15 minutes. This therefore impacts your RPO. It also means that if you need to fail over, we strongly recommend including a task in your run book to check for missing objects.

If Azure Database for PostgreSQL has replicated database pointers to Azure blob store objects which are not replicated at the point of failover, dangling database pointers may result. This manifests as UI artifacts such as, but not limited to, missing run or state history items. Refer to the Terraform: Operating Guide for Adoption for more information.

As well as being compatible with the type of storage Terraform Enterprise writes to the object store, we also recommend using Azure Storage Geo Priority Replication to increase visibility into your Blob Geo Lag. However, we strongly urge readers to refer to the preceding link for a full explanation of what this service, and its associated SLA, offers.

At this time of writing however, Azure Storage Geo Priority Replication is new in the Azure API, and the azurerm Terraform provider cannot deploy Azure Storage Geo Priority Replication. Refer to this page(opens in new tab) for information on adding this.

For cross-region object store GCS bucket replication, we recommend dual-region(opens in new tab) buckets. We do not recommend multi-region buckets unless you intend to enable more than one secondary region.
For geo-redundant storage, Cloud Storage data is redundant within at least one geographic place(opens in new tab) as soon as you upload it.
Be aware of the pairs of regions present in each dual-region bucket class. If you require cross-region deployment in Europe, due to EUR4 being europe-north1 and europe-west4 and the requirement to co-locate the managed instance group (MIG) in the same location as the GCS bucket, you must deploy the Terraform Enterprise compute layer into one of these two regions as well to ensure a working instance with successful cross-region replication.
If you have data sovereignty considerations, select two regions(opens in new tab) which also work for your company or regulator policy.
If the region choices are not possible or you require greater than n-1 region redundancy, you need to either develop independent replication or use multi-region buckets. We recommend discussing this matter with a HashiCorp solution architect during planning.
Use Turbo Replication for the object store buckets between multiple regions.
We recommend enabling GCS usage logs(opens in new tab) in both regions as this means you can refer to the logs to understand what the last object updated in the Terraform Enterprise object store was and compare this to the outage time to better ascertain the extent of lost data during the outage. Note that the HVD Modules do not enable this.

Note

Google confirms that use of Turbo Replication(opens in new tab) guarantees replication across regions for 100% of blob-stored objects within fifteen minutes. This therefore impacts your RPO. It also means that if you need to fail over, we strongly recommend including a task in your run book to check for missing objects. Google classes this as a premium feature, and while they offer a Standard Replication option (one hour), for your mission-critical Terraform Enterprise deployments, Turbo Replication is the recommendation. Turbo Replication logically means you lose less data in the event of a region failover which costs your business less operational loss and is less stressful for operations during a failover event.

Since this guide refers to multiple availability zones and maps these zones to separate VMware data centers, multi-region deployments confer connected data centers in different countries. The architecture choices for S3-compatible storage options for the object store of private cloud deployments of Terraform Enterprise are on the Manual Install page.
The recommendation for VMware is to deploy identical object store provision in each region and use the strategic connections your business already operates between these regions to ensure migration of Terraform Enterprise workloads during outages works effectively.
Most VMware customers solve replication with vSAN. The recommendation is to work with your VMware team to ensure you understand what technology is in place for replicating the underlying data, and what SLA is in place regarding how long it takes to replicate a completed write operation to the secondary.
If using Dell ECS, refer to the Dell white paper(opens in new tab) for information about ECS high availability design. We recommend you back up your ECS rig and enable cross-DC replication, then test to ensure the data is available to the application in the secondary region.

PostgreSQL

Warning

As PostgreSQL operates in one region at a time, a split-brain scenario on the database occurs if you promote the read replica in the secondary region to read-write while the primary read-write replica is still available. This leads to data divergence which cannot be automatically reconciled.

As Terraform Enterprise relies on a PostgreSQL database, this is why it primarily has a single-region application architecture. As such, you require close care in operating the product across regions. It is also why automated failover is not recommended.

Irrespective of your choice of cloud service provider, in a failover scenario, you need to be irrefutably correct that the primary database replica is offline, and stays offline, before promoting the secondary region read replica to read-write.

It is possible to offer greater than n-1 region redundancy if required due to both RDS and Aurora being able to offer read replicas in multiple secondary regions on AWS.
Use Aurora as the RDS DBaaS solution and enable cross-region read replicas(opens in new tab).
Ensure you are using a version of PostgreSQL which is both supported by Terraform Enterprise(opens in new tab) and also AWS Aurora read replication(opens in new tab).
- The HVD Modules for AWS use the aws_rds_global_cluster resource, so we recommend using versions of PostgreSQL from the preceding link which Aurora global database supports (>= version 15.4).
- Aurora global databases replicate cross-region in about 1 second so there is negligible expectation of database writes to the secondary not completing and if these are to lost state file objects, we advise use of terraform import commands or the HCL equivalent to restore lost configuration.
We recommend monitoring Aurora replicas for writer disconnects(opens in new tab) and also consider monitoring the write-through cache and logical slots for Aurora PostgreSQL logical replication(opens in new tab) if scaling your instance (monitor for AuroraReplicaLag and AuroraGlobalDBReplicationLag metrics).
Ensure you can access database logs and add documentation to the effect to your run book with regards accessing replica logs in the secondary region. We recommend following the details in the official AWS guide(opens in new tab) on this.
If your business operates in a regulated market and you require database audit logs, we recommend use of the PostgreSQL Audit extension (pgAudit) with your Aurora instance and, as such, recommend following the provisions on this page(opens in new tab). Note that the HVD does not enable pgAudit.

Consume the Azure document on PostgreSQL geo-replication(opens in new tab) to be aware of the architectural and operational significance.
Use read replicas in Azure Database for PostgreSQL(opens in new tab) to replicate the database to the secondary region.
Refer to this matrix(opens in new tab) as part of your calculation of which paired regions to use for your multi-region deployment.
If using Terraform to deploy the database, set geo_redundant_backup_enabled = true in the azurerm_postgresql_server resource. The HVD Modules for Azure use this parameter.
HashiCorp and Microsoft both recommend you monitor the replication lag(opens in new tab) between regions. This is important because Azure automatically sets your database to read only if storage reduces below 5% availability which adversely affects Terraform Enterprise service availability. The linked document also recommends setting storage threshold alerting. We recommend implementing all monitoring and alerting recommendations in the preceding link.
- Note that HVD Modules do not automatically configure monitoring and alerting of deployed cloud objects to provide flexibility to customers in different regulatory environments.
Microsoft suggest that lag between replicas is eventually consistent over seconds to minutes, which may impact your RPO. This may manifest in the secondary region not being consistent with the primary at the point the primary goes offline.
- This means it is possible that object store pointers are not replicated even though such objects replicated OK on the object store layer prior to the outage. In this case, state files may be inaccessible via the API or UI because database pointers are not replicated. We anticipate that only a small number of objects may experience impact, but this depends on how busy the cluster and network is at time of region outage.
- Cross-region database writes replication is more likely to complete before geo-redundant point-in-time recovery (PITR) snapshots make it to the secondary region. Even if the PITR snapshot triggers as soon as the most recent write completes and the database is not busy (so the WAL is tiny), the snapshot preparation still needs to complete on the primary, and a delay in the cross-region database write replication is likely to also impact the snapshot replication transfer as well.
- We recommend comparing the specific time of the outage if known, the most recent database write on the geo-redundant secondary and cross-reference this with most recent geo-redundant snapshot written.
We recommend following the details in the official Azure guide(opens in new tab) on this. However, note that Azure not only exposes metrics only every minute, but the available metrics appear not to be able to provide PostgreSQL replication views, limiting exposure to max physical replication lag(opens in new tab) only, rather than date stamps of last successful replication write.
If your business operates in a regulated market and you require database audit logs, we recommend use of the PostgreSQL Audit extension (pgAudit) with your Azure Database for PostgreSQL instance and, as such, recommend following the provisions on this page(opens in new tab). Note that the HVD does not enable pgAudit.

It is possible to offer greater than n-1 region redundancy if required due to Cloud SQL being able to offer cascading read replicas in multiple secondary regions on GCP.
Ensure you are using a PostgreSQL version compatible between both Terraform Enterprise and Cloud SQL replication (versions >= 15).
Cloud SQL cross-region replication has low latency (typically under a few seconds), and GCP documentation(opens in new tab) details that PITR allows returning to any timestamp in the PITR window providing coverage right up to the beginning of the experienced outage. As such, we expect negligible numbers of database writes to the secondary not completing and if these are to lost state file objects, we advise use of terraform import commands or the HCL equivalent to restore lost configuration.
- However, Google report that lag can happen(opens in new tab) with cross-region WAL-streaming replication. Provision both regions with the same resource availability.
We recommend following the details in the official GCP guide(opens in new tab) on managing database replicas. However, note that GCP appear not to be able to provide PostgreSQL replication views, limiting exposure to replication lag(opens in new tab) only, rather than date stamps of last successful replication write.
If your business operates in a regulated market and you require database audit logs, we recommend use of the PostgreSQL Audit extension (pgAudit) with your Azure Database for PostgreSQL instance and, as such, recommend following the provisions on this page(opens in new tab). Note that the HVD does not enable pgAudit.

VMware-based database replication solutions vary widely between customers and business verticals and regulatory requirements impacted design.
As such, if your intent is to deploy Terraform Enterprise on premise, and if you also require live replication, we would advise approaching your company database administration team, and liaise with them in the first instance. Recommendation of third-party tooling in this space is outside the scope of this document.
Our overarching recommendation is to enact a solution which provides public cloud-like WAL-shipping to ensure reliable database layer replication.

We also recommend the following:

Embed the recommendations in HashiCorp's public document(opens in new tab) regarding PostgreSQL database failover for Terraform Enterprise into your multi-region failover run book.
The HVD Modules for Terraform Enterprise support single-region deployments as this is the primary application architecture from the HashiCorp Product team. As such, from a database perspective, focus on ensuring you have point-in-time recovery enabled on the database, maximize retention, choose the SKU conferring fastest cross-region update times, and the enablement of read replicas in the secondary region.
Work to restore PITR-based recovery of the database to the secondary region going back to an earlier date that latest. Doing this proves that your run book can direct recovery from both a region failover and a replicated database corruption scenario.

Redis

Redis does not require replication between regions.
Ensure you deploy Redis in both regions and make it ready in the failover region before starting Terraform Enterprise. The HVD Module ensures this if used iteratively to deploy instances into both regions.

Security hardening

Next steps