Terraform Enterprise Failover

Note

This page is only relevant to customers who operate two external or active-active Terraform Enterprise instances in different public cloud regions, and actively replicate the object store and database layers from one region to another.

Fail over a Terraform Enterprise instance only during failover testing or by following a combination of the recommendations below and your site-specific run book for the same.

Customers deploy the application in a range of contexts and regulatory environments, so this page provides prescriptive, but generic, guidance in the regard.

For purposes of clarity, references to primary are to your main active operating region, and secondary are to your passive, failover region that you replicate your data to in case the primary region goes offline. Once the outage has passed, the inference is that you failback from the secondary to the primary.

Detection of need

We recommend deep observability provision around Terraform Enterprise both in terms of live metrics consumption(opens in new tab) and processing the diagnostics API(opens in new tab). Further, we strongly recommend connecting alerts to these metrics in case 200 responses are not experienced.

These provisions afford the Terraform Enterprise platform owner the ability to keep a close eye on the health of the platform. It is always better to be able to notify your user base of an outage situation than experience them notifying you.

On receiving a system health alert, the platform team can understand the level of outage and notify senior management accordingly, but we suggest that the decision to failover requires human confirmation.

Standard operating procedure (SOP) for failover

Automate the actual failover process as much as possible to ensure you execute steps in the correct order once you confirm that you require failover.

Recommended failover steps

Declare outage - Notify end users and stakeholders.
Contact HashiCorp Support - Even if you have the situation under control, we recommend alerting us to the situation.
Start failover - Depending on outage scope.
- If primary compute is accessible, drain Terraform Enterprise workloads and scale down to zero.
- If compute is inaccessible (for example, regional outage), skip this step.
Failover actions (can run in parallel).
- DNS - Update global DNS to point to the failover load balancer.
- PostgreSQL - Promote the failover PostgreSQL from a read replica to read-write. Only do this after you have fully confirmed that the compute layer in the primary region cannot write to the primary region database instance, to avoid split-brain.
Terraform Enterprise startup - Launch a single Terraform Enterprise instance in the failover region (Docker, Podman, Nomad, Kubernetes, OpenShift).
Validate health - Ensure Terraform Enterprise is operational and accessible. If the cluster is unhealthy, refer to the restore page of this HVD for further assistance.
Scale out - Once stable, add additional Terraform Enterprise instances as needed.

Standard operating procedure (SOP) after failover

When the primary region recovers from its outage:

S3-compatible storage - Ensure the failover region's bucket is replicating data back to the primary region.
PostgreSQL - If self-hosted, you must reinitialize the primary region's PostgreSQL instance so it receives replicated data from the failover region.
- This may not be applicable, if leveraging a cloud service.
- Ensure the database replicas are up-to-date, because the intention is to promote the replica in the primary region to read-write after failing back.

Standard operating procedure (SOP) for failback

When the primary region is ready to return to normalized operations, follow these steps:

Schedule failback - Notify end users and stakeholders of a maintenance window for failback.
Drain workloads - Scale down Terraform Enterprise workloads to zero in the secondary region.
Failback infrastructure
- S3-compatible storage - Ensure replication is complete from failover to primary, and no additional objects need replication.
- PostgreSQL - Ensure replication is complete from failover to primary. Demote the PostgreSQL instance in the secondary (failover) region to read-only, then promote PostgreSQL instance in the primary region to read-write. Double check that you only have one read-write database replica.
- DNS - Update DNS to point back to the primary region.
Launch Terraform Enterprise in primary region - Start a single instance and verify health.
Scale out - Once stable, add additional Terraform Enterprise instances as needed.
Post-failback verification
- Ensure that the Terraform Enterprise application is down in the secondary region.
- If applicable, ensure you reinitialize failover PostgreSQL (read-only) and ready for future replication.
- Confirm S3-compatible storage replication from primary to failover is operational.

Note

The decision to fail over from a primary to a secondary region follows official notice from your designated internal executive production staff of an outage of sufficient magnitude that application failover is unavoidable. If you automate both outage detection and automatic failover, it is possible that your Terraform Enterprise instance may fail over due to a transient network issue which is highly undesirable to the risk of data loss during failover.

Additional information

This section provides links and further details useful when following the SOPs in this section.

The sections below cover only public cloud-related services because, for example, confirmation of the status of replication between datacenter instances of an S3-compatible object store depend on the technology employed. For both object store and database implementations supporting your private cloud Terraform Enterprise deployment, consult the operations team owning the storage platform for more details.

Object store

Select the cloud service provider tab below regarding more object store-related failover information. None of the commands in the tabs below use exposed, dedicated "last write" APIs. They all work by listing objects and sorting by the LastModified/timestamp metadata. For large buckets with millions of objects, this can be slow. In scaled cases, consider enabling access logging instead. See the Terraform: Solution Design Guide Multiple Regions page for more information. The commands below aim to widen your toolkit for understanding more about your failover scenario.

Confirm the last modified object in an S3 bucket (for example the main object store bucket in the secondary region) by using this command:

aws s3api list-objects-v2 \
  --bucket YOUR_BUCKET_NAME \
  --query 'sort_by(Contents, &LastModified)[-1].{Key: Key, LastModified: LastModified}' \
  --output json

Click here(opens in new tab) for information on confirming the status of object store replication between regions on AWS.

Confirm the last modified object in an Azure blob store (for example the main object store storage account in the secondary region) by using this command:

az storage blob list \
  --account-name YOUR_ACCOUNT_NAME \
  --container-name YOUR_CONTAINER_NAME \
  --query "sort_by([].{Name:name, LastModified:properties.lastModified}, &LastModified)[-1]" \
  --output json

Click here(opens in new tab) for information on confirming the status of object store replication between regions on Azure.

Confirm the last modified object in a GCS bucket (for example the main object store bucket in the secondary region) by using this command:

gcloud storage ls --recursive --long gs://YOUR_BUCKET_NAME/ \
  | grep -v "^TOTAL:" \
  | sort -k2 -r \
  | head -1

Click here(opens in new tab) for information on confirming the status of object store replication between regions on GCP.

Database

Only AWS provides(opens in new tab) information on the status of cross-region PostgreSQL database replication from logs. Use the following command against the read replica in the secondary region of any cloud to ascertain the last successful WAL write that occurred.

psql -U postgres -d tfe -c "SELECT pg_current_wal_lsn();"

where tfe is the name of your Terraform Enterprise PostgreSQL database.

If the region with the primary Terraform Enterprise implementation fails and is offline long enough to start the failover recovery, follow the steps below.

Notify your user base of the outage.
Determine the outage time (T0).
Check to find if there are any pending objects that did not replicate to the secondary region. This process depends on your cloud provider, but recommendations regarding this are in the Terraform: Solution Design Guide Multiple Regions page.
The preceding link includes links to cloud-specific metric information on replication lag. In addition, run the following script to find the replication lag time on the secondary PostgreSQL database by connecting directly to the secondary read replica, first updating the psql* environment variables to instantiate your database details.

Note

You must execute the following psql command inside your private subnet (either from a VM inside the VPC/VNet or use an equivalently secure connection).

export psqlHostSecondary="yourSecondaryPostgreSQL.example.com"
export psqlDatabase="tfe_primary"
export psqlUsername="psqladmin"
export psqlPassword="***"

PGPASSWORD=${psqlPassword} \
psql --host    ${psqlHostSecondary} \
    --dbname   ${psqlDatabase}      \
    --username ${psqlUsername}      \
    --command  "SELECT extract(epoch from now() - pg_last_xact_replay_timestamp()) AS slave_lag"

This script returns the number of seconds since the last update. Monitor this until it resets.

Promote the read replica in the secondary region to read-write, and wait for it to come up.
Turn on the secondary VM scaling group by scaling from zero nodes to > 0.
Amend your DNS setup. Not every cloud supports global DNS at this time, so the approach differs between cloud vendors and will be different on private cloud external operational mode setups.
Notify your user base when you are back online.

Links for promoting public cloud PostgreSQL secondary region read replicas to read-write:

AWS: Promoting a read replica to a DB cluster for Aurora(opens in new tab) (document cites MySQL but applies to PostgreSQL)
Azure: Promote an Azure Database for PostgreSQL read replica(opens in new tab)
GCP: Promote replicas for regional migration or disaster recovery(opens in new tab)
VMware: Confer with your DBA team. VMware promotes use of pg_auto_failover(opens in new tab) between leader/follower nodes. To circumvent failures when you lose an entire vSAN cluster, VMware recommends vSphere Replication(opens in new tab)

Restore Terraform Enterprise

Observability