Increase Terraform Enterprise run capacity

This topic describes how to increase the run capacity of your Terraform Enterprise deployment on Kubernetes. For instructions on how to create the number of replicas, refer to Increase number of replicas.

Introduction

Terraform Enterprise executes runs by creating agent jobs in a different namespace, which in turn creates agent pods. Each run executes in its own agent pod. When run finishes, Terraform automatically cleanse up the agent job and agent pod. You can increase the maximum number of concurrent agent pods to reduce run queue lengths and wait time for runs to begin execution.

Complete the following steps to increase run capacity:

Configure the maximum number of concurrent agent jobs.
Configure memory and CPU limits for individual pods.
Adjust the Kubernetes worker timeout settings to allow Kubernetes to automatically scale the cluster.

Configure concurrency

To increase the number of agent pods that Terraform Enterpise can run concurrently, update the TFE_CAPACITY_CONCURRENCY value on the values file on the Helm chart and run helm upgrade to update the deployment.

The TFE_CAPACITY_CONCURRENCY value sets the maximum number of agent jobs each Terraform Enterprise pod can create at a given time in the TFE_CAPACITY_CONCURRENCY setting. The default concurrency is set to 10. You can specify up to 50 agent jobs. The following example sets the concurrent number of agent jobs allowed to 11:

  env:
  ...
  variables:
    TFE_CAPACITY_CONCURRENCY: "11"

TFE_CAPACITY_CONCURRENCY applies to each terraform-enterprise pod. For example, if you have three terraform-enterprise pods, and TFE_CAPACITY_CONCURRENCY is 10, the maximum number of agent pods for Terraform Enterprise is 30. Refer to TFE_CAPACITY_CONCURRENCY for additional information.

Configure limits for individual agent pods

You can increase the maximum amount of memory and CPU for each agent pod in the TFE_CAPACITY_MEMORYand TFE_CAPACITY_CPU values and run helm upgrade to update the deployment. Refer to the Helm chart for additional information.

In the following example, the CPU limit is set to 0, which enables an unlimited about of CPU. The memory limit is set to 2048, which enables up to 2048 mebibytes.

  env:
  ...
  variables:
    TFE_CAPACITY_CONCURRENCY: "10"     # Set the maximum number of concurrent runs, eg: 10
    TFE_CAPACITY_CPU: "0"              # Set the maximum CPU utilization. "0" equals unlimited.
    TFE_CAPACITY_MEMORY: "2048"        # Set the maximum memory utilization, eg: "2048" equals 2048Mi.

Use Kubernetes cluster autoscaling

Enable the autoscaling setting for your Kubernetes cluster so that Kubernetes can automatically scale the node capacity when Kubernetes cannot schedule the run due to resource constraints. You must also adjust the TFE_RUN_PIPELINE_KUBERNETES_WORKER_TIMEOUT setting so that Terraform Enterprise does not timeout before the Kubernetes environment can scale to meet resource demand for additional runs. This setting should be set to be greater than the number of seconds Kubernetes requires to scale out and initiate a new node fitting the constraints and requirements for the agent jobs that Terraform Enterprise generates.

When autoscaling is enabled for the Kubernetes cluster, Terraform Enterprise still complies with the maximum number of jobs it can run concurrently per the TFE_CAPACITY_CONCURRENCY configuration. We recommend that you carefully configure your Kubernetes environment with infrastructure layer upper and lower bounds on node availability to meet your business needs outside of Terraform Enterprise.

Google Cloud Platform Kubernetes Engine with Autopilot

You can use Google Cloud Platform Kubernetes Engine (GKE) pod annotations to fine tune the stability and availability of Terraform Enterprise. GKE Autopilot is a mode of operation that manages clusters. Refer to the Autopilot documentation for additional information.

At a minimum, we recommend the following annotations and node selectors:

require that tfc-agent pods are not interruptable using the following annotation: cluster-autoscaler.kubernetes.io/safe-to-evict=false
select a balanced compute class for both Terraform Enterprise pods and tfc-agent workloads using the following node selector: cloud.google.com/compute-class: "Balanced"
set resource requests for CPU and memory for Terraform Enterprise and tfc-agent pods

Manage these settings in the Terraform Enterprise Helm chart values.

The following example shows how to configure features significant to operating Terraform Enterprise in GKE Autopilot. Note that the example is incomplete and does not include additional configurations for operating Terraform Enterprise:

# Terraform Enterprise resource requests, annotations, and node selectors
resources:
  requests:
    memory: "8000Mi"
    cpu: "8"
nodeSelector:
  cloud.google.com/compute-class: "Balanced"
pod:
  annotations:
    cluster-autoscaler.kubernetes.io/safe-to-evict: "false"

# Agent resource requests, annotations, and node selectors, utilizing the agent pod template feature
agentWorkerPodTemplate :
  metadata :
    annotations :
      "cluster-autoscaler.kubernetes.io/safe-to-evict": "false"
  spec :
    nodeSelector :
      cloud.google.com/compute-class: "Balanced"
    containers:
      - name: "tfc-agent"
        image:  "hashicorp/tfc-agent:1.17.5"
        resources :
          requests :
            memory: 2Gi
            cpu: 2