Detect infrastructure drift and enforce OPA policies
As your organization grows and your infrastructure provisioning workflows mature, it gets harder to enforce consistency and best practices with training and hand-built tooling alone. Terraform can automatically check that your infrastructure satisfies industry best practices and organization-specific standards, with resource and module-specific conditions, workspace-specific run tasks, and workspace or organization-wide policies, including Open Policy Agent (OPA).
Note
Terraform Cloud Free Edition includes one policy set of up to five policies. In Terraform Cloud Plus Edition, you can connect a policy set to a version control repository or create policy set versions via the API. Refer to Terraform Cloud pricing for details.
In this tutorial, you will use both Terraform preconditions and Terraform Cloud native OPA support to validate configuration and enforce compliance with organizational practices. First, you will use Terraform preconditions to enforce network security conventions. Then, you will learn how to configure and enforce OPA policies in Terraform Cloud, preventing infrastructure deployments on certain days of the week. Finally, you will use Terraform Cloud's drift detection to detect when infrastructure settings have diverged from your written Terraform configuration.
Pre- and post-conditions help you define resource requirements in Terraform configurations. By including custom conditions in module definitions, you can ensure that downstream consumers comply with configuration standards, and use modules properly.
Prerequisites
This tutorial assumes that you are familiar with the Terraform and Terraform Cloud workflows. If you are new to Terraform, complete the Get Started tutorials first. If you are new to Terraform Cloud, complete the Terraform Cloud Get Started tutorials first.
In order to complete this tutorial, you will need the following:
- Terraform v1.4+ installed locally.
- An AWS account.
- A Terraform Cloud account with Terraform Cloud locally authenticated.
- A Terraform Cloud variable set configured with your AWS credentials.
Create example repository
Visit the template
repository
for this tutorial. Click the Use this template button and select Create a
New Repository. Choose the GitHub owner that you use with Terraform Cloud, and
name the new repository learn-terraform-drift-and-opa
. Leave the rest of
the settings at their default values.
Clone example configuration
Clone your example repository, replacing USER
with your own GitHub username. You will push to this fork later in the tutorial.
Change to the repository directory.
Review infrastructure configuration
This repository contains a local Terraform module that defines a network and bastion host, and a root configuration that uses the module. It also contains OPA policy definitions, which you will review later in this tutorial.
Open the modules/network/main.tf
file in your code editor. This configuration uses the public vpc
module to provision networking resources, including public and private subnets and a NAT gateway. It then launches a bastion host in one of the public subnets.
The bastion host is intended to be the single point of entry for any SSH traffic to instances within the VPC’s private subnets. The configuration also includes a security group that scopes any ingress SSH traffic to the bastion to just the 192.80.0.0/16
CIDR block, a hypothetical CIDR representing your organization’s network.
Though this configuration references this module locally, in a larger organization, you would likely publish it in your Terraform registry. By including a bastion in the boilerplate of your networking configuration, you can establish a standard for SSH access to instances in your networks.
Define a precondition
The network
module defines a bastion_instance_type
input variable to allow users to account for anticipated usage and workloads. While you want to allow users to specify an instance type, you do not want to allow them to provision an instance that is too big. You will add a precondition to verify that the instance type does not have more than 2 cores, to keep your operating costs low.
First, add the data source below to the module configuration. It accesses the instance type details, including the number of cores, from the AWS provider.
Now, add the precondition to the aws_instance.bastion
resource definition.
Terraform evaluates preconditions before provisioning the surrounding block. In this case, it will check whether the configuration satisfies the condition before provisioning the bastion instance.
Deploy infrastructure
Navigate back to the root level of the repository directory. The root Terraform configuration uses the network
module to create a bastion host and networking components including a VPC, subnets, a NAT gateway, and route tables.
It sets the values for input variables in the terraform.auto.tfvars
file. The initial value for the bastion instance type is t2.2xlarge
, which has 8 cores and will fail the precondition as expected.
Set your Terraform Cloud organization name as an environment variable to configure your Terraform Cloud integration.
Tip
If multiple users in your Terraform Cloud organization will run this tutorial, add a unique suffix to the workspace name in terraform.tf
.
Initialize your configuration. As part of initialization, Terraform creates your learn-terraform-drift-and-opa
Terraform Cloud workspace.
Now, attempt to apply your configuration. The apply will fail because the instance size you specified is too big, and the precondition will return an error.
Note
This tutorial assumes that you are using a tutorial-specific Terraform Cloud organization with a global variable set of your AWS credentials. Review the Create a Credential Variable Set for detailed guidance. If you are using a scoped variable set, assign it to your new workspace now.
The t2.2xlarge
instance type has 8 cores, so this Terraform run failed the precondition defined in the networking module. Overprovisioning the bastion would incur unnecessary cost for your organization.
Change the bastion_instance_type
variable in terraform.auto.tfvars
to t2.small
.
Apply your configuration again. Respond yes
to the prompt to confirm the operation.
Using a precondition to verify resource allocation lets you use the most up to date information from AWS to determine whether or not your configuration satisfies the requirement. While you could have also used variable validation to catch the violation, that would require researching all of the instance types and their capacities and listing all of the acceptable instance types in your configuration, making it less flexible.
Review OPA policy
Configuration-level validation such as variable constraints and preconditions let you socialize standards from within your written configuration. However, module authors and users must voluntarily comply with the standards. Module authors must include conditions in module definitions, and users must consume those modules to provision infrastructure. To enforce infrastructure standards across entire workspaces or organizations, you can use OPA policies, which work without requiring your users to write their infrastructure configuration in a specific way.
Navigate to the opa
directory in the example repository.
Open the policies.hcl
file to review the policy set configuration.
This policy set defines two policies, friday_deploys
and public_ingress
. It sets the enforcement level to mandatory
on both, which prevents infrastructure provisioning in the event of policy failure. Terraform Cloud policies also support an advisory
enforcement level, which notifies users of failures but allows them to provision resources anyway. The query format references the package name declared in the policy file, and the name of the rule defined for the policy.
Change to the policies
directory to review the policy definitions.
The public_ingress
policy parses the planned changes for a Terraform run and checks whether they include security group updates to allow public ingress traffic from all CIDRs (0.0.0.0/0
). This policy helps enforce your security posture by preventing the creation of any overly permissive security groups.
In addition to placing guardrails on infrastructure configuration, you may wish to enforce standards around your organization’s workflows themselves. One common practice is to prevent infrastructure deployments on Fridays in order to lower the risk of production incidents before the weekend. The friday_deploys
policy prevents infrastructure deployments on a certain day of the week.
In the friday_deploys.rego
file, replace DAY
with the current day of the week (e.g., Tuesday
) to test that the policy blocks deploys today.
Stage your update to the policy.
Commit the change.
Then, push your change.
Create a policy set
Terraform Cloud organizes policies in policy sets. Policy sets can contain either Sentinel or OPA policies. You can apply a policy set across an organization, or only to specific workspaces.
There are three ways to manage policy sets and their policies: VCS repositories, the Terraform Cloud API, or directly through the Terraform Cloud UI. In this tutorial, you will configure policy sets through VCS. The VCS workflow lets you collaborate on and safely develop and version your OPA policies, establishing the repository as the source of truth.
Navigate to your organization's Settings, then to Policy Sets. Click Connect a new policy set.
Select your Github version control integration.
Tip
Review the cloud VCS tutorial for detailed guidance on how to configure your VCS integration.
Select your fork of the learn-terraform-drift-and-opa
repository.
On the Settings page:
- Select OPA as the policy integration.
- Under Policy set source, expand the More options drop down.
- Set the Policies Path to
opa
. - Set the Scope of Policies to Policies enforced on selected workspaces
- Uncheck the box to allow overrides of failures
Under Workspaces, select your
learn-terraform-opa
workspace. Then, click Add workspace.
Finally, click Connect policy set
Trigger policy violation
The networking resources you provisioned earlier include a bastion host configured with a security group that restricts ingress traffic to your organization’s internal network. Imagine that an engineer is troubleshooting a production incident and tries to get around this restriction by making the security group more permissive.
To simulate this, update the ingress rule for the aws_security_group.bastion
resource in modules/network/main.tf
.
Navigate back to the root repository directory.
Run terraform apply
to attempt to update the security group.
Terraform Cloud detected the policy failures: the security group allows public ingress, and deploys are blocked today. The CLI output and run details in Terraform Cloud list which policies failed.
Using OPA policies in Terraform Cloud, you prevented Terraform from creating resources that violate your infrastructure and organization standards.
Before moving on, fix your policy and configuration to allow a successful apply.
First, update the friday_deploys
policy to check for deployments on Fridays. (If today is Friday, pick another day.)
Stage your update to the policy.
Commit the change.
Then, push your change.
Revert the change to your for the aws_security_group.bastion
resource in modules/network/main.tf
so that it reflects your actual infrastructure configuration.
Reapply your configuration to bring your workspace back into a healthy state.
Introduce infrastructure drift
Note
Drift detection is available in Terraform Cloud Plus Edition. Skip to the clean up step if you do not have access, or refer to Terraform Cloud pricing for details.
Custom conditions, input validation, and policy enforcement help organizations maintain their standards at the time of resource provisioning. Terraform Cloud can also check whether existing resources in Terraform state still match the intended configuration.
Returning to the hypothetical production incident, imagine that an engineer tries to work around the policy by making manual resource changes while troubleshooting.
To simulate this, navigate to your security groups in the AWS console.
Find the bastion_ssh
security group. Select the Inbound rules tab in the security group details, then click Edit inbound rules.
Delete the 192.168.0.0/16
source CIDR and replace it with 0.0.0.0/0
. Then, click Save rules.
You have now introduced infrastructure drift into your configuration by managing the security group resource outside of the Terraform workflow.
Detect drift
Terraform Cloud’s automatic health assessments help make sure that existing resources match their Terraform configuration. To do so, Terraform Cloud runs non-actionable, refresh-only plans in configured workspaces to compare the actual settings of your infrastructure against the resources tracked in your workspace’s state file. The assessments do not update your state or infrastructure configuration.
Assessments include two types of checks, which you enable together. Drift detection determines whether resources have changed outside of the Terraform workflow. Health checks verify that any custom conditions you define in your configuration are still valid, for example checking if a certificate is still valid. You can enable assessments on specific workspaces, or across all workspaces in an organization. Assessments only run on workspaces where the last apply was successful. If the last apply failed, the workspace already needs operator attention. Make sure your last apply succeeded before moving on.
Navigate to your learn-terraform-drift-and-opa
workspace in the Terraform
Cloud UI. Under the workspace's Settings, select Health.
Select Enable, then click Save settings.
Shortly after enabling health assessments, the first assessment runs on the workspace. After the first assessment, following assessments run once every 24 hours.
After a few minutes, Terraform will report failed assessments on the workspace overview page.
Click View Details to get more information. Terraform Cloud detected the change to your ingress rule and reported what will happen on your next run if you do not update your configuration.
Note
Drift detection only reports on changes to the resource attributes defined in your configuration. To avoid accidental drift, explicitly set any attributes critical to your operations in your configuration, even if you rely on a provider's default value for that attribute.
The health assessments detected infrastructure drift. These checks ensure that your infrastructure configuration still matches the written configuration and satisfies any defined custom conditions, extending your validation coverage beyond just the time of provisioning. Fixing drift is a manual process, because you need to understand whether you want to keep the infrastructure changes made outside of Terraform, or overwrite them. In this case, you could run another Terraform apply to overwrite the security group update.
Clean up infrastructure
Destroy the resources you created as part of this tutorial to avoid incurring unnecessary costs. Respond yes
to the prompt to confirm the operation.
Optionally, delete your learn-terraform-drift-and-opa
workspace and OPA policy set from your Terraform Cloud organization.
Next steps
In this tutorial, you used Terraform language features and Terraform Cloud policies to make sure that your infrastructure matches your configuration, and complies with your organization’s needs and standards. Configuration-level validation such as preconditions let you specify standards within Terraform configurations. Terraform Cloud policies let you enforce standards for an entire workspace or organization. You also used Terraform Cloud health assessments to make sure that existing infrastructure still matched Terraform configuration, and had not changed outside of the Terraform workflow.
To learn more about how Terraform features can help you validate your infrastructure configuration, check out the following resources:
- Review the OPA and policy documentation.
- Learn how to configure and use health assessments to detect infrastructure drift.
- Learn how to manage your infrastructure costs in Terraform Cloud.
- Learn how to use Terraform Cloud run tasks and HCP Packer to ensure machine image compliance.
- Review the health assessment documentation.