Observability
Observability is about being able to look at the live outputs of a system running in production and answer questions about what is going on. It is often associated to monitoring, but both concepts are distinct:
- Observability is about setting up a system so that we have access to real-time data about what the system is doing
- Monitoring is about capturing this data and analyzing it to make decisions. Example of monitoring goals are: Detecting failures or performance degradation, capacity planning, intruder detection.
Observability in HCP Terraform and Terraform Enterprise
Observability features play a critical role in HCP Terraform and Terraform Enterprise for maintaining a secure and compliant infrastructure-as-code workflow. These features provide detailed visibility into user actions, changes to infrastructure, and system events, creating a comprehensive record of all activities within the platform.
By capturing this information, organizations can effectively monitor and analyze activities, identify potential security threats, track configuration changes, and troubleshoot issues as soon as possible. Audit and logging help meet regulatory requirements, bolster accountability, and support incident response efforts.
HCP Terraform and Terraform Enterprise offer several features to support observability:
| Observability feature | HCP Terraform | Terraform Enterprise |
|---|---|---|
| Operational logs | No | Yes |
| Audit trail | Yes | Yes |
| Metrics | No | Yes |
| HCP Terraform agent logs | Yes | Yes |
Operational logs track the performance and behavior of the system. They provide information about the system's functioning, such as error messages, warnings, and other events that can help troubleshoot issues and identify performance bottlenecks. SRE teams typically use operational logs to monitor and maintain the service.
On the other hand, audit trail logs focus on tracking security-related events and activities within a system. They capture information about login attempts, access control changes, suspicious activities, and other security events. Audit trail logs are crucial for detecting and investigating security incidents, as they record who did what in a system. Security analysts and incident response teams often use these logs to identify and respond to potential threats.
Metrics measure application component performance and usage and are the raw data used to detect service quality issues or inform a capacity planning exercise.
HCP Terraform and Terraform Enterprise differ regarding operational logs and metrics, which stems from the shared responsibility model associated with HCP Terraform. With HCP Terraform, HashiCorp's SRE team tracks the operational logs and metrics as part of the service's operation. With Terraform Enterprise, your SRE team in charge of running the internal infrastructure-as-code service tracks those operation logs and metrics.
If you are using HCP Terraform agents, including those logs in the list of logs you collect and analyze is important. This helps ensure that you have a complete picture of all activities and can identify any issues or errors that may arise. By analyzing your logs together, you are better equipped to make informed decisions about optimizing and improving your HCP Terraform deployment. Remember to always stay vigilant and monitor your logs to ensure your systems run securely.
Monitoring focuses for Terraform Enterprise
You can configure Prometheus to gather metrics from Terraform Enterprise and its underlying components. Terraform Enterprise generates operational metrics that Prometheus can collect. This setup allows you to monitor system metrics, including CPU, memory usage, and request latency. For more information on constructing Prometheus queries, refer to this documentation https://prometheus.io/docs/prometheus/latest/querying/functions/#functions.
Also, you can use Grafana to visualize the metrics collected by Prometheus. You can create dashboards that provide real-time insights into the health of your Terraform Enterprise infrastructure, such as resource utilization trends and request rates. You can access the Terraform Enterprise Grafana dashboard here https://grafana.com/grafana/dashboards/15630-terraform-enterprise/
By utilizing these tools to monitor your instances, it allows your business to account for the health, availability, and scalability of your system. Recommended monitoring guidelines for your Terraform Enterprise instances are as follows.
Traffic
To effectively schedule downtime for maintenance and determine peak hours of your Terraform Enterprise instance, measure the traffic flow by analyzing throughput of completed jobs. Do this by taking the rate of all the jobs entering a terminal status. An example of the query is as follows.
rate(sum by(status)(tfe_run_current_count{status=~"applied|planned_and_finished|errored|discarded|canceled"})[30s:]) * 60
Demand capacity
As your business continues to grow, so do your operational capacity needs. When there are an increased number of runs that exceed system capacity configurations, it causes the starting jobs to hang in a pending state. To determine if Terraform Enterprise has been consistently exceeding configured capacity limits, the following query can take a look at the system as a whole and allow engineers to identify spikes in usage.
sum (tfe_run_current_count{status="pending"})
You may also want to narrow down this search by aggregating by workspace or organization. Do this with the following example.
sum by (organization_name) (tfe_run_current_count{status="pending"})
By monitoring the demand capacity of your system, it allows your practitioners to determine the best time to scale capacity resources. Also note that it is important to set alerts in a for metrics exceeding thresholds in any capacity, such as CPU or memory usage passing through a set percentage threshold determined by your business to signal potential overload. Alerts for the performance of PostgreSQL, Redis, and Vault help identify issues affecting Terraform Enterprise stability or performance.
Failure rate
An increase in run failures can suggest a widespread issue within your Terraform Enterprise. In order to effectively troubleshoot, practitioners must monitor run counts for any particular run status properly. A status labels the tfe_run_current_count metric, allowing you to investigate run counts filtered by the associated status. An example of querying a status labeled errored is as follows.
sum (tfe_run_current_count{status="errored"})
Should you wish to view the trends of failures, consider examining the rate with the following query.
rate(sum (tfe_run_current_count{status="errored"})[1m:])
Other container and global metrics are in our public documentation: https://developer.hashicorp.com/terraform/enterprise/flexible-deployments/monitoring/observability/metrics
Run resource usage
Proactively scaling your instances based on CPU or memory usage can help avoid performance degradation during peak loads. Ensure a consistent and reliable user experience ultimately resulting in proactive scaling. Use the following query to determine which organization is consuming which percentage of the instance's CPU.
sum by (organization_name)((rate(tfe_container_cpu_usage_kernel_ns{run_type!=""}[1m]) + rate(tfe_container_cpu_usage_user_ns{run_type!=""}[1m])) / 1e7)
Use the same query for memory usage, excluding rate which is out of scope for RAM usage.
sum by(organization_name)(tfe_container_memory_used_bytes{run_type!=""})
You can further aggregate per workspace and filter down by organization:
sum by(workspace_name)(tfe_container_memory_used_bytes{run_type!="",organization_name="my-org"})
Health checks
Health checks are crucial for scaling Terraform Enterprise because they ensure the system's reliability and performance by proactively detecting and addressing issues before they impact availability or efficiency during scaling operations. Learn more on how to monitor the health of application through external health checks in our public documentation https://developer.hashicorp.com/terraform/enterprise/flexible-deployments/troubleshooting
In situations where external health checks might add undesirable network load, such as in high-traffic environments, it is important to consider the impact. Frequent health checks can exacerbate network congestion.
This is how you can access health check external endpoint using curl:
curl http://$(docker inspect ptfe_health_check|jg -r .[].NetworkSettings.Networks[].IPAddress):23005/_health_check
This returns the following.
{"passed": true, "checks": [
{"name": "Archivist Health Check", "passed": true},
{"name": "Terraform Enterprise Health Check", "passed" : true},
{"name" : "Terraform Enterprise Vault Health Check", "passed" : true},
{"name": "Fluent Bit Health Check", "passed": false, "skipped" : true},
{"name": "RabbitMQ Health Check", "passed": true},
{"name": "Vault Server Health Check", "passed": true}
]}
Configuring data collection
In this section we'll cover how to set the data collection for:
- Collecting metrics and logs (including audit trail logs) on Terraform Enterprise
- Audit trail logs on HCP Terraform
- Collecting metrics and logs on HCP Terraform agents (applicable for HCP Terraform and Terraform Enterprise)
Terraform Enterprise metrics and logs
Configuring metrics collection on Terraform Enterprise
Explicitly enable Terraform Enterprise metrics collection as it is not enabled by default. Do this by setting the TFE_METRICS_ENABLE parameter to true.
Next, configure your monitoring tool of choice to periodically query the Terraform Enterprise metrics endpoint to collect and store this information.
| Configuration parameter | Description |
|---|---|
TFE_METRICS_HTTP_PORT | The HTTP port which exposes metrics. |
TFE_METRICS_HTTPS_PORT | The HTTPS port which exposes metrics. |
| Configuration parameter | Default Value |
|---|---|
TFE_METRICS_HTTP_PORT | 9090 |
TFE_METRICS_HTTPS_PORT | 9091 |
Capture metrics in two formats using the metrics endpoint URL. The table below provides the available options as well as the URL to use. We recommend capturing metrics over an encrypted connection.
| Metrics Endpoint URL | Metrics format |
|---|---|
https://<tfe_instance>:9091/metrics | JSON |
https://<tfe_instance>:9091/metrics?format=prometheus | Prometheus |
Terraform Enterprise computes an aggregate metric value from a 5-second sample. Terraform Enterprise keeps this value in memory for 15 seconds before flushing it. This means that if the monitoring tool pooling frequency is greater than 15 seconds (for example every 60 seconds), you may be missing information necessary to detect short-lived issues.
If you are running multiple Terraform Enterprise instances, you must collect the metrics from each deployed Terraform Enterprise instance and aggregate the information to have a global view of the Terraform Enterprise service.
Configuring log collection on Terraform Enterprise
Terraform Enterprise emits logs to standard output and standard error. We recommend collecting Terraform Enterprise logs in a central location, preferably using a specialized tool that provides searching and alerting capabilities, although we support sending logs to object storage for example.
There is a limited number of supported log destinations.
| Category | Supported log destinations |
|---|---|
| AWS | AWS S3, AWS CloudWatch |
| Microsoft Azure | Azure Blob Storage, Azure Log Analytics |
| Google Cloud Platform | Google Cloud Platform Cloud Logging |
| Specialized SaaS | Datadog, Splunk Enterprise HTTP Event Collector (HEC) |
| Other | Syslog, Fluent Bit or Fluentd instance |
Implementing audit trail on Terraform Enterprise
Terraform Enterprise generates audit trail logs along with the application logs. If you need to forward audit trail logs to a specialized system, such as a Security Information and Event Management solution (SIEM), configure a filter that intercepts all logs containing the string [Audit Log] and forward them to the SIEM system.
HCP Terraform audit trail logs
HCP Terraform features an Audit Log API endpoint that you must use to collect the audit events and store them in the appropriate system. To implement this solution, you need the following.
- A method to schedule and automate the audit events collection,
- A secure storage solution to store the audit events, and
- A data lifecycle solution to correctly dispose of the audit events once they are no longer required.
If you use a Security Information and Event Management system (SIEM), this must be the destination for those audit events. Suppose you are not using a SIEM but instead are using a centralized log management solution (Datadog, New Relic, Elastic, and so on). In that case, you must send those audit events to your centralized log management system. If neither of these solutions is available, still collect those audit events and store them securely using an object storage solution, such as AWS S3.
Metrics and logs on HCP Terraform agents
Configuring HCP Terraform agent metrics collection
The HCP Terraform agent binary exposes telemetry data using the OpenTelemetry protocol. This behavior allows the user to use a standard OpenTelemetry collector to push the metrics to a monitoring solution that supports the protocol, such as Prometheus or Datadog.
Because of that, to collect telemetry data from the agent, you need to have:
- A way to deploy and operate OpenTelemetry collector(s)
- A monitoring system that can integrate with OpenTelemetry collectors
Details about the selection of such a monitoring system or the operations of OpenTelemetry collectors are beyond the scope of this document. However, we provide some guidelines regarding integrating OpenTelemetry collectors with HCP Terraform agents.
Because OpenTelemetry is a push system, you must start the collector before the HCP Terraform agent. Conversely, shut down the collector after you have stopped all HCP Terraform agents using that collector. We recommend having a one to one ratio of HCP Terraform agent instance and OpenTelemetry collector for long running HCP Terraform agents as this simplifies management.
The OpenTelemetry integration tags metrics with a number of useful fields, including the agent pool ID (agent_pool_id) and the agent name (agent_name). If you do not already have a naming convention for your HCP Terraform agents, then we recommend building one, as it helps you organize your dashboard with the metrics collected from HCP Terraform agents. You can then set the agent's name at startup time using the TFC_AGENT_NAME environment variable or the -name command line option.
Configuring HCP Terraform agent log collection
If you are using HCP Terraform agents in your Terraform enter deployment then configure log collection for agents.
Using observability data
Data collection and aggregation
As you collect and aggregate metrics and logs, you must consider the following.
- Volume (including granularity and frequency)
- Data retention
- Security and privacy
As you collect metrics, you must consider the volume of information relative to the value of individual metrics. This impacts the range of metrics collected and the collection frequency, and decisions in this area have clear tradeoffs.
Collecting a wide range of metrics at a high frequency yields the most data and potentially a more accurate view of the system's health and load. But it comes at a higher cost because you must collect, store, analyze, alert, and report on all this information.
One approach to control costs without reducing the range of metrics collected is to have clear data retention rules and leverage metrics roll-up.
We recommend keeping highly detailed metrics for shorter periods (we suggest two weeks). Discard metrics after two weeks, or roll them up (aggregating 60-second resolution data to 5 minutes, hourly, or per day). If your monitoring platform supports rolling up data, you may have the choice of implementing staged roll-up as follows.
- Full resolution for the last two weeks.
- 5 minutes roll-up for metrics older than two weeks, up to a month.
- Hourly roll-up for metrics over a month.
- Discard metrics older than three months.
For logs, we recommend keeping application logs for a short period (we suggest two weeks) and discarding them after that. For audit trail logs, retention is typically longer, and you need to align with the policy of your security organization.
When it comes to logs, be mindful that they may contain sensitive information. We recommend putting proper security measures in place to protect the collected metrics and logs. This includes implementing access controls, encryption (at rest and in transit), and monitoring mechanisms to safeguard sensitive information.
Alerting and notification
Logging and audit capabilities in Terraform and Terraform Enterprise are essential for capturing and monitoring all events within the infrastructure management solution. Terraform Open Source mainly focuses on individual resource communication and API responses, while Terraform Enterprise provides comprehensive logs, including interactions with the solution, security events, and various inter-communication logs.
Terraform Enterprise offers different event streams and two types of logs: application logs and audit logs. Application logs provide information about the services that make up Terraform Enterprise, while audit logs record changes made to any resource managed by the platform. To ensure effective monitoring, notable log events contain a list of recommended events to track. These events fall into several categories, below are each of the categories and a non-comprehensive list of what events may fall under each:
Security Driven Events:
- Requests to Authentication Tokens
- Requests to Configuration Versions
- Changes in policy set assignments
- Changes in team permissions and user assignments
Login Events:
- Login and Logout
- Failed login attempts
- Accessing, editing, and removing policies
Configuration Events:
- Project and workspace operations
- Variable set operations
Usage and Consumption Driven Events:
- Execution or access of Terraform
- Creation of a run in Terraform
- Starting a plan in Terraform
- Initiating an apply in Terraform
- Policy overridden
Performance Driven Events:
- Monitoring Terraform System/Health-check Endpoint
- Tracking the number of active workers/agents
- Monitoring host resource utilization
The intent of presenting various events is to demonstrate different categories and approaches to separating duties. Some events may appear in multiple categories, and their classification might vary based on your organization's perspective. For instance, logins might be more informative to a security team and are visible through AD/LDAP, while the platform team would manage performance-driven events. The main focus is to highlight the diverse types of events and offer a logical approach for handling them.