Observability pillars
When monitoring Consul, it is helpful to tackle it based on the two primary components:
Monitoring the control plane
- This includes your Consul servers, which manage the cluster's overall operations. These servers can run on bare metal, virtual machines (VMs), or within Kubernetes environments. However, it is strongly recommended to deploy them on VMs, following HashiCorp's recommended deployment architecture.
Monitoring the data plane
- This involves observing the Consul clients and associated elements. Key components include Consul agents (clients), sidecar proxies, Consul dataplanes (when operating within containerized environments like Kubernetes), and Gateways, which include Mesh, API, and Terminating Gateway.
For both the control plane and data planes, it is essential to establish standards across the three core pillars of observability: metrics, logs, and traces. This approach ensures comprehensive monitoring is configured, which provides visibility for effectively managing and troubleshooting your Consul environment.
Monitor the control plane
Metrics
Consul emits a significant amount of telemetry metrics, which can be overwhelming. To start, focus on The Four Golden Signals of monitoring:
Latency: The time it takes to service a request.
Traffic: The volume of requests a system handles at any given time.
Errors: The rate of failed requests, either explicitly (e.g., HTTP 500s), implicitly (e.g., HTTP 200 success response with incorrect content), or by policy (e.g., any request over a committed response time).
Saturation: The percentage of resources consumed, indicating how full your system is.
Key metrics to monitor
System metrics
Visibility into system resource usage helps understand server saturation. Key resources include:
- CPU
- Memory
- Disk
Refer to “Server host metrics” for specific metrics for the above resources
Consul metrics
Server health: Please refer to the Consul documentation “Agent telemetry - server health” for details.
Enable Consul telemetry
Configure each Consul agent (server) to expose and capture local agent metrics with Prometheus and DogstatsD:
telemetry = {
prometheus_retention_time = "1h"
dogstatsd_addr = "127.0.0.1:8125"
disable_hostname = true
}
Note
You may need to restart the Consul agent for the configuration to take effect. Follow best practices and restart the leader node last.Recommendations
Recommended server metrics to monitor:
- Transaction timing
- Leadership changes
- Certificate authority expiration
- Autopilot
- Memory usage
- Garbage collection
- Network activity - RPC count
- Raft thread saturation
- Raft replication capacity issues
- License expiration
- Bolt DB performance
For full details on the specific metrics refer to “Agent telemetry - server health”
Additionally, monitor the following metrics:
consul.rpc.*
- specifically,consul.rpc.query
andconsul.rpc.queries_blocking
are the two most relevant.consul.grpc.server.*
- specifically,consul.grpc.server.stream.count
, andconsul.grpc.server.streams
show the number of streams being processed by the server.consul.xds.server.streams
- number of xDS streams. In version 1.14 and higher, this metric will be a large source of read load.
References
Logs
Consul generates two types of logs, Consul application logs and audit logs. Each serve distinct purposes for system monitoring and compliance. These logs contain valuable information regarding system interactions and current operational status. Integrating these logs into your Application Performance Monitoring (APM) platform enables the correlation of metrics with log events, facilitating efficient troubleshooting of issues.
Consul logs
By default, Consul generates application logs and outputs to stdout
. Consul logs capture information about the internal state of Consul, such as errors and messages. This information can be valuable in troubleshooting issues. The logs can be redirected to syslog
or a file.
To redirect Consul logs to a file, add the following to your agent configuration:
log_file = "/var/log/consul/consul.log"
log_level = "INFO"
log_rotate_bytes = 52428800
log_rotate_duration = "24h"
log_rotate_max_files = 7
Recommendations
Configure Consul logs to
file
.Configure log rotation by setting appropriate values for
log_rotate_bytes
,log_rotate_duration
, andlog_rotate_max_files
.Limit who has access to the logs by setting appropriate file permissions.
Configure your APM tool to ingest the logs from all Consul agents (e.g. "Datadog - enable log collection").
References
Audit logs
Consul can generate audit logs if enabled. This is an Enterprise feature, so you will need a valid license file. Actionable logs of authenticated events (both attempted and committed) that Consul processes via its HTTP API are captured in these logs. These events are then compiled into a JSON format for easy export and contain a timestamp, the operation performed, and the user who initiated the action.
Audit logging enables security and compliance teams within an organization to get greater insight into Consul access and usage patterns.
To enable audit logging, add the following to your agent configuration:
audit {
enabled = true
sink "My sink" {
type = "file"
format = "json"
path = "data/audit/audit.json"
delivery_guarantee = "best-effort"
rotate_duration = "24h"
rotate_max_files = 15
rotate_bytes = 25165824
}
}
Note
At least one ofrotate_duration
or rotate_bytes
must be configured to enable audit logging. To learn more about configuring and interpreting audit logs, please refer to the “Capture Consul events with audit logging” tutorial.
Recommendations
Configure log rotation by setting appropriate values for
rotate_duration
,rotate_max_files
, androtate_bytes
.Limit who has access to audit logs.
Configure your APM tool to ingest audit logs from all Consul agents (e.g. "Datadog - enable log collection").
References
Traces
Traces provide a view of the traffic flow from one service to another as users and applications interact with the system. In the context of Consul, it involves capturing layer 7 metrics from the sidecar proxy running on the Consul dataplane. These metrics offer visibility into the flow of traffic, errors, and response times, helping to identify issues and bottlenecks.
Since this primarily relates to the dataplane, we will cover it in detail in the “Monitor the data plane” section below.
Tip
This is applicable only if Consul’s Service Mesh feature is enabled.Monitor the data plane
Metrics
Key metrics to monitor
System metrics
Similar to the system metrics from Consul servers, it is important to monitor the system resources of your data plane nodes.
- CPU
- Memory
- Disk
Refer to server host metrics for the specific metrics for the above resources.
Consul metrics
The data plane comprises Consul agents (clients), sidecar proxies, and gateways (mesh, ingress, and terminating).
- Consul agent metrics - Similar to the Consul server agent, agents running as clients emit telemetry data - Refer to metrics reference.
Gateways (mesh, API, terminating)
Gateways also emit telemetry related to the network traffic passing through them. Metrics include ingress/egress request details, errors, and performance information.
Consul dataplane
When running in containerized environments, Consul supports a lightweight agent called Consul dataplane, which also emits telemetry. Refer to "Consul dataplane telemetry" for more information.
Services and health checks
Additionally, monitor the services running in the Consul datacenter. These include:
Catalog
Use the
catalog
command to query all registered services in a Consul data center.Service endpoints
Use the
/agent/service/:service_id
API endpoint to query individual services.Health checks
Use the local service healthcheck to get the aggregated state of service(s) on the local agent by name
Network
In a microservices environment, the real-time traffic and health of the network is critical, so monitoring it is valuable. APM tools like Datadog provide a robust network monitoring solution (NPM) which should be leveraged.
- Datadog: Refer to "Datadog NPM now supports Consul networking" for more details.
Enable Consul telemetry
Configure each Consul agent (client) to expose and capture local agent metrics with Prometheus and DogstatsD:
telemetry = {
prometheus_retention_time = "1h"
dogstatsd_addr = "127.0.0.1:8125"
disable_hostname = true
}
Note
You may need to restart the Consul agent for the configuration to take effect.Proxy monitoring
Consul proxy metrics provide detailed health and performance information about your service mesh applications. This includes upstream/downstream network traffic metrics, error rates, and additional performance information that you can use to understand your distributed applications.
Once proxy metrics are enabled in Consul, you do not need to configure or instrument your applications in the service mesh to leverage proxy metrics.
Enabling Envoy sidecar proxy metrics
This can be configured in two ways:
Global default, which applies to all Envoy proxies in the Consul datacenter.
# File: proxy-defaults.hcl Kind = "proxy-defaults" Name = "global" Config = { envoy_prometheus_bind_addr = "127.0.0.1:9102" envoy_dogstatsd_url = "udp://127.0.0.1:8125" }
Refer to “Proxy defaults configuration reference” documentation for more details.
Service (sidecar) specific, which applies to the specifically configured proxy.
# web.hcl - web service and sidecar proxy registration service { name = "web" id = "web-1" port = 8080 connect { sidecar_service{ proxy = { config = [ { envoy_prometheus_bind_addr = "127.0.0.1:9102" envoy_dogstatsd_url = "udp://127.0.0.1:8125" } ] } } }
For the configured changes to take effect, restart Envoy, or re-generate its bootstrap configuration using the
consul connect envoy
command
Recommendations
For full end-to-end visibility, HashiCorp recommends collecting telemetry from all components of your service mesh namely, agents (client), sidecar proxy (envoy proxy), gateways (envoy proxy), and Consul data plane.
References
Logs
Similar to the Consul control plane, the data plane components also generate both application logs, and access logs. The primary components are the Consul agents (clients).
Consul logs
The Consul agent (client) running in your Consul data center generates logs to stdout
by default. Agents can be configured to redirect logs to syslog
or a file.
To redirect Consul logs to a file, add the following to all of your agent (client) configurations:
log_file = "/var/log/consul/consul.log"
log_level = "INFO"
log_rotate_bytes = 52428800
log_rotate_duration = "24h"
log_rotate_max_files = 7
Recommendations
Enable logs from all components of your data plane.
Configure log rotation by setting appropriate values for
log_rotate_bytes
,log_rotate_duration
, andlog_rotate_max_files
.Limit who has access to the logs by setting appropriate file permissions.
Configure your APM tool to ingest the logs from all Consul agents (e.g. "Datadog - enable log collection").
References
Access logs
Envoy proxy can emit access logs, which record application connections and requests that pass through proxies in a service mesh, including sidecar proxies and gateways.
This can help in troubleshooting issues, threat detection, and audit compliance.
Sidecar proxy can be configured to emit access logs. These logs record service-to-service connections and requests in a service mesh.
It is configured globally as follows:
Kind = "proxy-defaults" Name = "global" AccessLogs { Enabled = true }
Gateway logs
Like the sidecar proxy (Envoy), gateways (mesh, API, and terminating) can emit access logs to record network traffic that passes through them in a service mesh.
Recommendations
Enable access logs in JSON format so APM tools can ingest them.
Envoy proxy does not handle log rotation for the access logs it generated. Use an appropriate log rotation tool to periodically rotate logs.
References
Audit logs
Consul agents that run as clients in your data plane also generate audit logs. Like your Consul servers, configure all your Consul client agents to enable audit logging.
To enable audit logging add the following to all your agent (client) configurations:
audit {
enabled = true
sink "My sink" {
type = "file"
format = "json"
path = "data/audit/audit.json"
delivery_guarantee = "best-effort"
rotate_duration = "24h"
rotate_max_files = 15
rotate_bytes = 25165824
}
}
Recommendations
Configure log rotation by setting appropriate values for
rotate_duration
,rotate_max_files
, androtate_bytes
.Limit who has access to audit logs.
Configure your APM tool to ingest audit logs from all Consul agents.
References
Traces
Traces provide a comprehensive and continuous view of an application’s network traffic, aiming to follow a request flow as it moves from one service to another. Think of tracing as representing a single user’s journey through an entire app stack. Its primary purpose is optimization rather than being reactive. By tracing through a stack, developers can pinpoint errors or performance bottlenecks.
When issues arise, tracing allows you to understand how the user encountered the problem by examining:
- Which function was involved
- The function’s duration
- Parameters passed
- How deep into the function the user progressed
The two parts of traces
Trace: Represents the entire journey of a request or action as it moves through various nodes of a distributed system, especially containerized applications or microservices architectures.
Span: An operation or work taking place on a service. For example, a web server responding to an HTTP request or a single invocation of a function. A span has a start time and an end time. A series of tagged time intervals, known as spans, form a single trace in distributed tracing.
Note
Before you can start tracing, your applications must be instrumented with an appropriate tracing library.Instrument your application with Datadog
Consul does not automatically implement tracing for your applications. They must be instrumented to support tracing.
If Datadog is your APM solution, you can use Datadog’s SDK and tracing library (Tracer) to instrument your applications.
For more implementation details and information on Datadog SDK’s supported environments and programming languages, please refer to Datadog’s "Application Instrumentation" documentation.
Consul trace configuration
This involves configuring your sidecar proxy (Envoy) by adding spans. The following two keys need to be configured:
envoy_tracing_json
envoy_extra_static_cluster_json
For details on trace configuration, please refer to the following articles:
Tip
Consul does not support all proxies and protocols. Refer to Considerations for details.Configure proxy defaults so the settings will automatically apply to all proxies by default.
Note
All proxies must be restarted for the configuration changes to take effect.Recommendations
Correlate logs to traces
Inject span IDs into your log messages, so your APM tool can correlate them.
- Datadog: For implementation details please refer to Datadog’s “Correlate Logs and Traces” document.
While implementing the instrumentation of your application, thoroughly test and verify the tracing data:
Use Consul’s (or other 3rd party tools') fault injection feature to inject errors and timeouts.
Stress test your complete environment and verify your tracing implementation is identifying the bottlenecks.
Data security
As with telemetry and logging data, tracing data can potentially expose sensitive data, such as personal identifiable information (PII). Be mindful and take measures to obfuscate such data.
Datadog: You can either leverage Sensitive Data Scanners or implement Datadog’s tracing library to remediate such data before sending it to Datadog.
Refer to Datadog’s Data Security documentation for more information.
References