Consul observability
This topic provides an overview of the observability features available to HCP Consul users through the management plane service. A dashboard visualizes server and proxy metrics to provide insights into Consul’s operations when monitoring or debugging your network’s performance.
Introduction
Consul exposes a variety of agent telemetry metrics that can provide observability into network operations. This telemetry data captures information about system operations such as transaction timing, Raft leadership and elections, server health, and resource usage. It also provides access to metrics collected by Envoy proxies in the service mesh.
Monitoring these metrics enables you to troubleshoot network performance. For example, the Consul documentation details strategies for using server metrics related to Raft performance when diagnosing issues in write-heavy workload at scale.
HCP Consul’s management plane service supports automated agent and proxy telemetry collection for clusters running Consul v1.16 or later. For clusters running older versions of Consul, you can deploy a telemetry collector to forward data to the management plane, which then visualizes the data in an observability dashboard containing a series of widgets. You can use these data widgets to gain insight into Consul’s operations in support of both monitoring and debugging efforts.
Workflow
The overall workflow to set up and use the observability dashboards for HCP Consul consists of the following steps:
- Use HCP Consul to deploy or link clusters. You can get insights into HashiCorp-managed clusters deployed using HCP, self-managed clusters created with HCP, or previously existing self-managed clusters that are linked to the HCP Consul management plane service. For more information and guidance, refer to create and manage clusters or link self-managed clusters overview.
- Deploy the telemetry collector. This collector is automatically deployed to HashiCorp-managed and self-managed clusters running Consul v1.16 or later, as well as any cluster with an existing link to the HCP management plane that is upgraded to Consul v1.16 or later. You can also deploy the Consul telemetry collector manually when the process does not begin automatically.
- Get visual insights with the observability dashboard. HCP provides visualizations of Consul server and Envoy proxy data in widgets that populate the dashboard. Use them to monitor important indicators of your service network’s health so that you prevent and recover from outages.
Observability dashboard
To access the observability dashboard, go to the Consul overview. Click the name of a cluster and then Observability. A separate observability dashboard is available for each Consul cluster linked to your HCP organization as long as the metrics collector is deployed.
The observability dashboard consists of two sections:
- Server metrics visualizes data about Raft communication and resource usage among the Consul servers.
- Envoy proxy metrics visualizes data about proxy operations, including the number of requests and the number of proxies servicing requests.
If your cluster is linked to HCP Consul but the Envoy proxy metrics are missing, you may need to manually deploy and configure the Consul telemetry collector.
Server metrics
The observability dashboard provides the following information about Consul servers:
Leader status
The leader status widget provides observability into your self-managed cluster’s Raft quorum. The widget provides the following information:
- Leader status indicates whether or not an elected leader has been seen in the last 30 seconds
- Election indicates the number of leader elections over the past day
- Latest heartbeat tracks the amount of time that elapsed over the past day without the leader being contacted. It indicates the max heartbeat, or the longest duration between leader contacts in the past day, as well as the latest heartbeat, or the amount of time elapsed since the leader was last contacted.
This visualization is produced by gathering the following Consul metrics:
Leader transactions
The leader transactions widget provides observability into your self-managed cluster’s write load and write latency. This widget provides the following information:
- Leader write load displays information about the maximum, median, and minimum number of leader transactions over any 10 second period during the previous day.
- Write latency displays information about the maximum, median, and minimum amount of time required for a leader to commit an entry to the Raft log over any 10 second period during the previous day.
This visualization is produced by gathering the following Consul metrics:
Server resource utilization
The server resource utilization widget provides observability into the overall memory and CPU utilization, as well as the I/O wait time for the server agents in a cluster. This widget provides the following information:
- Memory utilization displays information about the maximum, median, and minimum rate of memory utilization over any 10 second period during the previous day. Memory utilization is measured as a percentage of the server’s total memory.
- CPU utilization displays information about the maximum, median, and minimum rate of CPU utilization over any 10 second period during the previous day. CPU utilization is measured as a percentage of the server’s total capacity.
- I/O wait time displays information about the maximum, median, and minimum rate of CPU utilization during an I/O wait state over any 10 second period during the previous day. I/O wait time is measured as a percentage of the server’s total CPU capacity.
This visualization is produced by gathering the following Consul metrics:
You can find more detailed information about your servers’ resource usage in the Expanded server resource utilization widget, which is also located on the server metrics dashboard.
Heartbeat
The heartbeat widget provides observability into Raft leadership and elections over the last 24 hours. Heartbeat measures the amount of time elapsed, in milliseconds, since the leader contacted the follower nodes to check whether it is still the cluster’s leader or if an election should be held. The color of the line indicates which server was the leader and a change in colors indicates that an election was held at that time.
When you hover over the line, the widget displays a heartbeat and timestamp. Because of the graph’s resolution, the number displayed as the leader’s heartbeat is the maximum heartbeat detected during the previous 10 minute interval. Similarly, leadership changes indicate that the election occurred during the previous 10 minutes period.
This visualization is produced by gathering the following Consul metrics:
Leader write loads
The leader write loads widget provides observability into the number of Raft transactions the leader writes and the length of time it takes the leader to write them. The upper graph visualizes the rate at which the leader writes entries to the Raft store over the previous 24 hours, measured as the number of transactions per second. The lower graph visualizes the write latency, or the amount of time it takes for the transaction to complete.
In both graphs, the color of the line indicates which server was the leader and a change in colors indicates that an election was held at that time.
This visualization is produced by gathering the following Consul metrics:
Follower replications
The follower replications widget provides observability into the number of logs that the leader replicated for followers over the previous 24 hour period. This count is measured in one minute intervals. The color of the line indicates which server was the leader and a change in colors indicates that an election was held at that time.
This visualization is produced by gathering the following Consul metrics:
Expanded server resource utilization
The expanded server resource utilization widget provides additional observability into server resource utilization by visualizing a server’s maximum, median, and minimum percentage of total resource usage as a series of bar graphs for the previous 24 hour period.
Each bar indicates a resource’s utilization rate over the previous 10 minute interval. The top of the bar is the highest measure rate, and the bottom of the bar is the lowest measured rate. The colored dots indicate each server’s median utilization rate for that resource during the interval.
This visualization is produced by gathering the following Consul metrics:
Envoy proxy metrics
The observability dashboard provides the following information about Envoy proxies:
Service mesh requests
The service mesh requests widget provides observability into Envoy’s role in communication errors through the number of requests the proxies make per second and the status codes that returned during the previous 24 hour period.
The upper graph visualizes the summed rate of HTTP requests per second during a 10 minute interval, with the color of each line indicating the category of status code. The 2xx
range indicates success. The 4xx
range indicates client errors. The 5xx
range indicates server errors.
The lower graph visualizes the overall success rate of the Envoy proxies. The graph measures the success rate as the ratio between the number of 5xx
status codes returned, which indicate an issue with the Envoy servers, and the total number of requests sent.
This visualization is produced by gathering metrics from Envoy. Refer to Statistics metrics in the Envoy documentation for more information.
Envoy connections
The Envoy connections widget provides observability into the number of open Envoy connections during the previous 24 hour period. The upper graph visualizes the number of new connections per second established by Envoy proxies. The lower graph visualizes the total number of open Envoy connections over the same period.
This visualization is produced by gathering metrics from Envoy. Refer to Statistics metrics in the Envoy documentation for more information.
Envoy servers state count
The Envoy servers state count widget provides observability into the overall health of your network’s Envoy proxies over the previous 24 hours period. The color of a line indicates an Envoy server’s state, and the line visualizes the number of Envoy servers that are in that state during a 10 minute interval. As described in the Envoy documentation, the possible Envoy states are:
State | Description |
---|---|
Live | The Envoy server is live and serving traffic. |
Draining | The Envoy server is draining listeners in response to a failed external health check. |
Pre-initializing | The Envoy server has not completed cluster manager initialization. |
Initializing | The Envoy server is running the cluster manager initialization callbacks. |
When you hover over a line, the widget displays exact counts for the number of Envoy servers in each state.
This visualization is produced by gathering metrics from Envoy. Refer to Server state in the Envoy documentation for more information.
Envoy proxies management status
The Envoy proxies management status widget provides observability into the number of Envoy proxies that are connecting and disconnecting from Consul over the previous 24 hour period. The graph visualizes the number of connections made between Envoy and Consul in a 10 minute interval as a green bar extending up the y-axis. The red bar extending down the y-axis visualizes the number of Envoy proxies that lost their connection to Consul during the 10 minute interval.
When you hover over a line, the widget displays exact counts for the number of Envoy servers that were connected and disconnected.
This visualization is produced by gathering metrics from Envoy. Refer to Management server in the Envoy documentation for more information.