Monitor Consul Datacenter Health
After setting up your first datacenter, it is an ideal time to make sure your deployment is healthy and establish a baseline. This tutorial will cover several types of metrics in two sections: Consul health and server health.
Consul health:
- Transaction timing
- Leadership changes
- Autopilot
- Garbage collection
Server health:
- File descriptors
- CPU usage
- Network activity
- Disk activity
- Memory usage
For each type of metric, we will review their importance and help identify when a metric is indicating a healthy or unhealthy state.
First, we need to understand the three methods for collecting metrics. We will briefly cover using SIGUSR1, the HTTP API, and telemetry.
Before starting this tutorial, we recommend configuring ACLs.
How to collect metrics
There are three methods for collecting metrics. The first, and simplest, is to use SIGUSR1
for a one-time dump of current telemetry values. The second method is to get a similar one-time dump using the HTTP API. The third method, and the one most commonly used for long-term monitoring, is to enable telemetry in the Consul configuration file.
SIGUSR1 for local use
To get a one-time dump of current metric values, we can send the SIGUSR1
signal to the Consul process.
This will send the output to the system logs (e.g. /var/log/messages
, journald
). If you are monitoring the Consul process in the terminal via consul monitor
, you will get the metrics in the output.
Although this is the easiest way to get a quick read of a single Consul agent's health, it is much more useful to look at how the values change over time.
API GET request
Next let's use the HTTP API to quickly collect metrics with curl.
In production you will want to set up credentials with an ACL token and enable TLS for secure communications. Once ACLs have been configured, you can pass a token with the request.
In addition to being a good way to quickly collect metrics, it can be added to a script or it can be used with monitoring agents that support HTTP scraping, such as Prometheus, to visualize the data.
Enable telemetry
Finally, Consul can be configured to send telemetry data to a remote monitoring system. This allows you to monitor the health of agents over time, spot trends, and plan for future needs. You will need a monitoring agent and console for this.
Consul supports the following telemetry agents:
- Circonus
- DataDog (via
dogstatsd
) - StatsD (via
statsd
,statsite
,telegraf
, etc.)
If you are using StatsD, you will also need a compatible database and server, such as Grafana, Chronograf, or Prometheus.
Telemetry can be enabled in the agent configuration file, for example server.hcl
. Telemetry can be enabled on any agent, client or server. Normally, you would at least enable it on all the servers (both voting and read replicas) to monitor the health of the entire datacenter.
An example snippet of server.hcl
to send telemetry to DataDog looks like this:
When enabling telemetry on an existing datacenter, the Consul process will need to be reloaded. This can be done with consul reload
or kill -HUP <process_id>
. It is recommended to reload the servers one at a time, starting with the non-leaders.
Consul health
The Consul health metrics reveal information about the Consul datacenter. They include performance metrics for the key value store, transactions, raft, leadership changes, autopilot tuning, and garbage collection.
Transaction timing
The following metrics indicate how long it takes to complete write operations in various parts, including Consul KV and Raft from the Consul server. Generally, these values should remain reasonably consistent and no more than a few milliseconds each.
Metric Name | Description |
---|---|
consul.kvs.apply | Measures the time it takes to complete an update to the KV store. |
consul.txn.apply | Measures the time spent applying a transaction operation. |
consul.raft.apply | Counts the number of Raft transactions occurring over the interval. |
consul.raft.commitTime | Measures the time it takes to commit a new entry to the Raft log on the leader. |
Sudden changes in any of the timing values could be due to unexpected load on the Consul servers or due to problems on the hosts themselves. Specifically, if any of these metrics deviate more than 50% from the baseline over the previous hour, this indicates an issue. Below are examples of healthy transaction metrics.
Leadership changes
In a healthy environment, your Consul datacenter should have a stable leader. There shouldn't be any leadership changes unless you manually change leadership (by taking a server out of the datacenter, for example). If there are unexpected elections or leadership changes, you should investigate possible network issues between the Consul servers. Another possible cause could be that the Consul servers are unable to keep up with the transaction load.
Note: These metrics are reported by the follower nodes, not by the leader.
Metric Name | Description |
---|---|
consul.raft.leader.lastContact | Measures the time since the leader was last able to contact the follower nodes when checking its leader lease. |
consul.raft.state.candidate | Increments when a Consul server starts an election process. |
consul.raft.state.leader | Increments when a Consul server becomes a leader. |
consul.server.isLeader | Track if a server is a leader(1) or not(0). |
If the candidate
or leader
metrics are greater than 0 or the lastContact
metric is greater than 200ms, you should look into one of the possible causes described above. Below are examples of healthy leadership metrics.
Autopilot
There are two autopilot metrics that inform on the healthiness of the Consul datacenter.
Metric Name | Description |
---|---|
consul.autopilot.healthy | Tracks the overall health of the local server cluster. If all servers are considered healthy by autopilot, this will be set to 1. If any are unhealthy, this will be 0. |
consul.autopilot.failure_tolerance | Tracks the number of voting servers that the local server cluster can lose while continuing to function. |
The consul.autopilot.healthy
metric is a boolean. A value of 1 indicates a healthy datacenter and 0 indicates an unhealthy state. An alert should be setup for a returned value of 0. Below is an example of a healthy datacenter according to the autopilot metric.
The consul.autopilot.failure_tolerance
metric provides insight into how many Consul servers can be lost before the datacenter enters an unhealthy state. A warning alert should be setup for a returned value of 0, so that proactive action can be taken to safeguard the datacenter from becoming unhealthy.
Rate limiting
Rate limiting is a preventive feature on Consul servers that controls how many requests can be served to Consul clients. Rate limiting is an early indication that an extra load is being processed by the cluster and further action is needed - either to change the baseline for acceptable levels of workload, or investigate the reason for the unexpected extra load.
Metric Name | Description |
---|---|
consul.client.rpc | Increments whenever a Consul agent in client mode makes an RPC request to a Consul server. |
consul.client.rpc.exceeded | Increments whenever a Consul agent in client mode makes an RPC request to a Consul server gets rate limited by that agent's limits configuration. |
consul.client.rpc.failed | Increments whenever a Consul agent in client mode makes an RPC request to a Consul server and fails. |
Signs that an agent is being rate-limited or fails to make an RPC request to a Consul server can be sudden large changes to the consul.client.rpc
metrics (for example, greater than 50% deviation from baseline), as well as consul.client.rpc.exceeded
and consul.client.rpc.failed
having a non-zero value.
Garbage collection
Garbage collection (GC) pauses are a "stop-the-world" event, all runtime threads are blocked until GC completes. In a healthy environment these pauses should only last a few nanoseconds. If memory usage is high, the Go runtime may start the GC process so frequently that it will slow down Consul. You might observe more frequent leader elections or longer write times.
Metric Name | Description |
---|---|
consul.runtime.total_gc_pause_ns | Number of nanoseconds consumed by stop-the-world garbage collection (GC) pauses since Consul started. |
If the value return is more than 2 seconds/minute, you should start investigating the cause. If it exceeds 5 seconds per minute, you should consider the datacenter to be in a critical state and start ensuring failure recovery procedures are up-to-date and start investigating. Below is an example of healthy GC pause.
Note, total_gc_pause_ns
is a cumulative counter, so in order to calculate rates, such as GC/minute, you will need to apply a function such as non_negative_difference.
License expiration
Consul Enterprise requires a license. When the license expires, some Consul Enterprise features will cease to work. For example, it will no longer be possible to create or modify resources in non-default namespaces or to manage namespace definitions themselves. However, reads of namespaced resources will still work.
Metric Name | Description |
---|---|
consul.system.licenseExpiration | Number of hours until the Consul Enterprise license will expire. |
The consul.system.licenseExpiration
metric should be monitored to ensure that the license doesn't expire to prevent degradation of functionality.
Server health
The server metrics provide information about the health of your datacenter including file handles, CPU usage, network activity, disk activity, and memory usage.
File descriptors
The majority of Consul operations require a file descriptor handle, including receiving a connection from another host, sending data between servers, and writing snapshots to disk. If Consul runs out of handles, it will stop accepting connections.
Metric Name | Description |
---|---|
linux_sysctl_fs.file-nr | Number of file handles being used across all processes on the host. |
linux_sysctl_fs.file-max | Total number of available file handles. |
By default, process and kernel limits are conservative, you may want to increase the limits beyond the defaults. If the linux_sysctl_fs.file-nr
value exceeds 80% of linux_sysctl_fs.file-max
, the file handles should be increased. Below is an example of a file handle metric.
CPU usage
Consul should not be demanding of CPU time on either server or clients. A spike in CPU usage could indicate too many operations taking place at once.
Note
CPU metrics are only available via dogstatsd.
Metric Name | Description |
---|---|
cpu.user_cpu | Percentage of CPU being used by user processes (such as Vault or Consul). |
cpu.iowait_cpu | Percentage of CPU time spent waiting for I/O tasks to complete. |
If cpu.iowait_cpu
is greater than 10%, it should be considered critical as Consul is waiting for data to be written to disk. This could be a sign that Raft is writing snapshots to disk too often. Below is an example of a healthy CPU metric.
Network activity
Network activity should be consistent. A sudden spike in network traffic to Consul might be the result of a misconfigured client, such as Vault, that is causing too many requests.
Most agents will report separate metrics for each network interface, so be sure you are monitoring the right one.
Metric Name | Description |
---|---|
net.bytes_recv | Bytes received on each network interface. |
net.bytes_sent | Bytes transmitted on each network interface. |
Sudden increases to the net
metrics, greater than 50% deviation from baseline, indicates too many requests that are not being handled. Below is an example of a network activity metric.
Note: The net
metrics are counters, so in order to calculate rates, such as bytes/second,
you will need to apply a function such as non_negative_difference.
Disk activity
Normally, there is low disk activity, because Consul keeps everything in memory. If the Consul host is writing a large amount of data to disk, it could mean that Consul is under heavy write load and consequently is checkpointing Raft snapshots to disk frequently. It could also mean that debug/trace logging has accidentally been enabled in production, which can impact performance.
Metric Name | Description |
---|---|
diskio.read_bytes | Bytes read from each block device. |
diskio.write_bytes | Bytes written to each block device. |
diskio.read_time | Time spent reading from disk, in cumulative milliseconds. |
diskio.write_time | Time spent writing to disk, in cumulative milliseconds. |
Sudden, large changes to the diskio
metrics, greater than 50% deviation from baseline
or more than 3 standard deviations from baseline indicates Consul has too much disk I/O. Too much disk I/O can cause the rest of the system to slow down or become unavailable, as the kernel spends all its time waiting for I/O to complete. Below are examples of disk activity metrics.
Note: The diskio
metrics are counters, so in order to calculate rates (such as bytes/second),you will need to apply a function such as [non_negative_difference][].
Memory usage
As noted previously, Consul keeps all of its data -- the KV store, the catalog, etc -- in memory. If Consul consumes all available memory, it will crash. You should monitor total available RAM to make sure some RAM is available for other system processes and swap usage should remain at 0% for best performance.
Note
Memory metrics are only available via dogstatsd.
Metric Name | Description |
---|---|
consul.runtime.alloc_bytes | Measures the number of bytes allocated by the Consul process. |
consul.runtime.sys_bytes | The total number of bytes of memory obtained from the OS. |
mem.total | Total amount of physical memory (RAM) available on the server. |
mem.used_percent | Percentage of physical memory in use. |
swap.used_percent | Percentage of swap space in use. |
Consul servers are running low on memory if consul.runtime.sys_bytes
exceeds 90% of mem.total_bytes
, mem.used_percent
is over 90%, or swap.used_percent
is greater than 0. You should increase the memory available to Consul if any of these three conditions are met. Below are examples of memory usage metrics.
Raft Protocol Health
Only Consul server nodes participate in Raft and are part of the peer set. All client nodes forward requests to servers. The following metrics are reported on server nodes only.
Metric Name | Description |
---|---|
consul.raft.thread.main.saturation | An approximate measurement of the proportion of time the main Raft goroutine is busy and unavailable to accept new work. |
consul.raft.thread.fsm.saturation | An approximate measurement of the proportion of time the Raft FSM goroutine is busy and unavailable to accept new work. |
consul.raft.fsm.lastRestoreDuration | Measures the time taken to restore the FSM from a snapshot on an agent restart or from the leader calling installSnapshot. |
consul.raft.rpc.installSnapshot | Measures the time taken to process the installSnapshot RPC call. This metric should only be seen on agents which are currently in the follower state. |
consul.raft.leader.oldestLogAge | The number of milliseconds since the oldest entry in the leader's Raft log store was written. In normal usage this gauge value will grow linearly over time until a snapshot completes on the leader and the Raft log is truncated. |
Saturation of consul.raft.thread
of more than 50% can lead to elevated latency in the rest of the system and cause cluster instability.
Snapshot restore happens when a Consul agent is first started, or when specifically instructed to do so via RPC. For Consul servers, consul.raft.fsm.lastRestoreDuration
tracks the duration of the operation. For Consul clients, from the leader's perspective when it installs a new snapshot on a follower, consul.raft.rpc.installSnapshot
tracks the timing information. Both these metrics should be consistent, without sudden large changes.
Plotting of consul.raft.leader.oldestLogAge
should look like a saw-tooth wave increasing linearly with time until the leader takes a snapshot and then jumping down as the oldest Raft logs are truncated. The lowest point on that line should remain comfortably higher (for example, 2x or more) than the time it takes to restore a snapshot.
Bolt DB Performance
Consul turns each write operation into a single Raft log to be committed. Raft processes these Raft logs and stores them within a Bolt DB instance in batches. Each call to store Raft logs within Bolt DB is measured to record how long it took as well as how many Raft logs were contained in the batch. Monitoring Bolt DB performance can give insights about platform usage and system performance.
Metric Name | Description |
---|---|
consul.raft.boltdb.storeLogs | Measures the amount of time spent writing Raft logs to the db. |
consul.raft.boltdb.freelistBytes | Represents the number of bytes necessary to encode the freelist metadata. |
consul.raft.boltdb.logsPerBatch | Measures the number of Raft logs being written per batch to the db. |
consul.raft.boltdb.writeCapacity | Theoretical write capacity in terms of the number of Raft logs that can be written per second. |
Sudden increases in the consul.raft.boltdb.storeLogs
times will directly impact the upper limit to the throughput of write operations within Consul.
If the free space within the database grows excessively large, such as after a large spike in writes beyond the normal steady state and a subsequent slow down in the write rate, then Bolt DB could end up writing a large amount of extra data to disk for each Raft log storage operation. This will lead to an increase in the consul.raft.boltdb.freelistBytes
metric - a count of the extra bytes that are being written for each Raft log storage operation beyond the Raft log data itself. Sudden increases in this metric can be correlated to increases in the consul.raft.boltdb.storeLogs
metric indicating an issue.
The maximum number of Raft log storage operations that can be performed each second is represented with the consul.raft.boltdb.writeCapacity
metric. When Raft log storage operations are becoming slower you may not see an immediate decrease in write capacity due to increased batch sizes of the each operation. Sudden changes in this metric should be further investigated.
The consul.raft.boltdb.logsPerBatch
metric keeps track of the current batch size for Raft log storage operations. The maximum allowed is 64 Raft logs. Therefore if this metric is near 64 and the consul.raft.boltdb.storeLogs
metric is seeing increased time to write each batch to disk, it is likely that increased write latencies and other errors may occur.
Next steps
In this tutorial, we reviewed the three methods for collecting metrics. SIGUSR1 and agent HTTP API are both quick methods for collecting metrics, but enabling telemetry is the best method for moving data into monitoring software. Additionally, we outlined the various metrics collected and their significance.