Consul
Monitor Consul datacenter health with Telegraf
This page describes the process to set up Telegraf to monitor Consul datacenter telemetry.
Introduction
Consul makes a range of metrics in various formats available so operators can measure the health and stability of a datacenter, and diagnose or predict potential issues.
One monitoring solution is to use the Telegraf Consul plugin with the StatsD protocol supported by Consul. You can also use Grafana to organize and query the data you collect.
For the full list of Consul agent metrics, refer to the telemetry documentation.
Workflow
- Configure Telegraf to collect StatsD and host level metrics
- Configure Consul to send metrics to Telegraf
- Review Consul metrics
Configure Telegraf
Telegraf acts as a StatsD agent and can collect additional metrics about the hosts where Consul agents are running.
Telegraf includes input plugins to collect data such as CPU usage, memory usage, disk I/O, networking, and process status. The following example uses a telegraf.conf file configured to debug common Consul datacenter issues.
telegraf.conf
[global_tags]
role = "consul-server"
datacenter = "us-east-1"
[agent]
interval = "10s"
flush_interval = "10s"
omit_hostname = false
[[inputs.statsd]]
protocol = "udp"
service_address = ":8125"
delete_gauges = true
delete_counters = true
delete_sets = true
delete_timings = true
percentiles = [90]
metric_separator = "_"
parse_data_dog_tags = true
allowed_pending_messages = 10000
percentile_limit = 1000
[[inputs.cpu]]
percpu = true
totalcpu = true
collect_cpu_time = false
[[inputs.disk]]
# mount_points = ["/"]
# ignore_fs = ["tmpfs", "devtmpfs"]
[[inputs.diskio]]
# devices = ["sda", "sdb"]
# skip_serial_number = false
[[inputs.kernel]]
# no configuration
[[inputs.linux_sysctl_fs]]
# no configuration
[[inputs.mem]]
# no configuration
[[inputs.net]]
interfaces = ["eth*"]
[[inputs.netstat]]
# no configuration
[[inputs.processes]]
# no configuration
[[inputs.swap]]
# no configuration
[[inputs.system]]
# no configuration
[[inputs.procstat]]
pattern = "(consul)"
[[inputs.consul]]
address = "localhost:8500"
scheme = "http"
The telegraf.conf file starts with global tags options, which set the role and the datacenter variables. Furthermore, the agent section sets the default collection interval to 10 seconds and instructs Telegraf not to omit the hostname tag host in each metric.
Telegraf also allows you to set additional tags on the metrics that pass through it. This configuration adds tags for the server role consul-server and datacenter us-east-1. You can use these tags in Grafana to filter queries.
The next section of telegraf.conf sets up a StatsD listener on UDP port 8125 with instructions to calculate percentile metrics and to parse DogStatsD-compatible tags. Consul uses this data to report telemetry stats. For more information about these specifications, refer to the full reference documentation for available StatsD-related options in Telegraf.
The next sections in the file configure inputs for collecting CPU, memory, network I/O, and disk I/O data.
Under inputs.net, it is important to make sure the interfaces match the system interface names. Most Linux systems use names like eth0 or enp0s0, but you can choose any valid interface name from your system. The list supports glob patterns, so eth* will match with all interfaces that start with eth.
The configuration also includes the procstat Telegraf plugin, which reports metrics for a process according to a given pattern. In this case, you are using it to monitor the Consul agent process itself.
Finally, the configuration includes a plugin that monitors the health checks associated with the Consul agent by using the Consul API to query the data.
After editing the telegraf.conf file, make sure to reload the Telegraf agent so that you apply the configuration to your Telegraf instance.
Configure Consul
To send telemetry to Telegraf, add a telemetry section to your Consul server or client agent configuration. Include the hostname and port of the StatsD daemon address:
Consul agent configuration
telemetry {
dogstatsd_addr = "localhost:8125"
disable_hostname = true
}
The configuration specifies DogStatsD format instead of plain StatsD. As a result, Consul sends tags with each metric. You can use Grafana to filter data on your dashboards according to these tags. For example, you can display server agent data by filtering for role=consul-server. Telegraf is compatible with the DogStatsD format, and allows you to add your own tags too.
The disable_hostname option instructs Consul not to insert the hostname in the names of the metrics it sends to StatsD. For example, if disable_hostname is set to false, consul.raft.apply would become consul.<HOSTNAME>.raft.apply. For more information, refer to the Consul telemetry configuration reference. We include this configuration because telegraf.conf already inserts the hostnames as tags. If setting hostnames as a part of the metric names is a requirement for you, set this parameter to false.
Review Consul metrics
You can use a tool like Grafana or Chronograf to visualize metrics from Telegraf.
Some of the important metrics to monitor include:
- Memory usage metrics
- File descriptor metrics
- CPU usage metrics
- Network activity metrics
- Disk activity metrics
Memory usage metrics
| Metric Name | Description |
|---|---|
mem.total | Total amount of physical memory (RAM) available on the server. |
mem.used_percent | Percentage of physical memory in use. |
swap.used_percent | Percentage of swap space in use. |
Why they are important: Consul keeps all of its data in memory. If Consul consumes all available memory, it will crash. You should also monitor total available RAM to make sure some RAM is available for other processes. Swap usage should remain at 0% for best performance.
When to take action: If mem.used_percent is over 90%, or if swap.used_percent is greater than 0.
File descriptor metrics
| Metric Name | Description |
|---|---|
linux_sysctl_fs.file-nr | Number of file handles being used across all processes on the host. |
linux_sysctl_fs.file-max | Total number of available file handles. |
Why they are important: Practically anything Consul does, from receiving a connection from another host to sending data between servers or writing snapshots to disk, requires a file descriptor handle. If Consul runs out of handles, it will stop accepting connections. Refer to the Consul FAQ for more details.
By default, process and kernel limits are fairly conservative. We recommend that you increase these limits beyond the defaults.
When to take action: If file-nr exceeds 80% of file-max.
CPU usage metrics
| Metric Name | Description |
|---|---|
cpu.user_cpu | Percentage of CPU being used by user processes, including Consul. |
cpu.iowait_cpu | Percentage of CPU time spent waiting for I/O tasks to complete. |
Why they are important: In normal circumstances, Consul is not particularly demanding on CPU time. A spike in CPU usage might indicate too many operations taking place at once. iowait_cpu is especially critical to watch because it means Consul is waiting for data to be written to disk. That may be a sign that Raft is writing snapshots to disk too often.
When to take action: If cpu.iowait_cpu is greater than 10%.
Network activity metrics
| Metric Name | Description |
|---|---|
net.bytes_recv | Bytes received on each network interface. |
net.bytes_sent | Bytes transmitted on each network interface. |
Why they are important: A sudden spike in network traffic to Consul might be the result of a misconfigured application client causing too many requests to Consul. This source of this data is the system itself, not Consul. Be aware that the net metrics are counters, so in order to calculate rates such as bytes per second, you must apply a function such as non_negative_difference.
When to take action: There are sudden large changes to the net metrics that are more than a 50% deviation from the baseline.
Disk activity metrics
| Metric Name | Description |
|---|---|
diskio.read_bytes | Bytes read from each block device. |
diskio.write_bytes | Bytes written to each block device. |
Why they are important: When you run high volume workloads, the Consul host writes a lot of data to disk. Be aware that the diskio metrics are counters, so to calculate rates such as bytes per second, you must apply a function such as non_negative_difference.
There may be frequent major I/O spikes when leader elections occur. This happens because Consul is checkpointing Raft snapshots to disk frequently when under heavy load. It may also occur when Consul has debug/trace logging enabled in production, which can impact performance.
Too much disk I/O can cause the rest of the system to slow down or become unavailable, as the kernel spends all its time waiting for I/O to complete.
When to take action: You experience sudden large changes to the diskio metrics that are greater than 50% deviation or more than 3 standard deviations from baseline.
Next steps
For more information about agent telemetry in Consul, refer to Consul Agent Telemetry and Consul Dataplane Telemetry.
To learn more about monitoring, alerting, and logging data generated by Consul agents, refer to Consul Monitoring.