This document will guide you recommendations for monitoring your Consul control and data plane. By keeping track of these components and setting up alerts, you can better maintain the overall health and resilience of your service mesh.
A Consul datacenter is the smallest unit of Consul infrastructure that can perform basic Consul operations like service discovery or service mesh. A datacenter contains at least one Consul server agent, but a real-world deployment contains three or five server agents and several Consul client agents.
The Consul control plane consists of server agents that store all state information, including service and node IP addresses, health checks, and configuration. In addition, the control plane is responsible for securing the mesh, facilitating service discovery, health checking, policy enforcement, and other similar operational concerns. In addition, the control plane contains client agents that report node and service health status to the Consul cluster. In a typical deployment, you must run client agents on every compute node in your datacenter.
The Consul data plane consists of proxies deployed locally alongside each service instance. These proxies, called sidecar proxies, receive mesh configuration data from the control plane, and control network communication between their local service instance and other services in the network. The sidecar proxy handles inbound and outbound service connections, and ensures TLS connections between services are both verified and encrypted.
If you have Kubernetes workloads, you can also run Consul with an alternate service mesh configuration that deploys Envoy proxies but not client agents. Refer to Simplified service mesh with Consul dataplanes for more information.
The Consul control plane consists of the following components:
- RPC Communication between Consul servers and clients.
- Data plane routing instructions for the Envoy Layer 7 proxy.
- Serf Traffic: LAN and WAN
- Consul cluster peering and server federation
It is important to monitor and establish baseline and alert thresholds for Consul control plane components for abnormal behavior detection. Note that these alerts can also be triggered by some planned events like Consul cluster upgrades, configuration changes, or leadership change.
To help monitor your Consul control plane, we recommend to establish a baseline and standard deviation for the following:
It is important to have a highly performant network with low network latency. Ensure network latency for gossip in all datacenters are within the 8ms latency budget for all Consul agents. View the Production server requirements for more information.
Consul uses Raft for consensus protocol. High saturation of the Raft goroutines can lead to elevated latency in the rest of the system and may cause the Consul cluster to be unstable. As a result, it is important to monitor Raft to track your control plane health. We recommend the following actions to keep control plane healthy:
Create an alert that notifies you when Raft thread saturation exceeds 50%.
Monitor Raft replication capacity when Consul is handling large amounts of data and high write throughput.
raft_multiplierto keep your Consul cluster stable. The value of
raft_multiplierdefines the scaling factor for Consul. Default value for raft_multiplier is 5.
A short multiplier minimizes failure detection and election time but may trigger frequently in high latency situations. This can cause constant leadership churn and associated unavailability. A high multiplier reduces the chances that spurious failures will cause leadership churn but it does this at the expense of taking longer to detect real failures and thus takes longer to restore Consul cluster availability.
Wide networks with higher latency will perform better with larger
Raft uses BoltDB for storing data and maintaining its own state. Refer to the Bolt DB performance metrics when you are troubleshooting Raft performance issues.
The data plane of Consul consists of Consul clients or Connect proxies interacting with each other through service-to-service communication. Service-to-service traffic always stays within the data plane, while the control plane only enforces traffic rules. Monitoring service-to-service communication is important but may become extremely complex in an enterprise setup with multiple services communicating to each other across federated Consul clusters through mesh, ingress and terminating gateways.
You can extract the following service-related information:
- Use the
catalogcommand or the Consul UI to query all registered services in a Consul datacenter.
- Use the
/agent/service/:service_idAPI endpoint to query individual services. Connect proxies use this endpoint to discover embedded configuration.
Envoy is the supported Connect proxy for Consul service mesh. For virtual machines (VMs), Envoy starts as a sidecar service process. For Kubernetes, Envoy starts as a sidecar container in a Kubernetes service pod. Refer to the Supported Envoy versions documentation to find the compatible Envoy versions for your version of Consul.
For troubleshooting service mesh issues, set Consul logs to
debug. The following example annotation sets Envoy logging to
annotations: consul.hashicorp.com/envoy-extra-args: '--log-level debug --disable-hot-restart'
Refer to the Enable logging on Envoy sidecar pods documention for more information.
To troubleshoot service-to-service communication issues, monitor Envoy host statistics. Envoy exposes a local administration interface that can be used to query and modify different aspects of the server on port
19000 by default. Envoy also exposes a public listener port to receive mTLS connections from other proxies in the mesh on port
20000 by default.
All endpoints exposed by Envoy are available at the node running Envoy on port
19000. The node can either be a pod in Kubernetes or VM running Consul Service Mesh. For example, if you forward the Envoy port to your local machine, you can access the Envoy admin interface at
The following Envoy admin interface endpoints are particularly useful:
listenersendpoint lists all listeners running on
localhost. This allows you to confirm whether the upstream services are binding correctly to Envoy.
$ curl http://localhost:19000/listeners public_listener:192.168.19.168:20000::192.168.19.168:20000 Outbound_listener:127.0.0.1:15001::127.0.0.1:15001
/clustersendpoint displays information about the xDS clusters, such as service requests and mTLS related data. The following example shows a truncated output.
$ http://localhost:19000/clusters `local_app::observability_name::local_app local_app::default_priority::max_connections::1024 local_app::default_priority::max_pending_requests::1024 local_app::default_priority::max_requests::1024 local_app::default_priority::max_retries::3 local_app::high_priority::max_connections::1024 local_app::high_priority::max_pending_requests::1024 local_app::high_priority::max_requests::1024 local_app::high_priority::max_retries::3 local_app::added_via_api::true ## ...
Visit the main admin interface (
http://localhost:19000) to find the full list of possible Consul admin endpoints. Refer to the Envoy docs for more information.
In this guide, you learned recommendations for monitoring your Consul control and data plane.
To learn about monitoring the Consul host and instance resources, visit our Monitoring best practices documentation.