Consul
Troubleshoot Consul datacenter operations
Troubleshooting is a fundamental skill for any devops practitioner and Consul includes several tools to help you view log messages, validate configuration files, examine the service catalog, and do other debugging.
Before you start, consider following a pre-determined troubleshooting workflow, which can keep distractions from interfering with finding and fixing the problem. You or your team might already use a workflow such as Observe, Orient, Decide, Act or Rubber duck debugging.
Workflow
We recommend the following approach to troubleshooting Consul:
- Gather data. Review Consul's log files, status messages, process lists, data feeds, and agent metrics to help you understand what went wrong.
- Verify what works. Consul sits at the center of other highly complex systems. To narrow down the issue, confirm that systems, networking, and services are working as expected.
- Solve one problem at a time. Consul is highly configurable. Try to solve one problem at a time, verify successful operations, and then proceed to the next issue. Consider building a smaller system dedicated to testing specific configuration options or features that seem to be operating incorrectly.
Repeat these steps to isolate each piece of the problem.
For especially large or complex deployments, we recommend hypothesis-based testing to help you troubleshoot. Write down your theory about what is causing the problem and how you can test it. Then, observe the data and take action to confirm your theory is correct and find out if it fixes the underlying problem. Follow a consistent process and track your actions to reduce the likelihood that your troubleshooting efforts complicate the situation.
Where to start
We recommend gathering the following information when you begin troubleshooting your Consul cluster.
Cluster members
The consul members
command will list the other servers and agents that are part of the Consul datacenter connected to the current agent.
$ consul members
Node Address Status Type Build Protocol DC Segment
laptop 127.0.0.1:8301 alive server 1.4.0 2 dc1 <all>
Raft peers
If you need details other than those provided by members
, try the subcommands of consul operator raft
. These work with Consul at a lower level but can provide detail about the moment-to-moment status of the datacenter. This will list leader state, voting status, and raft protocol version.
$ consul operator raft list-peers
Node ID Address State Voter RaftProtocol
laptop abc-def-g12 127.0.0.1:8300 leader true 3
Agent log output
The consul monitor
command displays log output from the Consul agent. Other arguments can increase the amount of information given, such as -log-level debug
or -log-level trace
for large amounts of log data.
$ consul monitor
2019/01/25 17:48:33 [INFO] raft: Initial configuration (index=1): [{Suffrage:Voter ID:abcdef Address:127.0.0.1:8300}]
2019/01/25 17:48:33 [INFO] raft: Node at 127.0.0.1:8300 [Follower] entering Follower state (Leader: "")
2019/01/25 17:48:33 [INFO] serf: EventMemberJoin: laptop.dc1 127.0.0.1
2019/01/25 17:48:33 [INFO] serf: EventMemberJoin: laptop 127.0.0.1
2019/01/25 17:48:33 [INFO] consul: Adding LAN server laptop (Addr: tcp/127.0.0.1:8300) (DC: dc1)
2019/01/25 17:48:33 [INFO] consul: Handled member-join event for server "laptop.dc1" in area "wan"
2019/01/25 17:48:33 [ERR] agent: Failed decoding service file "services/.DS_Store": invalid character '\x00' looking for beginning of value
2019/01/25 17:48:33 [INFO] agent: Started DNS server 127.0.0.1:8600 (tcp)
2019/01/25 17:48:33 [INFO] agent: Started DNS server 127.0.0.1:8600 (udp)
2019/01/25 17:48:33 [INFO] agent: Started HTTP server on 127.0.0.1:8500 (tcp)
2019/01/25 17:48:33 [INFO] agent: Started gRPC server on 127.0.0.1:8502 (tcp)
Validate configuration
The consul validate
command can be run on a single Consul configuration file or, more commonly, on an entire directory of configuration files. Basic syntax and logical correctness will be analyzed and reported upon.
This command catches misspellings of Consul configuration keys, as well as the absence or misconfiguration of crucial attributes.
$ consul validate /etc/consul.d/counting.json
* invalid config key serrrvice
Generate debug file
The consul debug
command can be run on a node without any other arguments. It captures metrics, logs, profiling data, and other data to the current directory for two minutes.
All content is written in plain text to a compressed archive, so do not transmit the emitted data over unencrypted channels.
$ consul debug
==> Starting debugger and capturing static information...
Agent Version: '1.4.0'
Interval: '30s'
Duration: '2m0s'
Output: 'consul-debug-1548721978.tar.gz'
Capture: 'metrics, logs, pprof, host, agent, cluster'
==> Beginning capture interval 2019-01-28 16:32:58.56142 -0800 PST (0)
Unzip the archive.
$ tar xvfz consul-debug-1548721978.tar.gz
View the contents.
$ tree consul-debug-1548721978
Latest health check status
When enabled, health checks are a crucial part Consul's catalog operations. Consul does not return unhealthy services for discovery through standard DNS, as well as some HTTP API calls.
You can monitor the health of nodes and services from the Consul Web UI, or by querying the HTTP API. The /v1/agent/services
endpoint returns a list of all services registered in the catalog.
$ curl "http://localhost:8500/v1/agent/services"
Provide filters on the query string, such as a name and health check status. For example, to query all instances of service counting
with a healthy passing status, query the endpoint /v1/health/service/counting?passing
.
$ curl 'http://localhost:8500/v1/health/service/counting?passing'
[
{
"Node": {
"ID": "da8eb9d3-...",
"Node": "laptop",
"Address": "127.0.0.1",
"Datacenter": "dc1",
},
"Checks": [
{
"Node": "laptop",
"Output": "Agent alive and reachable"
},
{
"Node": "laptop",
"Name": "Service 'counting' check",
"Status": "passing",
"Output": "HTTP GET http://localhost:9003/health: 200 OK Output: Hello, you've hit /health\n",
"ServiceName": "counting"
}
]
}
]
Kubernetes deployments
In Kubernetes deployments, you can inspect the running Consul workloads and interact with the cluster with the kubectl
command.
Use kubectl get pods
to verify the status of your Consul deployment. By default, the Consul service is deployed within a Kubernetes namespace called consul
.
$ kubectl get pods --namespace consul
NAME READY STATUS RESTARTS AGE
consul-cni-45fgb 1/1 Running 0 3m
consul-connect-injector-574799b944-n6jf6 1/1 Running 0 3m
consul-connect-injector-574799b944-xvksv 1/1 Running 0 3m
consul-server-0 1/1 Running 0 3m
consul-webhook-cert-manager-74467cdd8d-88m6j 1/1 Running 0 3m
Inspect the status of the Consul server pods called consul-server-<X>
and whether they are running successfully. You can use typical Consul troubleshooting commands like consul members
through kubectl
in the following format:
kubectl exec consul-server-0 -c consul -- consul members
If any of the pods are in a status other than Running
, you can further inspect their state by running kubectl describe pods <PODNAME>
. For more information about troubleshooting of the deployment's pods, refer to Debug Kubernetes Pods in the Kubernetes documentation.
OpenShift deployments
When deploying Consul on OpenShift clusters, you may encounter errors with the images you pull. Instead of pulling images directly from the RedHat Registry, use the oc import-image
command to pre-load Consul and Consul on Kubernetes images from an external image registry.
If there is already an old version imported with oc import-image
present, the local image is updated. For example, to locally cache the Consul image version 1.21.0-ubi
, run the following command:
$ oc import-image hashicorp/consul:1.21.0-ubi --from=registry.connect.redhat.com/hashicorp/consul:1.21.0-ubi --confirm
imagestream.image.openshift.io/consul imported
If the image registry has an invalid HTTPS certificate or is hosted over HTTP, include the --insecure=true
flag when you import images.
Similarly, if the image registry is overloaded and you need to extend the request timeout, use the --request-timeout
flag and a unit of time. For example, --request-timeout=5m
extends the timeout length to five minutes.
External troubleshooting tools
Consul was designed to work within an established ecosystem of networking protocols which means that you can use existing tools to gather data, verify what is working, and debug the network, applications, and security context surrounding Consul.
ps
A common tool on Unix systems is ps
. Run it to verify that your Consul processes are running as expected.
$ ps | grep consul
79846 ttys001 0:00.07 /usr/local/bin/consul agent -server -config-file=/etc/consul.conf -data-dir=/tmp/consul -config-dir=/etc/consul.d
Or, consider pstree
which can be installed separately with your operating system's package manager. It displays a hierarchy of running child processes and parent processes.
$ pstree
| | \-+= 74259 geoffrey -zsh
| | \--= 79846 geoffrey /usr/local/bin/consul agent -server -config-file=/etc/consul.conf -data-dir=/tmp/consul -config-dir=/etc/consul.d
dig
The dig
tool is a command line application that interacts with DNS servers. To retrieve details about a DNS record in Consul, pass the -p
flag with the port number of the Consul DNS server (default is 8600
) as well as the address of the Consul DNS server, preceded by the @
symbol. The following example queries directly the Consul DNS server running on the current host about the counting
service.
$ dig @127.0.0.1 -p 8600 counting.service.consul
; <<>> DiG 9.10.6 <<>> @127.0.0.1 -p 8600 counting.service.consul ... ;;
ANSWER SECTION: counting.service.consul. 0 IN A 192.168.0.35
A DNS query to Consul always returns the IP address of a healthy service, not all registered services. Use the HTTP API if you want to retrieve a list of all healthy services, or all registered services regardless of health.
Consul also provides additional information with the SRV
(service) argument. Add SRV
to the dig
command to retrieve the port number that the counting
service operates on (shown as port 9003
below).
$ dig @127.0.0.1 -p 8600 counting.service.consul SRV
;; ANSWER SECTION: counting.service.consul. 0 IN SRV 1 1 9003
Machine.local.node.dc1.consul.
;; ADDITIONAL SECTION: Machine.local.node.dc1.consul. 0 IN A
192.168.0.35
curl
As simple as it may seem, the curl
command is extremely useful for debugging any web service or discovering details about content and HTTP headers. curl
can be used to verify connectivity to a service in the service mesh.
If you have configured DNS forwarding, you may be able to use Consul-specific domain names to communicate to the service. For example, curl http://counting.service.consul/
.
Note
If a service uses Consul service mesh prior to version 1.10, you may need to communicate to a service from a node in the datacenter using localhost
and the local port as configured with Consul service mesh.
ping
The ping
command tests network connectivity to a host if the specific host is allowed to respond to ping
requests.
ping
is also useful for viewing the latency between nodes, and the reliability of packet transfer between them.
$ ping 192.168.1.1
PING 192.168.1.1 (192.168.1.1): 56 data bytes
64 bytes from 192.168.1.1: icmp_seq=0 ttl=64 time=4.648 ms
64 bytes from 192.168.1.1: icmp_seq=1 ttl=64 time=5.980 ms
64 bytes from 192.168.1.1: icmp_seq=2 ttl=64 time=4.068 ms
64 bytes from 192.168.1.1: icmp_seq=3 ttl=64 time=3.663 ms
^C
--- 192.168.1.1 ping statistics ---
4 packets transmitted, 4 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 3.663/4.590/5.980/0.876 ms
Next steps
For further information related to troubleshooting, check the Frequently Asked Questions.
To learn how to troubleshoot service mesh issues, refer to Debug service mesh.
If you are experiencing connectivity issues within the service mesh services, refer to Troubleshoot service-to-service communication.
If you are operating Consul at scale, review our best practices and recommendations to help you fine-tune Consul operations across your network.