Ensure Only Healthy Services are Discoverable
One important feature of the Consul agent is to manage system-level and application-level health checks. A health check is considered to be application-level if it is associated with a service. If not associated with a service, the check monitors the health of the entire node.
Checks are useful for monitoring the state of your services inside your datacenter and can be applied to many different use cases. Ultimately, Consul leverages checks to maintain accurate DNS query results by omitting services and nodes that are marked unhealthy.
In this tutorial, you'll learn how to select the best health check to configure depending on the entity you want to monitor, how to write a health check definition and how to register a service and a node health check manually. Finally, you'll learn how to monitor the state of the services using the resources natively provided by Consul: UI, HTTP API, and CLI.
Prerequisites
To complete this tutorial, you can use a local dev agent or an existing Consul deployment.
Create a directory for consul-data
.
Create a directory for consul-conf
.
Be sure to enable the UI. If you're using a local dev agent add the -ui
flag.
You can test the health checks with the demo services provided below:
Download the Counting Service and run it with the following configuration.
Download the Dashboard Service and run it with the following configuration.
How to register a check
There are three steps for registering a check in Consul.
- Define monitoring scope: Decide if you want the check to monitor a service or a node.
- Write check definition: Select the type of check you want to register and write the definition.
- Register the check: Register the check using one of the available methods.
In this tutorial, you will complete all three steps.
Define monitoring scope
Before writing the check definition, it is best practice to define the monitoring scope.
Monitor a service
Checks can be registered in association with a service by either embedding the
check definition inside the service definition or by associating them with a
service using the
ServiceID
parameter.
When registered in association with a service definition, the check will only
affect the health of the service it is associated with. For example, if
associated with a service called database, the failure of the check will only
affect the availability of the database service. All other services provided by
the node will remain unchanged. This is the perfect approach for monitoring the
health of a service and make sure the DNS
interface only returns services that
are up and running. You will use this method in the tutorial.
Monitor an external service
The steps provided by this tutorial are for services that have a local to the Consul agent. If you want to monitor a service that runs on a node where you cannot run a local Consul agent, you can follow the steps provided in External Services. Once familiarized with the steps to register an external service, you can then apply concepts present in this tutorial to define checks for your external services too.
Monitor a node
When a check is not associated with a service, it will monitor the health of the whole node. This is not a common configuration, but it is perfect in case you want to ensure that the node is not used to serve traffic in case some basic health requirements are not respected. One possible scenario for this case is to setup a check for hardware resources (RAM, CPU usage, or disk space) and mark the node unhealthy until those parameters are back below the desired threshold.
Warning
If a node is marked unhealthy ALL the services exposed by the node will not be returned by the DNS.
Write check definition
Consul provides you with an ample range of options when it comes to health checks; review the full list of available checks in the Consul documentation documentation. In this tutorial, you'll get an overview of the most common ones.
- Script + Interval
- HTTP + Interval
- TCP + Interval
- Alias
Write a script + interval check
Often, especially when migrating legacy applications to the cloud, you already have some customized scripts that monitor your machines to ensure they're healthy. Script checks allow you to re-use those scripts with Consul. Another use case for script checks are when you want to perform more complex checks that might rely on the underlying OS.
Add the following script check definition to a Consul agent.
This script measures the memory usage of a Linux machine and returns a warning state if it rises above 70%.
Tuning scripts to be compatible with Consul
Consul doesn't put limitations on the operations the scripts can perform but it uses a convention on the script exit code to decide the status of the script.
Read more on this at check scripts documentation.
Enabling scripts on your Consul agent
Script checks must be enabled in the agent's configuration so that they have permissions to execute scripts locally.
Security Warning
Enabling script checks in some configurations may introduce a remote execution vulnerability which is known to be targeted by malware. We strongly recommend `-enable-local-script-checks` instead. See [this blog post](https://www.hashicorp.com/blog/protecting-consul-from-rce-risk-in-specific-configurations) for more details.Write an HTTP + interval check
HTTP checks are the perfect approach in case the service you want to monitor
provides an endpoint that gives state information.
The status of the service will depend on the HTTP response code. Any 2xx
code
is considered passing, a 429 Too ManyRequests
is a warning, and anything else
is a failure. This type of check should be preferred over a script that uses
curl or another external process to check a simple HTTP operation. The
Dashboard
service
you configured in the prerequisites provides a /health
endpoint that is the
perfect recipient for these checks.
In the registration section you'll learn how to embed the following check definition inside a service definition in order to be able to use it with the HTTP API.
This check will make an HTTP GET
request to the URL specified in the http
field, waiting the specified interval
amount of time between requests.
In case you want to have the check definition in a standalone file (i.e. not
associated with the service one) you will want to specify service_id
to have
the check associated to the correct service.
Write an TCP + interval check
Not all applications expose an HTTP endpoint to be monitored using an HTTP check. For these applications the best approach is to use TCP checks.
Once a TCP check is configured Consul will attempt to connect to the specified port, and address if specified, and will define the service health based on the connection attempt:
- if the connection is accepted, the status is success.
- otherwise the status is critical.
The Counting service you configured in the prerequisites is a good use case to use this check.
In the registration section you'll learn how to embed the check definition inside a service definition in order to be able to use it with the HTTP API.
This check makes a TCP connection attempt to the IP/hostname and port specified
in the tcp
field, waiting interval
amount of time between attempts.
In case you want to have the check definition in a standalone file (i.e. not
associated with the service one) you will want to specify service_id
to have
the check associated to the correct service.
Challenge: Write an alias check
Sometimes a service can be healthy but one of more of their dependencies are not. This can result in requests being sent to a service that in the best case would not answer but could also respond with some unpredictable content. One valid example could be a two-tier application with frontend and backend where the backend is the dependency.
To avoid this scenario, one option is to add an additional check that monitors the backend service and will be associated with the frontend service. However, this can generate additional load or network traffic to check a service that is already monitored.
Consul provides an elegant solution to that by defining an alias check. An alias check aliases the health state of another registered node or service.
For aliased services on the same agent, the local state is monitored and no additional network resources consumed. For other services and nodes, the check maintains a blocking query over the agent's connection with a current server and allows stale requests.
The counting service could be used to represent the backend and the dependency for the dashboard service. With an alias check, if the counting service fails, then the dashboard will also be marked as unhealthy and will not be returned by the DNS interface.
Configuration info: The configuration above expects the alias to be registered
on the same agent as the one you are aliasing. If the service is not registered
with the same agent, "alias_node": "<node_id>"
must also be specified.
When using alias_node
, if no service is specified, the check will alias the
health of the node. If a service is specified, the check will alias the specified
service on this particular node.
Register the checks
The final step is to register your checks. You will manually register the checks to gain a better understanding of the process and the information that your automation tooling will ultimately need to provide Consul in order to take better advantage of service discovery.
Register a node check using the configuration directory
Checks are part of Consul reloadable configuration, you do not need to restart Consul in order to register or modify a check.
To apply the configuration you can follow these steps:
Copy the configuration file inside Consul
config-dir
Apply the configuration by issuing
consul reload
Check persistence
Checks installed using this method are not persisted in Consul data folder. You
can remove the check by removing the check definition file or edit it in case
you want to change something in the definition, and run consul reload
Register the counting service and check using the CLI
Consul CLI provides a command to register a service in the catalog using the same definition structure we used during the check creation.
Write the following definition inside a file called service_counting.json
.
Once the definition is saved you can register the service by running:
Deregister the service
In case you need to deregister a service, and the associated check, registered using the CLI you can use the following command:
Register the dashboard service and check using the API
The third option to register a service and a check is via the HTTP API.
Write the following definition inside a file called service_dashboard.json
.
Once the definition is saved you can register the service by running:
You might have noticed the file syntax is a bit different when it comes to the file definition for the API. Make sure you double check your files before applying them to a production environment.
Deregister the service
In case you need to deregister a service registered using the API, and the associated check, you can use the following command:
Troubleshooting Checks
At this point you should be all set, you registered your checks and hopefully they are healthy. However, as you probably already experienced, reality is much more variable. Here are a few methods for monitoring checks.
Consul UI
The first way to check on the state of your services it to use Consul UI.
Consul (if configured using the ui
parameter) exposes a web interface by
default on port 8500
of the node it is running on.
Here is one example of how the UI will look after we registered all services and checks:
In case something is going on with your services and the checks start failing the view is going to be less reassuring:
You can click on the different icons to discover which checks are failing and see if the output provides additional information.
Logs
Another place that can help you see what is going on with your checks are the
log files.
Here is what will be shown in the logs in case you have a check called
mem-util
in the different states:
Passing:
you will need to have
log-level
set toDEBUG
orTRACE
to see the line in your logs.Warning:
Critical
REST API
Both the indicators provided above are not the most accurate way to check the state of your services and they are also not easy to automate. If you want to get a more detailed set of information you can use the REST API:
In case you want to filter the results you can use the filtering section of the REST endpoint.
Different endpoints in the API
Consul exposes several resources to interact with services and checks, the most common used when it comes to registration are:
Warning
While the /catalog
endpoint seems to offer a valid alternative to the
/agent
one when it comes to register services and checks it is not recommended
to use it to register agent related entities. The reason behind this is that,
thanks to the
anti-entropy mechanism
Consul will constantly re-align the state of the single nodes with the one of
the global catalog. When this happens services that were registered using the
/catalog
endpoint will disappear.
The /catalog
endpoint is the recommended way to register external
services
because in that case we will register the service as belonging to an
external-node
.
Next steps
In this tutorial, you registered a health with Consul and learned how to leverage the health checks Consul natively provides using the UI and the HTTP API. You can find a complete list of health checks registration fields in the API documentation, or learn more about health checks in the check definition documentation.