Health Checks

One of the primary roles of the agent is management of system-level and application-level health checks. A health check is considered to be application-level if it is associated with a service. If not associated with a service, the check monitors the health of the entire node. Review the health checks tutorial to get a more complete example on how to leverage health check capabilities in Consul.

A check is defined in a configuration file or added at runtime over the HTTP interface. Checks created via the HTTP interface persist with that node.

There are several different kinds of checks:

Script + Interval - These checks depend on invoking an external application that performs the health check, exits with an appropriate exit code, and potentially generates some output. A script is paired with an invocation interval (e.g. every 30 seconds). This is similar to the Nagios plugin system. The output of a script check is limited to 4KB. Output larger than this will be truncated. By default, Script checks will be configured with a timeout equal to 30 seconds. It is possible to configure a custom Script check timeout value by specifying the timeout field in the check definition. When the timeout is reached on Windows, Consul will wait for any child processes spawned by the script to finish. For any other system, Consul will attempt to force-kill the script and any child processes it has spawned once the timeout has passed. In Consul 0.9.0 and later, script checks are not enabled by default. To use them you can either use :
- enable_local_script_checks: enable script checks defined in local config files. Script checks defined via the HTTP API will not be allowed.
- enable_script_checks: enable script checks regardless of how they are defined.
Security Warning: Enabling script checks in some configurations may introduce a remote execution vulnerability which is known to be targeted by malware. We strongly recommend enable_local_script_checks instead. See this blog post for more details.
HTTP + Interval - These checks make an HTTP GET request to the specified URL, waiting the specified interval amount of time between requests (eg. 30 seconds). The status of the service depends on the HTTP response code: any 2xx code is considered passing, a 429 Too ManyRequests is a warning, and anything else is a failure. This type of check should be preferred over a script that uses curl or another external process to check a simple HTTP operation. By default, HTTP checks are GET requests unless the method field specifies a different method. Additional header fields can be set through the header field which is a map of lists of strings, e.g. {"x-foo": ["bar", "baz"]}. By default, HTTP checks will be configured with a request timeout equal to 10 seconds.
It is possible to configure a custom HTTP check timeout value by specifying the timeout field in the check definition. The output of the check is limited to roughly 4KB. Responses larger than this will be truncated. HTTP checks also support TLS. By default, a valid TLS certificate is expected. Certificate verification can be turned off by setting the tls_skip_verify field to true in the check definition. When using TLS, the SNI will be set automatically from the URL if it uses a hostname (as opposed to an IP address); the value can be overridden by setting tls_server_name.
Consul follows HTTP redirects by default. Set the disable_redirects field to true to disable redirects.
TCP + Interval - These checks make a TCP connection attempt to the specified IP/hostname and port, waiting interval amount of time between attempts (e.g. 30 seconds). If no hostname is specified, it defaults to "localhost". The status of the service depends on whether the connection attempt is successful (ie - the port is currently accepting connections). If the connection is accepted, the status is success, otherwise the status is critical. In the case of a hostname that resolves to both IPv4 and IPv6 addresses, an attempt will be made to both addresses, and the first successful connection attempt will result in a successful check. This type of check should be preferred over a script that uses netcat or another external process to check a simple socket operation. By default, TCP checks will be configured with a request timeout of 10 seconds. It is possible to configure a custom TCP check timeout value by specifying the timeout field in the check definition.
UDP + Interval - These checks direct the client to periodically send UDP datagrams to the specified IP/hostname and port. The duration specified in the interval field sets the amount of time between attempts, such as 30s to indicate 30 seconds. The check is logged as healthy if any response from the UDP server is received. Any other result sets the status to critical. The default interval for, UDP checks is 10s, but you can configure a custom UDP check timeout value by specifying the timeout field in the check definition. If any timeout on read exists, the check is still considered healthy.
Time to Live (TTL) - These checks retain their last known state for a given TTL. The state of the check must be updated periodically over the HTTP interface. If an external system fails to update the status within a given TTL, the check is set to the failed state. This mechanism, conceptually similar to a dead man's switch, relies on the application to directly report its health. For example, a healthy app can periodically PUT a status update to the HTTP endpoint; if the app fails, the TTL will expire and the health check enters a critical state. The endpoints used to update health information for a given check are: pass, warn, fail, and update. TTL checks also persist their last known status to disk. This allows the Consul agent to restore the last known status of the check across restarts. Persisted check status is valid through the end of the TTL from the time of the last check.
Docker + Interval - These checks depend on invoking an external application which is packaged within a Docker Container. The application is triggered within the running container via the Docker Exec API. We expect that the Consul agent user has access to either the Docker HTTP API or the unix socket. Consul uses $DOCKER_HOST to determine the Docker API endpoint. The application is expected to run, perform a health check of the service running inside the container, and exit with an appropriate exit code. The check should be paired with an invocation interval. The shell on which the check has to be performed is configurable which makes it possible to run containers which have different shells on the same host. Check output for Docker is limited to 4KB. Any output larger than this will be truncated. In Consul 0.9.0 and later, the agent must be configured with enable_script_checks set to true in order to enable Docker health checks.
gRPC + Interval - These checks are intended for applications that support the standard gRPC health checking protocol. The state of the check will be updated by probing the configured endpoint, waiting interval amount of time between probes (eg. 30 seconds). By default, gRPC checks will be configured with a default timeout of 10 seconds. It is possible to configure a custom timeout value by specifying the timeout field in the check definition. gRPC checks will default to not using TLS, but TLS can be enabled by setting grpc_use_tls in the check definition. If TLS is enabled, then by default, a valid TLS certificate is expected. Certificate verification can be turned off by setting the tls_skip_verify field to true in the check definition. To check on a specific service instead of the whole gRPC server, add the service identifier after the gRPC check's endpoint in the following format /:service_identifier.
H2ping + Interval - These checks test an endpoint that uses http2 by connecting to the endpoint and sending a ping frame. TLS is assumed to be configured by default. To disable TLS and use h2c, set h2ping_use_tls to false. If the ping is successful within a specified timeout, then the check is updated as passing. The timeout defaults to 10 seconds, but is configurable using the timeout field. If TLS is enabled a valid certificate is required, unless tls_skip_verify is set to true. The check will be run on the interval specified by the interval field.
Alias - These checks alias the health state of another registered node or service. The state of the check will be updated asynchronously, but is nearly instant. For aliased services on the same agent, the local state is monitored and no additional network resources are consumed. For other services and nodes, the check maintains a blocking query over the agent's connection with a current server and allows stale requests. If there are any errors in watching the aliased node or service, the check state will be critical. For the blocking query, the check will use the ACL token set on the service or check definition or otherwise will fall back to the default ACL token set with the agent (acl_token).

Check Definition

A script check:

Review the service health checks tutorial to get a more complete example on how to leverage health check capabilities in Consul.

Registering a health check

There are three ways to register a service with health checks:

Start or reload a Consul agent with a service definition file in the agent's configuration directory.
Call the /agent/service/register HTTP API endpoint to register the service.
Use the consul services register CLI command to register the service.

When a service is registered using the HTTP API endpoint or CLI command, the checks persist in the Consul data folder across Consul agent restarts.

Types of checks

This section describes the available types of health checks you can use to automatically monitor the health of a service instance or node.

To manually mark a service unhealthy: Use the maintenance mode CLI command or HTTP API endpoint to temporarily remove one or all service instances on a node from service discovery DNS and HTTP API query results.

Script check

Script checks periodically invoke an external application that performs the health check, exits with an appropriate exit code, and potentially generates some output. The specified interval determines the time between check invocations. The output of a script check is limited to 4KB. Larger outputs are truncated.

By default, script checks are configured with a timeout equal to 30 seconds. To configure a custom script check timeout value, specify the timeout field in the check definition. After reaching the timeout on a Windows system, Consul waits for any child processes spawned by the script to finish. After reaching the timeout on other systems, Consul attempts to force-kill the script and any child processes it spawned.

Script checks are not enabled by default. To enable a Consul agent to perform script checks, use one of the following agent configuration options:

enable_local_script_checks: Enable script checks defined in local config files. Script checks registered using the HTTP API are not allowed.
enable_script_checks: Enable script checks no matter how they are registered.
Security Warning: Enabling non-local script checks in some configurations may introduce a remote execution vulnerability known to be targeted by malware. We strongly recommend enable_local_script_checks instead. For more information, refer to this blog post.

The following service definition file snippet is an example of a script check definition:

Script Check

check = {
  id = "mem-util"
  name = "Memory utilization"
  args = ["/usr/local/bin/check_mem.py", "-limit", "256MB"]
  interval = "10s"
  timeout = "1s"
}

Check script conventions

A check script's exit code is used to determine the health check status:

Exit code 0 - Check is passing
Exit code 1 - Check is warning
Any other code - Check is failing

Any output of the script is captured and made available in the Output field of checks included in HTTP API responses, as in this example from the local service health endpoint.

HTTP check

HTTP checks periodically make an HTTP GET request to the specified URL, waiting the specified interval amount of time between requests. The status of the service depends on the HTTP response code: any 2xx code is considered passing, a 429 Too ManyRequests is a warning, and anything else is a failure. This type of check should be preferred over a script that uses curl or another external process to check a simple HTTP operation. By default, HTTP checks are GET requests unless the method field specifies a different method. Additional request headers can be set through the header field which is a map of lists of strings, such as {"x-foo": ["bar", "baz"]}.

By default, HTTP checks are configured with a request timeout equal to 10 seconds. To configure a custom HTTP check timeout value, specify the timeout field in the check definition. The output of an HTTP check is limited to approximately 4KB. Larger outputs are truncated. HTTP checks also support TLS. By default, a valid TLS certificate is expected. Certificate verification can be turned off by setting the tls_skip_verify field to true in the check definition. When using TLS, the SNI is implicitly determined from the URL if it uses a hostname instead of an IP address. You can explicitly set the SNI value by setting tls_server_name.

Consul follows HTTP redirects by default. To disable redirects, set the disable_redirects field to true.

The following service definition file snippet is an example of an HTTP check definition:

HTTP Check

check = {
  id = "api"
  name = "HTTP API on port 5000"
  http = "https://localhost:5000/health"
  tls_server_name =  ""
  tls_skip_verify = false
  method = "POST"
  header = {
     Content-Type = ["application/json"]
  }
  body = "{\"method\":\"health\"}"
  disable_redirects = true
  interval = "10s"
  timeout = "1s"
}

{
  "check": {
    "id": "api",
    "name": "HTTP API on port 5000",
    "http": "https://localhost:5000/health",
    "tls_server_name": "",
    "tls_skip_verify": false,
    "method": "POST",
    "header": { "Content-Type": ["application/json"] },
    "body": "{\"method\":\"health\"}",
    "interval": "10s",
    "timeout": "1s"
  }
}

TCP check

TCP checks periodically make a TCP connection attempt to the specified IP/hostname and port, waiting interval amount of time between attempts. If no hostname is specified, it defaults to "localhost". The health check status is success if the target host accepts the connection attempt, otherwise the status is critical. In the case of a hostname that resolves to both IPv4 and IPv6 addresses, an attempt is made to both addresses, and the first successful connection attempt results in a successful check. This type of check should be preferred over a script that uses netcat or another external process to check a simple socket operation.

By default, TCP checks are configured with a request timeout equal to 10 seconds. To configure a custom TCP check timeout value, specify the timeout field in the check definition.

The following service definition file snippet is an example of a TCP check definition:

TCP Check

check = {
  id = "ssh"
  name = "SSH TCP on port 22"
  tcp = "localhost:22"
  interval = "10s"
  timeout = "1s"
}

UDP check

UDP checks periodically direct the Consul agent to send UDP datagrams to the specified IP/hostname and port, waiting interval amount of time between attempts. The check status is set to success if any response is received from the targeted UDP server. Any other result sets the status to critical.

By default, UDP checks are configured with a request timeout equal to 10 seconds. To configure a custom UDP check timeout value, specify the timeout field in the check definition. If any timeout on read exists, the check is still considered healthy.

The following service definition file snippet is an example of a UDP check definition:

UDP Check

check = {
  id = "dns"
  name = "DNS UDP on port 53"
  udp = "localhost:53"
  interval = "10s"
  timeout = "1s"
}

Time to live (TTL) check

TTL checks retain their last known state for the specified ttl duration. If the ttl duration elapses before a new check update is provided over the HTTP interface, the check is set to critical state.

This mechanism relies on the application to directly report its health. For example, a healthy app can periodically PUT a status update to the HTTP endpoint. Then, if the app is disrupted and unable to perform this update before the TTL expires, the health check enters the critical state. The endpoints used to update health information for a given check are: pass, warn, fail, and update. TTL checks also persist their last known status to disk. This persistence allows the Consul agent to restore the last known status of the check across agent restarts. Persisted check status is valid through the end of the TTL from the time of the last check.

To manually mark a service unhealthy, it is far more convenient to use the maintenance mode CLI command or HTTP API endpoint rather than a TTL health check with arbitrarily high ttl.

The following service definition file snippet is an example of a TTL check definition:

TTL Check

check = {
  id = "web-app"
  name = "Web App Status"
  notes = "Web app does a curl internally every 10 seconds"
  ttl = "30s"
}

Docker check

These checks depend on periodically invoking an external application that is packaged within a Docker Container. The application is triggered within the running container through the Docker Exec API. We expect that the Consul agent user has access to either the Docker HTTP API or the unix socket. Consul uses $DOCKER_HOST to determine the Docker API endpoint. The application is expected to run, perform a health check of the service running inside the container, and exit with an appropriate exit code. The check should be paired with an invocation interval. The shell on which the check has to be performed is configurable, making it possible to run containers which have different shells on the same host. The output of a Docker check is limited to 4KB. Larger outputs are truncated. The agent must be configured with enable_script_checks set to true in order to enable Docker health checks.

The following service definition file snippet is an example of a Docker check definition:

Docker Check

check = {
  id = "mem-util"
  name = "Memory utilization"
  docker_container_id = "f972c95ebf0e"
  shell = "/bin/bash"
  args = ["/usr/local/bin/check_mem.py"]
  interval = "10s"
}

{
  "check": {
    "id": "mem-util",
    "name": "Memory utilization",
    "docker_container_id": "f972c95ebf0e",
    "shell": "/bin/bash",
    "args": ["/usr/local/bin/check_mem.py"],
    "interval": "10s"
  }
}

gRPC check

gRPC checks are intended for applications that support the standard gRPC health checking protocol. The state of the check will be updated by periodically probing the configured endpoint, waiting interval amount of time between attempts.

By default, gRPC checks are configured with a timeout equal to 10 seconds. To configure a custom Docker check timeout value, specify the timeout field in the check definition.

gRPC checks default to not using TLS. To enable TLS, set grpc_use_tls in the check definition. If TLS is enabled, then by default, a valid TLS certificate is expected. Certificate verification can be turned off by setting the tls_skip_verify field to true in the check definition. To check on a specific service instead of the whole gRPC server, add the service identifier after the gRPC check's endpoint in the following format /:service_identifier.

The following service definition file snippet is an example of a gRPC check for a whole application:

gRPC Check

check = {
  id = "mem-util"
  name = "Service health status"
  grpc = "127.0.0.1:12345"
  grpc_use_tls = true
  interval = "10s"
}

The following service definition file snippet is an example of a gRPC check for the specific my_service service

gRPC Specific Service Check

check = {
  id = "mem-util"
  name = "Service health status"
  grpc = "127.0.0.1:12345/my_service"
  grpc_use_tls = true
  interval = "10s"
}

H2ping check

H2ping checks test an endpoint that uses http2 by connecting to the endpoint and sending a ping frame, waiting interval amount of time between attempts. If the ping is successful within a specified timeout, then the check status is set to success.

By default, h2ping checks are configured with a request timeout equal to 10 seconds. To configure a custom h2ping check timeout value, specify the timeout field in the check definition.

TLS is enabled by default. To disable TLS and use h2c, set h2ping_use_tls to false. If TLS is not disabled, a valid certificate is required unless tls_skip_verify is set to true.

The following service definition file snippet is an example of an h2ping check definition:

H2ping Check

check = {
  id = "h2ping-check"
  name = "h2ping"
  h2ping = "localhost:22222"
  interval = "10s"
  h2ping_use_tls = false
}

Alias check

These checks alias the health state of another registered node or service. The state of the check updates asynchronously, but is nearly instant. For aliased services on the same agent, the local state is monitored and no additional network resources are consumed. For other services and nodes, the check maintains a blocking query over the agent's connection with a current server and allows stale requests. If there are any errors in watching the aliased node or service, the check state is set to critical. For the blocking query, the check uses the ACL token set on the service or check definition. If no ACL token is set in the service or check definition, the blocking query uses the agent's default ACL token (acl.tokens.default).

Configuration info: The alias check configuration expects the alias to be registered on the same agent as the one you are aliasing. If the service is not registered with the same agent, "alias_node": "<node_id>" must also be specified. When using alias_node, if no service is specified, the check will alias the health of the node. If a service is specified, the check will alias the specified service on this particular node.

The following service definition file snippet is an example of an alias check for a local service:

Alias Check

check = {
  id = "web-alias"
  alias_service = "web"
}

Check definition

This section covers some of the most common options for check definitions. For a complete list of all check options, refer to the Register Check HTTP API endpoint documentation.

Casing for check options: The correct casing for an option depends on whether the check is defined in a service definition file or an HTTP API JSON request body. For example, the option deregister_critical_service_after in a service definition file is instead named DeregisterCriticalServiceAfter in an HTTP API JSON request body.

General options

name (string: <required>) - Specifies the name of the check.
id (string: "") - Specifies a unique ID for this check on this node.
If unspecified, Consul defines the check id by:
- If the check definition is embedded within a service definition file, a unique check id is auto-generated.
- Otherwise, the id is set to the value of name. If names might conflict, you must provide unique IDs to avoid overwriting existing checks with the same id on this node.
interval (string: <required for interval-based checks>) - Specifies the frequency at which to run this check. Required for all check types except TTL and alias checks.
The value is parsed by Go's time package, and has the following formatting specification:
A duration string is a possibly signed sequence of decimal numbers, each with optional fraction and a unit suffix, such as "300ms", "-1.5h" or "2h45m". Valid time units are "ns", "us" (or "µs"), "ms", "s", "m", "h".
service_id (string: <required for service health checks>) - Specifies the ID of a service instance to associate this check with. That service instance must be on this node. If not specified, this check is treated as a node-level check. For more information, refer to the service-bound checks section.
status (string: "") - Specifies the initial status of the health check as "critical" (default), "warning", or "passing". For more details, refer to the initial health check status section.
Health defaults to critical: If health status it not initially specified, it defaults to "critical" to protect against including a service in discovery results before it is ready.
deregister_critical_service_after (string: "") - If specified, the associated service and all its checks are deregistered after this check is in the critical state for more than the specified value. The value has the same formatting specification as the interval field.
The minimum timeout is 1 minute, and the process that reaps critical services runs every 30 seconds, so it may take slightly longer than the configured timeout to trigger the deregistration. This field should generally be configured with a timeout that's significantly longer than any expected recoverable outage for the given service.
notes (string: "") - Provides a human-readable description of the check. This field is opaque to Consul and can be used however is useful to the user. For example, it could be used to describe the current state of the check.
token (string: "") - Specifies an ACL token used for any interaction with the catalog for the check, including anti-entropy syncs and deregistration.
For alias checks, this token is used if a remote blocking query is necessary to watch the state of the aliased node or service.

Success/failures before passing/warning/critical

To prevent flapping health checks and limit the load they cause on the cluster, a health check may be configured to become passing/warning/critical only after a specified number of consecutive checks return as passing/critical. The status does not transition states until the configured threshold is reached.

success_before_passing - Number of consecutive successful results required before check status transitions to passing. Defaults to 0. Added in Consul 1.7.0.
failures_before_warning - Number of consecutive unsuccessful results required before check status transitions to warning. Defaults to the same value as that of failures_before_critical to maintain the expected behavior of not changing the status of service checks to warning before critical unless configured to do so. Values higher than failures_before_critical are invalid. Added in Consul 1.11.0.
failures_before_critical - Number of consecutive unsuccessful results required before check status transitions to critical. Defaults to 0. Added in Consul 1.7.0.

This feature is available for all check types except TTL and alias checks. By default, both passing and critical thresholds are set to 0 so the check status always reflects the last check result.

Flapping Prevention Example

checks = [
  {
    name = "HTTP TCP on port 80"
    tcp = "localhost:80"
    interval = "10s"
    timeout  = "1s"
    success_before_passing =  3
    failures_before_warning =  1
    failures_before_critical =  3
  }
]

{
  "checks": [
    {
      "name": "HTTP TCP on port 80",
      "tcp": "localhost:80",
      "interval": "10s",
      "timeout": "1s",
      "success_before_passing": 3,
      "failures_before_warning": 1,
      "failures_before_critical": 3
    }
  ]
}

Initial health check status

By default, when checks are registered against a Consul agent, the state is set immediately to "critical". This is useful to prevent services from being registered as "passing" and entering the service pool before they are confirmed to be healthy. In certain cases, it may be desirable to specify the initial state of a health check. This can be done by specifying the status field in a health check definition, like so:

Status Field Example

check = {
  id = "mem"
  args = ["/bin/check_mem", "-limit", "256MB"]
  interval = "10s"
  status = "passing"
}

The above service definition would cause the new "mem" check to be registered with its initial state set to "passing".

Service-bound checks

Health checks may optionally be bound to a specific service. This ensures that the status of the health check will only affect the health status of the given service instead of the entire node. Service-bound health checks may be provided by adding a service_id field to a check configuration:

Status Field Example

check = {
  id = "web-app"
  name = "Web App Status"
  service_id = "web-app"
  ttl = "30s"
}

In the above configuration, if the web-app health check begins failing, it will only affect the availability of the web-app service. All other services provided by the node will remain unchanged.

Agent certificates for TLS checks

The enable_agent_tls_for_checks agent configuration option can be utilized to have HTTP or gRPC health checks to use the agent's credentials when configured for TLS.

Multiple check definitions

Multiple check definitions can be defined using the checks (plural) key in your configuration file.

Multiple Checks Example

checks = [
  {
    id       = "chk1"
    name     = "mem"
    args     = ["/bin/check_mem", "-limit", "256MB"]
    interval = "5s"
  },
  {
    id       = "chk2"
    name     = "/health"
    http     = "http://localhost:5000/health"
    interval = "15s"
  },
  {
    id       = "chk3"
    name     = "cpu"
    args     = ["/bin/check_cpu"]
    interval = "10s"
  },
  ...
]

{
  "checks": [
    {
      "id": "chk1",
      "name": "mem",
      "args": ["/bin/check_mem", "-limit", "256MB"],
      "interval": "5s"
    },
    {
      "id": "chk2",
      "name": "/health",
      "http": "http://localhost:5000/health",
      "interval": "15s"
    },
    {
      "id": "chk3",
      "name": "cpu",
      "args": ["/bin/check_cpu"],
      "interval": "10s"
    },
    ...
  ]
}