Nomad
Configure health check restart
The check_restart stanza instructs Nomad when to restart
tasks with unhealthy service checks. When a health check in Consul has been
unhealthy for the limit specified in a check_restart stanza, it is restarted
according to the task group's restart policy. Restarts are local to the node
running the task based on the tasks restart policy.
The limit field is used to specify the number of times a failing health check
is seen before local restarts are attempted. Operators can also specify a
grace duration to wait after a task restarts before checking its health.
You should configure the check restart on services when its likely that a restart would resolve the failure. An example of this might be restarting to correct a transient connection issue on the service.
The following check_restart stanza waits for two consecutive health check
failures with a grace period and considers both critical and warning
statuses as failures.
check_restart {
  limit           = 2
  grace           = "10s"
  ignore_warnings = false
}
The following CLI example output shows health check failures triggering restarts until its restart limit is reached.
$ nomad alloc status e1b43128-2a0a-6aa3-c375-c7e8a7c48690
ID                   = e1b43128
Eval ID              = 249cbfe9
Name                 = demo.demo[0]
Node ID              = 221e998e
Job ID               = demo
Job Version          = 0
Client Status        = failed
Client Description   = <none>
Desired Status       = run
Desired Description  = <none>
Created              = 2m59s ago
Modified             = 39s ago
Task "test" is "dead"
Task Resources
CPU      Memory   Disk     Addresses
100 MHz  300 MiB  300 MiB  p1: 127.0.0.1:28422
Task Events:
Started At     = 2018-04-12T22:50:32Z
Finished At    = 2018-04-12T22:50:54Z
Total Restarts = 3
Last Restart   = 2018-04-12T17:50:15-05:00
Recent Events:
Time                       Type              Description
2018-04-12T17:50:54-05:00  Not Restarting    Exceeded allowed attempts 3 in interval 30m0s and mode is "fail"
2018-04-12T17:50:54-05:00  Killed            Task successfully killed
2018-04-12T17:50:54-05:00  Killing           Sent interrupt. Waiting 5s before force killing
2018-04-12T17:50:54-05:00  Restart Signaled  health check: check "service: \"demo-service-test\" check" unhealthy
2018-04-12T17:50:32-05:00  Started           Task started by client
2018-04-12T17:50:15-05:00  Restarting        Task restarting in 16.887291122s
2018-04-12T17:50:15-05:00  Killed            Task successfully killed
2018-04-12T17:50:15-05:00  Killing           Sent interrupt. Waiting 5s before force killing
2018-04-12T17:50:15-05:00  Restart Signaled  health check: check "service: \"demo-service-test\" check" unhealthy
2018-04-12T17:49:53-05:00  Started           Task started by client