check_restart block in the job specification

Placement	`job -> group -> task -> service -> check_restart` `job -> group -> task -> service -> check -> check_restart`

The check_restart block instructs Nomad when to restart tasks with unhealthy service checks. When a health check in Nomad or Consul has been unhealthy for the limit specified in a check_restart block, it is restarted according to the task group's restart policy. The check_restart settings apply to checks, but may also be placed on services to apply to all checks on a service. If check_restart is set on both the check and service, the blocks are merged with the check values taking precedence.

job "mysql" {
  group "mysqld" {

    restart {
      attempts = 3
      delay    = "10s"
      interval = "10m"
      mode     = "fail"
    }

    task "server" {
      service {
        tags = ["leader", "mysql"]

        port = "db"

        check {
          type     = "tcp"
          port     = "db"
          interval = "10s"
          timeout  = "2s"
        }

        check {
          type     = "script"
          name     = "check_table"
          command  = "/usr/local/bin/check_mysql_table_status"
          args     = ["--verbose"]
          interval = "60s"
          timeout  = "5s"

          check_restart {
            limit = 3
            grace = "90s"
            ignore_warnings = false
          }
        }
      }
    }
  }
}

limit (int: 0) - Restart task when a health check has failed limit times. For example 1 causes a restart on the first failure. The default, 0, disables health check based restarts. Failures must be consecutive. A single passing check will reset the count, so flapping services may not be restarted.
grace (string: "1s") - Duration to wait after a task starts or restarts before checking its health.
ignore_warnings (bool: false) - By default checks with both critical and warning statuses are considered unhealthy. Setting ignore_warnings = true treats a warning status like passing and will not trigger a restart. Only available in the Consul service provider.

Example behavior

Using the example mysql above would have the following behavior:

check_restart {
  # ...
  grace = "90s"
  # ...
}

When the server task first starts and is registered in Consul, its health will not be checked for 90 seconds. This gives the server time to startup.

check_restart {
  limit = 3
  # ...
}

After the grace period if the script check fails, it has 180 seconds (60s interval * 3 limit) to pass before a restart is triggered. Once a restart is triggered the task group's restart policy takes control:

restart {
  # ...
  delay    = "10s"
  # ...
}

The restart block controls the restart behavior of the task. In this case it will stop the task and then wait 10 seconds before starting it again.

Once the task restarts Nomad waits the grace period again before starting to check the task's health.

restart {
  attempts = 3
  # ...
  interval = "10m"
  mode     = "fail"
}

If the check continues to fail, the task will be restarted up to attempts times within an interval. If the restart attempts are reached within the limit then the mode controls the behavior. In this case the task would fail and not be restarted again. See the restart block for details.