Consul
Consul Autopilot
This page describes Consul Autopilot, a set of features that provide operator-friendly management automations for Consul servers.
Overview
Consul autopilot helps you maintain the health and stability of the Consul server cluster. It includes the following features:
- Server health checking
- Server stabilization time
- Dead server cleanup
- Redundancy zones (only available in Consul Enterprise)
- Automated upgrades (only available in Consul Enterprise)
Default configuration
To check the default autopilot values, use the consul operator CLI command or the /v1/operator/autopilot endpoint.
$ consul operator autopilot get-config
CleanupDeadServers = true
LastContactThreshold = 200ms
MaxTrailingLogs = 250
MinQuorum = 0
ServerStabilizationTime = 10s
RedundancyZoneTag = ""
DisableUpgradeMigration = false
UpgradeVersionTag = ""
The following table lists autopilot configuration parameters, their descriptions, and their default values:
| Autopilot setting | Type | Default value | Description |
|---|---|---|---|
CleanupDeadServers | Boolean | true | Enables periodic dead server removal from the Raft peer set. |
LastContactThreshold | Duration | 200ms | The interval that can elapse between a server's last contact with the current leader before Consul considers it unhealthy. |
MaxTrailingLogs | Integer | 250 | Maximum number of Raft log entries that a server can trail the leader by and still be considered healthy. |
MinQuorum | Integer | 0 | Minimum number of healthy voting servers required to maintain quorum in the datacenter. |
ServerStabilizationTime | Duration | 10s | Time duration that a new server must remain healthy before it can become a voting member. |
RedundancyZoneTag | String | "" | Tag name used to identify redundancy zones for servers in Consul Enterprise. |
DisableUpgradeMigration | Boolean | false | Flag to disable automatic upgrade migrations in Consul Enterprise. |
UpgradeVersionTag | String | "" | Tag name used to identify server versions for automated upgrades in Consul Enterprise. |
Consul servers maintain changes to the autopilot configuration in the Raft database. As a result, autopilot configurations are included in the Consul snapshot data.
Server health checking
An internal health check runs on the leader to track the stability of servers. A server is considered healthy if all of the following conditions are true.
- It has a SerfHealth status of 'Alive'.
- The time since its last contact with the current leader is below
LastContactThreshold. The default value is200ms. - Its latest Raft term matches the leader's term.
- The number of Raft log entries it trails the leader by does not exceed
MaxTrailingLogs. The default value is250.
To return the status of these health checks, use the /v1/operator/autopilot/health HTTP endpoint. The Healthy field at the top indicates the overall status of the datacenter:
$ curl localhost:8500/v1/operator/autopilot/health | jq .
{
"Healthy": true,
"FailureTolerance": 1,
"Servers": [
{
# ...
"Name": "server-dc1-1",
"Address": "10.20.10.11:8300",
"SerfStatus": "alive",
"Version": "1.7.2",
"Leader": false,
# ...
"Healthy": true,
"Voter": true,
# ...
},
{
# ...
"Name": "server-2",
"Address": "10.20.10.12:8300",
"SerfStatus": "alive",
"Version": "1.7.2",
"Leader": false,
# ...
"Healthy": true,
"Voter": true,
# ...
},
{
# ...
"Name": "server-3",
"Address": "10.20.10.13:8300",
"SerfStatus": "alive",
"Version": "1.7.2",
"Leader": false,
# ...
"Healthy": true,
"Voter": false,
# ...
}
]
}
Server stabilization time
When a new server joins the datacenter, there is an initial waiting period where it must stay healthy and stable before it can become a voting member. This duration is configured by the ServerStabilizationTime parameter. By default it is 10 seconds.
If you need a different amount of time, you can tune the parameter to set a different duration. The following example extends the waiting period to 15 seconds:
$ consul operator autopilot set-config -server-stabilization-time=15s
Configuration updated!
Use the get-config command to check the configuration.
$ consul operator autopilot get-config
CleanupDeadServers = true
LastContactThreshold = 200ms
MaxTrailingLogs = 250
MinQuorum = 0
ServerStabilizationTime = 15s
RedundancyZoneTag = ""
DisableUpgradeMigration = false
UpgradeVersionTag = ""
Dead server cleanup
When autopilot is disabled, it takes 72 hours for Consul to automatically reap dead servers. The alternative would be for an operator to manually issue the consul force-leave <dead-server-name> command for each dead server.
In this situation, another server failure could jeopardize the cluster's quorum. The Consul cluster still considers the missing server a member of the datacenter, even if the failed Consul server was automatically replaced.
Autopilot helps prevent these kinds of outages from becoming outages. It quickly removes failed servers as soon as a replacement Consul server comes online. When servers are removed by the cleanup process, they enter the "left" state and are not considered for the datacenter's quorum.
Autopilot also triggers the cleanup process automatically whenever a new server successfully joins the datacenter.
We recommend leaving autopilot enabled to avoid issues with faulty nodes that require manual pruning. In test scenarios and dev environments you can disable the faulty node pruning with the consul operator autopilot set-config -cleanup-dead-servers=false command.
Redundancy zones (Enterprise)
Redundancy zones provide high availability in case of server failure. With Consul Enterprise, autopilot helps you create redundancy zones by adding read replicas to your datacenter that will be promoted to the "voting" status if a voting server fails.
You can set up redundancy zones to implement isolated failure domains. For example, deploying a server and a read replica in each AWS Availability Zones (AZ) provides additional protection against failure within a region.
To learn more, refer to provide fault tolerance with redundancy zones.
Automated upgrades (Enterprise)
Automated upgrades are an Enterprise feature that helps you upgrade existing Consul datacenter. With autopilot, you can add new servers running a new Consul version directly to the datacenter. Then when you have enough servers running the new version, you can perform a leadership change and demote the old servers to "non-voters".
To learn more, refer to automate upgrades with Consul Enterprise.
Next steps
To learn more about the autopilot features described on this page, refer to read replicas and redundancy zones.
For agent specifications related to autopilot settings for stability, refer to the last_contact_threshold and max_trailing_logs parameters in the Consul agent configuration documentation.