Overview
Autopilot is a set of features to allow for automatic operator-friendly management of Nomad servers. It includes automated upgrades, monitoring the state of the Raft cluster, and stable server introduction. This allows the Nomad upgrade process to be significantly simplified.
A basic tutorial of Autopilot and how to enable can be found at the Autopilot tutorial(opens in new tab) page.
Below we will be discussing the recommendations and more in-depth topics around Autopilot.
Prerequisites
raft_protocol
needs to be set to 3 on all server nodes. This only applies if you are running Nomad 1.3 or older because the default raft versions were less than 2. This may also apply if you've upgraded from an older version, but never upgradedraft_protocol
to 3. If you are running an older Nomad version with the parameter set to 2 or lower, then you'll need to upgrade. Refer to this article(opens in new tab) for more details:autopilot{}
block and its parameters need to be added to the server node configuration before bootstrapping the cluster. Otherwise, you will need to add theAutopilot{}
block, start the agent, and enable autopilot through the nomad operator autopilot set-config command or through the/v1/operator/autopilot/configuration
API endpoint. Refer to this article(opens in new tab) for more information on how to enable Autopilot.
We recommend you to have a build pipeline with Packer to create golden images, deploy the machines with Terraform, and a configuration management tool such as Ansible to perform any OS level operations after the VM provisioning. The pipeline will help you deploy Nomad VMs with the new Nomad version and/or to do any changes to the OS or Nomad configuration. See the Using Terraform to Configure Nomad section for more details and recommendations on the deployment process.
Example pipeline
Below is an example pipeline of what the upgrade process may look like with Autopilot, Packer, and HCP Terraform.
Packer builds a new Nomad image and stores it at an image registry. This example shows AWS EC2 AMI, but the same principles apply for vSphere Content Library, Azure Compute Gallery, etc...
HCP Terraform uses a new workspace to deploy a new set of Nomad server VM’s.
Figure 1: Terraform uses a new workspace to deploy a new set of Nomad servers.
- The new 1.8 nodes are joining the 1.7 cluster.
Figure 2: The new nodes are joining the cluster.
- Autopilot will then automatically demote the old server nodes to non-voting members so they are no longer participating in the quorum.
Figure 3: Autopilot automatically demotes the old server nodes.
Run nomad operator raft list-peers
to confirm the Nomad nodes from "Workspace A" does not have a leader and are not set to voting members. Once you confirmed the the Nomad instances deployed from "Workspace A" are non-voting, a terraform destroy
can be ran to remove the old instances.
Recommended Autopilot configuration options
Cleanup Dead Server
The cleanup_dead_servers
parameter ensures that dead servers are automatically removed from the cluster. This is crucial for maintaining a healthy cluster state.
autopilot {
cleanup_dead_servers = true
}
Tip
Always set this to `true` to prevent dead servers from causing issues in the cluster.Last Contact Threshold
The last_contact_threshold
parameter defines the maximum allowed time since the last contact with a server before it is considered unhealthy.
autopilot {
last_contact_threshold = "200ms"
}
Tip
Set this to a low value (e.g., `200ms`) to quickly detect and handle network partitions or server failures.Server Stabilization Time
The server_stabilization_time
parameter defines the time a server must be stable before it is considered healthy.
autopilot {
server_stabilization_time = "10s"
}
Tip
Set this to a reasonable value (e.g., `10s`) to ensure servers are stable before being marked as healthy.Max Trailing Logs
The max_trailing_logs
parameter specifies the maximum number of log entries a server can trail behind the leader before it is considered unhealthy.
autopilot {
max_trailing_logs = 250
}
Tip
Start with 250 and adjust this value based on your cluster's workload and log generation rate. A higher value (1000) may be necessary for high-throughput environments. Monitor the cluster's performance and adjust the value based on your observations. If servers are frequently being marked as unhealthy due to exceeding the max trailing logs, consider increasing the value.Server Stabilization Time
The server_stabilization_time
parameter specifies the minimum duration a server must be stable before being added to the cluster. The appropriate value depends on your server hardware and startup time.
autopilot {
server_stabilization_time = "10s"
}
Tip
Servers with fast startup times and stable hardware, a lower value like "10s" can be used to quickly add new servers to the cluster. For servers with slower startup times or if you want to provide more time for servers to stabilize before joining the cluster, a higher value like "30s" can be used. Observe the cluster's behavior during server additions and adjust if necessary.Enable Redundancy Zones
The enable_redundancy_zones
parameter allows you to enable redundancy zones, which improves the fault tolerance of your Nomad cluster by distributing voting servers across distinct failure domains. It's recommended to name the redundancy_zone in your server configuration the same as the underlying availability zone.
autopilot {
enable_redundancy_zones = true
}
Tip
Enable this feature to improve the resilience of your cluster by distributing servers across different failure domains. When enabling redundancy zones, ensure that you have sufficient servers distributed across multiple zones to maintain quorum and avoid data loss.Disable Upgrade Migration
The disable_upgrade_migration
parameter controls whether automatic upgrades are disabled.
autopilot {
disable_upgrade_migration = false
}
Tip
Set this to `false` to enable automatic upgrades, ensuring your cluster is always running the latest version.Disable Upgrade Migration and Enable Custom Upgrades
disable_upgrade_migration
and enable_custom_upgrades
Tip
`false` for both, unless you have specific upgrade requirements.By default, Autopilot's upgrade migration strategy is enabled, and custom upgrades are disabled.
This allows Autopilot to automatically manage the upgrade process.
If you have specific upgrade requirements or want to manually control the upgrade process due to just configuration changes without a newer version, you can set disable_upgrade_migration
to true
and enable_custom_upgrades
to true
.
This allows you to implement your own upgrade logic.
For most cases, it's recommended to leverage Autopilot's default upgrade migration strategy for a seamless and automated upgrade experience.
If you choose to enable custom upgrades, ensure that you have upgrade_version set to your specific versioning semantics to allow you to increment the version tag on future images.
Summary
Note
Optimal settings may vary based on your specific environment, workload, and requirements.It's crucial to monitor your Nomad cluster's performance, stability, and behavior and make adjustments as needed by conducting thorough testing and gradually rolling out changes to Autopilot parameters in a staging environment before applying them to your production cluster. This allows you to validate the settings and ensure a smooth operation.