Nomad
Federated cluster operations
This page lists operational considerations for running multi-region federated clusters as well as instructions for migrating the authoritative region to a federated region.
Operational considerations
When operating multi-region federated Nomad clusters, consider the following:
Regular snapshots: You can back up Nomad server state using the
nomad operator snapshot save
andnomad operator snapshot agent
commands. Performing regular backups expedites disaster recovery. The cadence depends on cluster rates of change and your internal SLA’s. You should regularly test snapshots using thenomad operator snapshot restore
command to ensure they work.Local ACL management tokens: You need local management tokens to perform federated cluster administration when the authoritative region is down. Make sure you have existing break-glass tokens available for each region.
Known paths to creating local ACL tokens: If the authoritative region fails, creation of global ACL tokens fails. If this happens, having the ability to create local ACL tokens allows you to continue to interact with each available federated region.
Authoritative and federated regions
Can non-authoritative regions continue to operate if the authoritative region is unreachable?: Yes, running workloads are never interrupted due to federation failures. Scheduling of new workloads and rescheduling of failed workloads is never interrupted due to federation failures. See Failure Scenarios for details.
Can the authoritative region be deployed with servers only? Yes, deploying the Nomad authoritative region with servers only, without clients, works as expected. This servers-only approach can expedite disaster recovery of the region. Restoration does not include objects such as nodes, jobs, or allocations, which are large and require compute intensive reconciliation after restoration.
Can I migrate the authoritative region to a currently federated region? It is possible by following these steps:
- Update the
authoritative_region
configuration parameter on the desired authoritative region servers. - Restart the server processes in the new authoritative region and ensure all data is present in state as expected. If the network was partitioned as part of the failure of the original authoritative region, writes of replicated objects may not have been successfully replicated to federated regions.
- Update the
authoritative_region
configuration parameter on the federated region servers and restart their processes.
- Update the
Can federated regions be bootstrapped while the authoritative region is down? No they cannot.