Federated cluster failure scenarios

When running Nomad in federated mode, failure situations and impacts are different depending on whether the authoritative region is the impacted region or not, and what the failure mode is. In soft failures, the region's servers have lost quorum but the Nomad processes are still up, running, and reachable. In hard failures, the regional servers are completely unreachable and are akin to the underlying hardware having been terminated (cloud) or powered-off (on-prem).

The scenarios are based on a Nomad deployment running three federated regions:

Federated region failure: soft

In this situation the region asia-south-1 has lost leadership but the servers are reachable and up.

All server logs in the impacted region have entries such as this example.

[ERROR] nomad/worker.go:504: worker: failed to dequeue evaluation: worker_id=d19e6bb5-5ec9-8f75-9caf-47e2513fe28d error="No cluster leader"

✅ Request forwarding continues to work between all federated regions that are running with leadership.

🟨 API requests, either directly or attempting to use request forwarding to the impacted region, fail unless using the stale=true flag.

✅ Creation and deletion of replicated objects, such as namespaces, is written to the authoritative region.

✅ Any federated regions with leadership is able to continue to replicate all objects detailed previously.

✅ Creation of local ACL tokens continues to work for all regions with leadership.

✅ Jobs without the multiregion block deploy to all regions with leadership.

❌ Jobs with the multiregion block defined fail to deploy.

Federated region failure: hard

In this situation the region asia-south-1 has gone down. When this happens, the Nomad server logs for the other regions have log entries similar to this example.

[DEBUG] go-hclog@v1.6.3/stdlog.go:58: nomad: memberlist: Failed UDP ping: asia-south-1-server-1.asia-south-1 (timeout reached)
[INFO]  go-hclog@v1.6.3/stdlog.go:60: nomad: memberlist: Suspect asia-south-1-server-1.asia-south-1 has failed, no acks received
[DEBUG] go-hclog@v1.6.3/stdlog.go:58: nomad: memberlist: Initiating push/pull sync with: us-east-1-server-1.us-east-1 192.168.1.193:9002
[DEBUG] go-hclog@v1.6.3/stdlog.go:58: nomad: memberlist: Failed UDP ping: asia-south-1-server-1.asia-south-1 (timeout reached)
[INFO]  go-hclog@v1.6.3/stdlog.go:60: nomad: memberlist: Suspect asia-south-1-server-1.asia-south-1 has failed, no acks received

✅ Request forwarding continues to work between all federated regions that are running with leadership.

❌ API requests, either directly or attempting to use request forwarding to the impacted region, fail.

✅ Creation and deletion of replicated objects, such as namespaces, are written to the authoritative region.

✅ Any federated regions with leadership continue to replicate all objects detailed above.

✅ Creation of local ACL tokens continues to work for all regions which are running with leadership.

✅ Jobs without the multiregion block deploy to all regions with leadership.

❌ Jobs with the multiregion block defined fail to deploy.

Authoritative region failure: soft

In this situation the region europe-west-1 has lost leadership but the servers are reachable and up.

The server logs in the authoritative region have entries such as this example.

[ERROR] nomad/worker.go:504: worker: failed to dequeue evaluation: worker_id=68b3abe2-5e16-8f04-be5a-f76aebb0e59e error="No cluster leader"

✅ Request forwarding continues to work between all federated regions that are running with leadership.

🟨 API requests, either directly or attempting to use request forwarding to the impacted region, fail unless using the stale=true flag.

❌ Creation and deletion of replicated objects, such as namespaces, fails.

❌ Any federated regions are able to read data to replicate as they use the stale flag, but no writes can occur to the authoritative region as described previously.

✅ Creation of local ACL tokens continues to work for all federated regions which are running with leadership.

✅ Jobs without the multiregion block deploy to all federated regions which are running with leadership.

❌ Jobs with the multiregion block defined fails to deploy.

Authoritative region failure: hard

In this situation the region europe-west-1 has gone down. When this happens, the Nomad server leader logs for the other regions have log entries similar to this example.

[ERROR] nomad/leader.go:544: nomad: failed to fetch namespaces from authoritative region: error="rpc error: EOF"
[ERROR] nomad/leader.go:1767: nomad: failed to fetch policies from authoritative region: error="rpc error: EOF"
[ERROR] nomad/leader.go:2498: nomad: failed to fetch ACL binding rules from authoritative region: error="rpc error: EOF"
[ERROR] nomad/leader_ent.go:226: nomad: failed to fetch quota specifications from authoritative region: error="rpc error: EOF"
[ERROR] nomad/leader.go:703: nomad: failed to fetch node pools from authoritative region: error="rpc error: EOF"
[ERROR] nomad/leader.go:1909: nomad: failed to fetch tokens from authoritative region: error="rpc error: EOF"
[ERROR] nomad/leader.go:2083: nomad: failed to fetch ACL Roles from authoritative region: error="rpc error: EOF"
[DEBUG] nomad/leader_ent.go:84: nomad: failed to fetch policies from authoritative region: error="rpc error: EOF"
[ERROR] nomad/leader.go:2292: nomad: failed to fetch ACL auth-methods from authoritative region: error="rpc error: EOF"
[DEBUG] go-hclog@v1.6.3/stdlog.go:58: nomad: memberlist: Failed UDP ping: europe-west-1-server-1.europe-west-1 (timeout reached)
[INFO]  go-hclog@v1.6.3/stdlog.go:60: nomad: memberlist: Suspect europe-west-1-server-1.europe-west-1 has failed, no acks received
[DEBUG] go-hclog@v1.6.3/stdlog.go:58: nomad: memberlist: Failed UDP ping: europe-west-1-server-1.europe-west-1 (timeout reached)

✅ Request forwarding continues to work between all federated regions that are running with leadership.

❌ API requests, either directly or attempting to use request forwarding to the impacted region, fail.

❌ Creation and deletion of replicated objects, such as namespaces, fails.

❌ Any federated regions with leadership is not able to replicate objects detailed in the logs.

✅ Creation of local ACL tokens continues to work for all regions with leadership.

✅ Jobs without the multiregion block deploy to regions with leadership.

❌ Jobs with the multiregion block defined fail to deploy.