Consul
Consistency
The page provides conceptual information about Consul's anti-entropy mechanism, which keeps Consul catalog results consistent across nodes in a datacenter.
Introduction
Entropy is the tendency of systems to become increasingly disordered over time. Consul includes anti-entropy mechanisms to counter this tendency and keep the state of the cluster ordered, even when cluster components fail.
In Consul, there is a distinction between the global service catalog and the agent's local state. Agents forward information about services and their registered health checks to the leader node in the cluster, which replicates the authorative global service catalog to the other server nodes. As a result, any node in a Consul cluster may have catalog information that differs from the other nodes at a specific moment in time.
Consul's anti-entropy mechanism reconciles catalog differences by periodically synchronizing the local agent state with the catalog.
For example, when a user registers a new service or check with the agent, the agent notifies the leader that this new check exists, and the leader updates the catalog. Similarly, when a check is deleted from the agent, it notifies the leader to update the catalog. Using this information, the catalog can respond intelligently to queries about its nodes and services based on their availability.
Consul treats the state of the agent as authoritative. If there are any differences between the agent's view and the catalog view, the agent uses its local view.
Periodic synchronization
Consul's anti-entropy mechanism is a long-running process. In addition to detecting agent changes, it periodically syncs service and health check information to the catalog. This sync ensures that the catalog closely matches the agent's actual state.
This capability also allows Consul to re-populate the service catalog, even in the case of complete data loss.
The amount of time between periodic anti-entropy runs varies based on cluster size. The following table describes the the relationship between cluster size, counted by the number of nodes in the cluster, and sync interval:
| Cluster Size | Periodic Sync Interval |
|---|---|
| 1 - 128 | 1 minute |
| 129 - 256 | 2 minutes |
| 257 - 512 | 3 minutes |
| 513 - 1024 | 4 minutes |
| ... | ... |
These intervals are approximate. To avoid too many nodes syncing at one time, each Consul agent randomly chooses a staggered start time within the interval window.
Synchronization failures
There are a number of situations where Consul's anti-entropy can fail. These include:
- Agent misconfiguration
- Misconfiguration of the agent's operating environment
- I/O problems, such as a full disk or filesystem permission error
- Networking problems, such as an agent being unable to communicate with the server
If an error is encountered during an anti-entropy run, the agent logs the error and continues to run. Syncs are designed to run periodically to automatically recover from these types of transient failures.
Consistency modes
When you use Consul's service discovery features to return a registered service instance, Consul forwards the request to the cluster's leader by default. That way, Consul returns the most recent authoritative results from the catalog.
You can change a Consul agent's consistency mode so that the agent returns services with a greater or lower degree of accuracy, depending on the needs of the workloads in your service networking environment.
There are three consistency modes for agents to return catalog information:
default- To return accurate results as quick as possible, agents forward catalog read requests to the cluster leader. In Raft, agents use leader leasing, which provides a set time window where the leader assumes its role is stable. If an election occurs before the leader leasing window is complete, the old leader continues to service read requests on behalf of the entire cluster. Therefore Consul may occasionally return a stale result, but it processes reads faster in exchange.consistent- This mode is strongly consistent without caveats. It requires a leader to verify with a quorum of peers that it is still the leader before it returns results. This mode introduces additional traffic to all server nodes as a result. For read requests, results are always consistent, but requests have additional latency.stale- This mode allows any server to service the read, regardless of whether it is the leader. Reads become faster and more scalable, but are more likely to return stale value. This mode also allows reads without a leader, meaning that Consul servers can still respond to requests during an outage.
For more information, refer to Consistency modes in the HTTP API documentation.
Jepsen testing
Jepsen is a tool designed to test the partition tolerance of distributed systems. It creates network partitions while fuzzing the system with random operations. The results are analyzed to find out if the system violates any of the consistency properties it claims to have.
As part of our Consul testing, we ran a Jepsen test to determine if any consistency issues could be uncovered. In our testing, Consul gracefully recovered from partitions without introducing any consistency issues.
Test output
The following output was captured during our Jepsen testing.
$ lein test :only jepsen.system.consul-test
lein test jepsen.system.consul-test
INFO jepsen.os.debian - :n5 setting up debian
INFO jepsen.os.debian - :n3 setting up debian
INFO jepsen.os.debian - :n4 setting up debian
INFO jepsen.os.debian - :n1 setting up debian
INFO jepsen.os.debian - :n2 setting up debian
INFO jepsen.os.debian - :n4 debian set up
INFO jepsen.os.debian - :n5 debian set up
INFO jepsen.os.debian - :n3 debian set up
INFO jepsen.os.debian - :n1 debian set up
INFO jepsen.os.debian - :n2 debian set up
INFO jepsen.system.consul - :n1 consul nuked
INFO jepsen.system.consul - :n4 consul nuked
INFO jepsen.system.consul - :n5 consul nuked
INFO jepsen.system.consul - :n3 consul nuked
INFO jepsen.system.consul - :n2 consul nuked
INFO jepsen.system.consul - Running nodes: {:n1 false, :n2 false, :n3 false, :n4 false, :n5 false}
INFO jepsen.system.consul - :n2 consul nuked
INFO jepsen.system.consul - :n3 consul nuked
INFO jepsen.system.consul - :n4 consul nuked
INFO jepsen.system.consul - :n5 consul nuked
INFO jepsen.system.consul - :n1 consul nuked
INFO jepsen.system.consul - :n1 starting consul
INFO jepsen.system.consul - :n2 starting consul
INFO jepsen.system.consul - :n4 starting consul
INFO jepsen.system.consul - :n5 starting consul
INFO jepsen.system.consul - :n3 starting consul
INFO jepsen.system.consul - :n3 consul ready
INFO jepsen.system.consul - :n2 consul ready
INFO jepsen.system.consul - Running nodes: {:n1 true, :n2 true, :n3 true, :n4 true, :n5 true}
INFO jepsen.system.consul - :n5 consul ready
INFO jepsen.system.consul - :n1 consul ready
INFO jepsen.system.consul - :n4 consul ready
INFO jepsen.core - Worker 0 starting
INFO jepsen.core - Worker 2 starting
INFO jepsen.core - Worker 1 starting
INFO jepsen.core - Worker 3 starting
INFO jepsen.core - Worker 4 starting
INFO jepsen.util - 2 :invoke :read nil
INFO jepsen.util - 3 :invoke :cas [4 4]
INFO jepsen.util - 0 :invoke :write 4
INFO jepsen.util - 1 :invoke :write 1
INFO jepsen.util - 4 :invoke :cas [4 0]
INFO jepsen.util - 2 :ok :read nil
INFO jepsen.util - 4 :fail :cas [4 0]
(Log Truncated...)
INFO jepsen.util - 4 :invoke :cas [3 3]
INFO jepsen.util - 4 :fail :cas [3 3]
INFO jepsen.util - :nemesis :info :stop nil
INFO jepsen.util - :nemesis :info :stop "fully connected"
INFO jepsen.util - 0 :fail :read nil
INFO jepsen.util - 1 :fail :write 0
INFO jepsen.util - :nemesis :info :stop nil
INFO jepsen.util - :nemesis :info :stop "fully connected"
INFO jepsen.core - nemesis done
INFO jepsen.core - Worker 3 done
INFO jepsen.util - 1 :invoke :read nil
INFO jepsen.core - Worker 2 done
INFO jepsen.core - Worker 4 done
INFO jepsen.core - Worker 0 done
INFO jepsen.util - 1 :ok :read 3
INFO jepsen.core - Worker 1 done
INFO jepsen.core - Run complete, writing
INFO jepsen.core - Analyzing
(Log Truncated...)
INFO jepsen.core - Analysis complete
INFO jepsen.system.consul - :n3 consul nuked
INFO jepsen.system.consul - :n2 consul nuked
INFO jepsen.system.consul - :n4 consul nuked
INFO jepsen.system.consul - :n1 consul nuked
INFO jepsen.system.consul - :n5 consul nuked
1964 element history linearizable. :D
Ran 1 tests containing 1 assertions.
0 failures, 0 errors.
We ran Jepsen multiple times, and Consul passed each time. This output is only representative of a single run and has been edited for length.