Consul
Consensus
This page provides conceptual information about the Raft protocol and its implementation in Consul. For more information about the underlying architecture, refer to Persistent data backend architecture.
Introduction
Raft is a consensus algorithm that Consul implements to manage distributed datacenter operations. You should have a general understanding of the Raft Consensus Algorithm before learning how Consul implements Raft. Refer to The Secret Lives of Data for an interactive guide on the how Raft protocol works.
In Consul, the Raft protocol operates according to the following concepts:
- Log entries are the fundamental unit of work in a Raft system. A log is an ordered sequence of entries, and entries include any cluster changes, such as adding nodes, registering services, and updating key-value pairs.
- The Raft index records when the log entry was made, in an authoritative order. We consider the log consistent when all members agree on the entries and their order.
- The peer set is the set of all members participating in log replication. In Consul, a datacenter's server nodes are the members of the peer set.
- Quorum refers to the majority of a peer set's members. Mathematically, for a set of size
N, quorum requires at least(N/2)+1members. For example, if there are 5 members in the peer set, you need 3 nodes to form a quorum. If a quorum of nodes is unavailable, the Consul cluster becomes unavailable because no new logs can be committed.
Node elections
- Follower
- Candidate
- Leader
All nodes start out as a follower. In the follower state, nodes can:
- Accept log entries from the leader
- Cast votes in elections
If no entries are received for some time, nodes self-promote to the candidate state. In the candidate state, nodes request votes from the other members of the peer set. If a candidate receives a quorum of votes, then it is promoted to leader.
The leader is the member of the peer set that records the authoritative Raft log. Then it replicates the log to the other members of the peer set. The leader must accept new log entries and replicate to all the other followers. If stale reads are not acceptable, Consul also performs all queries on the leader.
Log entries
When a cluster has a leader, it is able to accept new log entries. A client can request that a leader append a new log entry. The leader then writes the entry to durable storage and attempts to replicate to a quorum of followers.
An entry is considered committed when it is durably stored on a quorum of nodes. After an entry is committed it can be applied to a finite state machine. Consul blocks writes until a log is both committed and applied. This configuration enables consistent mode for HTTP API queries.
Consul uses MemDB to maintain a consistent log. One of the advantages of using MemDB is that it allows Consul to continue accepting new transactions even while capturing the old state in a snapshot, which prevents any availability issues. Snapshots of the cluster's state compact the logs to prevent unbounded growth in their size and the resources required to replicate them to other nodes.
Quorum
Raft is fault tolerant up to the point where quorum is available. For example, a Raft cluster of 3 nodes can tolerate a single node failure, while a cluster of 5 can tolerate 2 node failures.
When a quorum of nodes is unavailable, it is impossible to process log entries or reason about peer membership. Typically this situation requires manual intervention to re-establish a leader.
The following table shows quorum size and failure tolerance for various cluster sizes. We recommend either 3 or 5 servers for production deployments. You should restrict the use of single server agent datacenters to development scenarios.
| Servers | Quorum size | Fault tolerance |
|---|---|---|
| 1 | 1 | 0 |
| 2 | 2 | 0 |
| 3 | 2 | 1 |
| 4 | 3 | 1 |
| 5 | 3 | 2 |
| 6 | 4 | 2 |
| 7 | 4 | 3 |
| 8 | 5 | 3 |
For more information, refer to quorum size, which summarizes the potential cluster size options and their fault tolerance.
Raft in Consul operations
In Consul, only Consul server nodes are part of the peer set. Because client agents are not part of the peer set, a group of server agents can support thousands of client nodes, without scale affecting the cluster's Raft performance.
Bootstrap mode
To start a new datacenter, you must place a single Consul server in bootstrap mode. This mode allows it to self-elect as a leader. Once a leader is elected, other servers can join the peer set in a way that preserves consistency. After the first set of servers joins, you can disable bootstrap mode. For more information, refer to bootstrap a Consul datacenter.
Catalog transactions
Non-server leaders forward requests for catalog information or updates to the catalog to the leader. If it is a request, meaning it is read-only, the leader generates the result based on the current state of the Raft log. If it is an update, meaning it modifies state, the leader generates a new log entry and applies it using Raft. After the log entry is committed and applied, the transaction is complete.
Additional information
For more information about the Raft protocol, refer to the following external resources: