Integrated Raft storage

Vault supports several options for durable information storage. Each backend offers pros, cons, advantages, and trade-offs. For example, some backends support high availability while others provide a more robust backup and restoration process. Integrated storage is a "built-in" storage option that supports backup/restore workflows, high availability, and Enterprise replication features without relying on third-party systems.

Raft protocol overview

Tip

The Secret Lives of Data has a nice visual explanation of Raft storage.

Raft storage uses a consensus protocol based on Paxos and the work in "Raft: In search of an Understandable Consensus Algorithm" to provide CAP consistency.

Raft performance is bound by disk I/O and network latency, and comparable to Paxos. With stable leadership, committing a log entry requires a single round trip to half of the peer set.

Compared to Paxos, Raft is designed to have fewer states and a simpler, more understandable algorithm that depends on the following elements:

Log - An ordered sequence of entries (replicated log) that tracks cluster changes. For example, writing data is a new event, which creates a corresponding log entry.
Peer set - The set of all members participating in log replication. All server nodes are in the peer set of the local cluster.
Leader - At any given time, the peer set elects a single node to be the leader. Leaders ingest new log entries, replicate the log to followers, and manage when an entry should be committed. Leaders manage log replication and inconsistencies within replicated log entries may indicate an issue with the leader.
Quorum - A majority of members from a peer set. For a peer set of size N, quorum requires at least ceil( (N + 1) / 2 ) members. For example, quorum in a peer set of 5 members requires 3 nodes. If a cluster cannot achieve quorum, the cluster becomes unavailable and cannot commit new logs.
Committed entry - A log entry that is replicated to a quorum of nodes. Log entries are only applied once they are committed.
Deterministic finite-state machine ([DFSM]) - A collection of known states with predictable transitions between the states. In Raft, the DFSM transitions between states whenever new logs are applied. By DFSM rules, multiple applications of the same sequence of logs must always result in the same final state.

Node states

Raft nodes are always in one of following states:

follower - All nodes start as a follower. Followers accept log entries from a leader and cast votes for leader selection.
candidate - A node self-promotes to the candidate state whenever it goes without receiving log entries for a given period of time. During self-promotion, candidates request votes from the rest of their peer set.
leader - Nodes become leaders once they receive a quorum of votes as a candidate.

Writing logs

With Raft, a log entry is an opaque binary blob. Once the peer set elects a leader, the peer set can accept new log entries. When clients ask the set to append a new log entry, the leader writes the entry to durable storage and tries to replicate the data to a quorum of followers. Once the log entry is committed, the leader applies the log entry to a deterministic finite state machine to maintain the cluster state.

Raft in Vault

Vault uses BoltDB or WAL Raft as the deterministic finite state machine and blocks writes until they are both committed and applied.

Compacting logs

To avoid unbounded growth in the replicated logs, Raft saves the current state to snapshots then compacts the associated logs. Because the finite-state machine is deterministic, restoring a snapshot of the DFSM always results in the same state as replaying the sequence of logs associated with the snapshot. Taking snapshots lets Raft capture the DFSM state at any point in time and then remove the logs used to reach that state, thereby compacting the log data.

Raft in Vault

Vault compacts logs automatically to prevent unbounded disk usage while also minimizing the time spent replaying logs. Using BoltDB as the DFSM also keeps the Vault snapshots lightweight because the Vault data is already persisted to disk in BoltDB, the snapshot process just needs to truncate the Raft logs.

Quorum

Raft consensus is fault-tolerant when a peer set has quorum. However, when a quorum of nodes is not available, the peer set cannot process log entries, elect leaders, or manage peer membership.

For example, suppose there are only 2 peers: A and B. To have quorum, both nodes must participate, so the quorum size is 2. As a result, both nodes must agree before they can commit a log entry. If one of the nodes fails, the remaining node cannot reach quorum; the peer set can no longer add or remove nodes or commit additional log entries. When the peer set can no longer take action, it becomes unavailable. Once a peer set becomes unavailable, it can only be recovered manually by removing the failing node and restarting the remaining node in bootstrap mode so it self-elects as leader.

Raft leadership in Vault

When a single Vault server (node) initializes, it establishes a cluster (peer set) of size 1 and self-elects itself as leader. Once the cluster has a leader, additional servers can join the cluster using an encrypted challenge/answer workflow. For the join process to work, all nodes in a single Raft cluster must share the same seal configuration. If the cluster is configured to use auto-unseal, the join process automatically decrypts the challenge and responds with the answer using the configured seal. For other seal options, like a Shamir seal, nodes must have access to the unseal keys before joining so they can decrypt the challenge and respond with the decrypted answer.

In a high availability configuration, the active Vault node is the leader node and all standby nodes are followers.

Leadership elections

Nodes become the Raft leader through Raft leadership elections.

All nodes in a Raft cluster start as followers. Followers monitor leader health through a leader heartbeat. If a follower does not receive a heartbeat within the configured heartbeat timeout, the node becomes a candidate. Candidates watch for election notices from other nodes in the cluster. If the election timeout period expires, the candidate starts an election for leader. If the candidate gets responses from a quorum of other nodes in the cluster, the candidate becomes the new leader node.

Raft leaders may step down voluntarily if the node cannot connect to a quorum of nodes with the leader lease timeout period.

The relevant timeout periods (heartbeat timeout, election timeout, leader lease timeout) scale according to the performance_multiplier setting in your Vault configuration. By default, the performance_multiplier is 5, which translates to the following timeout values:

Timeout	Default duration
Heartbeat timeout	5 seconds
Election timeout	5 seconds
Leader lease timeout	2.5 seconds

We recommend using the default multiplier unless one of the following is true:

Platform telemetry strongly indicates the default behavior is insufficient.
The reliability of your platform or network requires different behavior.

BoltDB Raft logs

BoltDB is a single file database, which means BoltDB cannot shrink the file on disk to recover space when you delete data. Instead, BoltDB notes the places where the deleted data was stored on a "freelist". On subsequent writes, BoltDB consults the freelist to reuse old pages before allocating new space to persist the data.

BoltDB requires careful tuning

On Vault clusters with high churn, the BoltDB freelist can become quite large and the database file can become highly fragmented. Large freelists and fragmented database files can slow BoltDB transaction and directly impact the performance of your Vault cluster.
On busy Vault clusters, where new followers struggle to sync Raft snapshots before receiving subsequent snapshots from the leader, the BoltDB file is susceptible to sudden bursts of writes. Not only will new followers potentially fail to join quorum, Vault installations that do not provide for spiky file growth or over-allocate and waste disk space will likely see poor performance.

Write-ahead Raft logs

Experimental

Experimental features are tested but unproved. Until the feature is verified through heavy production use, proceed with caution.

By default, Vault uses the raft-boltdb library for BoltDB to store Raft logs, but you can also configure Vault to use the raft-wal library for write-ahead Raft logs.

Library	Filename(s)	Storage directory
`raft-boltdb`	`raft.db`	`raft`
`raft-wal`	`wal-meta.db`, `XXXXXXXXXXXXXXXXXXXX-XXXXXXXXXXXXXXXX.wal`	`raft/wal`

The raft-wal library is designed specifically for storing Raft logs. Rather than using a freelist like raft-boltdb, raft-wal maintains a directory of files as its data store and compacts data over time to free up space when a given file is no longer needed.

Storing data as files in a directory also means that the raft-wal library can easily increase or decrease the number of logs retained by leaders before truncating and compacting without risking poor performance from spiky writes.

Quorum management in Vault

With autopilot

With the autopilot feature, Vault uses a configurable set of parameters to confirm a node is healthy before considering it an eligible voter in the quorum list.

Autopilot is enabled by default and includes stabilization logic for nodes joining the cluster:

A node joins the cluster as a non-voter.
The joined node syncs with the current Raft index.
Once the configured stability threshold is met, the node becomes a full voting member of the cluster.

Verify your stability threshold is appropriate

Setting the stability threshold too low can lead to cluster instability because nodes may begin voting before they are fully in sync with the Raft index.

Autopilot also includes a dead server cleanup feature. When you enable dead server cleanup with the Autopilot API, Vault automatically removes unhealthy nodes from the Raft cluster without manual operator intervention.

Without autopilot

Without autopilot, when a node joins a Raft cluster, the node tries to catch up with the peer set just by replicating data received from the leader. While the node is in the initial synchronization state, it cannot vote, but is counted for the purposes of quorum. If multiple nodes join the cluster simultaneously (or within a small enough window) the cluster may exceed the expected failure tolerance, quorum may be lost, and the cluster can fail.

For example, consider a 3-node cluster with a large amount of data and a failure tolerance of 1. If 3 nodes join the cluster at the same time, the cluster size becomes 6 with an expected failure tolerance of 2. But 3 of the nodes are still synchronizing and cannot vote, which means the cluster loses quorum.

If you are not using autopilot, we strongly recommend that you ensure all new nodes have Raft indexes that are in sync (or very close to in sync) with the leader before adding additional nodes. You can check the status of current Raft indexes with the vault status CLI command.

Quorum size and failure tolerance

The table below compares quorum size and failure tolerance for various cluster sizes.

Servers	Quorum size	Failure tolerance
1	1	0
2	2	0
3	2	1
4	3	1
5	3	2
6	4	2
7	4	3

Best practice

For best performance, we recommended at least 5 servers for a standard production deployment to maintained a minimum failure tolerance of 2. We also recommend maintaining a cluster with an odd number of nodes to avoid voting stalemates.

We strongly discourage single server deployments for production use due to the high risk of data loss during failure scenarios.

To maintain failure tolerance during maintenance and other changes, we recommend sequentially scaling and reverting your cluster, 2 nodes at a time.

For example, if you start with a 5-node cluster:

Scale the cluster to 7 nodes.
Confirm the new nodes are joined and in sync with the rest of the peer set.
Stop or destroy 2 of the older nodes.
Repeat this process 2 more times to cycle out the rest of the pre-existing nodes.

You should always maintain quorum to limit the impact on failure tolerance when changing or scaling your Vault instance.

Note

Be aware that you need to adjust peers as needed during scaling events, as purposeful scaling events to increase or reduce cluster size can transpire in a minimal time window. Consider this if you use Autopilot for automatic server cleanup because your scaling time window can be shorter than the default dead_server_last_contract_threshold, and you need to adjust this value in such cases.

Redundancy Zones

If you are using autopilot with redundancy zones, the total number of servers will be different from the above, and is dependent on how many redundancy zones and servers per redundancy zone that you choose.

The majority of the voting servers in a cluster need to be available to agree on changes in configuration. If a voting node becomes unavailable and that causes the cluster to have fewer voting nodes than the quorum size, then Autopilot will not be able to promote a non-voter to become a voter. This is the failure tolerance of the cluster. Redundancy zones are not able to improve the failure tolerance of a cluster.

Say that you have a cluster configured to have 2 redundancy zones and each zone has 2 servers within it (for total of 4 nodes in the cluster). The quorum size is 2. If the zone voter in either of the redundancy zones becomes unavailable, the cluster does not have quorum and is not able to agree on the configuration change needed to promote the non-voter in the zone into a voter.

Redundancy zones do improve the optimistic failure tolerance of a cluster. The optimistic failure tolerance is the number of healthy active and back-up voting servers that can fail gradually without causing an outage. If the Vault cluster is able to maintain a quorum of voting nodes, then the cluster has the capability to lose nodes gradually and promote the standby redundancy zone nodes to take the place of voters.

For example, consider a cluster that is configured to have 3 redundancy zones with 2 nodes in each zone. If a voting node becomes unreachable, the zone standby in that zone is promoted. The cluster then maintains 3 voting nodes with 2 remaining standbys. The cluster can handle an additional 2 gradual failures before it loses quorum.

Best practice

If you choose to use redundancy zones, we strongly recommend using at least 3 zones to ensure failure tolerance.

Redundancy zones	Servers per zone	Quorum size	Failure tolerance	Optimistic failure tolerance
2	2	2	0	2
3	2	2	1	3
3	3	2	1	5
5	2	3	2	6