Vault supports a number of Storage options for the durable storage of Vault's information. Each backend has pros, cons, advantages, and trade-offs. For example, some backends support high availability while others provide a more robust backup and restoration process.
As of Vault 1.4 an Integrated Storage option is offered. This storage backend does not rely on any third party systems, it implements high availability, supports Enterprise Replication features, and provides backup/restore workflows.
Vault's Integrated Storage uses a consensus protocol to provide Consistency (as defined by CAP). The consensus protocol is based on "Raft: In search of an Understandable Consensus Algorithm". For a visual explanation of Raft, see The Secret Lives of Data.
Raft is a consensus algorithm that is based on Paxos. Compared to Paxos, Raft is designed to have fewer states and a simpler, more understandable algorithm.
There are a few key terms to know when discussing Raft:
Log - The primary unit of work in a Raft system is a log entry. The problem of consistency can be decomposed into a replicated log. A log is an ordered sequence of entries. Entries includes any cluster change: adding nodes, adding services, new key-value pairs, etc. We consider the log consistent if all members agree on the entries and their order.
FSM - Finite State Machine. An FSM is a collection of finite states with transitions between them. As new logs are applied, the FSM is allowed to transition between states. Application of the same sequence of logs must result in the same state, meaning behavior must be deterministic.
Peer set - The peer set is the set of all members participating in log replication. For Vault's purposes, all server nodes are in the peer set of the local cluster.
Quorum - A quorum is a majority of members from a peer set: for a set of size
n, quorum requires at least
(n+1)/2members. For example, if there are 5 members in the peer set, we would need 3 nodes to form a quorum. If a quorum of nodes is unavailable for any reason, the cluster becomes unavailable and no new logs can be committed.
Committed Entry - An entry is considered committed when it is durably stored on a quorum of nodes. Once an entry is committed it can be applied.
Leader - At any given time, the peer set elects a single node to be the leader. The leader is responsible for ingesting new log entries, replicating to followers, and managing when an entry is considered committed. For Vault's purposes, the leader node is also the Active vault node and followers are standby nodes. See the High Availability docs for more information.
Raft is a complex protocol and will not be covered here in detail (for those who desire a more comprehensive treatment, the full specification is available in this paper). We will, however, attempt to provide a high level description which may be useful for building a mental model.
Raft nodes are always in one of three states: follower, candidate, or leader. All nodes initially start out as a follower. In this state, nodes can accept log entries from a leader and cast votes. If no entries are received for some time, nodes self-promote to the candidate state. In the candidate state, nodes request votes from their peers. If a candidate receives a quorum of votes, then it is promoted to a leader. The leader must accept new log entries and replicate to all the other followers.
Once a cluster has a leader, it is able to accept new log entries. A client can request that a leader append a new log entry (from Raft's perspective, a log entry is an opaque binary blob). The leader then writes the entry to durable storage and attempts to replicate to a quorum of followers. Once the log entry is considered committed, it can be applied to a finite state machine. The finite state machine is application specific; in Vault's case, we use BoltDB to maintain cluster state. Vault's writes block until it is both committed and applied.
Obviously, it would be undesirable to allow a replicated log to grow in an unbounded fashion. Raft provides a mechanism by which the current state is snapshotted and the log is compacted. Because of the FSM abstraction, restoring the state of the FSM must result in the same state as a replay of old logs. This allows Raft to capture the FSM state at a point in time and then remove all the logs that were used to reach that state. This is performed automatically without user intervention and prevents unbounded disk usage while also minimizing the time spent replaying logs. One of the advantages of using BoltDB is that it allows Vault's snapshots to be very light weight. Since Vault's data is already persisted to disk in BoltDB the snapshot process just needs to truncate the raft logs.
Consensus is fault-tolerant while a cluster has quorum. If a quorum of nodes is unavailable, it is impossible to process log entries or reason about peer membership. For example, suppose there are only 2 peers: A and B. The quorum size is also 2, meaning both nodes must agree to commit a log entry. If either A or B fails, it is now impossible to reach quorum. This means the cluster is unable to add or remove a node or to commit any additional log entries. This results in unavailability. At this point, manual intervention would be required to remove either A or B and to restart the remaining node in bootstrap mode.
A Raft cluster of 3 nodes can tolerate a single node failure while a cluster of 5 can tolerate 2 node failures. The recommended configuration is to either run 3 or 5 Vault servers per cluster. This maximizes availability without greatly sacrificing performance. The deployment table below summarizes the potential cluster size options and the fault tolerance of each.
In terms of performance, Raft is comparable to Paxos. Assuming stable leadership, committing a log entry requires a single round trip to half of the cluster. Thus, performance is bound by disk I/O and network latency.
When getting started, a single Vault server is initialized. At this point the cluster is of size 1 which allows the node to self-elect as a leader. Once a leader is elected, other servers can be added to the peer set in a way that preserves consistency and safety.
The join process is how new nodes are added to the vault cluster, it uses an encrypted challenge/answer workflow. To accomplish this, all nodes in a single raft cluster must share the same seal configuration. If using an Auto Unseal the join process can use the configured seal to automatically decrypt the challenge and respond with the answer. If using a Shamir seal the unseal keys must be provided to the node attempting to join the cluster before it can decrypt the challenge and respond with the decrypted answer.
Since all servers participate as part of the peer set, they all know the current leader. When an API request arrives at a non-leader server, the request is forwarded to the leader.
Similar to other storage backends, data that is written to the Raft log and FSM will be encrypted by Vault's barrier.
Vault does not currently offer automated dead server cleanup. If you wish to
decommission a node, or a node dies and must be replaced, the node must manually
be removed from the cluster with the
Below is a table that shows quorum size and failure tolerance for various cluster sizes. The recommended deployment is either 3 or 5 servers. A single server deployment is highly discouraged as data loss is inevitable in a failure scenario.
|Servers||Quorum Size||Failure Tolerance|