High availability

Vault can run in a high availability (HA) mode to protect against outages by running multiple Vault servers.

Design overview

The primary design goal for making Vault Highly Available (HA) is to minimize downtime without affecting horizontal scalability. Vault is bound by the IO limits of the storage backend rather than the compute requirements. Being bound by the IO limits simplifies the HA approach and avoids complex coordination.

Storage backends, such as Integrated Storage, provide additional coordinative functions enabling Vault to run in an HA configuration. Supported by the backend, Vault will automatically run in HA mode without further configuration.

When running in HA mode, Vault servers have two states they can be: standby and active. For multiple Vault servers sharing a storage backend, only a single instance is active at any time. All standby instances are placed in hot standbys.

Only the active server processes all requests; the standby server redirects all requests to an active Vault server.

Meanwhile, if the active server is sealed, fails, or loses network connectivity, then one of the standby Vault server becomes the active instance.

Please note that only unsealed Vault servers may act as a standby. If a server is in a sealed state, it cannot act as a standby. Servers in a sealed state cannot serve any requests if the active server fails.

Performance standby nodes (Enterprise)

Performance Standby Nodes service read-only requests from users or applications. Read-only requests are requests that do not modify Vault's storage. Vault quickly scales its ability to service these operations, providing a near-linear request-per-second scaling for most scenarios and secrets engines like KV and Transit. Traffic is distributed across performance standby nodes, allowing clients to scale these IOPS horizontally, and control high traffic workloads.

If a request comes into a Performance Standby Node that causes a storage to write the request, the request is forwarded to the active server. Read-only requests are serviced locally on the Performance Standby.

Like traditional HA standbys, a Performance Standby Node becomes the active instance when the active node is sealed, fails, or loses network connectivity.

Becoming an active node

An active node in an HA configuration holds an HA lock. When the active node becomes unavailable or steps down because an operator invokes vault operator step-down, all the nodes in the cluster attempt to grab the HA lock. The first node that succeeds in grabbing and holding the lock becomes the new active node.

The HA lock competition process depends on the HA storage backend of the cluster. For example, with raft integrated storage, Vault always gives the HA lock to whichever node wins the Raft leadership election.

After obtaining an HA lock, the node goes through a series of inauguration steps to formally become the active node for its cluster:

Seals the local Vault instance.
Reloads the seal configuration.
Migrates the seal, if needed.
Reloads encryption keys.
Creates a new HA intra-cluster TLS certificate.
Writes an entry to storage advertising its status as the active node.
Unseals the local Vault instance.
Start accepting connections from other nodes in the cluster.

After completing the inauguration steps, the new active node beings responding to new client requests.

HA standby nodes check for active node updates every 2.5 seconds. When the active node changes, standby nodes update their forwarding connection and open a connection to the new active node. The connection will fail until the new active node starts accepting connections from other nodes in the cluster, so standby nodes retry the connection every 5 seconds as a heartbeat, or whenever a new client request arrives that requires forwarding.

With performance standbys enabled, standby nodes also promote themselves as performance standbys the active node updates. The standby requests a certificate and key pair from the active node to receive replicated data. Once the standby node seals, performs the necessary setup tasks, completes its post-unseal setup, it can serve new client requests.

Tutorial

Refer to the following tutorials to learn more.