Vault with integrated storage reference architecture

15min
|
Vault

This guide describes recommended best practices for infrastructure architects and operators to follow when deploying Vault using the Integrated Storage (Raft) storage backend in a production environment. You can review common recommended practices for designing resilient systems in the HashiCorp Well Architected Framework.

This guide includes general guidance as well as specific recommendations for popular cloud infrastructure platforms. These recommendations have also been encoded into official Terraform modules for AWS, Azure, and GCP.

Note

If you are deploying Vault to Kubernetes, please refer to the Vault on Kubernetes deployment guide.

Recommended architecture

The following diagram shows the recommended architecture for deploying a single Vault cluster with maximum resiliency:

Recommended architecture diagram

With five nodes in the Vault cluster distributed between three availability zones, this architecture can withstand the loss of two nodes from within the cluster or the loss of an entire availability zone.

If deploying to three availability zones is not possible, the same architecture may be used across two or one availability zones, at the expense of significant reliability risk in case of an availability zone outage.

For Vault Enterprise customers, additional resiliency is possible by implementing a multi-cluster architecture, which allows for additional performance and disaster recovery options. See the Multi-Cluster Architecture Guide for more information.

System requirements

This section contains specific hardware capacity recommendations, network requirements, and additional infrastructure considerations. Since every hosting environment is different and every customer's Vault usage profile is different, these recommendations should only serve as a starting point from which each customer's operations staff may observe and adjust to meet the unique needs of each deployment.

Warning

All specification outlined in this document are minimum recommendations without any reservations toward vertical scaling, redundancy or other SRE needs and without measure of your user volumes or their use-cases in all scenarios. All resource requirements are directly proportional to the operations being performed by the Vault cluster as well as the end users utilization.

Note

To match your requirements and maximise the stability of your Vault instances, it's important to ensure that you are performing load tests and continuing to monitor resource usage as well as all reported matricies from Vaults telemetry.

Hardware sizing for Vault servers

Sizing recommendations have been divided into two common cluster sizes.

Small clusters would be appropriate for most initial production deployments or for development and testing environments.

Large clusters are production environments with a consistently high workload. That might be a large number of transactions, a large number of secrets, or a combination of the two.

Size	CPU	Memory	Disk Capacity	Disk IO	Disk Throughput
Small	2-4 core	8-16 GB RAM	100+ GB	3000+ IOPS	75+ MB/s
Large	4-8 core	32-64 GB RAM	200+ GB	10000+ IOPS	250+ MB/s

For each cluster size, the following table gives recommended hardware specs for each major cloud infrastructure provider.

Provider	Size	Instance/VM Types	Disk Volume Specs
AWS	Small	`m5.large`, `m5.xlarge`	100+GB `gp3`, 3000 IOPS, 125MB/s
	Large	`m5.2xlarge`, `m5.4xlarge`	200+GB `gp3`, 10000 IOPS, 250MB/s
Azure	Small	`Standard_D2s_v3`, `Standard_D4s_v3`	1024GB* `Premium_LRS`
	Large	`Standard_D8s_v3`, `Standard_D16s_v3`	1024GB* `Premium_LRS`
GCP	Small	`n2-standard-2`, `n2-standard-4`	500GB* `pd-balanced`
	Large	`n2-standard-8`, `n2-standard-16`	1000GB* `pd-ssd`

Note

For GCP and Azure recommendations, the disk sizes listed are larger than the minimum size recommended, because for the recommended disk type, available IOPS increases with disk capacity, and the listed sizes are necessary to provision the required IOPS.

Note

For predictable performance on cloud providers, it's recommended to avoid "burstable" CPU and storage options (such as AWS t2 and t3 instance types) whose performance may degrade rapidly under continuous load.

Note

The internal database that Vault uses is optimized for modern SSD drives. Running Vault on magnetic spinning disks will incur a dramatic performance penalty.

Hardware considerations

In general, CPU and storage performance requirements will depend on the customer's exact usage profile (eg, types of requests, average request rate, and peak request rate). Memory requirements depend on the total size of data stored in memory and should be sized according to that data.

When using Integrated Storage the Vault servers should have a relatively high-performance hard disk subsystem. If many secrets are being generated or rotated frequently, this information will need to flush to disk often and the use of slower storage systems will significantly impact performance.

In addition, Hashicorp strongly recommends configuring Vault with audit logging enabled. The impact of the additional storage I/O from audit logging will vary depending on your particular pattern of requests. For best performance, audit logs should be written to a separate disk.

Network latency and bandwidth

In order for cluster members to stay properly in sync, network latency between availability zones should be less than eight milliseconds (8 ms).

The amount of network bandwidth used by Vault will depend entirely on the specific customer's usage patterns. In many cases, even a high request volume will not translate to a large amount of network bandwidth consumption. However, all data written to Vault will be replicated to all cluster members. It's also important to consider bandwidth requirements to other external systems such as monitoring and logging collectors. And finally, a multi-cluster Vault setup will require Vault datasets to be transmitted between clusters to provide Performance and DR Replication.

Network connectivity

The following table outlines the network connectivity requirements for Vault cluster nodes. If general network egress is restricted, particular attention must be paid to granting outgoing access from the Vault servers to any external integration providers (for example, authentication and secret provider backends) as well as external log handlers, metrics collection, security and config management providers, and backup and restore systems.

Source	Destination	port	protocol	Direction	Purpose
Client machines	Load balancer	443	tcp	incoming	Request distribution
Load balancer	Vault servers	8200	tcp	incoming	Vault API
Vault servers	Vault servers	8200	tcp	bidirectional	Cluster bootstrapping
Vault servers	Vault servers	8201	tcp	bidirectional	Raft, replication, request forwarding
Vault servers	External systems	various	various	various	External APIs

Network traffic encryption

All Vault-related network traffic should be encrypted along every segment. From client machines to the load balancer, and from the load balancer to the Vault servers, standard HTTPS TLS encryption can be used.

For communication between Vault servers (port 8201 by default) including Raft gossip, data replication, and request forwarding traffic, Vault automatically negotiates an mTLS connection when new servers join the cluster initially via the API address port (8200 by default).

Load balancer recommendations

For the highest levels of reliability and stability, it is highly recommended to use some load balancing technology to distribute requests to your Vault cluster members. Each major cloud platform provides good options for managed load balancer services, or there are a number of self-hosted options as well as service discovery systems like Consul.

If you choose to terminate TLS at your load balancer, it is also strongly recommended to use TLS for the connection from the load balancer to Vault as well to minimize the exposure of secret content on your network.

To monitor the health of Vault cluster nodes, the load balancer should be configured to poll the /v1/sys/health API endpoint to detect the status of the node and direct traffic accordingly. Refer to the sys/health API documentation for specific details on the query options and response codes and their meanings.

Scaling considerations

As of Vault 1.7, in a cloud-based environment, it is recommended to use a managed scaling service (such as Auto Scaling Groups on AWS) to keep your Vault cluster populated with healthy instances. However, because of the nature of the Integrated Storage backend, it's important not to replace all instances in the managed scaling group too quickly to avoid having to restore data from a snapshot.

Note

Auto-server cleanup is not enabled by default when using Integrated Storage. The feature must be enabled after cluster initialization via the Raft Autopilot API. Also see the Integrated Storage Autopilot Tutorial for more details.

For scaling the performance of your Vault cluster, there are two factors to consider. Adding additional members to the Vault cluster will not increase performance for any activity that triggers writes to the Vault storage backend. However, for Vault Enterprise customers, adding performance standby nodes can provide horizontal scalability for read requests within a Vault cluster.

Failure tolerance characteristics

When deploying a Vault cluster, it's important to consider and design for your specific requirements for various failure scenarios:

Node failure

The Integrated Storage backend for Vault allows for individual node failure by replicating all data between each node of the cluster. If the leader node fails, the remaining cluster members will elect a new leader following the Raft protocol. To allow for the failure of up to two nodes in the cluster, the ideal size is five nodes for a Vault cluster using Integrated Storage.

Availability zone failure

By deploying a Vault cluster in the recommended architecture across three availability zones, the Raft consensus algorithm should be able to maintain consistency and availability given the failure of any one availability zone.

In cases where deployment across three zones is not possible, the failure of an availability zone may cause the Vault cluster to become inaccessible or unable to elect a leader. In a two availability zone deployment, for example, the failure of one availability zone would have a 50% chance of causing a cluster to lose its Raft quorum and be unable to service requests.

Region or cluster failure

In the event of a failure of an entire region or cluster, Vault Enterprise provides replication features that can help provide resiliency across multiple clusters and/or regions. Please see the Multi-Cluster Architecture Guide for more information.

External token storage

The Tokenization transformation feature reached General Availability in Vault 1.7. This feature introduces additional architectural considerations.

The tokenization feature requires an external data store to facilitate the mapping of tokens to cryptographic values. Be sure to architect your external data stores for high availability. Where possible, it's important to follow reliability and disaster-recovery architectural patterns that meet the same requirements you have for Vault itself. And in order to ensure data consistency the external data store backup cadence must be in sync with backups of Vault.

Glossary

Vault cluster

A Vault cluster is a set of Vault processes that together run a Vault service. These Vault processes could be running on physical or virtual servers or in containers.

Availability zone

An availability zone is a single network failure domain that hosts part or all of a Vault cluster. Examples of availability zones include:

An isolated datacenter
An isolated cage in a datacenter if it is isolated from other cages by all other means (power, network, etc)
An "Availability Zone" in AWS or Azure; A "Zone" in GCP

Region

A region is a collection of one or more availability zones on a low-latency network. Regions are typically separated by significant distances. A region could host one or more Vault clusters, but a single Vault cluster would not be spread across multiple regions due to network latency issues.

Autoscaling

Autoscaling is the process of automatically scaling computational resources based on service activity. Autoscaling may be either horizontal, meaning to add more machines into the pool of resources, or vertical, meaning to increase the capacity of existing machines.

Each major cloud provider offers a managed autoscaling service:

Cloud	Managed Autoscaling Service
AWS	Auto Scaling Groups
Azure	Virtual Machine Scale Sets
GCP	Managed Instance Groups

Load balancer

A load balancer is a system that distributes network requests across multiple servers. It may be a managed service from a cloud provider, a physical network appliance, a piece of software, or a service discovery platform such as Consul.

Each major cloud provider offers one or more managed load balancing services:

Cloud	Layer	Managed Load Balancing Service
AWS	Layer 4	Network Load Balancer
	Layer 7	Application Load Balancer
Azure	Layer 4	Azure Load Balancer
	Layer 7	Azure Application Gateway
GCP	Layer 4/7	Cloud Load Balancing

Knowledge checks

A quiz to test your knowledge.

Why does the recommended Vault integrated storage architecture use five nodes distributed across three availability zones?
This layout lets the cluster tolerate the loss of up to two nodes or one entire availability zone while maintaining Raft quorum and service availability.
What load balancer health check endpoint should you configure to monitor Vault cluster nodes?
🔘 /v1/sys/seal-status
🔘 /v1/sys/leader
🔘 /v1/sys/health
🔘 /v1/storage/raft/status
❌ /v1/sys/seal-status
❌ /v1/sys/leader
✅ /v1/sys/health
❌ /v1/storage/raft/status
How do additional Vault cluster members affect performance when you use Integrated Storage?
🔘 They increase performance for write operations because Raft spreads writes across more nodes.
🔘 They do not increase performance for storage writes, but Vault Enterprise performance standby nodes can scale read requests.
🔘 They eliminate the need for a load balancer because each node handles traffic independently.
🔘 They reduce network latency between availability zones by replicating data locally.
❌ They increase performance for write operations because Raft spreads writes across more nodes.
✅ They do not increase performance for storage writes, but Vault Enterprise performance standby nodes can scale read requests.
❌ They eliminate the need for a load balancer because each node handles traffic independently.
❌ They reduce network latency between availability zones by replicating data locally.

Next steps

Additional references

Collection Overview

Day one preparation

Multi-cluster architecture guide

This tutorial also appears in:

8 tutorials

Integrated Storage
Operational tasks associated with integrated storage to persist Vault data rather than using external storage.
- Vault