Vault with Consul Storage Reference Architecture
This guide applies to Vault versions 1.7 and above and Consul versions 1.8 and above.
This guide describes recommended best practices for infrastructure architects and operators to follow when deploying Vault using the Consul storage backend in a production environment.
This guide includes general guidance as well as specific recommendations for popular cloud infrastructure platforms.
For production
Integrated Storage native to Vault is now recommended rather than using Consul for Vault storage. Use Consul for Vault storage only when there are clear reasons.
Kubernetes users
If you are deploying Vault to Kubernetes, please refer to the Vault on Kubernetes Reference Architecture.
Recommended architecture
The following diagram shows the recommended architecture for deploying a single Vault cluster using Consul storage using the Enterprise release of both Vault and Consul:
In this architecture, the primary availability risk is to the storage layer. With six nodes in the Consul cluster distributed between three availability zones configured as Consul Redundancy Zones with three voting members and three non-voting members, this architecture can withstand the loss of up to three Consul nodes or the loss of an entire availability zone and remain available. Since Vault uses only a single active node, the Vault cluster only needs three cluster members to withstand the loss of two nodes or an entire availability zone.
If deploying to three availability zones is not possible, the same architecture may be used across two or one availability zones, at the expense of significant reliability risk in case of an availability zone outage.
Additional resiliency is possible by implementing a multi-cluster architecture, which allows for additional performance and disaster recovery options. See the Multi-Cluster Architecture Guide for more information.
It is important to use a dedicated Consul cluster for Vault storage, separate from any Consul cluster used for other purposes, to minimize resource contention on the storage layer. This will likely necessitate using non-default ports for Consul network connectivity. In this architecture, ports 7300 and 7301 have been used rather than the defaults of ports 8300 and 8301.
System requirements
This section contains specific hardware capacity recommendations, network requirements, and additional infrastructure considerations. Since every hosting environment is different and every customer's Vault usage profile is different, these recommendations should only serve as a starting point from which each customer's operations staff may observe and adjust to meet the unique needs of each deployment.
Warning
All specification outlined in this document are minimum recommendations without any reservations toward vertical scaling, redundancy or other SRE needs and without measure of your user volumes or their use-cases in all scenarios. All resource requirements are directly proportional to the operations being performed by the Vault cluster as well as the end users utilisation.
Note
To match your requirements and maximise the stability of your Vault instances, it's important to ensure that you are performing load tests and continuing to monitor resource usage as well as all reported matricies from Vaults telemetry.
Hardware sizing for Vault servers
Sizing recommendations have been divided into two common cluster sizes.
Small clusters would be appropriate for most initial production deployments or for development and testing environments.
Large clusters are production environments with a consistently high workload. That might be a large number of transactions, a large number of secrets, or a combination of the two.
Size | CPU | Memory | Disk Capacity | Disk IO | Disk Throughput |
---|---|---|---|---|---|
Small | 2-4 core | 8-16 GB RAM | 100+ GB | 3000+ IOPS | 75+ MB/s |
Large | 4-8 core | 32-64 GB RAM | 200+ GB | 3000+ IOPS | 125+ MB/s |
For each cluster size, the following table gives recommended hardware specs for each major cloud infrastructure provider.
Provider | Size | Instance/VM Types | Disk Volume Specs |
---|---|---|---|
AWS | Small | m5.large , m5.xlarge | 100+GB gp3 , 3000 IOPS, 125MB/s |
Large | m5.2xlarge , m5.4xlarge | 200+GB gp3 , 5000 IOPS, 125MB/s | |
Azure | Small | Standard_D2s_v3 , Standard_D4s_v3 | 1024GB* Premium_LRS |
Large | Standard_D8s_v3 , Standard_D16s_v3 | 1024GB* Premium_LRS | |
GCP | Small | n2-standard-2 , n2-standard-4 | 500GB* pd-balanced |
Large | n2-standard-8 , n2-standard-16 | 1000GB* pd-ssd |
Note
For GCP and Azure recommendations, the disk sizes listed are larger than the minimum size recommended, because for the recommended disk type, available IOPS increases with disk capacity, and the listed sizes are necessary to provision the required IOPS.
Note
For predictable performance on cloud providers, it's recommended to avoid "burstable" CPU and storage options (such as AWS t2
and t3
instance types) whose performance may degrade rapidly under continuous load.
Hardware sizing for Consul servers
Size | CPU | Memory | Disk Capacity | Disk IO | Disk Throughput |
---|---|---|---|---|---|
Small | 2-4 core | 8-16 GB RAM | 100+ GB | 3000+ IOPS | 75+ MB/s |
Large | 4-8 core | 32-64 GB RAM | 200+ GB | 10000+ IOPS | 250+ MB/s |
For each cluster size, the following table gives recommended hardware specs for each major cloud infrastructure provider.
Provider | Size | Instance/VM Types | Disk Volume Specs |
---|---|---|---|
AWS | Small | m5.large , m5.xlarge | 100+GB gp3 , 3000 IOPS, 125MB/s |
Large | m5.2xlarge , m5.4xlarge | 200+GB gp3 , 10000 IOPS, 250MB/s | |
Azure | Small | Standard_D2s_v3 , Standard_D4s_v3 | 1024GB* Premium_LRS |
Large | Standard_D8s_v3 , Standard_D16s_v3 | 1024GB* Premium_LRS | |
GCP | Small | n2-standard-2 , n2-standard-4 | 500GB* pd-balanced |
Large | n2-standard-8 , n2-standard-16 | 1000GB* pd-ssd |
Note
For GCP and Azure recommendations, the disk sizes listed are larger than the minimum size recommended, because for the recommended disk type, available IOPS increases with disk capacity, and the listed sizes are necessary to provision the required IOPS.
Hardware considerations
In general, CPU and storage performance requirements will depend on the customer's exact usage profile (eg, types of requests, average request rate, and peak request rate). Memory requirements depend on the total size of data stored in memory and should be sized according to that data.
Hashicorp strongly recommends configuring Vault with audit logging enabled. The impact of the additional storage I/O from audit logging will vary depending on your particular pattern of requests. For best performance, audit logs should be written to a separate disk.
Network latency and bandwidth
In order for cluster members to stay properly in sync, network latency between availability zones should be less than eight milliseconds (8 ms).
The amount of network bandwidth used by Vault and Consul will depend entirely on the specific customer's usage patterns. In many cases, even a high request volume will not translate to a large amount of network bandwidth consumption. However, all data written to Vault will be replicated to all Consul cluster members. It's also important to consider bandwidth requirements to other external systems such as monitoring and logging collectors. And finally, a multi-cluster Vault setup will require Vault datasets to be transmitted between clusters to provide Performance and DR Replication.
Network connectivity
The following table outlines the network connectivity requirements for Vault cluster nodes when using Consul storage. If general network egress is restricted, particular attention must be paid to granting outgoing access from the Vault servers to any external integration providers (for example, authentication and secret provider backends) as well as external log handlers, metrics collection, security and config management providers, and backup and restore systems.
Source | Destination | port | protocol | Direction | Purpose |
---|---|---|---|---|---|
Client machines | Load balancer | 443 | tcp | incoming | Request distribution |
Load balancer | Vault servers | 8200 | tcp | incoming | Vault API |
Vault servers | Vault servers | 8200 | tcp | bidirectional | Cluster bootstrapping |
Vault servers | Vault servers | 8201 | tcp | bidirectional | Raft, replication, request forwarding |
Vault servers | External systems | various | various | various | External APIs |
Consul and Vault servers | Consul servers | 7300* | tcp | incoming | Consul server RPC |
Consul and Vault servers | Consul and Vault servers | 7301* | tcp, udp | bidirectional | Consul LAN gossip |
Note
Ports for Consul RPC and gossip traffic are different than the defaults in this architecture.
Network traffic encryption
All Vault-related network traffic should be encrypted along every segment. From client machines to the load balancer, and from the load balancer to the Vault servers, standard HTTPS TLS encryption can be used.
For communication between Vault servers (port 8201 by default) for request forwarding traffic, Vault automatically negotiates an mTLS connection when new servers join the cluster initially via the API address port (8200 by default).
For communication between Consul agents on the Vault and Consul clusters, it is strongly recommended to configure gossip encryption, which is covered in the Deployment Guide.
Load balancer recommendations
For the highest levels of reliability and stability, it is highly recommended to use some load balancing technology to distribute requests to your Vault cluster members. Each major cloud platform provides good options for managed load balancer services, or there are a number of self-hosted options as well as service discovery systems like Consul.
If you choose to terminate TLS at your load balancer, it is also strongly recommended to use TLS for the connection from the load balancer to Vault as well to minimize the exposure of secret content on your network.
To monitor the health of Vault cluster nodes, the load balancer should be
configured to poll the /v1/sys/health
API endpoint to detect the status
of the node and direct traffic accordingly. Refer to the
sys/health API documentation
for specific details on the query options and response codes and their meanings.
Scaling considerations
In a cloud-based environment, it is recommended to use a managed scaling service (such as Auto Scaling Groups on AWS) to keep your Vault and Consul clusters populated with healthy instances. However, it's important not to replace all Consul instances in the managed scaling group too quickly which risks data loss.
For scaling the performance of your Vault cluster, there are two factors to consider. Adding additional members to the Vault cluster will not increase performance for any activity that triggers writes to the Vault storage backend. However, for Vault Enterprise customers, adding performance standby nodes can provide horizontal scalability for read requests within a Vault cluster.
Failure tolerance characteristics
When deploying a Vault cluster, it's important to consider and design for your specific requirements for various failure scenarios:
Node failure
In a high-availability Vault cluster using Consul storage, all data is stored in the Consul cluster, and so the failure of a Vault node does not risk data loss. To determine Vault cluster leadership, one of the Vault servers obtains a lock within the Consul data store to become the active Vault node.
If at any time the leader is lost, another Vault node will take its place as the cluster leader. To allow for the loss of two Vault nodes, the minimum recommended Vault cluster size is three.
Consul achieves replication and leadership through the use of its consensus and gossip protocols. In these protocols, a leader is elected by consensus and so a quorum of active servers must always exist. To allow for the loss of two nodes from the Consul cluster, the minimum recommended size of the Consul cluster is five nodes.
Availability zone failure
By deploying Vault and Consul cluster members in the recommended architecture across three availability zones, the overall architecture can tolerate the loss of any single availability zone.
In cases where deployment across three zones is not possible, the failure of an availability zone may cause the Vault cluster to become inaccessible or for the Consul cluster to be unable to elect a leader. In a two availability zone deployment, for example, the failure of one availability zone would have a 50% chance of causing the Consul cluster to lose its Raft quorum and be unable to service requests.
Region or cluster failure
In the event of a failure of an entire region or cluster, Vault Enterprise provides replication features that can help provide resiliency across multiple clusters and/or regions. Please see the Multi-Cluster Architecture Guide for more information.
External token storage
The Tokenization transformation feature reached General Availability in Vault 1.7. This feature introduces additional architectural considerations.
The tokenization feature requires an external data store to facilitate the mapping of tokens to cryptographic values. Be sure to architect your external data stores for high availability. Where possible, it's important to follow reliability and disaster-recovery architectural patterns that meet the same requirements you have for Vault itself. And in order to ensure data consistency the external data store backup cadence must be in sync with backups of Vault.
Glossary
Vault cluster
A Vault cluster is a set of Vault processes that together run a Vault service. These Vault processes could be running on physical or virtual servers or in containers.
Availability zone
An availability zone is a single network failure domain that hosts part or all of a Vault cluster. Examples of availability zones include:
- An isolated datacenter
- An isolated cage in a datacenter if it is isolated from other cages by all other means (power, network, etc)
- An "Availability Zone" in AWS or Azure; A "Zone" in GCP
Region
A region is a collection of one or more availability zones on a low-latency network. Regions are typically separated by significant distances. A region could host one or more Vault clusters, but a single Vault cluster would not be spread across multiple regions due to network latency issues.
Autoscaling
Autoscaling is the process of automatically scaling computational resources based on service activity. Autoscaling may be either horizontal, meaning to add more machines into the pool of resources, or vertical, meaning to increase the capacity of existing machines.
Each major cloud provider offers a managed autoscaling service:
Cloud | Managed Autoscaling Service |
---|---|
AWS | Auto Scaling Groups |
Azure | Virtual Machine Scale Sets |
GCP | Managed Instance Groups |
Load balancer
A load balancer is a system that distributes network requests across multiple servers. It may be a managed service from a cloud provider, a physical network appliance, a piece of software, or a service discovery platform such as Consul.
Each major cloud provider offers one or more managed load balancing services:
Cloud | Layer | Managed Load Balancing Service |
---|---|---|
AWS | Layer 4 | Network Load Balancer |
Layer 7 | Application Load Balancer | |
Azure | Layer 4 | Azure Load Balancer |
Layer 7 | Azure Application Gateway | |
GCP | Layer 4/7 | Cloud Load Balancing |