Vault on Kubernetes Reference Architecture
This document outlines a reference architecture for deployment of HashiCorp Vault in the context of the Kubernetes cluster scheduler. Those interested in deploying a Vault service consistent with these recommendations should read the upcoming Vault on Kubernetes Deployment Guide which will include instructions on the usage of the official HashiCorp Vault Helm Chart.
As of Vault 1.4, this document supports both Vault Open Source as well as Vault Enterprise deployments utilizing HashiCorp Consul Enterprise as the persistent storage layer. Readers may want to refer to the non-Kubernetes Consul Reference Architecture and Consul Deployment Guide as a general reference. The recommendations in this document related to the Consul deployment are heavily informed by those documents.
The following topics are addressed in this guide:
- Kubernetes cluster features & configuration
- Infrastructure requirements
- Infrastructure Design
- Help and Reference
Kubernetes cluster features & configuration
Federation and cluster-level high availability
This document details designs for a resilient, reliable, and highly-available Vault service on a single, dedicated Kubernetes cluster deployment via effective use of availability zones and other forms of in-region datacenter redundancies.
Future versions of this document will include designs optimized for Vault Enterprise Disaster Recovery Replication and multi-datacenter Performance Replication.
Third-party redundancy/resiliency tooling
There is no expectation that your Kubernetes cluster has been configured for Kubernetes-specific forms of multi-datacenter redundancy such as Federation v2 or other third-party tools for improving Kubernetes reliability and disaster recovery. Future updates to this Reference Architecture may take these other technologies into account.
Secure scheduling via RBAC and NodeRestrictions
This document details various cluster scheduling constructs used to ensure the proper spread of Vault and Consul Pods amongst a pool of nodes in multiple availability zones. The current recommendations are to run Vault and Consul on its own dedicated cluster; however, the same constructs also ensure, for security purposes, that the Vault and Consul Pods do not share a Node with non-Vault and non-Consul Pods. These constructs rely on Kubernetes Node Labels. Historically, the kubelets running on Nodes have been given privileges to modify their own Node labels and sometimes even the labels of other Nodes. This opens the possibility of rogue operators or workloads modifying Node Labels in such a way as to subvert isolation of Consul and Vault workloads from other workloads on the cluster. For this reason the use of the following is required though not covered in this doc:
A dedicated Kubernetes cluster for the Vault and Consul deployment mitigates some of these security concerns; however, it does not preclude the use of a multi-tenant cluster. Running workloads on the same cluster as the Vault and Consul deployments does introduce a variety of security concerns as noted above. A future update to this reference architecture will address these concerns.
More details may be found in the upcoming Vault on Kubernetes Deployment Guide.
Network-attached storage volumes
For the purposes of this Reference Architecture, the Consul Pods have a mandatory requirement of durable storage via PersistentVolumes and PersistentVolumeClaims. It is also strongly encouraged, bordering on hard requirement, that those volumes be network-attached and capable of being re-bound to new Pods should the original Pods holding the volume claim go offline due to permanent Node failure. Although it is possible to deploy this Reference Architecture using PersistentVolumeClaims, which are not capable of being re-bound to replacement Pods (ex: hostPath), it will significantly reduce the effectiveness of deploying across multiple availability zones for both Consul and Vault and is thus not recommended.
Examples of network-attached storage which would meet the above requirements include AWS EBS, GCE PD, Azure Disk, and Cinder volumes. Please see the Persistent Volumes documentation for more details.
Infrastructure requirements
Dedicated Nodes/kubelets
Vault Pods should be scheduled to a dedicated Kubernetes cluster on which no other workloads can be scheduled. This prevents the possibility of cotenant rogue workloads attempting to penetrate protections provided by the Node operating system and container runtime to gain access to Vault kernel-locked memory or Consul memory and persistent storage volumes. See below for details.
Sizing of Kubernetes Nodes (kubelets)
The suggested hardware requirements for kubelets hosting Consul and Vault do not vary substantially from the recommendations made in the non-Kubernetes Reference Architecture documents for Consul and Vault. Canonical sizing information can be found here:
The sizing tables as specified at the time of this writing have been reproduced below for convenience:
Sizing for Consul Nodes:
Size | CPU | Memory | Disk | Typical Cloud Instance Types |
---|---|---|---|---|
Small | 2-4 core | 8-16 GB RAM | 50GB | AWS: m5.large, m5.xlarge Azure: Standard_D2_v3, Standard_D4_v3 GCP: n2-standard-2, n2-standard-4 |
Large | 8-16 core | 32-64 GB RAM | 100GB | AWS: m5.2xlarge, m5.4xlarge Azure: Standard_D8_v3, Standard_D16_v3 GCP: n2-standard-8, n2-standard-16 |
Sizing for Vault Nodes:
Size | CPU | Memory | Disk | Typical Cloud Instance Types |
---|---|---|---|---|
Small | 2 core | 4-8 GB RAM | 25 GB | AWS: m5.large Azure: Standard_D2_v3 GCE: n1-standard-2, n1-standard-4 |
Large | 4-8 core | 16-32 GB RAM | 50 GB | AWS: m5.xlarge, m5.2xlarge Azure: Standard_D4_v3, Standard_D8_v3 GCE: n1-standard-8, n1-standard-16 |
Control plane nodes
The Kubernetes community generally recommends against running non-administrative workloads on control plane/master nodes. Most Kubernetes cluster installers and cloud-hosted Kubernetes clusters disallow scheduling general workloads on the control plane. Even if your cluster allows it, Vault and Consul Pods should not be scheduled on the control plane. As general workloads, neither Vault nor Consul place unusual demands on the control plane relative to other general workloads. For these reasons control plane node sizing is considered outside of the scope of this document and thus no specific recommendations are offered.
Infrastructure Design
Baseline Node layout
The following diagram represents the initial configuration of Kubernetes Nodes without application of any of the various constructs following sections will leverage for scheduling our Consul and Vault Pods. Although Nodes at the bottom of the diagram are set off visually from Nodes at the top of the diagram, at this point, they represent identical configurations. Note there are three availability zones: Availability Zone 0, Availability Zone 1, Availability Zone 2.
This is the baseline configuration upon which following sections of this doc will build.
Consul Server Pods and Vault Server Pods
Limiting our Consul Server Pods and Vault Server Pods to a subset of Nodes
Working from the non-configured baseline in the previous diagram, the set of Nodes must first be partitioned into those where Consul Server Pods and Vault Server Pods will run and those which are available for other non-Vault-related workloads. The Kubernetes constructs of Node Labels and Node Selectors are utilized to notify the scheduler on which Nodes we'd like Consul and Vault workloads to land. Later sections of this document will discuss how to enforce the requirement that this same set of Nodes is dedicated for use by Consul and Vault workloads.
Recent versions of Kubernetes will often auto-label Nodes with a set of built-in labels using metadata from the hosting cloud provider. If the cloud provider does not support auto-labeling these labels can be manually populated. Built-in labels in the diagram below are in black text.
It is worth noting that 'topology.kubernetes.io/zone' has special meaning within Kubernetes when used as a topologyKey: during scheduling Kubernetes will best-effort spread Pods evenly amongst the specified zones. This doc shows a generic 'az0', 'az1', etc. but, as an example, in AWS this might look like 'ca-central-1' or 'ap-south-1'.
In this doc Nodes are assumed to have been provisioned with unique hostnames and thus the built-in label 'kubernetes.io/hostname' can be used, again via topologyKey, to best-effort spread Consul Server Pods and Vault Server Pods on our selected Nodes.
Nodes can also be provisioned with custom labels. In the diagram below our custom label denoting a Node is reserved for use by a Vault workload,
vault_in_k8s=true
, is in blue text. Nodes without that label will not be used for Vault-related workloads and are included in the diagram only to emphasize this point.
In the diagram above nine nodes have been labeled with vault_in_k8s: true
.
This k/v pair is referenced by our nodeSelector to inform Kubernetes where to
place our Consul and Vault Pods.
Example:
The k/v pair and nodeSelector are necessary but not sufficient for our requirements:
- There is no guarantee that a single Node will have only a Consul Server Pod or only a Vault Server Pod exclusively.
- There is no guarantee that the Consul server and Vault Server Pods will be distributed evenly amongst our availability zones.
- There is no guarantee that untrusted Pods will not be scheduled onto Nodes were Consul and Vault are running.
In the next section podAntiAffinity scheduling will resolve #1 and #2 above.
Spread Consul Server Pods and Vault Server Pods across Availability Zones and Nodes
In the previous section a Node Selector was used to request that Consul Server Pods and Vault Server Pods run on a select subset of the available Nodes. The next requirement is that Pods be evenly distributed amongst the availability zones and amongst the selected Nodes. Pods are spread across the availability zones to limit exposure to problems in a particular availability zone. Consul server Pods and a Vault Server Pods must run on separate Nodes to limit exposure to extreme resource pressure in those services. For example, if the Vault service is suffering from unusually high k/v write requests no single node will ever be required to handle both the resulting Vault k/v load and the resulting Consul k/v load. Ensuring the Pods are never co-tenant also makes for much less error-prone and reliable rolling upgrades of Consul Pods, Vault Pods, and Kubernetes itself.
The following scheduling constructs are utilized to ensure the two dimensions of spread mentioned above:
As mentioned in the previous section, spread amongst availability zones is via the 'topology.kubernetes.io/zone'. Spread amongst the Nodes is via the 'kubernetes.io/hostname' key.
Example Pod spec for Consul Server Pod:
Example Pod spec for Vault Server Pod:
With the above configs Kubernetes will ensure best-effort spread of Consul Server Pods and Vault Server Pods amongst both AZs and Nodes. Ensuring Node-level isolation of Consul and Vault workloads from general workloads via Taints and Tolerations which are covered in the next section.
Ensuring Node-level isolation of Vault-related workloads
In the previous section we leveraged Kubernetes anti-affinity scheduling to ensure a desired spread of Consul Server Pods and Vault Server Pods along axes of Availability Zones and Nodes (identified by unique hostname). In this section the Kubernetes constructs of Taints and Tolerations ensure that Vault-related workloads never share a node with non-Vault-related workloads. As mentioned in previous sections Node-level isolation is a safeguard to prevent rogue workloads from penetrating a Node's OS-level and container runtime-level protections for a possible attack on Vault shared memory, Consul process memory, and Consul persistent storage.
Taints
First, the labeled Nodes dedicated for use by Consul Server Pods and Vault
Server Pods must be tainted to prevent general workloads from running on them.
They are tainted 'NoExecute' so that any running Pods will be removed from the
Nodes before we place our intended Pods. The diagram below shows our partitioned
nodes with a newly applied taint called
taint_for_consul_xor_vault=true:NoExecute
. The taint is shown in blue for
emphasis.
Tolerations
Once a Taint has been placed on a Node a Pod spec must include a Toleration if a Pod is to run on that Node.
Example Pod spec (Consul and Vault) with Toleration:
The diagrams below show the various scheduling configurations of our Consul Server Pods and Vault Server Pods to this point. Some things to note:
- Node Selector and Tolerations are shown in yellow text for emphasis.
- The diagram uses a custom syntax to denote anti-affinity rules as the actual syntax is too verbose to easily fit onto the diagram.
Consul Server Pod:
Vault Server Pod:
Consul Client Pods
Previous sections focused on proper scheduler configuration for Consul Server Pods and Vault Server Pods. There's an additional Pod type which will be part of the infrastructure: Consul Client Pods. The Consul Client Pod is used by the Vault Server Pod to find the Consul cluster which will be used for Vault secure storage. Consul Client Pods will be scheduled onto each Node which has been dedicated to Vault-related workload hosting. Strictly speaking, there's no requirement that a Consul Server Pod also have a Consul Client Pod but for the sake of simplicity a Consul Client Pod is scheduled everywhere a Vault Server Pod might be scheduled. This simplification comes at negligible additional resource cost for the Node.
Many of the same scheduling constructs already used will also be leveraged for Consul Client Pods however with much less complexity. The Consul Client Pods are scheduled as a DaemonSet but is limited to the partitioned Nodes via use of Node Selector and Toleration.
Node Selector
As with the Consul Server Pod and Vault Server Pod, the Node Selector is part of the Pod spec and is quite simple:
This ensures that Pods from the DaemonSet can only be deployed onto Nodes
labeled with vault_in_k8s=true
. Remember, however, that the dedicated Nodes
have a Taint applied. The DaemonSet must include a Toleration.
Tolerations
As with the Consul Server Pods and Vault Server pods, A Toleration is specified
for taint_for_consul_xor_vault=true:NoExecute
.
Example DaemonSet Pod spec with Toleration:
Deployed Infrastructure
All the required Labels, Node Selectors, Taints, Tolerations, and Anti-affinity scheduling are now specified. Upon The diagrams below demonstrate the resulting infrastructure include Nodes and Pod placement.
Note that this results in a bit of extra capacity amongst the dedicated Nodes. There is a Node in Availability Zone 2 with a Consul Client but neither a Consul Server Pod nor a Vault Server Pod. That's by design. Remember that Kubernetes is doing a best-effort spread of Pods amongst the AZs and available Nodes. In the case of a Node failure that spare Node is available for re-scheduling of Pods previously on the failed Node.
Exposing the Vault Service
With the resilient Vault service available in the Kubernetes cluster the next question becomes how to expose that Service to Vault Clients running outside of the Kubernetes cluster. There are three common constructs for exposing Kubernetes services to external/off-cluster clients:
Communication between Vault Clients and Vault Servers depends on the request path, client address, and TLS certificates and thus only Load Balancer and Node Port, both Layer 4 proxies, are recommended at this time. Future versions of this document may include details on using Ingress. Both the Load Balancer and Node Port require setting externalTrafficPolicy to 'Local' to preserve Vault Client addresses embedded in the Vault client requests and responses.