Consul: Solution Design Guide | Sizing Guidelines

Sizing guidelines

The Consul server nodes process all read and write operations from the agents as well as maintain a consensus among the cluster, and as such are I/O bound for writes and CPU bound for reads. This needs to be taken into consideration and monitoring put in place to adjust as required depending on the type of workload inside of the cluster.

Workloads on virtual machines require Consul client agents for service discovery and service mesh. So the following guidelines are based on that requirement.

As a general rule, we recommend that the maximum size for a single datacenter is 5,000 Consul client agents. This estimate is based on impact of recovery time, write and read requests, and other factors. We recommend deploying read replicas for improved scalability in clusters that are read-heavy. We have customers who have scaled Consul to tens of thousands of agents per cluster, but it is highly dependent on the read and write workloads of the cluster. As such, customers must optimize for stability at the gossip layer as the cluster scales. The two main factors that affect this with client agent are:

Total size of the gossip pool
The churn of nodes/agents in the pool

Control plane on VMs

We recommend deploying at a minimum the following types of instances for the Consul Servers. These are broken down into Initial and Large clusters. We recommend starting with the Initial cluster size and once adoption occurs, vertically scaling the servers to the Production Cluster size.

Size	Potential Instance Type	CPU	Memory	Disk Capacity	Disk IO
Initial	m5.large	2	8	min: 100 GB (gp3)	min: 3000 IOPS
Small	m5.xlarge	4	16	min: 100 GB (gp3)	min: 3000 IOPS
Large	m5.2xlarge	8	32	min: 200 GB (gp3)	min: 7500 IOPS
Extra-Large	m5.4xlarge	16	64	min: 200 GB (gp3)	min: 7500 IOPS

Size	Potential Instance Type	CPU	Memory	Disk Capacity	Disk IO
Initial	n2-standard-2	2	8	min: 100 GB (pd-balanced)	min: 3000 IOPS
Small	n2-standard-4	4	16	min: 100 GB (pd-balanced)	min: 3000 IOPS
Large	n2-standard-8	8	32	min: 200 GB (pd-ssd)	min: 7500 IOPS
Extra-Large	n2-standard-16	16	64	min: 200 GB (pd-ssd)	min: 7500 IOPS

Size	Potential Instance Type	CPU	Memory	Disk Capacity	Disk IO
Initial	Standard_D2s_v3	2	8	min: 100 GB (Premium SSD)	min: 3000 IOPS
Small	Standard_D4s_v3	4	16	min: 100 GB (Premium SSD)	min: 3000 IOPS
Large	Standard_D8s_v3	8	32	min: 200 GB (Premium SSD)	min: 7500 IOPS
Extra-Large	Standard_D16s_v3	16	64	min: 200 GB (Premium SSD)	min: 7500 IOPS

The above architecture will support a high level of agents based clients, but we highly recommend that if a single datacenter in the above architecture is provisioned, that customers monitor cluster metrics to both establish a baseline and set threshold levels.

Control plane on Kubernetes

Use the CPU and memory recommendations to set resource limits for the Consul pods, and apply the disk recommendations when configuring persistent volumes. Both limits and requests should be set in the Helm chart. Below is an example Helm configuration snippet for deploying a Consul server in a large environment.

server:
  resources: |
    requests:
      memory: "32Gi"
      cpu: "8"
    limits:
      memory: "32Gi"
      cpu: "8"
 
  storage: 200Gi

HashiCorp recommends monitoring your production deployment to take data-driven informed decisions to scale your production server resource limits or vertically scale the VM deployments.

Architecture - Consul on Kubernetes

Detailed Design