Detailed design
This section takes the recommended architecture from the above section and provides more detail on each component. Carefully review this section to identify all technical and personnel requirements before moving to implementation.
Sizing
Every hosting environment is different, and every customer's Boundary usage profile is different. Refer to the tables below for sizing recommendations for controller nodes and worker nodes, as well as small and large use cases, based on expected usage.
Small deployments would be appropriate for most initial production deployments or for development and testing environments.
Large deployments are production environments with a consistently high workload, such as a large number of sessions.
Controller nodes
Size | CPU | Memory | Disk Capacity | Network Throughput |
---|---|---|---|---|
Small | 2-4 core | 8-16 GB RAM | 50+ GB | Minimum 5 GB/s |
Large | 4-8 core | 32-64 GB RAM | 200+ GB | Minimum 10 GB/s |
Worker nodes
Size | CPU | Memory | Disk Capacity | Network Throughput |
---|---|---|---|---|
Small | 2-4 core | 8-16 GB RAM | 50+ GB | Minimum 10 GB/s |
Large | 4-8 core | 32-64 GB RAM | 200+ GB | Minimum 10 GB/s |
Refer to the hardware sizing for Boundary servers for more details. These recommendations should only serve as a starting point for operations staff to observe and adjust to meet the unique needs of each deployment. To match your requirements and maximize the stability of your Boundary controller and worker instances, it's important to perform load tests and continue monitoring resource usage and all reported metrics from Boundary's telemetry.
Hardware considerations
CPU, memory, and storage performance requirements will depend on your exact usage profile (e.g., types of requests, average request rate, and peak request rate). Boundary controllers and worker nodes have distinct resource needs as they handle different tasks. Refer to the hardware considerations for more information.
We recommend enabling audit logging in Boundary. It’s best to write audit logs to a separate disk for optimal performance. We recommend monitoring both the file descriptor usage and the memory consumption for each Boundary worker node. These resources can become constrained depending on the number of clients connecting to Boundary targets at any given time. If session recording is enabled on a target, the worker stores the session recordings locally during the recording phase. Refer to the storage considerations to determine how much storage to allocate for recordings on the worker nodes.
Networking
Network bandwidth requirements for Boundary controllers and workers depend on your specific usage patterns. It's also essential to consider bandwidth requirements for other external systems, such as monitoring and logging collectors. Refer to network considerations for more information. Monitor the networking metrics of Boundary workers to prevent situations where they are unable to initiate session connections. Review your provider-specific virtual machine networking limitations. You should increase the VM size to achieve higher network throughput.
Network connectivity
Refer to the network connectivity section for the minimum requirements for Boundary cluster nodes. You may also need to grant the Boundary nodes outbound access to additional services that live elsewhere, either within your internal network or via the internet. Examples may include:
- Authentication provider backends, such as Okta, Auth0, or Microsoft Entra ID
- Remote log handlers, such as a Splunk or ELK environment
- Metrics collection, such as Prometheus
Storage
It is recommended to enable audit logging in Boundary. For optimal performance, writing audit logs on a separate disk is advisable. If session recording is enabled for a target, the worker stores session recordings locally during the recording process. When estimating worker storage needs, consider the number of concurrent sessions recorded on that worker. Refer to the storage guidelines to determine the appropriate amount of storage to allocate for recordings on the worker nodes.
KMS
Boundary controllers and workers require different types of cryptographic keys. The KMS provider provides the root of trust for keys used for various purposes, such as protecting secrets, authenticating workers, recovering data, encrypting values in Boundary’s configuration, and more. Refer to the KMS section for more information.
Traffic encryption
Boundary is secure by default, and all network traffic communication is encrypted using TLS. Boundary has three types of connections, as described in the previous TLS section:
- Client-to-controller TLS
- Client-to-worker TLS
- Worker-to-upstream TLS
Refer to the TLS documentation for detailed information on each connection type, such as how TLS establishment is performed.
From a load balancing requirement, TLS should always be configured for passthrough, i.e., without any TLS termination configured. The load balancing section provides more information on this.
Load Balancing
A layer 4 load balancer meets Boundary’s requirements. However, organizations may implement a layer 7 capable load balancer for additional controls. Regardless of which load balancer is adopted, the following is required:
- HTTPS listener with valid TLS certificate for the domain it's serving or TLS passthrough
- Health checks should use TCP-9203
Each major cloud provider offers one or more managed load-balancing services suitable for Boundary. Follow the guidance provided in the load balancer recommendations.
Client-to-controller
Boundary controller nodes are placed in a private network and not exposed directly to the public internet. A load balancer is used to expose services such as the API and administrative console. This design utilizes a layer 4 load balancer with additional network security controls, such as security groups or firewall ACLs, to restrict the network flow to the load balancer interface.
Health Check
Monitoring the health of the Boundary controller nodes requires the load balancer to be configured to poll the /health
API endpoint to detect the status.
The endpoint is read-only and does not support any input. Responses are returned with an empty body.
Status | Description |
---|---|
200 | GET /health returns HTTPS status 200 OK if the controller’s API gRPC service is up |
5xx | GET /health returns HTTPS status 5xx or request timeout if unhealthy |
503 | GET /health returns HTTPS status 503 Service Unavailable status if the controller is shutting down |
The listener stanza is used to configure the controller's operational endpoints. By default, it listens on TCP-9203. The operational endpoint exposes both health and metrics endpoints.
Operational endpoint stanza configuration
# Ops listener for operations like health checks for load balancers
listener "tcp" {
# Should be the address of the interface where your external systems'
# (eg: Load-Balancer and metrics collectors) will connect on.
address = "0.0.0.0:9203"
# The purpose of this listener block
purpose = "ops"
tls_disable = false
tls_cert_file = "/etc/boundary.d/tls/boundary-cert.pem"
tls_key_file = "/etc/boundary.d/tls/boundary-key.pem"
}
Worker-to-controller
Similar to clients-to-controllers, ingress workers require access to Boundary’s controller nodes which are placed in a private network. For this design, where the deployment consists of a single cloud, an internal load balancer would be sufficient to allow the ingress workers to establish connectivity via port TCP-9201 to the controllers.
For multi-cloud deployments operating a single control-plane, e.g., in AWS and targets with ingress workers in other clouds, e.g., on-premise, it may be necessary to expose port TCP-9201 externally so that it is reachable. A consideration is to add another listener for port TCP-9201 to the load balancer used for client-to-controller communication.
Worker-to-worker (multi-hop sessions)
With multi-hop sessions, workers operate as intermediaries or egress workers. If more than one provides identical capabilities (typically for increased availability, resilience, and scale), they should be part of a load-balanced set of workers. For example, the upstream configuration "initial_upstream" would be configured as the FQDN or virtual IP (VIP) address of the load-balanced pool of workers.
The upstream configuration “initial_upstream” allows a list of hosts/IPs. However, as workers can be dynamic, e.g., part of an auto scaling group, using a load balancer helps with future scale-out/in scenarios and ensures a robust architecture.
Monitoring
Gaining visibility into Boundary’s controllers and workers is essential for production environments. It enables operators to manage, scale, and troubleshoot Boundary efficiently and assists in detecting and mitigating anomalies in a deployment.
Logs
If events are not configured, e.g., via the events stanza, Boundary will output logs to standard output (stdout) and errors directed to standard error (stderr) by default. Linux distributions typically capture Boundary’s log output to the system journal.
The events stanza should be used for production environments to allow increased control over event logging. Event logging configured via the events stanza overrides the default behavior. For example, if Boundary is configured to send events to a file, it will no longer emit logs to stdout or stderr.
Logs should be aggregated using a centralized platform for analysis, audit, and compliance, and aid in troubleshooting.
Minimum configuration
events {
audit_enabled = true
observations_enabled = true
sysevents_enabled = true
telemetry_enabled = false
sink "stderr" {
name = "all-events"
description = "All events sent to stderr"
event_types = ["*"]
format = "hclog-text"
}
}
Metrics
Metrics for controllers and workers are available for ingestion by third-party telemetry platforms, such as Prometheus and Grafana. The metrics use the OpenMetric exposition format. Refer to the Boundary metrics documentation for a list of all available metrics.
Boundary provides metrics through the /metrics path using a listener with the "ops" purpose.
The listener stanza is used to configure the controller's operational endpoints. By default, it listens on TCP-9203. The operational “ops” listener exposes both health and metrics endpoints.
Operational endpoint stanza configuration
# Ops listener for operations like health checks for load balancers
listener "tcp" {
# Should be the address of the interface where your external systems'
# (eg: Load-Balancer and metrics collectors) will connect on.
address = "0.0.0.0:9203"
# The purpose of this listener block
purpose = "ops"
tls_disable = false
tls_cert_file = "/etc/boundary.d/tls/boundary-cert.pem"
tls_key_file = "/etc/boundary.d/tls/boundary-key.pem"
}
Failure considerations
Organizations should rely on the experience from their architecture, cloud, and platform teams to provide the appropriate levels of availability for Boundary that meet their requirements. This architecture design leverages several principles and standard infrastructure services with major cloud providers to provide the highest availability and fault tolerance while balancing costs.
- Auto Scaling to enhance fault tolerance. For example, if a Boundary instance is unhealthy, the auto scaling service can terminate it and launch an instance to replace it. You can also configure the service to use multiple availability zones and launch instances in another availability zone to compensate if one becomes unavailable.
- Templating images to decrease the time to deploy controllers and workers during the initial deployment, notably failure and scaling scenarios.
- Infrastructure-as-Code to produce consistent, known deployments that significantly reduce configuration errors.
- Availability Zones to protect against data center failures. Boundary components would be deployed and spread across at least three availability zones in production environments. If deploying Boundary to three availability zones is not possible, you can use the same architecture across one or two availability zones at the expense of a reliability risk in case of an availability zone outage.
- Load balancing to provide traffic redirection and health checking.
Controllers
Boundary controllers are stateless. They store all state and configuration within PostgreSQL and can withstand failure scenarios where only one node is accessible. When a controller node fails, users will still be able to interact with other Boundary controllers, assuming the presence of additional nodes behind a load balancer. Boundary controllers depend on a PostgreSQL database. The database should be deployed so all Boundary controller nodes can reach it. It should also inherit the same levels of availability as the controllers. PostgreSQL should not be deployed on the controller nodes.
Workers
Boundary workers are used as either proxies or reverse proxies. Workers routinely communicate with the controllers to report their health. In the event of a worker node failure, it’s best practice to have at least three workers per network boundary and per type (ingress and egress) for production environments. Therefore, the controller will assign a user’s proxy session to an active Boundary worker node.
Availability zone failures
The following section provides recommendations for controllers and workers to overcome availability zone outages.
Controllers
By deploying Boundary controllers in the recommended architecture across three availability zones with load balancing in front of them, the Boundary control plane can survive outages in up to 2 availability zones.
Workers
The best practice for deploying Boundary workers is to have at least one worker deployed per availability zone. In the case of an availability zone outage, if the networking service is still up, users will have their attempted session connection proxied through a worker in a different availability zone and then onto the target (provided the proper security rules are in place to allow for cross-subnet/availability zone communication).
Regional failures
Generally speaking, when there is a failure in an entire cloud region, the resources running in that region will most likely be inaccessible, especially if the networking service is affected.
Controllers
To continue to serve Boundary controller requests during a regional outage, a deployment like the one outlined in this guide must be in a different region. The nodes in the secondary region must be able to communicate with the PostgreSQL database, which can be accomplished with multi-regional database technologies from the various cloud providers. For example, AWS RDS read replicas, can be promoted to the primary DB in the event the primary resides in a failed region).
Another point of consideration is how to handle load balancing Boundary controller requests across regions that are not in a failed state. Services like AWS Global Accelerator for AWS, Cross-region Load Balancer for Azure, and GCP Cloud Load Balancer for GCP all provide this level of functionality.
Workers
In the case of a regional outage, if a Boundary worker cannot reach its upstream worker or a controller; a user cannot reach the worker or any combination of the above, the user will not be able to establish a proxied session to the target.