Design and scale compute instances

Inadequate compute capacity planning causes service outages, wastes infrastructure costs, and creates unpredictable user experiences. When infrastructure cannot meet demand, applications slow down or fail, directly impacting your users and revenue. However, over-provisioning infrastructure to avoid outages leads to unnecessary spending on unused resources.

Workloads can be unpredictable. Seasonal cycles cause large spikes in traffic in minutes, reaching unknown demand levels. An online game can see huge growth in its playerbase long after release from organic, viral popularity. With proper planning, you can respond to these growing needs before your users experience issues.

Planning your infrastructure needs includes the following steps:

Profile your workload: Load tests and stress tests help you understand how your applications use your infrastructure resources as the workload increases.
Design for scale: Early design decisions like load balancing and caching strategy can help evenly distribute your workload across your infrastructure.
Monitor and react: Know how to act when you begin reaching the thresholds of your infrastructure.

After profiling your workload with tests, use the insights to design autoscaling policies that respond to demand before users experience issues.

Why select and scale compute

Proper compute capacity planning addresses the following strategic challenges:

Prevent revenue loss from outages: Undersized infrastructure causes applications to slow down or fail during traffic spikes, directly impacting user experience and revenue.
Reduce infrastructure waste: Over-provisioning to avoid outages leads to paying for unused capacity that sits idle during normal traffic periods.
Enable predictable growth: Without capacity planning and testing, you cannot predict how your infrastructure will respond to organic growth or viral traffic spikes.
Minimize deployment risk: Scaling decisions made during incidents lead to rushed changes and configuration errors that can worsen outages.

The capacity planning workflow follows these steps:

Profile your workload: Run load and stress tests to identify capacity limits and bottlenecks.
Design for scale: Configure autoscaling policies, load balancers, and caching based on test results.
Monitor and respond: Set up health checks and alerts to respond before reaching capacity limits.

Plan for your known capacity

This section explains how to use load and stress testing to understand your infrastructure's capacity limits and determine the best scaling approach for your workload.

Run tests to understand what your application workload looks like and where the bottlenecks are. Understanding your infrastructure's capacity helps you know how to scale your infrastructure as the workload increases. For example, in some use cases it may be better to scale by adding more virtual machines, known as horizontal scale, or it may be better to swap your virtual machines for more powerful ones, known as vertical scale.

Use vertical scaling when you need more processing power for single-threaded applications, memory-intensive workloads, or database operations that benefit from larger instances. Use horizontal scaling when you need high availability, fault tolerance, or cost-effective scaling for stateless applications.

Diagram showing vertical scaling increases instance size while horizontal scaling increases instance count to distribute workload

Vertical scaling is often useful for applications that can take advantage of more raw resources such as memory or number of CPU cores, but vertical scaling can make budgeting and dynamic scaling harder. Developing your application to support horizontal scaling requires solving issues like data persistence and load balancing, but horizontal scaling makes scaling up and down easier, and can make your costs easier to budget.

Two types of tests that you can run to help plan your infrastructure capacity requirements are load tests and stress tests.

Load tests: Simulate normal workloads and measure how your infrastructure resource consumption grows as the load increases. Load tests are useful for identifying possible bottlenecks early so that you can scale your infrastructure before users start experiencing issues.
Stress tests: Simulate extreme, unnatural workloads to see how your infrastructure performs under high resource utilization. Stress tests are useful to determine how your infrastructure handles sudden spikes in traffic, and help identify how your infrastructure may break under extreme conditions.

How you run load and stress tests depends on your infrastructure, the applications running on it, and your users' behavior. For example, if your infrastructure runs a web application where users make many lightweight requests, your tests could simulate sending as many of these requests as possible across different endpoints. If your infrastructure runs a service that processes users uploads, your tests could upload many extremely large files all at once.

As you run your load and stress tests, monitor metrics such as CPU usage, memory usage, response latency, and application error rates. These metrics give important insight into how your application responds to heavy traffic.

Terraform can create test environments that mirror your production environments with infrastructure as code (IaC). Mirror environments ensure consistency between your testing and production environments, and helps reduce cost by allowing you to run your test environment only when you need it.

When you automate Terraform into your CI/CD pipeline, you can automatically spin up temporary environments, perform a load test, and destroy the environment before pushing changes to production. This level of testing gives you better confidence in your production changes while saving money on operational costs.

When you use Terraform to help test capacity, you gain the following benefits:

Reproducibility: Your test environment configuration is version-controlled and you can recreate it to be identical whenever you want, eliminating configuration drift between tests.
Cost efficiency: Spin up test infrastructure only when running tests, then destroy it immediately after.
Parallel testing: Create multiple isolated test environments simultaneously to test different configurations or scenarios without interference.
Reusable modules: Use the same Terraform modules for both test and production environments, ensuring your tests accurately reflect your production code.

HashiCorp resources

External resources

Learn more about tools to run load and stress tests:

After identifying your capacity limits through testing, design your infrastructure to respond to workload changes automatically.

Design to meet demand

Early design decisions not only improve your application performance, but can help make it easier to scale your infrastructure as your usage grows. The following is an example of early design decisions to consider to help plan for future growth:

Load balancing: If your infrastructure scales horizontally, load balancers can work together with autoscaling to more evenly spread the workload on individual compute instances.
Caching: If the application running on your infrastructure frequently returns data that does not change often, such as images, static webpages, or API requests, you can use caching and Content Delivery Networks (CDN) to lower response time latency and the need for individual compute instances to process requests.
Autoscaling: Set rules based on these key metrics to automatically scale your infrastructure up as the workload increases, and scale it back down as it decreases. Designing your application to be stateless and storing your data externally makes it easier to scale horizontally.

You can deploy and manage load balancers, autoscaling policies, and caching solutions using IaC and Terraform. Terraform lets you configure your infrastructure scaling and caching policies alongside your infrastructure configuration.

The following example demonstrates a complete autoscaling configuration that automatically adjusts compute capacity based on CPU utilization while distributing traffic across healthy instances. This configuration creates an autoscale group in AWS, configures how the group scales the instances, sets up a load balancer, and attaches it to the autoscale group.


resource "aws_launch_template" "app" {
  name_prefix   = "app-"
  image_id      = data.aws_ami.app_ami.id
  instance_type = "t3.micro"
}

resource "aws_autoscaling_group" "app" {
  name                = "app-asg"
  vpc_zone_identifier = data.aws_subnets.default.ids
  min_size            = 2
  max_size            = 4
  desired_capacity    = 2
  target_group_arns   = [aws_lb_target_group.app.arn]

  launch_template {
    id      = aws_launch_template.app.id
    version = "$Latest"
  }

  tag {
    key                 = "Name"
    value               = "app-instance"
    propagate_at_launch = true
  }
}

resource "aws_autoscaling_policy" "cpu" {
  name                   = "cpu-scaling"
  autoscaling_group_name = aws_autoscaling_group.app.name
  policy_type            = "TargetTrackingScaling"

  target_tracking_configuration {
    predefined_metric_specification {
      predefined_metric_type = "ASGAverageCPUUtilization"
    }
    target_value = 80.0
  }
}

resource "aws_lb" "test" {
  name               = "my-app-lb"
  internal           = false
  load_balancer_type = "application"

  subnets = data.aws_subnets.default.ids

  tags = {
    Environment = "production"
  }
}

resource "aws_lb_target_group" "app" {
  name        = "app-tg"
  port        = 80
  protocol    = "HTTP"
  vpc_id      = data.aws_vpc.default.id
  target_type = "instance"

  health_check {
    healthy_threshold   = 2
    unhealthy_threshold = 2
    timeout             = 3
    interval            = 30
    path                = "/health"
    matcher             = "200"
  }
}

resource "aws_lb_listener" "app" {
  load_balancer_arn = aws_lb.test.arn
  port              = 80
  protocol          = "HTTP"

  default_action {
    type             = "forward"
    target_group_arn = aws_lb_target_group.app.arn
  }
}

This example includes the following resources to configure autoscaling:

aws_launch_template: The AWS launch template that configures the instances the autoscaling group creates.
aws_autoscaling_group: The AWS autoscaling group that defines information such as the minimum and maximum number of instances that AWS can scale to.
aws_autoscaling_policy: An autoscaling policy that tells the autoscaling group to scale up when the average CPU utilization meets or exceeds 80%. Terraform attaches this policy to the autoscaling group.

This example includes the following resources to configure load balancing:

aws_lb: The application load balancer infrastructure to distribute traffic.
aws_lb_target_group: Configures how AWS routes traffic between the load balancer and the application instances.
aws_lb_listener: Listens for incoming traffic to the load balancer and routes it to the target group.

If you deploy your application or service as a container, use an orchestrator to scale containers based on resource utilization or custom metrics automatically.

Use Kubernetes when you need extensive ecosystem tooling, complex networking requirements, or multi-cloud portability. Use Nomad when you need to orchestrate both containerized and non-containerized workloads, or want simpler operations with less infrastructure overhead.

HashiCorp resources

Try the Manage AWS Auto Scaling Groups and Manage Azure Virtual Machine Scale Sets with Terraform tutorials.
Terraform resource: aws_autoscaling_group
Terraform resource: azurerm_linux_virtual_machine_scale_set
Terraform resource: google_compute_autoscaler
Learn how Nomad autoscaling works.

External resources

Respond to scale issues

As the workload on your infrastructure grows, know how and when to react to the scaling needs. When you understand how workload profiles grow, you can identify when to take action. With strong monitoring and alerting, you can respond to these needs before they become a problem for your users.

Monitoring and alerting: Monitoring key metrics of your infrastructure, such as CPU, memory, and disk IOPS, lets you understand when you are approaching those bottlenecks you identified during load and stress testing. With early alerts, you can make the decision to add more resources to your infrastructure to keep the applications that depend on it running without error.
Continuous validation: Along with monitoring and alerting, you can routinely evaluate workload patterns to decide when to make changes to your infrastructure architecture. Continuous validation is the practice of routinely performing checks against your infrastructure to ensure they meet certain requirements. For example, you can write a check to get the expiration date of a TLS certificate and alert you 30 days before it expires.
Health checks: Custom logic that runs periodically to check that a service is performing as expected. Often this logic lives in the application itself, and can be invoked by infrastructure such as a load balancer. The logic lets the infrastructure know if the application is in a state to accept traffic.

Define health checks with Consul

Health checks work together with autoscaling and load balancing to ensure traffic only routes to instances that can handle requests. Consul lets you define health checks that send HTTP requests, establish TCP connections, run local scripts, and more.

The following example defines a health check that sends an HTTP GET request to the /health endpoint every 15 seconds. If the service returns an HTTP 200 response code, Consul considers the service healthy and load balancers continue routing traffic to the instance.


checks = [
  {
    id       = "chk1"
    name     = "/health"
    http     = "http://localhost:5000/health"
    interval = "15s"
  }
]

By moving the logic inside the service, you can gain deeper insight into how the service is performing, and decide when it is no longer healthy. A response delay might mean that the service is functioning as expected but under a very high load, while an unhealthy response might mean there's a deeper issue in the service. In this health check configuration, you can build your application to inspect certain aspects that are critical to keep running smoothly, such as database and service connectivity, to help make it easier to track down issues.

HashiCorp resources

Learn how to Manage AWS Auto Scaling Groups and Azure Virtual Machine Scale Sets with Terraform.
Read the HCP Terraform continuous validation documentation.
Learn how to Set up monitoring agents with Terraform and Packer.
Read the Define health checks Consul documentation.

Next steps

In this section of Select and design infrastructure, you learned how to profile the workload on your infrastructure with load and stress tests, react to growing infrastructure needs, and design to minimize the load on your system. Select and design infrastructure is part of the Optimize systems pillar.

To learn more about how to design and scale your infrastructure, refer to the following resources: