React to metrics and monitoring

15min
|
Terraform
Vault
Consul
Nomad

This guide describes how you can troubleshoot issues in your infrastructure with HashiCorp tools. Monitoring key metrics and configuring proper alerting can help you fix issues before they affect your production applications, or help you quickly diagnose issues as they arise. You can use this guide to develop your organization's troubleshooting strategy.

For more information on configuring your monitoring and alerting, refer to Manage infrastructure and service monitoring.

Manage resources

Your application's resource requirements change over time, and you can avoid performance issues by scaling proactively. You can get actionable insight from your applications and the infrastructure they run on by monitoring metrics like CPU and memory utilization, disk I/O and free space, and network I/O.

As your system runs over time, you need to understand how it changes in order to improve its performance and security. Automating the correction of issues such as high CPU usage and drift of non-compliant infrastructure configuration before they cause problems for users and other services reduces the manual effort to manage an incident.

An advantage of infrastructure as code (IaC) is the ability to quickly make changes to your infrastructure. If your monitoring alerts you that you are approaching resource thresholds, like a disk getting low on storage space or a message queue that can no longer keep up with demand, you can update your Terraform configuration to allocate more resources.

When you have a robust monitoring and alerting strategy, you can address some issues proactively before they cause problems for your users and other services. Even if you cannot catch every issue before it becomes a problem, planning to respond to these issues reactively can reduce downtime.

Scale servers

As the load on your application grows, the infrastructure resources it uses also grows. It is important to know how your application's resource usage grows over time so that you can develop a strategy early to scale the resources behind it.

Autoscaling enables you to add or remove infrastructure resources based on a set of metrics. The metric could be that the average CPU utilization is above a predefined threshold, or it could trigger based on application-specific metrics such as average response time or queue depth.

Many cloud providers provide the ability to create autoscaling groups, and you can create these groups programmatically with Terraform. Refer to the provider's documentation for more information.

HashiCorp resources:

External resources:

Scale container workloads and orchestrators

If you are using a container orchestrator such as Kubernetes or Nomad, you can configure your workloads to scale automatically depending on the current load demand. Containerized workloads are often quicker to scale and offer more flexibility than scaling virtual machines, and lets you dynamically meet the resource requirements for your applications.

We recommend that you manage and monitor your container orchestrator the same way that you manage your other servers. This lets you build and scale your Nomad and Kubernetes clusters with Terraform just as you would manage the containers you run with them.

HashiCorp resources:

Automatically detect resource drift and health

There may be times when you or someone else makes changes to your infrastructure manually, causing your actual state to no longer match your Terraform configuration. For example, a teammate can manually update your infrastructure, or your cloud provider can automatically update a resource. This can often cause issues when you go to update your configuration. This difference between your Terraform state and real-world resources is called "drift".

You can also write custom conditions in Terraform to validate your resources with custom logic. For example, if your Terraform configuration deploys a web application, you can write a custom condition to send an HTTP request to the application and ensure it returns a healthy response.

HCP Terraform can run periodic refresh-only Terraform plans to automatically detect drift and run your custom conditions. You can configure these health checks on individual workspaces. When HCP Terraform detects drift, you can choose to either run a new Terraform apply, or update your configuration to match the new state.

HashiCorp resources:

Network health checks

The network communication between your services can become very complex, especially if your applications communicate with several other internal services. Network stability and troubleshooting are a complex topic, but discovering where your services are running in a dynamic service architecture can also be challenging.

Monitor your service network traffic

If you use a network service to manage communication between your services, such as HashiCorp's Consul, you should maintain a list of troubleshooting steps that can help you quickly resolve any issues that arise. Troubleshooting service-to-service communication can be especially complicated when you have many services all communicating with each other. In these situations, a visualization of your network's health can help quickly identify issues in your distributed services.

HashiCorp resources:

Plan for failover

As important as it is to plan for performance degradation in your infrastructure, it's just as important to have a plan for unexpectedly losing critical infrastructure.

Load balancers can help the stability of your applications in two ways. First, they spread the load to your application across multiple instances, reducing the load on any single instance. Second, some load balancers have built-in health checks. If an instance fails, the load balancer will stop sending traffic to that instance. You can use Terraform to manage load balancers in all major cloud providers, such as the aws_lb resource in AWS, the azurerm_lb resource in Azure, and the google_compute_forwarding_rule in GCP. You can also create multi-region and multi-cloud deployments with Terraform.

Additionally, you can use a service discovery and service mesh solution like Consul and its internal DNS to ensure your application only sends requests to healthy instances without needing to route all the way up to the load balancer and back down. Consul also automatically routes requests to services as they dynamically spin up and shut down, or otherwise become unresponsive.

Terraform lets you reuse your configuration to set up multi-region and multi-cloud deployments. With Stacks, you can roll out the same configuration from testing environments, to staging, to production, ensuring that your infrastructure changes are stable and production-ready.

HashiCorp resources:

Security and compliance management

Being proactive in building resilient security measures for your applications and having a plan to routinely rotate sensitive information not only keeps your infrastructure secure, but also helps simplify ongoing operation. Vault is a powerful tool that can help protect sensitive information as well as make it easy to routinely rotate secrets like TLS certificates to secure communication between your application and services.

Rotate expired certificates

We recommend that you define a consistent TTL for every certificate in your infrastructure and automatically rotate your certificates prior to their expiration. When implementing automatic certificate rotation, set up your alerting solution to notify you before your certificates are invalid in case services or infrastructure fails to reload the new certificate. You can apply the process of handling an expired certificate to other situations, such as revoking a certificate outside of your usual rotation cycle. If a certificate is compromised, a private key is leaked, or other security incidents, having a solution to quickly revoke the certificate and issue a new one can help lessen downtime and increase security.

HashiCorp Vault can manage, issue, rotate, and revoke certificates throughout your infrastructure. You can also use the Vault Agent to automatically make requests on behalf of the client application. This means once you reissue a certificate, the Vault Agent automatically makes it available to your application. You can use Vault Agent's reload capability to restart the service to use the new certificates or build the reload into the application.

You can use the Vault Agent to supervise a specific process and take actions related to that process. For example, if you use the Vault Agent with MongoDB, the agent can restart the service or send a signal to the process to reload the configuration after it obtains a new TLS certificate.

HashiCorp resources:

Seal Vault during a security incident

In the case of a security incident, it can be important to lock down your most sensitive services such as Vault until the issue is resolved. Incidents such as credential leakage, intrusion, or denial-of-service attacks mean that timely mitigation is top priority. Vault provides two features to help you lock the service down until you resolve the incident:

Seal: Vault discards its in-memory key to unlock data, preventing it from responding to any request to access secrets.
API Lock: If you do not require Vault to be entirely sealed, you can instead lock the API for individual namespaces.

After a security incident, it's important to review what caused it, and invalidate any compromised credentials. Boundary provides audit logging and session recording, giving you valuable insight into how an attacker gained access to your infrastructure. Vault Radar automatically detects and identifies unmanaged secrets in your code, letting you know if there are any sensitive credentials that might be used to gain access to your infrastructure.

HashiCorp resources:

Next steps

While it's impossible to prepare for every troubleshooting scenario you may encounter in your infrastructure, having a design with reliability in mind and a solid strategy to diagnose potential issues can help reduce downtime and improve the experience for your users. By using Terraform, Consul, and Vault, you can build reliable infrastructure, monitor for potential issues, and get ahead of problems before they affect production.

To learn more about setting up monitoring and alerting for your infrastructure, refer to Manage infrastructure and service monitoring.

Monitor your infrastructure and services

Zero-downtime deployments