Run a reliable Nomad cluster

20min
|
Nomad

This document outlines implementation resources for maintaining reliable Nomad clusters. When you implement proper reliability measures, you ensure high availability, fault tolerance, and consistent performance of your Nomad infrastructure.

The following sections cover architecture, monitoring, resource management, and recovery.

Architecture

Learn about Nomad Community Edition and Enterprise architecture and best practices to build reliable Nomad environments.

Reference architecture for HashiCorp Nomad production deployments
Learn the technical details of Nomad with Nomad system architecture

Monitoring

Monitor your Nomad environment to collect telemetry data to view performance, audits, and infrastructure usage to ensure Nomad is reliable.

Monitor the Nomad client and server agents with metrics collected by the Nomad client and server agents.
Use Nomad runtime metrics for debugging or understanding the performance of your Nomad cluster.
Monitor the underlying infrastructure that Nomad runs on.

Resource management

Efficiently manage your Nomad infrastructure, scaling, and performance.

Use Nomad Bench to run test scenarios in order to collect metrics and data from Nomad clusters running at scale.
Scale with Nomad Autoscaler, a horizontal application and cluster autoscaler for Nomad.
Manage resources quotas with Sentinel

Recovery

Recover Nomad in the case of cluster degradation through the use of regular backups.

Recover from a Nomad outage
Generate a snapshot of Nomad server state for disaster recovery.
Learn about failure recovery strategies for tasks and jobs.

Next steps

In this document, you learned about the HashiCorp resources for implementing and running a reliable Nomad cluster. The following are implementation guides on the other HashiCorp products.

Vault implementation resources

Collection Overview

Nomad

Next Collection

Vault