Run a reliable Nomad cluster
This document outlines implementation resources for maintaining reliable Nomad clusters. When you implement proper reliability measures, you ensure high availability, fault tolerance, and consistent performance of your Nomad infrastructure.
The following sections cover architecture, monitoring, resource management, and recovery.
Architecture
Learn about Nomad Community Edition and Enterprise architecture and best practices to build reliable Nomad environments.
- Reference architecture for HashiCorp Nomad production deployments
- Learn the technical details of Nomad with Nomad system architecture
Monitoring
Monitor your Nomad environment to collect telemetry data to view performance, audits, and infrastructure usage and ensure Nomad's reliability.
- Monitor the Nomad client and server agents with metrics collected by the Nomad client and server agents.
- Use Nomad runtime metrics to debug or understand the performance of your Nomad cluster.
- Monitor the underlying infrastructure that Nomad runs on.
Resource management
Efficiently manage your Nomad infrastructure, scaling, and performance.
- Use Nomad Bench to run test scenarios to collect metrics and data from Nomad clusters running at scale.
- Scale with Nomad Autoscaler, a horizontal application and cluster autoscaler for Nomad.
- Manage resources quotas with Sentinel.
Recovery
Recover Nomad in the case of cluster degradation through regular backups.
- Recover from a Nomad outage
- Generate a snapshot of Nomad server state for disaster recovery.
- Learn about failure recovery strategies for tasks and jobs.
Next steps
In this document, you learned about the HashiCorp resources for implementing and running a reliable Nomad cluster. The following are implementation guides on the other HashiCorp products.