Observability in Nomad

What is observability, all its pillars and how they apply in a Nomad environment?

Observability is a crucial aspect of operating any system, and Nomad is no exception. It refers to the ability to understand the internal state of systems. In the context of Nomad, observability allows operators to gain insights into the health, performance, and behavior of both the Nomad cluster itself and the workloads it manages.

Tip

The three pillars of observability are:

Metrics: Quantitative measurements of system behavior over time.
Logs: Detailed records of events and actions within the system.
Tracing/sometimes overlapping with Application Performance Monitoring (APM): The ability to follow a request or transaction through a distributed system.

In a Nomad environment, these pillars apply as follows:

Metrics

Nomad-provided metrics help monitor cluster health, resource utilization, and job performance.

Application metrics are typically handled by applications and their developers, and help monitor application-specific behavior and metrics, which can be technical (number of requests per second, latency) or business (number of orders processed, revenue, etc.).

System metrics are provided by the underlying infrastructure (be it bare metal or virtualized) and help monitor the health and performance of the infrastructure on which Nomad runs.

Having all three in a single system and being able to dashboard and alert on them together is key to having a complete understanding.

As an example, application request latency will be impacted by the underlying infrastructure's I/O performance, and seeing a spike in the former can be hard to explain without knowledge of the latter. Metrics can also be used for autoscaling to ensure that the cluster can handle the load its under and that there is minimal waste.

Logs

Nomad-provided logs come in two types - operational logs, which provide detailed information about Nomad operations (evaluations, scheduling allocations, provisioning storage/networking, operating on templates, etc.) and audit logs which are mostly useful to keep track of human/machine actions in Nomad.

Application logs are generated by the applications themselves and help understand application-specific behavior and events.

System logs are provided by the underlying operating system and related infrastructure if it exists (e.g. hypervisor if running in VMs), and help understand the health of the infrastructure on which Nomad runs.

Having all three in a single system and being able to search and correlate them together is key to having a complete understanding. As an example, a network card driver issue (logged about in the OS/hypervisor logs) can cause network-related errors in both Nomad and application logs, but they wouldn't be aware of it, and neither would an operator with access only to them.

Tracing

Tracing allows you to understand the flow of requests through applications deployed on Nomad. This is especially useful in a microservices architecture where a single request can span multiple services. By tracing the request, you can identify bottlenecks, latency issues, and other performance problems, and correlate them to metrics/logs explaining them.

Implementing a comprehensive observability strategy in Nomad is crucial for both operators and application developers - it enables proactive issue detection, faster troubleshooting, and informed capacity planning, but also helps to better understand applications' behaviour, identify bottlenecks and catch regressions.

The exact composition and configuration of observability tools and practices will depend on the specific requirements of each Nomad environment, the organization's needs and the existing tooling already in use. Thankfully most observability agents support multiple inputs and outputs, so in theory any combination of tools is possible. The following sections provide generally applicable good practices to follow. The key is to have a holistic view of the system, and to be able to correlate information and events across the three pillars to understand everything.

Observability of the Nomad cluster

Metrics to monitor

When monitoring the Nomad cluster itself, focus on the most crucial metrics as described in the Monitoring Nomad documentation. They include vital information about the health of the cluster itself, as well as scheduling performance and bottlenecks. It is important to, alongside metric collection, set up alerting on the most important and relevant metrics. There is also a complete Metrics Reference available - some metrics might not be crucial, but can still be useful for debugging or correlations.

To enable the collection of these metrics, Nomad provides the telemetry block in agent configuration. Metrics are available in a variety of formats, and either pushed to a compatible system/agent, or made available to be pulled. For push formats such as statsd and Datadog are supported; for pull it's the widely adopted Prometheus format. Nowadays most modern metrics/observability agents support multiple metric formats, usually if nothing else the Prometheus one (this applies for the OpenTelemetry Collector, Datadog Agent, Elastic Agent, Grafana Alloy, InfluxData Telegraf, New Relic, AppDynamics, Dynatrace, etc etc), so any metrics system can be integrated with Nomad.

For example, to enable Prometheus-compatible metrics collection, you can add the following to your Nomad configuration

telemetry {
    publish_allocation_metrics = true # if metrics from the allocations running in Nomad should be published
    publish_node_metrics = true # if metrics from Nomad nodes should be published
    prometheus_metrics = true # if metrics should be made available in the Prometheus format on the /metrics path
}

Operational logs and Audit logs

Operational logs provide detailed information about Nomad's internal operations, while audit logs capture security-relevant events for compliance and forensics. Both should be shipped to a centralised location to ensure they survive node failures, and can be searched and ideally correlated with other logs and metrics. This central location should be a log aggregation system like ElasticSearch/OpenSearch, Loki, Splunk, Graylog, Datadog, etc.

Operational Logs

Server logs: Contain information about leadership changes, job submissions, and evaluations.
Client logs: Include details about task starts, stops, related activities such as downloading container images, provisioning networking, attaching volumes, etc.

To configure logging, adjust the log_level and log_file settings in your Nomad agent configuration:

log_level = "INFO"
log_file = "/var/log/nomad/nomad.log"

There are also a number of settings around log rotation that might need to be adjusted, depending on the specifics of the environment such as available disk space.

Audit Logs

Capture all API requests and responses
Record authentication and authorization decisions

Enable audit logging by adding an audit block to your Nomad configuration:

audit {
    enabled = true
    sink "file" {
        type = "file"
        format = "json"
        path = "/var/log/nomad/audit.log"
    }
}

Observability of Workloads Running in Nomad

Metrics to Monitor

By default each workload in Nomad will have metrics collected about it from Nomad itself - how much resource it is using, for how long it was throttled because it tried to use more than its limits allow it to, how many times it was restarted, etc. This is useful for understanding how the workload is behaving in broad strokes, but it is not enough to understand the workload itself. For that, application-specific metrics need to be emitted by the application and collected, but that is the responsibility of the application developers to expose.

For workloads running on Nomad, focus on these generic metrics and anything else that is relevant to your applications:

Application performance (entirely up to application developers to expose):

Request rates and latencies
Error rates
Custom application-specific metrics such as number of transactions, unique users, third-party API failrue rates, etc.

Resource consumption:

CPU and memory usage per task
Number of allocations running/failed
Number of allocations that were OOM (out of memory) killed
Amount of time the allocation was CPU throttled

Nomad allows you to expose the second set of metrics through its telemetry system, and the first set by integrating metric collection tools with Nomad.

As an example, a Datadog Agent, Grafana Alloy, Prometheus Agent or any other one can be ran as a system job on all Nomad clients, collecting both the Nomad-provided metrics from the local Nomad agent, and the application-provided ones by either scraping the application's metrics endpoint (discoverable through service discovery or agent-native application discovery) or by having the application push them to the agent over a local port/socket. This allows for easy scaling and consumption, with the agent being centrally managed and thus easy to update/reconfigure, and the applications only needing very basic configuration.

Logs

By default, Nomad collects stdout and stderr from running allocations (storing them in the alloc/logs folder) and makes them available via the API, UI and CLI. This is useful for live debugging and troubleshooting, but it is not enough for a production system because the logs are not persisted (they will be at some point garbage collected) and are not really searchable, nor correlatable with other allocations. To have complete log management, you need to collect these logs and ship them to a centralized log aggregation system like ElasticSearch/OpenSearch, Loki, Splunk, Graylog, Datadog, etc. As with everything else, ideally this would be a place where Nomad server and client, as well as OS logs are also shipped to, and where they can be correlated with each other and everything else (metrics, traces).

There are a number of different ways to collect allocation logs from Nomad, and the most appropriate one will depend on the log aggregation system in use, the agent intermediaries, and most importantly, the allocation type (e.g. Docker, Java, exec). Some methods will work in all scenarios, others would be task driver specific.

Sidecar

Since each allocation has access to all of its logs in the alloc/logs folder, a sidecar task can be ran in the same task group as the main workload, that collects the logs and ships them to the log aggregation system. This can be done with a sidecar container that runs a log shipper like Filebeat, Fluentd, Logstash, Vector, Promtail, etc. It allows custom logic such as sampling (only send 50% of the logs), filtering (only error-level logs) and transformations (remove labels/tags XYZ, add custom metadata such as node name, version, etc.). Also, since the sidecar shares the same environment, so it has access to all the same metadata (task/task group name, env variables, cloud metadata, etc.) which can be used to enrich logs. The main downside of this approach is that it is wasteful on resources (each allocation will have its own logging agent), and that the configuration is per-task group, making it hard to scale without an advanced deployment flow implementing Nomad Pack and Pack dependencies or equivalent.

Task driver logging

Some Task Drivers, most notably Docker (list of available log destinations out of the box here), support various logging options and logging plugins. They can be configured to directly send logs to a supported backend such as AWS CloudWatch Logs, Splunk, syslog, etc. This is one of the most efficient way to collect logs, as it doesn't require an intermediary agent. However it has a number of downsides - it relies on the Task Driver implementing the feature and thus it won't work for heterogeneous workloads; logs are shipped directly as they are, therefore it's not possible to do any sampling or transformations client-side; and the configuration is per-task, making it hard to scale without an advanced deployment flow implementing Nomad Pack and Pack dependencies or equivalent.

Agent integrating with Task Driver

Most logging agents come with direct support for some of the most popular Task Drivers such as Docker directly. This allows the agent to collect logs from the Docker Daemon directly (which is orchestrated by Nomad) and ship them to the log aggregation system. This is a good middle ground between the two previous approaches - it is per Nomad client (the agent can easily be ran as a system job) and thus more efficient, doesn't require a sidecar, allows for custom transformations/filters/sampling, and can be scaled easily. The downside is that it requires the agent to be able to integrate with the Task Driver, and thus it might not work for all Task Drivers. For Docker in particular, agents rely on Docker labels to enrich logs with metadata (to know which container the logs are from), and by default those only include the allocation id; extra_labels need to be configured to expose more metadata.

Agent integrating with Nomad

The most advanced approach is to have the agent (usually running as a system job) integrate with Nomad directly, and collect logs from the Nomad API. This allows for the most flexibility, as the agent can collect logs from any allocation, regardless of the Task Driver, do any transformations/filters required, and can be configured to collect logs from all allocations, or only those that match certain criteria. The downside is that it requires the agent to be able to integrate with the Nomad API, which limits the agents that can be used. At the time of writing, the only agent with direct support of Nomad is filebeat; for other agents, there are community utilities to bridge the gap (such as vector-logger for Vector, and nomad_follower for others).

Agent collecting logs from the filesystem

As all the allocation folders are stored locally on the Nomad client in the data_dir, any logging agent can also collect the logs directly from host the filesystem. This approach usually isn't the best because the only metadata available is the allocation id, which usually isn't enough, and enriching with more relevant information such as the task/task group name isn't easy to achieve.

Tracing

Implementing distributed tracing for microservices running on Nomad provides end-to-end visibility into request flows, service dependencies, and performance bottlenecks. Tracing is crucial for distributed environments to have a clear picture of how everything works. Nomad itself does not provide tracing, but it allows you to run popular tracing systems alongside your workloads.

Similarly to metrics and log collection, an agent (ideally the same one for all three) can be ran as a system job, and applications can be pointed to send to it. A common pattern is to have the agent listen on a local (to the host) port/socket, and have all applications (on that host) sending traces to it. The agent then collects them, performs any necessary sampling, transformations and filters, and ships them to the tracing backend. This allows for easy scaling and consumption, with the agent being centrally managed and thus easy to update/reconfigure, and the applications only needing very basic configuration.

Observability of the underlying infrastructure

Monitoring the underlying infrastructure is crucial for maintaining a healthy Nomad cluster:

Host-level metrics:
- CPU, memory and disk usage
- File descriptors and open connections
Network performance:
- Throughput and packet loss
Storage metrics:
- IOPS and latency
- Read/write throughput
Virtualisation layer metrics (if applicable):
- Hypervisor metrics like CPU ready / CPU contention metrics, memory balooning, etc.

When running Nomad on bare metal, the host metric collection can be orchestrated using Nomad itself (via an agent running as a system job) or via the underlying host OS's service management (such as systemd). When running Nomad on virtualised infrastructure, the virtualisation layer metrics can be collected using the virtualisation provider's APIs.

Who watches the watchmen

If you are self-hosting the observability stack on top of Nomad, it is crucial to monitor the monitoring system itself to avoid a Nomad cluster outage taking out your ability to notice it and debug it. This can be done by having a separate Nomad cluster running the monitoring stack, or by using a SaaS monitoring system for high-level health checks. This is especially important for the alerting system - if it goes down, you won't know if anything else does. "Dead man switches" (an alert is raised if it doesn't receive a "I'm alive" signal from the monitoring system) are a good way to ensure that you are aware of the monitoring system being down.

Benchmarking Nomad

Benchmarking Nomad is a crucial part of setting up observability, and evaluating your setup. It allows you to understand the performance limits of your Nomad cluster and identify bottlenecks before they impact production workloads. HashiCorp provides a Nomad Benchmarking project to help you get started with Nomad cluster performance testing.

Refer to this blog post from the Nomad team on the project and their results.

Reference/Useful links:

Documentation: Nomad metrics reference

Documentation: Monitoring Nomad

Blog: Logging on Nomad with Vector

Blog: Logging on Nomad and log aggregation with Loki

OpenTelemetry collector running in Nomad example

Running the OpenTelemetry demo app on HashiCorp Nomad

Datadog integration for Nomad

Nomad integration for Grafana Cloud

Filebeat autodiscover

Prometheus configuration reference

Backups

Networking