Metrics

The Nomad agent collects various runtime metrics about the performance of different libraries and subsystems. These metrics are aggregated on a ten second interval and are retained for one minute.

This data can be accessed via an HTTP endpoint or via sending a signal to the Nomad process.

As of Nomad version 0.7, this data is available via HTTP at /metrics. See Metrics for more information.

To view this data via sending a signal to the Nomad process: on Unix, this is USR1 while on Windows it is BREAK. Once Nomad receives the signal, it will dump the current telemetry information to the agent's stderr.

This telemetry information can be used for debugging or otherwise getting a better view of what Nomad is doing.

Telemetry information can be streamed to both statsite as well as statsd based on providing the appropriate configuration options.

To configure the telemetry output please see the agent configuration.

Below is sample output of a telemetry dump:

[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.nomad.broker.total_blocked': 0.000
[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.nomad.plan.queue_depth': 0.000
[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.runtime.malloc_count': 7568.000
[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.runtime.total_gc_runs': 8.000
[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.nomad.broker.total_ready': 0.000
[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.runtime.num_goroutines': 56.000
[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.runtime.sys_bytes': 3999992.000
[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.runtime.heap_objects': 4135.000
[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.nomad.heartbeat.active': 1.000
[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.nomad.broker.total_unacked': 0.000
[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.nomad.broker.total_waiting': 0.000
[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.runtime.alloc_bytes': 634056.000
[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.runtime.free_count': 3433.000
[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.runtime.total_gc_pause_ns': 6572135.000
[2015-09-17 16:59:40 -0700 PDT][C] 'nomad.memberlist.msg.alive': Count: 1 Sum: 1.000
[2015-09-17 16:59:40 -0700 PDT][C] 'nomad.serf.member.join': Count: 1 Sum: 1.000
[2015-09-17 16:59:40 -0700 PDT][C] 'nomad.raft.barrier': Count: 1 Sum: 1.000
[2015-09-17 16:59:40 -0700 PDT][C] 'nomad.raft.apply': Count: 1 Sum: 1.000
[2015-09-17 16:59:40 -0700 PDT][C] 'nomad.nomad.rpc.query': Count: 2 Sum: 2.000
[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.serf.queue.Query': Count: 6 Sum: 0.000
[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.nomad.fsm.register_node': Count: 1 Sum: 1.296
[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.serf.queue.Intent': Count: 6 Sum: 0.000
[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.runtime.gc_pause_ns': Count: 8 Min: 126492.000 Mean: 821516.875 Max: 3126670.000 Stddev: 1139250.294 Sum: 6572135.000
[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.raft.leader.dispatchLog': Count: 3 Min: 0.007 Mean: 0.018 Max: 0.039 Stddev: 0.018 Sum: 0.054
[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.nomad.leader.reconcileMember': Count: 1 Sum: 0.007
[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.nomad.leader.reconcile': Count: 1 Sum: 0.025
[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.raft.fsm.apply': Count: 1 Sum: 1.306
[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.nomad.client.get_allocs': Count: 1 Sum: 0.110
[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.nomad.worker.dequeue_eval': Count: 29 Min: 0.003 Mean: 363.426 Max: 503.377 Stddev: 228.126 Sum: 10539.354
[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.serf.queue.Event': Count: 6 Sum: 0.000
[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.raft.commitTime': Count: 3 Min: 0.013 Mean: 0.037 Max: 0.079 Stddev: 0.037 Sum: 0.110
[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.nomad.leader.barrier': Count: 1 Sum: 0.071
[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.nomad.client.register': Count: 1 Sum: 1.626
[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.nomad.eval.dequeue': Count: 21 Min: 500.610 Mean: 501.753 Max: 503.361 Stddev: 1.030 Sum: 10536.813
[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.memberlist.gossip': Count: 12 Min: 0.009 Mean: 0.017 Max: 0.025 Stddev: 0.005 Sum: 0.204

Key Metrics

When telemetry is being streamed to statsite or statsd, interval is defined to be their flush interval. Otherwise, the interval can be assumed to be 10 seconds when retrieving metrics using the above described signals.

Metric	Description	Unit	Type
`nomad.runtime.num_goroutines`	Number of goroutines and general load pressure indicator	# of goroutines	Gauge
`nomad.runtime.alloc_bytes`	Memory utilization	# of bytes	Gauge
`nomad.runtime.heap_objects`	Number of objects on the heap. General memory pressure indicator	# of heap objects	Gauge
`nomad.raft.apply`	Number of Raft transactions	Raft transactions / `interval`	Counter
`nomad.raft.replication.appendEntries`	Raft transaction commit time	ms / Raft Log Append	Timer
`nomad.raft.leader.lastContact`	Time since last contact to leader. General indicator of Raft latency	ms / Leader Contact	Timer
`nomad.broker.total_ready`	Number of evaluations ready to be processed	# of evaluations	Gauge
`nomad.broker.total_unacked`	Evaluations dispatched for processing but incomplete	# of evaluations	Gauge
`nomad.broker.total_blocked`	Evaluations that are blocked until an existing evaluation for the same job completes	# of evaluations	Gauge
`nomad.plan.queue_depth`	Number of scheduler Plans waiting to be evaluated	# of plans	Gauge
`nomad.plan.submit`	Time to submit a scheduler Plan. Higher values cause lower scheduling throughput	ms / Plan Submit	Timer
`nomad.plan.evaluate`	Time to validate a scheduler Plan. Higher values cause lower scheduling throughput. Similar to `nomad.plan.submit` but does not include RPC time or time in the Plan Queue	ms / Plan Evaluation	Timer
`nomad.worker.invoke_scheduler.<type>`	Time to run the scheduler of the given type	ms / Scheduler Run	Timer
`nomad.worker.wait_for_index`	Time waiting for Raft log replication from leader. High delays result in lower scheduling throughput	ms / Raft Index Wait	Timer
`nomad.heartbeat.active`	Number of active heartbeat timers. Each timer represents a Nomad Client connection	# of heartbeat timers	Gauge
`nomad.heartbeat.invalidate`	The length of time it takes to invalidate a Nomad Client due to failed heartbeats	ms / Heartbeat Invalidation	Timer
`nomad.rpc.query`	Number of RPC queries	RPC Queries / `interval`	Counter
`nomad.rpc.request`	Number of RPC requests being handled	RPC Requests / `interval`	Counter
`nomad.rpc.request_error`	Number of RPC requests being handled that result in an error	RPC Errors / `interval`	Counter

Client Metrics

The Nomad client emits metrics related to the resource usage of the allocations and tasks running on it and the node itself. Operators have to explicitly turn on publishing host and allocation metrics. Publishing allocation and host metrics can be turned on by setting the value of publish_allocation_metrics publish_node_metrics to true.

By default the collection interval is 1 second but it can be changed by the changing the value of the collection_interval key in the telemetry configuration block.

Please see the agent configuration page for more details.

As of Nomad 0.9, Nomad will emit additional labels for parameterized and periodic jobs. Nomad emits the parent job id as a new label parent_id. Also, the labels dispatch_id and periodic_id are emitted, containing the ID of the specific invocation of the parameterized or periodic job respectively. For example, a dispatch job with the id myjob/dispatch-1312323423423, will have the following labels.

Label	Value
job	`myjob/dispatch-1312323423423`
parent_id	myjob
dispatch_id	1312323423423

Host Metrics (post Nomad version 0.7)

Starting in version 0.7, Nomad will emit tagged metrics, in the below format:

Metric	Description	Unit	Type	Labels
`nomad.client.allocated.cpu`	Total amount of CPU shares the scheduler has allocated to tasks	MHz	Gauge	node_id, datacenter
`nomad.client.unallocated.cpu`	Total amount of CPU shares free for the scheduler to allocate to tasks	MHz	Gauge	node_id, datacenter
`nomad.client.allocated.memory`	Total amount of memory the scheduler has allocated to tasks	Megabytes	Gauge	node_id, datacenter
`nomad.client.unallocated.memory`	Total amount of memory free for the scheduler to allocate to tasks	Megabytes	Gauge	node_id, datacenter
`nomad.client.allocated.disk`	Total amount of disk space the scheduler has allocated to tasks	Megabytes	Gauge	node_id, datacenter
`nomad.client.unallocated.disk`	Total amount of disk space free for the scheduler to allocate to tasks	Megabytes	Gauge	node_id, datacenter
`nomad.client.allocated.network`	Total amount of bandwidth the scheduler has allocated to tasks on the given device	Megabits	Gauge	node_id, datacenter, device
`nomad.client.unallocated.network`	Total amount of bandwidth free for the scheduler to allocate to tasks on the given device	Megabits	Gauge	node_id, datacenter, device
`nomad.client.host.memory.total`	Total amount of physical memory on the node	Bytes	Gauge	node_id, datacenter
`nomad.client.host.memory.available`	Total amount of memory available to processes which includes free and cached memory	Bytes	Gauge	node_id, datacenter
`nomad.client.host.memory.used`	Amount of memory used by processes	Bytes	Gauge	node_id, datacenter
`nomad.client.host.memory.free`	Amount of memory which is free	Bytes	Gauge	node_id, datacenter
`nomad.client.uptime`	Uptime of the host running the Nomad client	Seconds	Gauge	node_id, datacenter
`nomad.client.host.cpu.total`	Total CPU utilization	Percentage	Gauge	node_id, datacenter, cpu
`nomad.client.host.cpu.user`	CPU utilization in the user space	Percentage	Gauge	node_id, datacenter, cpu
`nomad.client.host.cpu.system`	CPU utilization in the system space	Percentage	Gauge	node_id, datacenter, cpu
`nomad.client.host.cpu.idle`	Idle time spent by the CPU	Percentage	Gauge	node_id, datacenter, cpu
`nomad.client.host.disk.size`	Total size of the device	Bytes	Gauge	node_id, datacenter, disk
`nomad.client.host.disk.used`	Amount of space which has been used	Bytes	Gauge	node_id, datacenter, disk
`nomad.client.host.disk.available`	Amount of space which is available	Bytes	Gauge	node_id, datacenter, disk
`nomad.client.host.disk.used_percent`	Percentage of disk space used	Percentage	Gauge	node_id, datacenter, disk
`nomad.client.host.disk.inodes_percent`	Disk space consumed by the inodes	Percent	Gauge	node_id, datacenter, disk
`nomad.client.allocs.start`	Number of allocations starting	Integer	Counter	node_id, job, task_group
`nomad.client.allocs.running`	Number of allocations starting to run	Integer	Counter	node_id, job, task_group
`nomad.client.allocs.failed`	Number of allocations failing	Integer	Counter	node_id, job, task_group
`nomad.client.allocs.restart`	Number of allocations restarting	Integer	Counter	node_id, job, task_group
`nomad.client.allocs.complete`	Number of allocations completing	Integer	Counter	node_id, job, task_group
`nomad.client.allocs.destroy`	Number of allocations being destroyed	Integer	Counter	node_id, job, task_group

Nomad 0.9 adds an additional node_class label from the client's NodeClass attribute. This label is set to the string "none" if empty.

Host Metrics (deprecated post Nomad 0.7)

The below are metrics emitted by Nomad in versions prior to 0.7. These metrics can be emitted in the below format post-0.7 (as well as the new format, detailed above) but any new metrics will only be available in the new format.

Metric	Description	Unit	Type
`nomad.client.allocated.cpu.<HostID>`	Total amount of CPU shares the scheduler has allocated to tasks	MHz	Gauge
`nomad.client.unallocated.cpu.<HostID>`	Total amount of CPU shares free for the scheduler to allocate to tasks	MHz	Gauge
`nomad.client.allocated.memory.<HostID>`	Total amount of memory the scheduler has allocated to tasks	Megabytes	Gauge
`nomad.client.unallocated.memory.<HostID>`	Total amount of memory free for the scheduler to allocate to tasks	Megabytes	Gauge
`nomad.client.allocated.disk.<HostID>`	Total amount of disk space the scheduler has allocated to tasks	Megabytes	Gauge
`nomad.client.unallocated.disk.<HostID>`	Total amount of disk space free for the scheduler to allocate to tasks	Megabytes	Gauge
`nomad.client.allocated.network.<Device-Name>.<HostID>`	Total amount of bandwidth the scheduler has allocated to tasks on the given device	Megabits	Gauge
`nomad.client.unallocated.network.<Device-Name>.<HostID>`	Total amount of bandwidth free for the scheduler to allocate to tasks on the given device	Megabits	Gauge
`nomad.client.host.memory.<HostID>.total`	Total amount of physical memory on the node	Bytes	Gauge
`nomad.client.host.memory.<HostID>.available`	Total amount of memory available to processes which includes free and cached memory	Bytes	Gauge
`nomad.client.host.memory.<HostID>.used`	Amount of memory used by processes	Bytes	Gauge
`nomad.client.host.memory.<HostID>.free`	Amount of memory which is free	Bytes	Gauge
`nomad.client.uptime.<HostID>`	Uptime of the host running the Nomad client	Seconds	Gauge
`nomad.client.host.cpu.<HostID>.<CPU-Core>.total`	Total CPU utilization	Percentage	Gauge
`nomad.client.host.cpu.<HostID>.<CPU-Core>.user`	CPU utilization in the user space	Percentage	Gauge
`nomad.client.host.cpu.<HostID>.<CPU-Core>.system`	CPU utilization in the system space	Percentage	Gauge
`nomad.client.host.cpu.<HostID>.<CPU-Core>.idle`	Idle time spent by the CPU	Percentage	Gauge
`nomad.client.host.disk.<HostID>.<Device-Name>.size`	Total size of the device	Bytes	Gauge
`nomad.client.host.disk.<HostID>.<Device-Name>.used`	Amount of space which has been used	Bytes	Gauge
`nomad.client.host.disk.<HostID>.<Device-Name>.available`	Amount of space which is available	Bytes	Gauge
`nomad.client.host.disk.<HostID>.<Device-Name>.used_percent`	Percentage of disk space used	Percentage	Gauge
`nomad.client.host.disk.<HostID>.<Device-Name>.inodes_percent`	Disk space consumed by the inodes	Percent	Gauge

Allocation Metrics

Metric	Description	Unit	Type
`nomad.client.allocs.<Job>.<TaskGroup>.<AllocID>.<Task>.memory.rss`	Amount of RSS memory consumed by the task	Bytes	Gauge
`nomad.client.allocs.<Job>.<TaskGroup>.<AllocID>.<Task>.memory.cache`	Amount of memory cached by the task	Bytes	Gauge
`nomad.client.allocs.<Job>.<TaskGroup>.<AllocID>.<Task>.memory.swap`	Amount of memory swapped by the task	Bytes	Gauge
`nomad.client.allocs.<Job>.<TaskGroup>.<AllocID>.<Task>.memory.max_usage`	Maximum amount of memory ever used by the task	Bytes	Gauge
`nomad.client.allocs.<Job>.<TaskGroup>.<AllocID>.<Task>.memory.kernel_usage`	Amount of memory used by the kernel for this task	Bytes	Gauge
`nomad.client.allocs.<Job>.<TaskGroup>.<AllocID>.<Task>.memory.kernel_max_usage`	Maximum amount of memory ever used by the kernel for this task	Bytes	Gauge
`nomad.client.allocs.<Job>.<TaskGroup>.<AllocID>.<Task>.cpu.total_percent`	Total CPU resources consumed by the task across all cores	Percentage	Gauge
`nomad.client.allocs.<Job>.<TaskGroup>.<AllocID>.<Task>.cpu.system`	Total CPU resources consumed by the task in the system space	Percentage	Gauge
`nomad.client.allocs.<Job>.<TaskGroup>.<AllocID>.<Task>.cpu.user`	Total CPU resources consumed by the task in the user space	Percentage	Gauge
`nomad.client.allocs.<Job>.<TaskGroup>.<AllocID>.<Task>.cpu.throttled_time`	Total time that the task was throttled	Nanoseconds	Gauge
`nomad.client.allocs.<Job>.<TaskGroup>.<AllocID>.<Task>.cpu.total_ticks`	CPU ticks consumed by the process in the last collection interval	Integer	Gauge

Job Summary Metrics

Job summary metrics are emitted by the Nomad leader server.

Metric	Description	Unit	Type	Labels
`nomad.job_summary.queued`	Number of queued allocations for a job	Integer	Gauge	job, task_group
`nomad.job_summary.complete`	Number of complete allocations for a job	Integer	Gauge	job, task_group
`nomad.job_summary.failed`	Number of failed allocations for a job	Integer	Gauge	job, task_group
`nomad.job_summary.running`	Number of running allocations for a job	Integer	Gauge	job, task_group
`nomad.job_summary.starting`	Number of starting allocations for a job	Integer	Gauge	job, task_group
`nomad.job_summary.lost`	Number of lost allocations for a job	Integer	Gauge	job, task_group

Job Status Metrics

Job status metrics are emitted by the Nomad leader server.

Metric	Description	Unit	Type
`nomad.job_status.pending`	Number jobs pending	Integer	Gauge
`nomad.job_status.running`	Number jobs running	Integer	Gauge
`nomad.job_status.dead`	Number of dead jobs	Integer	Gauge

Metric Types

Type	Description	Quantiles
Gauge	Gauge types report an absolute number at the end of the aggregation interval	false
Counter	Counts are incremented and flushed at the end of the aggregation interval and then are reset to zero	true
Timer	Timers measure the time to complete a task and will include quantiles, means, standard deviation, etc per interval.	true

Tagged Metrics

As of version 0.7, Nomad will start emitting metrics in a tagged format. Each metric can support more than one tag, meaning that it is possible to do a match over metrics for datapoints such as a particular datacenter, and return all metrics with this tag. Nomad supports labels for namespaces as well.