Nomad job schedulers

This page provides conceptual information about Nomad service, batch, system, and system batch job schedulers.

Service

The service scheduler is designed for scheduling long lived services that should never go down. As such, the service scheduler ranks a large portion of the nodes that meet the job's constraints and selects the optimal node to place a task group on. The service scheduler uses a best fit scoring algorithm influenced by Google's work on Borg. Ranking this larger set of candidate nodes increases scheduling time but provides greater guarantees about the optimality of a job placement, which given the service workload is highly desirable.

Service jobs are intended to run until explicitly stopped by an operator. If a service task exits it is considered a failure and handled according to the job's restart and reschedule blocks.

Batch

Batch jobs are much less sensitive to short term performance fluctuations and are short lived, finishing in a few minutes to a few days. Although the batch scheduler is very similar to the service scheduler, it makes certain optimizations for the batch workload. The main distinction is that after finding the set of nodes that meet the job's constraints it uses the power of two choices described in Berkeley's Sparrow scheduler to limit the number of nodes that are ranked.

Batch jobs are intended to run until they exit successfully. Batch tasks that exit with an error are handled according to the job's restart and reschedule blocks.

System

The system scheduler is used to register jobs that should be run on all clients that meet the job's constraints. The system scheduler is also invoked when clients join the cluster or transition into the ready state. This means that all registered system jobs will be re-evaluated and their tasks will be placed on the newly available nodes if the constraints are met.

This scheduler type is extremely useful for deploying and managing tasks that should be present on every node in the cluster. Since these tasks are managed by Nomad, they can take advantage of job updating, service discovery, and more.

The system scheduler will preempt eligible lower priority tasks running on a node if there isn't enough capacity to place a system job. See preemption for details on how tasks that get preempted are chosen.

Systems jobs are intended to run until explicitly stopped either by an operator or preemption. If a system task exits it is considered a failure and handled according to the job's restart block; system jobs do not have rescheduling.

When used with node pools, system jobs run on all nodes of the pool used by the job. The built-in node pool all allows placing allocations on all clients in the cluster.

System Batch

The sysbatch scheduler is used to register jobs that should be run to completion on all clients that meet the job's constraints. The sysbatch scheduler will schedule jobs similarly to the system scheduler, but like a batch job once a task exits successfully it is not restarted on that client.

This scheduler type is useful for issuing "one off" commands to be run on every node in the cluster. Sysbatch jobs can also be created as periodic and parameterized jobs. Since these tasks are managed by Nomad, they can take advantage of job updating, service discovery, monitoring, and more.

The sysbatch scheduler will preempt lower priority tasks running on a node if there is not enough capacity to place the job. See preemption details on how tasks that get preempted are chosen.

Sysbatch jobs are intended to run until successful completion, explicitly stopped by an operator, or evicted through preemption. Sysbatch tasks that exit with an error are handled according to the job's restart block.

Like the batch scheduler, task groups in system batch jobs may have a count greater than 1 to control how many instances are run. Instances that cannot be immediately placed will be scheduled when resources become available, potentially on a node that has already run another instance of the same job.