Scheduling

Scheduling algorithms

Nomad supports two primary scheduling algorithms.

Bin packing - Aims to maximize resource usage by packing as many workloads as possible into the available clients. It is ideal for cloud environments where infrastructure billing is time/resource based and can be quickly scaled in and out.
Spread - This algorithm distributes jobs evenly across all available clients to reduce density and potential resource contention. It is suitable for environments where clients are pre-provisioned and scale slowly, such as on-premises deployments.

These can be configured as a cluster level default or at the node pool level.

It is important to distinguish between the spread stanza for task groups and the spread algorithm at the cluster level. The task group’s spread option is for customizing the spreading allocations, which is by default client-based. The spread algorithm is for spreading out the jobs. The default behavior will be to bin pack jobs together, and spread allocations of jobs.

Configuring scheduling algorithms

To set the scheduling algorithm at the cluster level, use either config file, CLI, or API. If the cluster has not been bootstrapped yet, you can enable it by adding a default_scheduler_config section to your agent configuration file.

server {
  default_scheduler_config {
    scheduler_algorithm = "spread"
  }
}

Node Pool Level Configuration

Node pools in Nomad Enterprise allow you to customize the scheduler algorithm per node pool. This is useful for mixed environments where different node types require different scheduling strategies.

Configuring a Node Pool

Ensure your Nomad endpoint environment variable is set to the correct server address and login if needed
```
$ echo $NOMAD_ADDR
https://<correct IP or hostname>:4646
nomad login
```

Create a configuration file: nodepools.nomad.hcl

node_pool {
  name = "cloud-pool"
  scheduler_config {
   scheduler_algorithm = "binpack"
  }
}

node_pool {
  name = "on-prem-pool"
  scheduler_config {
   scheduler_algorithm = "spread"
  }
}

Apply the nodepool configuration to the cluster: nomad node pool apply nomadpools.nomad.hcl

Add the node_pool parameter to the the client configuration file to add the client to the node pool

# client.hcl
client {
   node_pool = "cloud-pool" #change name to suit your naming convention
}

Restart the Nomad agent, ie. systemctl restart nomad
Job's are now able to opt-in to a node pool by specifying the node_pool parameter

For production, it's recommended to version control your node pool configurations and apply them to your pipeline build process.

Note

Node pool configurations override the default scheduler configuration. For example, if the default cluster configuration is set to `binpack` and there is a node pool with `spread`, then any workloads placed on the node pool will use `spread`

Preemption configuration

Preemption allows Nomad to evict lower-priority tasks to make room for higher-priority tasks when resources are scarce. It ensures that critical workloads can acquire the necessary resources even when the cluster is under high utilization. This feature is enabled by default for `system` jobs.

We recommend enabling preemption on production clusters for all workload types, especially if there are critical tier 1 workloads that require priority sharing hosts with lower tier workloads. This ensures that tier 1 workloads will always receive priority over the expense of potentially having downtime for the lower tier workloads.

If the cluster has not been bootstrapped yet, you can enable preemption by adding a default_scheduler_config section to your agent configuration file.

server {
  default_scheduler_config {
    preemption_config {
      batch_scheduler_enabled    = true
      system_scheduler_enabled   = true
      service_scheduler_enabled  = true
      sysbatch_scheduler_enabled = true     }
  }
}

CLI

`nomad operator scheduler set-config -preempt-batch-scheduler=true`(opens in new tab)
`nomad operator scheduler set-config -preempt-service-scheduler=true`(opens in new tab)
`nomad operator scheduler set-config -preempt-sysbatch-scheduler=true`(opens in new tab)

API

`/v1/operator/scheduler/configuration`

For additional details visit the Preemption(opens in new tab) documentation page.

Memory Oversubscription

Memory oversubscription is an opt-in feature which allows tasks to exceed their reserved memory limit if the client has excess memory capacity. It is recommended to enable this feature to help maximize cluster memory utilization while also allowing a margin of error in case a task has a sudden memory spike.

Tip

This feature can be enabled globally or per node pool

Currently, ExecV2, raw_exec, Docker, Podman, and Java task drivers support memory oversubscription. Consult the documentation of community-supported task drivers for their memory oversubscription support.

Visit the Oversubscribe Memory tutorial for more information on how to configure.

If the cluster has not been bootstrapped yet, you can enable the memory oversubscription by adding a default_scheduler_config section to your agent configuration file.

server {
  default_scheduler_config {
    memory_oversubscription_enabled = true
  }
}

CLI: [nomad operator scheduler set-config -memory-oversubscription=true](https://developer.hashicorp.com/nomad/commands/operator/scheduler/set-config#memory-oversubscription)

API: [/v1/operator/scheduler/configuration](https://developer.hashicorp.com/nomad/api-docs/operator/scheduler#update-scheduler-configuration)

Node pool-level configuration

node_pool {
  name = "cloud-pool"
  scheduler_config {
   memory_oversubscription_enabled = true
  }
}

node_pool {
  name = "on-prem-pool"
  scheduler_config {
   memory_oversubscription_enabled = false
  }
}

Task configuration

Tasks must specify memory_max to take advantage of memory oversubscription.

job "example-job" {
  group "example" {
    task "server" {
      resources {
        cpu                 = 100
        memory          = 256
        memory_max = 768

      }
    }
  }
}

Additional recommendations

To avoid degrading the cluster experience, we recommend examining and monitoring resource utilization and considering the following suggestions:

Set oom_score_adj for Linux host services that are not managed by Nomad, e.g. Docker, logging services, and the Nomad agent itself. For systemd services, you can use the OOMScoreAdj field.
Monitor hosts for memory utilization and set alerts on out-of-memory errors
Set the client reserved with enough memory for host services that are not managed by Nomad as well as a buffer for the memory excess. For example, if the client reserved memory is 1GB, the allocations on the host may exceed their soft memory limit by almost 1GB in aggregate before the memory becomes contended and allocations get killed.
Leverage resource quotas to restrict resource utilization within a namespace.

Remember to thoroughly test and validate these configurations in a non-production environment before applying them to your production Nomad cluster. Monitor the cluster's performance and resource utilization closely and make adjustments based on your specific workload requirements.

Networking

Resource Quotas