Scheduling
Scheduling algorithms
Nomad supports two primary scheduling algorithms.
- Bin packing - Aims to maximize resource usage by packing as many workloads as possible into the available clients. It is ideal for cloud environments where infrastructure billing is time/resource based and can be quickly scaled in and out.
- Spread - This algorithm distributes jobs evenly across all available clients to reduce density and potential resource contention. It is suitable for environments where clients are pre-provisioned and scale slowly, such as on-premises deployments.
These can be configured as a cluster level default or at the node pool level.
It is important to distinguish between the spread stanza for task groups and the spread algorithm at the cluster level. The task group’s spread option is for customizing the spreading allocations, which is by default client-based. The spread algorithm is for spreading out the jobs. The default behavior will be to bin pack jobs together, and spread allocations of jobs.
Configuring scheduling algorithms
To set the scheduling algorithm at the cluster level, use either config file, CLI, or API. If the cluster has not been bootstrapped yet, you can enable it by adding a default_scheduler_config
section to your agent configuration file.
server {
default_scheduler_config {
scheduler_algorithm = "spread"
}
}
Node Pool Level Configuration
Node pools in Nomad Enterprise allow you to customize the scheduler algorithm per node pool. This is useful for mixed environments where different node types require different scheduling strategies.
Configuring a Node Pool
- Ensure your Nomad endpoint environment variable is set to the correct server address and login if needed
$ echo $NOMAD_ADDR
https://<correct IP or hostname>:4646
nomad login
- Create a configuration file:
nodepools.nomad.hcl
node_pool {
name = "cloud-pool"
scheduler_config {
scheduler_algorithm = "binpack"
}
}
node_pool {
name = "on-prem-pool"
scheduler_config {
scheduler_algorithm = "spread"
}
}
[nomad node pool apply](https://developer.hashicorp.com/nomad/docs/commands/node-pool/apply) nomadpools.nomad.hcl
to apply the configuration to the cluster- Add the
node_pool
parameter to the the client configuration file to add the client to the node pool
# client.hcl
client {
node_pool = "cloud-pool" #change name to suit your naming convention
}
- Restart the Nomad agent, ie.
systemctl restart nomad
- Job's are now able to opt-in to a node pool by specifying the
node_pool
parameter
For production, it's recommended to version control your node pool configurations and apply them to your pipeline build process.
Note
Node pool configurations override the default scheduler configuration. For example, if the default cluster configuration is set to `binpack` and there is a node pool with `spread`, then any workloads placed on the node pool will use `spread`Preemption configuration
Preemption allows Nomad to evict lower-priority tasks to make room for higher-priority tasks when resources are scarce. It ensures that critical workloads can acquire the necessary resources even when the cluster is under high utilization. This feature is enabled by default for `system` jobs.We recommend enabling preemption on production clusters for all workload types, especially if there are critical tier 1 workloads that require priority sharing hosts with lower tier workloads. This ensures that tier 1 workloads will always receive priority over the expense of potentially having downtime for the lower tier workloads.
If the cluster has not been bootstrapped yet, you can enable preemption by adding a default_scheduler_config
section to your agent configuration file.
server {
default_scheduler_config {
preemption_config {
batch_scheduler_enabled = true
system_scheduler_enabled = true
service_scheduler_enabled = true
sysbatch_scheduler_enabled = true }
}
}
#### CLI
- <a href="https://developer.hashicorp.com/nomad/docs/commands/operator/scheduler/set-config#preempt-batch-scheduler" target="_blank">`nomad operator scheduler set-config -preempt-batch-scheduler=true`</a>
- <a href="https://developer.hashicorp.com/nomad/docs/commands/operator/scheduler/set-config#preempt-service-scheduler" target="_blank">`nomad operator scheduler set-config -preempt-service-scheduler=true`</a>
- <a href="https://developer.hashicorp.com/nomad/docs/commands/operator/scheduler/set-config#preempt-sysbatch-scheduler" target="_blank">`nomad operator scheduler set-config -preempt-sysbatch-scheduler=true`</a>
#### API
<a href="https://developer.hashicorp.com/nomad/api-docs/operator/scheduler#update-scheduler-configuration" target="_blank">`/v1/operator/scheduler/configuration`</a>
For additional details visit the <a href="https://developer.hashicorp.com/nomad/docs/concepts/scheduling/preemption" target="_blank">Preemption</a> documentation page.
## Memory Oversubscription
Memory oversubscription is an opt-in feature which allows tasks to exceed their reserved memory limit if the client has excess memory capacity. It is recommended to enable this feature to help maximize cluster memory utilization while also allowing a margin of error in case a task has a sudden memory spike.
<Tip>This feature can be enabled globally or <a href="https://developer.hashicorp.com/nomad/docs/other-specifications/node-pool#memory_oversubscription_enabled">per node pool</a></Tip>
Currently, ExecV2, raw_exec, Docker, Podman, and Java task drivers support memory oversubscription. Consult the documentation of community-supported task drivers for their memory oversubscription support.
Visit the [Oversubscribe Memory](https://developer.hashicorp.com/nomad/tutorials/advanced-scheduling/memory-oversubscription) tutorial for more information on how to configure.
If the cluster has not been bootstrapped yet, you can enable the memory oversubscription by adding a `default_scheduler_config` section to your agent configuration file.
```hcl
server {
default_scheduler_config {
memory_oversubscription_enabled = true
}
}
CLI:
[nomad operator scheduler set-config -memory-oversubscription=true](https://developer.hashicorp.com/nomad/docs/commands/operator/scheduler/set-config#memory-oversubscription)
API:
[/v1/operator/scheduler/configuration](https://developer.hashicorp.com/nomad/api-docs/operator/scheduler#update-scheduler-configuration)
Node pool-level configuration
node_pool {
name = "cloud-pool"
scheduler_config {
memory_oversubscription_enabled = true
}
}
node_pool {
name = "on-prem-pool"
scheduler_config {
memory_oversubscription_enabled = false
}
}
#### Task configuration
Tasks must specify <a href="https://developer.hashicorp.com/nomad/docs/job-specification/resources#memory_max">`memory_max`</a> to take advantage of memory oversubscription.
```hcl
job "example-job" {
group "example" {
task "server" {
resources {
cpu = 100
memory = 256
memory_max = 768
}
}
}
}
Additional recommendations
To avoid degrading the cluster experience, we recommend examining and monitoring resource utilization and considering the following suggestions:
- Set
oom_score_adj
for Linux host services that are not managed by Nomad, e.g. Docker, logging services, and the Nomad agent itself. For systemd services, you can use theOOMScoreAdj
field. - Monitor hosts for memory utilization and set alerts on out-of-memory errors
- Set the client reserved with enough memory for host services that are not managed by Nomad as well as a buffer for the memory excess. For example, if the client reserved memory is 1GB, the allocations on the host may exceed their soft memory limit by almost 1GB in aggregate before the memory becomes contended and allocations get killed.
- Leverage resource quotas to restrict resource utilization within a namespace.
Remember to thoroughly test and validate these configurations in a non-production environment before applying them to your production Nomad cluster. Monitor the cluster's performance and resource utilization closely and make adjustments based on your specific workload requirements.