Scheduling
Scheduling algorithms
Nomad Enterprise supports two primary scheduling algorithms.
- Bin packing - Aims to maximize resource usage by packing workloads logically into available clients. It is ideal for cloud environments where infrastructure billing is time/resource based and can be scaled in and out.
- Spread - This algorithm distributes jobs evenly across all available clients to reduce density and potential resource contention. It is suitable for environments where clients are pre-provisioned and scale gradually, such as on-premises deployments.
Configure these as a cluster-level default or at the node pool level.
It is important to distinguish between the spread stanza for task groups and the spread algorithm at the cluster level. The task group's spread option is for customizing the spreading allocations, which is, by default, client-based. The spread algorithm is for spreading out the jobs. The default behavior is to bin pack jobs together, and spread allocations of jobs.
Configuring scheduling algorithms
To set the scheduling algorithm at the cluster level, use either configure file, command-line tool, or API. If the cluster is not in a bootstrapped state, you can enable it by adding a default_scheduler_config section to your agent configuration file.
server {
default_scheduler_config {
scheduler_algorithm = "spread"
}
}
Node pool-level configuration
Node pools in Nomad Enterprise allow you to customize the scheduler algorithm per node pool. This is useful for mixed environments where different node types require different scheduling strategies.
Configuring a node pool
Set your Nomad endpoint environment variable to the correct server address and login if needed.
$ echo $NOMAD_ADDR https://<correct IP or hostname>:4646 nomad loginCreate a configuration file:
nodepools.nomad.hcl.node_pool { name = "cloud-pool" scheduler_config { scheduler_algorithm = "binpack" } } node_pool { name = "on-prem-pool" scheduler_config { scheduler_algorithm = "spread" } }Apply the node pool configuration to the cluster:
nomad node pool apply nomadpools.nomad.hcl.Add the
node_poolparameter to the client configuration file to add the client to the node pool.# client.hcl client { node_pool = "cloud-pool" #change name to suit your naming convention }Restart the Nomad agent, that is,
systemctl restart nomad.Job's are now able to opt-in to a node pool by specifying the
node_poolparameter
For production, version control your node pool configurations and apply them to your pipeline build process.
Note
Node pool configurations override the default scheduler configuration. For example, if setting the default cluster configuration to binpack and there is a node pool with spread set, then any workloads placed on the node pool use spread.
Preemption configuration
Preemption allows Nomad Enterprise to evict lower-priority tasks to make room for higher-priority tasks when resources are scarce. It ensures that critical workloads can acquire the necessary resources even when the cluster is under high utilization. Nomad Enterprise enables this feature by default for `system` jobs.We recommend enabling preemption on production clusters for all workload types if there are critical tier 1 workloads that require priority sharing hosts with lower tier workloads. This ensures that tier 1 workloads always receive priority over the expense of potentially having downtime for the lower tier workloads.
If the cluster is not in a bootstrapped state yet, you can enable preemption by adding a default_scheduler_config section to your agent configuration file.
server {
default_scheduler_config {
preemption_config {
batch_scheduler_enabled = true
system_scheduler_enabled = true
service_scheduler_enabled = true
sysbatch_scheduler_enabled = true }
}
}
Command-line tool
- `nomad operator scheduler set-config -preempt-batch-scheduler=true`(opens in new tab)
- `nomad operator scheduler set-config -preempt-service-scheduler=true`(opens in new tab)
- `nomad operator scheduler set-config -preempt-sysbatch-scheduler=true`(opens in new tab)
API
`/v1/operator/scheduler/configuration`(opens in new tab)For additional details visit the Preemption(opens in new tab) documentation page.Memory oversubscription
Memory oversubscription is an opt-in feature which allows tasks to exceed their reserved memory limit if the client has excess memory capacity. We recommend enabling this feature to help maximize cluster memory utilization while also allowing a margin of error in case a task has a sudden memory spike.
Tip
Enable this feature globally or per node poolExecV2, raw_exec, Docker, Podman, and Java task drivers support memory oversubscription. Consult the documentation of community-supported task drivers for their memory oversubscription support.
Visit the Oversubscribe Memory tutorial for more information on how to configure.
If the cluster is not in a bootstrapped state yet, you can enable the memory oversubscription by adding a default_scheduler_config section to your agent configuration file.
server {
default_scheduler_config {
memory_oversubscription_enabled = true
}
}
Command-line tool:
[nomad operator scheduler set-config -memory-oversubscription=true](https://developer.hashicorp.com/nomad/commands/operator/scheduler/set-config#memory-oversubscription)
API:
[/v1/operator/scheduler/configuration](https://developer.hashicorp.com/nomad/api-docs/operator/scheduler#update-scheduler-configuration)
Node pool-level configuration
node_pool {
name = "cloud-pool"
scheduler_config {
memory_oversubscription_enabled = true
}
}
node_pool {
name = "on-prem-pool"
scheduler_config {
memory_oversubscription_enabled = false
}
}
Task configuration
Tasks must specify memory_max to take advantage of memory oversubscription.
job "example-job" {
group "example" {
task "server" {
resources {
cpu = 100
memory = 256
memory_max = 768
}
}
}
}
Additional recommendations
To avoid degrading the cluster experience, we recommend examining and monitoring resource utilization and considering the following suggestions:
- Set
oom_score_adjfor Linux host services that are not managed by Nomad Enterprise, for example Docker, logging services, and the Nomad agent itself. Forsystemdservices, you can use theOOMScoreAdjfield. - Monitor hosts for memory utilization and set alerts on out-of-memory errors
- Set the client reserved with enough memory for host services that are not managed by Nomad as well as a buffer for the memory excess. For example, if the client reserved memory is 1GB, the allocations on the host may exceed their soft memory limit by almost 1GB in aggregate before the memory becomes contended and allocations get killed.
- Leverage resource quotas to restrict resource utilization within a namespace.
Remember to thoroughly test and validate these configurations in a non-production environment before applying them to your production Nomad cluster. Monitor the cluster's performance and resource utilization and make adjustments based on your specific workload requirements.