Nomad
Schedule AI workloads efficiently on Nomad
AI workloads present unique challenges for container orchestration. Unlike traditional web applications, AI models require significant computational resources, varying memory footprints, and often specialized hardware like Graphics Processing Units (GPUs). Different AI models have different resource requirements — some are optimized for smaller hardware, while others need substantial compute power.
Nomad makes running AI workloads easier and keeps your data in your control. Private Large Language Models (LLMs) run on your infrastructure, reducing risk and ensuring your sensitive data is not used to train models without explicit consent. In addition, Nomad offers node pools for workload segmentation, native service discovery to automatically connect your services, and Nomad Actions to enable safe, repeatable operations on running applications.
In this tutorial, you will deploy a complete AI application stack on Nomad. You will learn how to set up a Nomad cluster with multiple node pools, deploy IBM Granite language models using Ollama, and use Nomad's native service discovery to connect AI backends with the frontend chat interface. You will also safely apply changes to your services with Nomad Actions, customize AI models by connecting to the running allocation, and scale your cluster by adding new nodes to the node pool.
AI workload components
This tutorial uses three components to build the complete AI application:
IBM Granite - An enterprise-grade large language model that offers several versions for different use cases.
Ollama - An open source framework for building and running LLMs. It handles model loading, memory management, and provides an API for interaction.
Open WebUI - An open source web-based chat interface that communicates with Ollama. It offers a ChatGPT-like experience with chat history, model switching, and RAG integration.
By the end of this tutorial, you will have hands-on experience running AI workloads on Nomad and understand how to scale and manage these workloads in production environments.
Prerequisites
To follow this tutorial on your local device, you need your AWS access key and secret key to complete this tutorial. You can also use the hosted Instruqt track to follow along with the tutorial.
Click on the following Start interactive lab button to open the Instruqt track. A page overlay appears that loads the track next to these instructions. We call this overlay "the lab." You can resize the lab at any time using the two horizontal lines at the top.
Launch Terminal
This tutorial includes a free interactive command-line lab that lets you follow along on actual cloud infrastructure.
At the top right, click Launch to start the track. The track will take about two minutes to load. Click the Start button in the bottom right corner after the track loads.
The lab has three tabs visible at the top:
- CLI, a terminal session
- Editor, a visual code editor
- Cloud Console, a web page displaying AWS cloud credentials.
During this tutorial, you will run commands in the CLI tab and edit code in the Editor tab.
Review the example repository
This tutorial uses the learn-nomad-ai-workloads
companion repository on GitHub. This repository contains all the Terraform and Nomad configuration files you will need to complete this tutorial.
Configure AWS credentials
You need to set your AWS region and credentials so Terraform can use them to deploy your Nomad cluster.
First, set your AWS region. Replace us-east-1
with your region.
$ export AWS_DEFAULT_REGION="us-east-1"
Then, set your AWS access key and secret key.
$ export AWS_ACCESS_KEY_ID="<Your AWS Access Key>" && \
export AWS_SECRET_ACCESS_KEY="<Your AWS Secret Key>"
Finally, set the TF_VAR_...
environment variables so Terraform uses your AWS keys and region to configure your Nomad cluster. Terraform will use these values to write them as Nomad variables. You can use Nomad variables to store and consume sensitive values rather than saving them into images or code. In this scenario, the Open WebUI job uses these values to access AWS S3 storage to persist assets and chat artifacts.
$ export TF_VAR_aws_access_key="$AWS_ACCESS_KEY_ID" && \
export TF_VAR_aws_secret_access_key="$AWS_SECRET_ACCESS_KEY" && \
export TF_VAR_aws_default_region="$AWS_DEFAULT_REGION"
Deploy infrastructure
In this section, you will deploy a Nomad cluster with multiple node pools optimized for different AI workload types. Then, you will access the Nomad cluster through both the Nomad UI and CLI.
Configure node pools
AI workloads have distinct infrastructure requirements across their components. A typical AI application includes a web interface for user interaction and a language model for handling the computational work.
The web interface serves as a lightweight frontend, primarily processing user input. Since it is stateless and relays requests to backend services, the web interface requires minimal CPU and memory resources but must be Internet-accessible.
The language model represents a fundamentally different workload profile, demanding substantial CPU and memory for inference processing. Keeping these models in private networks reduces attack surface and prevents unauthorized access. These requirements make them well-suited for larger instances within private network segments.
Without node pools, you must write complex constraints for every job to ensure proper placement on the appropriate nodes. The following example shows the constraints required to place a small job on a public node.
constraint {
attribute = "${meta.public}"
value = "true"
}
constraint {
attribute = "${node.class}"
value = "small"
}
This approach has several problems: jobs may need similar constraint logic, the syntax is error-prone, changes require updating multiple files, and adding new node types is difficult. Node pools solve this by grouping clients into logical sets for job placement. Instead of writing constraints for every job, you can define node pools once and reference them in job specifications, making configuration much simpler.
You will create two node pools:
- Small public nodes for lightweight workloads like web UIs requiring external access
- Large private nodes for heavy compute workloads like large AI models in private networks
Create a file named defaults.auto.tfvars
to specify the node pool configuration. Terraform uses these values to deploy the Nomad cluster (nomad-server.tf
) and client nodes (nomad-clients.tf
).
defaults.auto.tfvars
aws_region = "us-east-1"
aws_server_count = "1" # 1 Nomad server
aws_small_private_client_count = "0"
aws_small_public_client_count = "1" # 1 small client with public IP for UI
aws_medium_private_client_count = "0"
aws_medium_public_client_count = "0"
aws_large_private_client_count = "1" # 1 large client with no public IP for AI model
aws_large_public_client_count = "0"
This defines and creates the following Nomad cluster:
- One Nomad server: Manages the cluster and handles job scheduling decisions
- One small public client: Runs lightweight workloads like web UIs with external access
- One large private client: Runs compute-intensive AI models in a secure, private network
Later in this tutorial, you will add a medium private client to demonstrate how Nomad can scale to accommodate different model sizes and resource requirements.
Deploy the Nomad cluster
Initialize Terraform configuration.
$ terraform init
Apply the Terraform configuration to create a Nomad server cluster, client nodes in different node pools, networking infrastructure including VPC, subnets, and security groups, and initial Nomad configurations and management tokens. When prompted, enter yes
to confirm.
$ terraform apply
The Terraform configuration automatically configures each client with the appropriate node pool assignment. When you deploy jobs later, they can reference the node pool name rather than writing complex constraints.
shared/data-scripts/user-data-client.sh
client {
enabled = true
options {
"driver.raw_exec.enable" = "1"
"docker.privileged.enabled" = "true"
}
meta {
_NOMAD_AGENT_META
externalAddress = "_PUBLIC_IP_ADDRESS"
}
server_join {
retry_join = [ "_NOMAD_RETRY_JOIN" ]
}
node_pool = "_NODE_POOL"
## ...
}
Once complete, Terraform outputs the nomad_UI
URL and nomad_management_token
ACL token with management privileges.
Access Nomad UI
Open the nomad_UI
address in your browser. The cluster has a self-signed certificate so your browser will likely show a warning. Accept the certificate and proceed to the Nomad UI.
On the main page of the Nomad UI, click on the token link under the Not Authorized
heading. Click on the red Sign Out button to log out of the anonymous user. Paste the nomad_management_token
from the Terraform output in the Secret ID field and click Sign in with secret.
You are now logged in as a Nomad admin. Click on Topology to view the dashboard with cluster overview showing one server and two clients.
Configure the Nomad CLI
Configure the Nomad CLI for job deployment.
$ export NOMAD_ADDR="$(terraform output -raw nomad_UI)" && \
export NOMAD_TOKEN="$(terraform output -raw nomad_management_token)" && \
export NOMAD_SKIP_VERIFY=true
These commands point the Nomad CLI to your cluster's address and provide the ACL token to authorize requests. You can now use nomad
CLI commands without repeatedly specifying the address or token.
Deploy core AI services
In this section, you will deploy your AI application. This includes an Ollama service running the Granite 3.3 language model on a large private compute node and Open WebUI on a small public node. In addition, you will define a consistent workflow with Nomad Actions to create a secure admin user.
Deploy the Granite 3.3 model
First, you will deploy an AI language model server using Ollama to host IBM's Granite 3.3 2B parameter model.
The Granite 3.3 job specification defines a service that runs Ollama in a Docker container, automatically downloads the Granite 3.3 model weights during startup, allocates substantial CPU and memory resources for model inference, and targets the large
node pool to ensure placement on the compute-optimized node.
jobs/granite-3-3.nomad.hcl
job "ollama" {
type = "service"
node_pool = "large"
group "ollama" {
count = 1
network {
port "ollama" {
to = 11434
static = 8080
}
}
task "ollama-task" {
driver = "docker"
service {
name = "ollama-backend"
port = "ollama"
provider = "nomad"
}
config {
image = "ollama/ollama"
ports = ["ollama"]
}
resources {
cpu = 9100
memory = 15000
}
}
task "download-granite3.3-model" {
driver = "exec"
lifecycle {
hook = "poststart"
}
resources {
cpu = 100
memory = 100
}
template {
data = <<EOH
{{ range nomadService "ollama-backend" }}
OLLAMA_BASE_URL="http://{{ .Address }}:{{ .Port }}"
{{ end }}
EOH
destination = "local/env.txt"
env = true
}
config {
command = "/bin/bash"
args = [
"-c",
"curl -X POST ${OLLAMA_BASE_URL}/api/pull -d '{\"name\": \"granite3.3:2b\"}'"
]
}
}
}
}
Deploy the job. Nomad will allocate this job on the large private client node because the job specifies node_pool = "large"
in its configuration.
$ nomad job run jobs/granite-3-3.nomad.hcl
In the Nomad UI on the Jobs page, the ollama
job appears. Wait for the job to download the model and start up. Inspect the job's allocation in the UI by clicking Jobs and then ollama. You will find a running allocation where you can view logs, and CPU and memory allocations.
The Ollama service is now running and serving the Granite 3.3 model, but you need a way to interact with it.
Deploy Open WebUI
The Open WebUI job specification defines a web interface that targets the small
node pool for public-facing deployment, configures native service discovery to automatically connect to the Ollama backend, implements secure admin user creation through Nomad Actions, and provisions persistent storage using a host volume for user data and chat history.
jobs/openwebui.nomad.hcl
job "open-webui" {
type = "service"
node_pool = "small"
group "open-webui" {
constraint {
attribute = "${meta.isPublic}"
operator = "="
value = "true"
}
volume "openwebui-data" {
type = "host"
source = "openwebui-data"
read_only = false
}
count = 1
network {
port "open-webui" {
to = 8080
static = 80
}
}
task "open-webui-task" {
driver = "docker"
service {
name = "open-webui-svc"
port = "open-webui"
provider = "nomad"
check {
type = "http"
name = "open-webui-health"
path = "/"
interval = "20s"
timeout = "5s"
}
}
volume_mount {
volume = "openwebui-data"
destination = "/app/backend/data"
read_only = false
}
config {
image = "ghcr.io/open-webui/open-webui:main"
ports = ["open-webui"]
}
resources {
cpu = 4000
memory = 3500
}
template {
data = <<EOH
OLLAMA_BASE_URLS={{ range nomadService "ollama-backend" -}}http://{{ .Address }}:{{ .Port }};{{- end }}
ENV="dev"
DEFAULT_MODELS="granite-3.3"
OFFLINE_MODE="True"
ENABLE_SIGNUP="False"
ENABLE_OPENAI_API="False"
STORAGE_PROVIDER="s3"
{{ with nomadVar "nomad/jobs/open-webui" }}
S3_ACCESS_KEY_ID="{{ .aws_access_key_id }}"
S3_SECRET_ACCESS_KEY="{{ .aws_access_secret_key }}"
S3_ENDPOINT_URL="https://s3.{{ .aws_default_region }}.amazonaws.com"
S3_REGION_NAME="{{ .aws_default_region }}"
S3_BUCKET_NAME="{{ .openwebui_bucket }}"
{{ end }}
EOH
destination = "local/env.txt"
env = true
}
# user-email: admin@local.local
# bcrypt the desired password and place the value into the auth table
# (substitute BCRYPTED_PASSWORD with value)
template {
data = <<EOH
INSERT INTO user (id,name,email,role,profile_image_url,last_active_at,updated_at,created_at) VALUES('ec80e845-976d-4f0e-beb7-30212e69da61','admin','admin@local.local','admin','...','1752842322','1752842322','1752842322');
INSERT INTO auth (id,email,password,active) VALUES ('ec80e845-976d-4f0e-beb7-30212e69da61','admin@local.local','BCRYPTED_PASSWORD','1');
EOH
destination = "local/create-admin-user.sql"
env = false
}
action "create-admin-user" {
command = "/bin/bash"
args = [
"-c",
"apt-get update && apt-get install -y sqlite3 && echo 'Running SQL insert commands...' && sqlite3 /app/backend/data/webui.db < /local/create-admin-user.sql && echo 'Finished running SQL commands'"
]
}
}
}
}
Open WebUI requires an admin login with a bcrypt-hashed password. For this tutorial, use the pre-generated hash for password hashiconf25
:
$2a$12$3iBwcMCx.eSLBwpIqLdbQ.H.xMwQGAixpe2z6Xqq9gNcdbZtkIRke
Open the Open WebUI Nomad job file jobs/openwebui.nomad.hcl
in your editor. Find the placeholder text BCRYPTED_PASSWORD
and replace it with the password hash.
Save the changes to the job file, then run the job.
$ nomad job run jobs/openwebui.nomad.hcl
In the Nomad UI, you will find a job named open-webui
alongside the ollama
job.
Create admin user for Open WebUI
With Open WebUI running, the web interface is deployed, but you need to finalize setup by creating an admin user account so you can log in.
Nomad Actions define safe, repeatable operations that you can trigger on running tasks. Unlike connecting with a shell into containers, which is insecure and not auditable, you need to declare Nomad Actions in the job spec, making them version-controlled and repeatable. They are ACL-gated, auditable with logged executions, and accessible through both the Nomad UI and CLI.
The Open WebUI jobspec includes a custom Nomad Action called create-admin-user
that safely inserts an admin user into the application's database using the password hash provided.
Click on the Jobs page from the left navigation and then click on the open-webui job from the list. In the top right corner, click on the Actions dropdown and then click on the create-admin-user
option. This action inserts the admin user with the password you chose and hashed with Bcrypt Generator.
Open WebUI now has an admin user.
Interact with the AI model
Everything is now in place to interact with the AI model.
First, find the Open WebUI URL by locating the Open WebUI service running on the Nomad client with a public IP. In the Nomad UI, go to Clients and click on the client that was labeled in Terraform variables as the public client, likely named aws-small-public-client-0
or similar.
In the client details, scroll to the Attributes section and find the externalAddress
metadata variable. This is the public IP of the client node. Alternatively, you can find the IP address using the following command.
$ nomad node status -verbose
Copy the IP address.
Open a web browser tab and navigate to http://<IP_ADDRESS>
. Then, login to Open WebUI using the credentials:
Name | Value |
---|---|
admin@local.local | |
Password | hashiconf25 or the password you used if you generated a different hash |
Once logged in, you will find the web interface with a chat prompt. In the top-left, the model selector dropdown shows that the default model selection is Granite 3.3, which is the model deployed with Ollama.
Now, test the model by ensuring the model selected is the Granite 3.3 model. In the message box, type "Why is the ocean blue?" and submit your message.
The Granite 3.3 model processes the question and provides a response in the chat UI. This may take a few seconds given the model size of 2B parameters and the hardware of the large node.
While the question is being processed, take a look at the Nomad UI or CLI metrics. Navigate to the Ollama job in Nomad UI and select the allocation running the Granite 3.3 model. Observe the CPU and memory resource usage of the ollama-task
while the model is preparing a response. CPU usage spikes and memory usage increases as the model runs the inference. This shows Nomad's resource isolation and monitoring - the job is limited to its allocation and Nomad tracks its usage.
You have now successfully deployed an AI model on Nomad and interacted with it through a web interface. In doing so, you have touched on several Nomad concepts.
- Node pools ensured the heavy model runs on the appropriate node.
- Native service discovery allowed the Open WebUI to automatically discover the Ollama backend using Nomad's built-in service discovery.
- Secrets management provided AWS keys and admin password securely using environment variables and Nomad's variable store.
- Resource monitoring used the Nomad UI to find how the workload uses CPU and memory.
- Nomad Actions provided a safe, auditable way to create admin users.
Deploy a custom model
Nomad lets you execute commands inside of running containers, which is useful for debugging or in-place modifications. Use nomad exec
to fine-tune the model's behavior without redeploying anything. Note that nomad exec
is designed for debugging and development, not for production operations. Any changes made inside the allocation's filesystem are ephemeral and will be lost if and when the allocation is rescheduled or restarted.
The goal is to create a variant of the Granite 3.3 model that responds like a pirate. Ollama supports creating new model variants by providing a model file with instructions like a system prompt based on an existing model.
Open an interactive shell session inside the running Ollama allocation container.
$ nomad exec -i -t -task ollama-task -job ollama /bin/bash
The -i -t
flags open an interactive tty session, -task ollama-task
specifies to exec into the container running the Ollama service, which is the task name in the job spec, and -job ollama
specifies the job.
In the container shell session, create a new model definition file named pirate-granite3-3-2b.modelfile
. This instructs Ollama to derive a new model from granite3.3:2b with a modified system prompt that gives the model a persona of a pirate.
$ cat > pirate-granite3-3-2b.modelfile << EOF
FROM granite3.3:2b
# sets the temperature to 1 (higher is more creative, lower is more coherent)
PARAMETER temperature 1
# sets the context window size to 4096 tokens (how much context the model can consider)
PARAMETER num_ctx 4096
# sets a custom system message to specify the behavior of the assistant
SYSTEM You are a pirate, acting as an assistant.
EOF
Create the new model.
$ ollama create "pirate-granite-3.3:2b" -f pirate-granite3-3-2b.modelfile
Ollama loads the base model and applies the modifications, registering a new model named pirate-granite-3.3:2b
. This operation is quick as it reuses the existing model weights with a new configuration on top of the existing model.
Exit the container shell by typing exit
.
$ exit
Open the Open WebUI interface in your browser and refresh the page. The model dropdown now lists pirate-granite-3.3:2b
as an available model since the UI queries the backend for available models. Select pirate-granite-3.3:2b
from the dropdown. If it does not appear, wait a few seconds and refresh the page again.
Submit the same question of "Why is the ocean blue?". The response contains the same factual content coming from the Granite model but is phrased in a pirate-like style: "Arr, the ocean be blue because...".
Understand ephemeral vs persistent storage
The new model was stored using ephemeral storage. By default, Nomad uses ephemeral storage for allocations — if the job is rescheduled or stopped, this new model file would be lost. This is acceptable for this tutorial since you can always re-download the original model data. In a production scenario, if you wanted to preserve such changes, you need to use a persistent volume. Nomad supports host volumes or CSI volumes to attach durable storage to tasks.
The choice between ephemeral and persistent storage depends on whether your workload is stateful or stateless. Model weights are stateless data that can be re-downloaded, making ephemeral storage ideal for them. Ephemeral storage is typically faster than network-attached persistent storage, improving model loading and inference performance. Ephemeral storage requires no additional configuration or management overhead. When scaling out, new instances can quickly download model weights rather than waiting for persistent storage to be provisioned and attached. However, user conversations and customizations are stateful data that need to persist across restarts, making them better suited for persistent storage.
Nomad handles stateful workloads by combining ephemeral and external storage: model weight data is ephemeral but reproducible, and important state is shipped to external storage. If you truly need persistent local state, you could allocate a host volume such as an EBS volume on AWS to the Nomad job for storing models or data across restarts.
Expand the cluster
In this section, you will add a medium-sized node to accommodate different AI model types, then deploy the Granite 4.0 model to show how Nomad's service discovery handles multiple backends.
Add a medium client node
So far, all workloads have been running on the two original client nodes. Now, you will simulate a scenario where you want to deploy a new type of workload that has different resource needs. IBM's Granite 4.0 Tiny Preview is a newer model optimized to run on smaller hardware. It is a 7B parameter MoE model, but with only ~1B active parameters.
Since Granite 4.0 can perform on smaller resources, you would want to create a node pool sized specifically for it. As a result, you will create a medium private node pool, then schedule Granite 4.0 on it.
Open defaults.auto.tfvars
again in your editor. Update the line for aws_medium_private_client_count
to 1
.
defaults.auto.tfvars
aws_region = "us-east-1"
aws_server_count = "1"
aws_small_private_client_count = "0"
aws_small_public_client_count = "1"
aws_medium_private_client_count = "1"
aws_medium_public_client_count = "0"
aws_large_private_client_count = "1"
aws_large_public_client_count = "0"
Save the file and apply the configuration. Confirm with yes
when prompted.
$ terraform apply
Terraform creates a new medium-sized Nomad client in the private subnet with no public IP. After a few minutes, the new client automatically joins the Nomad cluster. You can verify in the Nomad UI under Clients that there is now an aws-medium-private-client-0
alongside the existing small and large clients.
You have scaled the Nomad cluster by adding a node in a new node pool. The separation into node pools ensures you can schedule appropriate jobs there without impacting the original nodes.
Deploy the Granite 4.0 model
Deploy the job for the Granite 4.0 Tiny Preview model.
$ nomad job run jobs/granite-4-0.nomad.hcl
This Nomad jobspec runs another Ollama instance that downloads the Granite 4.0 Tiny Preview model. Nomad schedules this job on the new medium-sized client and downloads the Granite 4.0 model from HuggingFace. Notice how Nomad registers this service as ollama-backend
.
jobs/granite-4-0.nomad.hcl
job "ollama-granite-4-0" {
type = "service"
node_pool = "medium"
group "ollama-granite-4-0" {
## ...
task "ollama-task-granite-4-0" {
## ...
service {
name = "ollama-backend"
port = "ollama"
provider = "nomad"
}
## ...
}
task "download-granite4.0-model" {
## ...
config {
command = "/bin/bash"
args = [
"-c",
"curl -X POST ${OLLAMA_BASE_URL}/api/pull -d '{\"name\": \"hf.co/ibm-granite/granite-4.0-tiny-preview-GGUF:Q4_K_M\"}'"
]
}
}
}
}
Check the Nomad UI to ensure the new job is running on the medium client. Once it is running, the Granite 4.0 model is available in the cluster. At this point, you have two AI models deployed on Nomad: Granite 3.3 on a large node with an additional pirate variant, and Granite 4.0 on a medium node.
Open WebUI will automatically pick up this new model since its configuration specifies all instances of ollama-backend
, delimited by a semicolon. When you deploy another Ollama job that registers as ollama-backend
, Open WebUI's environment retrieves the backend addresses automatically. Open WebUI will automatically invalidate the current authentication token and force you to log in again.
OLLAMA_BASE_URLS={{ range nomadService "ollama-backend" -}}http://{{ .Address }}:{{ .Port }};{{- end }}
Login to Open WebUI again.
Name | Value |
---|---|
admin@local.local | |
Password | hashiconf25 or the password you used if you generated a different hash |
When you log in again, you will find granite-4.0-tiny-preview
as an available model in the model dropdown. Submit a question to this model and note that the response rapidly appears, despite being hosted on a smaller node.
Understand Nomad volumes and storage
The reason Open WebUI maintains your chat history, user accounts, and customizations across restarts is because it uses Nomad's volume system for persistent storage. Understanding Nomad's volume options is crucial for designing stateful applications.
Volume types and trade-offs
Nomad supports several volume types, each with different characteristics and use cases.
Host volumes provide direct access to the host filesystem, offering the best performance and lowest latency since data is stored locally on the node. They are ideal for applications that need high I/O performance or when you want to leverage local SSD storage. However, host volumes tie your data to a specific node, making it difficult to migrate workloads or scale across multiple nodes. If the node fails, you risk data loss unless you have external backup mechanisms.
CSI volumes integrate with Container Storage Interface drivers, enabling integration with cloud storage providers like AWS EBS, Azure Disk, or Google Persistent Disk. They provide persistent, network-attached storage that survives node failures and enables workload migration. CSI volumes offer better data durability and portability compared to host volumes, but they typically have higher latency and lower throughput than local storage. They are well-suited for applications that need data persistence across node failures.
Open WebUI's storage strategy
The Open WebUI job uses a host volume configuration that demonstrates a practical approach to AI workload storage.
volume "openwebui-data" {
type = "host"
source = "openwebui-data"
read_only = false
}
volume_mount {
volume = "openwebui-data"
destination = "/app/backend/data"
read_only = false
}
This configuration mounts a host directory to /app/backend/data
inside the container, where Open WebUI stores its SQLite database containing user accounts, chat history, and application state. The choice of host volumes makes sense for this use case because Open WebUI's data is relatively small. Open WebUI benefits from fast local storage for database operations, and the web UI is stateless enough that it can be easily recreated if the node fails.
For production AI workloads, you might choose different storage strategies based on your requirements. Model weights could be stored on fast local storage for optimal inference performance, while user data and conversation history might use CSI volumes for better durability and portability. The key is matching your storage choice to your application's data access patterns, durability requirements, and performance needs.
Cluster scaling considerations
This section defines Nomad's autoscaling capabilities to automatically adjust both workload resources and cluster nodes based on demand, demonstrating how to handle dynamic AI workload requirements efficiently.
Why autoscaling matters for AI workloads
Manually scaling the cluster and jobs is educational, but in production you would want this to happen automatically based on demand. AI workloads present unique scaling challenges that make autoscaling particularly valuable.
AI models have variable resource requirements depending on the type of inference being performed. Some queries are simple and fast, while others require extensive processing. User demand for AI services can be highly unpredictable, with sudden spikes during peak hours or when new features are released. Different AI models have different resource profiles, and you may need to scale different model types independently based on their individual demand patterns. Over-provisioning leads to wasted resources and inefficient utilization, while under-provisioning results in poor user experience and potential service degradation. AI workloads often have long startup times due to model loading, making reactive scaling less effective than predictive scaling.
Configure Nomad autoscaling for workloads and nodes
Nomad Autoscaler is a component that can monitor metrics and scale workloads (change task group counts) or nodes (provision or terminate clients via cloud integration) in response to load.
Horizontal Application Scaling allows Nomad Autoscaler to monitor metrics (like CPU or queue length) and adjust the number of task instances for a job dynamically. For example, if Open WebUI had multiple replicas and high CPU, the autoscaler could increase the count of those tasks. Horizontal Cluster Scaling enables the autoscaler to integrate with cloud APIs to add or remove Nomad client nodes. For example, it can scale out another "large" node if the current one is over-utilized.
In this scenario, instead of manually running Terraform to add a medium node, we could have configured the autoscaler to add the node when the Granite 4.0 job was submitted or when resources became scarce.
The Nomad Autoscaler runs as a separate service, and it can even run as a Nomad job. It uses plugins to interface with monitoring systems and cloud providers. By using it, you can automatically maintain your cluster and workload instance count to respond to demand while ensuring optimal resource utilization.
In other words, Nomad plus its autoscaler provides autoscaling capabilities similar to other orchestrators: scaling up for performance under load and scaling down when demand decreases.
Clean up resources
Destroy the infrastructure. Confirm with yes
when prompted.
$ terraform destroy
Next steps
You have successfully deployed and interacted with AI workloads on Nomad. In this tutorial, you covered how to use Nomad for scheduling a real-world application (an AI model and UI) across different node pools.
You learned how to use node pools to segregate workloads onto appropriate nodes, giving jobs control over placement for better efficiency and isolation. You explored volumes and storage considerations, understanding that Nomad uses ephemeral storage by default for allocations, which is fine for stateless or temporary data, while for persistent data you need to integrate host or network volumes.
You used native service discovery to automatically connect services without needing a separate Consul cluster, reducing resource overhead and simplifying edge deployments. You implemented secrets management by injecting secrets (AWS keys, admin password) into Nomad jobs securely using environment variables and Nomad's variables store. In a more secure setup, you might use Vault integration with Nomad to fetch dynamic secrets or use Nomad's ACL policies to tightly control variable access. You used Nomad Actions to safely create admin users without shelling into containers, demonstrating a secure and auditable approach to operational tasks.
You used nomad exec
to enter a running container and create a new model. You expanded the cluster with an additional node and deployed a new workload to it without disruption to existing services, highlighting Nomad's flexibility to scale out and schedule new jobs on new capacity seamlessly. You learned how Nomad's autoscaler help you automatically scale workloads and clients based on real-time demand.
With these skills, you can deploy complex applications on Nomad, ensuring they run on the right hardware, can be discovered by consumers, and are managed securely. As a next step, you can explore more Nomad integrations like Nomad Pack for packaged job templates, Vault for secrets management, and Consul for secure service-to-service networking.
For more information, check out the following resources:
- Learn more about Nomad's node pools by visiting the Nomad documentation.
- Read more about Nomad Actions by visiting the Nomad documentation.
- Complete the tutorials in the Nomad ACL System Fundamentals collection to configure a Nomad cluster for ACLs, bootstrap the ACL system, author your first policy, and grant a token based on the policy.