Nomad
AI workloads on Nomad - Overview
Nomad supports application workloads that run artificial intelligence (AI) operations using large language models (LLMs) such as OpenAI’s ChatGPT, Google’s Gemini, and IBM’s Granite. These popular applications help users analyze data and information, as well as generate text, code, images, and video. Typically, these web-based applications provide a text interface where the user writes a prompt and submits it to the model for analysis and processing.
Nomad supports open source AI tools such as Ollama and Open WebUI so that you can run LLMs on local machines or at scale in the cloud.
In this tutorial you will learn how to run AI workloads on Nomad by reviewing their required components and the job specifications to run them on Nomad. In the subsequent tutorials in this collection, you will run these jobs and customize how and where Nomad schedules them.
Why run a private LLM?
Running an LLM privately, whether on your local machine, in a private datacenter, or on a private cloud, has several benefits that make it a worthwhile alternative to interacting with subscription-based LLM web applications.
Data privacy
When you a run a private deployment of an LLM, you maintain control of the data used to train and interact with the model. As a result, there is a lower risk of your data becoming compromised because it is not being stored remotely by another company. Moreover, your sensitive data is not used to train the model without your explicit consent, which is a common fear for smaller enterprises who want to leverage the benefits of AI applications.
Runs in air-gapped networks
When you run a private deployment of an LLM, you can take advantage of generative AI in an environment that is not connected to the internet. LLMs are self-contained in the sense that they do not need to reach out to the Internet to respond to prompts. Running AI applications in an intranet disconnected from the internet offers additional security and privacy for sensitive data.
Cost controls
Running a private deployment of an LLM can help you control ongoing costs. Most hosted and paid web-based LLM applications charge a per user fee. Depending on the number of users and the types of generative tasks the AI performs, local and private deployments can help you save on expenditures.
Why run an LLM on Nomad?
Unless you have a particularly powerful computer, the resource requirements for LLM applications may prevent you from running them locally. For example, Ollama recommends 8GB of RAM for models with 7 billion (B) parameters, 16GB for those with 13B parameters, and 32GB for those with 33B parameters.
Cloud service providers are an easy way to access relatively low-cost infrastructure options. For example, at the time of writing, an x8g.large
EC2 instance on AWS with 32GB of RAM costs about 20 cents an hour to run, or about $150 per month. A similar 16GB instance costs about $75/month and an 8GB instance is about $40/month.
Nomad supports LLM workloads with features like multi-region application deployments, reliable and scalable client nodes, and NVIDIA GPU support. Once a Nomad cluster is running, you can add clients of any size and target workloads to run on them. And by using Terraform and Nomad together, you can programmatically update your your infrastructure according to a workload's needs.
Instances like these can be created manually with the AWS GUI or the aws
CLI tool, or automatically with Terraform. Once the infrastructure is set up, you can deploy the components by logging in to the instance and installing each one manually, or you can use Nomad and a job specification (jobspec) to codify and deploy the entire scenario automatically.
AI workload components
This tutorial schedules jobs for three components of an AI workload:
- A large language model; in this case, Granite
- Ollama, an open source framework for building and running LLMs
- Open WebUI, an open source web-based chat interface for LLM applications
The large language model
An LLM is made up of many files, including model weights and architectures, training and associated metadata, configurations, and other utility files. This data is compressed and packaged into a single file for ease of storage and transmission.
In this tutorial collection, you will use IBM Granite as the LLM. IBM offers several versions of Granite for different use cases including granite3.2-vision
for understanding visual documents like charts, diagrams, and visualizations and granite-code
for understanding and generating source code. For more information, refer to the IBM Granite documentation.
Ollama
Some components of the LLM, like configuration and training data files, are human readable. Others, like model weights and architectures, are not. In this tutorial, Ollama is the component that provides the framework to load, run, and interact with the LLMs. For more information, refer to the Ollama documentation.
Open WebUI
Open WebUI communicates with Ollama to provide a self-hosted browser-based interface for interacting with the LLM. It includes additional quality-of-life features including chat history, seamless switching between models, and Retrieval Augmented Generation (RAG) integration. For more information, refer to the Open WebUI documentation.
In this tutorial, Open WebUI is the component that acts as the interface between the user and Ollama.
Jobspecs for AI workloads
This tutorial uses one jobspec to deploy Ollama and download the Granite model, and another to deploy Open WebUI. Each jobspec is outlined below and they are available in this tutorial's companion code repository on GitHub in ollama.nomad.hcl
and openwebui.nomad.hcl
.
The ollama
job consists of two tasks:
ollama-task
runs Ollama itselfdownload-granite3.3-model
downloads the LLM model that Ollama uses
Ollama task
ollama.nomad.hcl
job "ollama" {
type = "service"
node_pool = "large"
group "ollama" {
count = 1
network {
port "ollama" {
to = 11434
static = 8080
}
}
task "ollama-task" {
driver = "docker"
service {
name = "ollama-backend"
port = "ollama"
provider = "nomad"
}
config {
image = "ollama/ollama"
ports = ["ollama"]
}
resources {
cpu = 9100
memory = 15000
}
}
# ...
}
}
This task runs the ollama/ollama
Docker image, binds the Ollama port 11434
to a static port of 8080
on the client node, and creates a Nomad service named ollama-backend
that points to port 8080
. It requests 9.1GHz of CPU and 15GB of memory.
Download model task
ollama.nomad.hcl
job "ollama" {
# ...
group "ollama" {
# ...
task "download-granite3.3-model" {
driver = "exec"
lifecycle {
hook = "poststart"
}
resources {
cpu = 100
memory = 100
}
template {
data = <<EOH
{{ range nomadService "ollama-backend" }}
OLLAMA_BASE_URL="http://{{ .Address }}:{{ .Port }}"
{{ end }}
EOH
destination = "local/env.txt"
env = true
}
config {
command = "/bin/bash"
args = [
"-c",
"curl -X POST ${OLLAMA_BASE_URL}/api/pull -d '{\"name\": \"granite3.3:2b\"}'"
]
}
}
}
}
This task runs after the Ollama task has started, as defined in the lifecycle.hook
block. It sends a cURL request to the running Ollama service to pull the granite-3.3:2b
model from the Ollama library. The Ollama URL is retrieved from Nomad with the nomadService
function, and Nomad provides it as an environment variable.
Running Open WebUI
The open-webui
job schedules the Open WebUI component.
openwebui.nomad.hcl
job "open-webui" {
type = "service"
node_pool = "small"
group "open-webui" {
constraint {
attribute = "${meta.isPublic}"
operator = "="
value = "true"
}
count = 1
network {
port "open-webui" {
to = 8080
static = 80
}
}
task "open-webui-task" {
driver = "docker"
service {
name = "open-webui-svc"
port = "open-webui"
provider = "nomad"
check {
type = "http"
name = "open-webui-health"
path = "/"
interval = "20s"
timeout = "5s"
}
}
config {
image = "ghcr.io/open-webui/open-webui:main"
ports = ["open-webui"]
}
resources {
cpu = 4000
memory = 3500
}
template {
data = <<EOH
OLLAMA_BASE_URL={{ range nomadService "ollama-backend" }}http://{{ .Address }}:{{ .Port }}{{ end }}
ENV="dev"
DEFAULT_MODELS="granite-3.3"
OFFLINE_MODE="True"
ENABLE_SIGNUP="False"
ENABLE_OPENAI_API="False"
STORAGE_PROVIDER="s3"
{{ with nomadVar "nomad/jobs/open-webui" }}
S3_ACCESS_KEY_ID="{{ .aws_access_key_id }}"
S3_SECRET_ACCESS_KEY="{{ .aws_access_secret_key }}"
S3_ENDPOINT_URL="https://s3.{{ .aws_default_region }}.amazonaws.com"
S3_REGION_NAME="{{ .aws_default_region }}"
S3_BUCKET_NAME="{{ .openwebui_bucket }}"
{{ end }}
EOH
destination = "local/env.txt"
env = true
}
template {
data = <<EOH
# SQL commands to insert administrator user
# ...
EOH
destination = "local/create-admin-user.sql"
env = false
}
action "create-admin-user" {
command = "/bin/bash"
args = [
"-c",
"apt-get update && apt-get install -y sqlite3 && echo 'Running SQL insert commands...' && sqlite3 /app/backend/data/webui.db < /local/create-admin-user.sql && echo 'Finished running SQL commands'"
]
}
}
}
}
This task runs Open WebUI on a node that is publicly accessible, as defined in the group's constraint
block with meta.isPublic
. It runs the ghcr.io/open-webui/open-webui:main
Docker image and sets several environment variables for Open WebUI.
Importantly, it uses the nomadService
function to retrieve and set the Ollama URL. It also retrieves the S3 bucket credentials from Nomad Variables with the nomadVar
function. These credentials are written to Nomad Variables by Terraform during the apply phase.
Finally, this task creates a Nomad Action that inserts an administrator user into the Open WebUI database when run from within Nomad. This step is required because the user sign-up form in Open WebUI is disabled by the ENABLE_SIGNUP="False"
environment variable for added security.
Next steps
In this tutorial, you learned about the reasons to run an AI workload on Nomad and reviewed the job specifications for the required components.
In the next tutorial, you will create a Nomad cluster on AWS with Terraform and then run the LLM workload on Nomad by submitting the jobspec.