Scale node pools to run more AI models

10min
|
Nomad
Terraform

You can configure Ollama to load multiple models if a node has sufficient resource capacity, with the downside of a particularly large and costly instance in your application workload. An alternate approach is to run several instances of Ollama, each on a smaller instance and each with a specific model. Then, you can configure Open WebUI with the Ollama backend service addresses to use multiple backends with a single Open WebUI frontend. This approach allows you to run smaller and less costly instances that you use more efficiently.

This tutorial introduces the granite-4.0-h-tiny LLM to your application. It is a seven billion parameter model designed to handle general instruction-following tasks, and it is optimized to use fewer compute resources than the Granite 3.3 model that you deployed in the previous tutorial.

In this tutorial, you will scale the Nomad cluster’s infrastructure to add a client to the "medium" size node pool. Then you will install the granite-4.0-h-tiny model on the medium node. Finally, you will modify the Open WebUI jobspec to use Nomad’s native service discovery operations, which enables multiple private Ollama backends running different models for your application’s public Open WebUI frontend.

Prerequisites

This tutorial continues from the previous tutorial of this collection, Run a Granite AI workload. Select the version you would like to run.

An Instruqt track is available that allows you to complete this tutorial using a hosted web-based session. The only requirement is a compatible web browser, so you do not need to install additional software on your local machine.

Open the Create the Nomad cluster section of the previous tutorial and click on the Instruqt option. Complete that tutorial to set up the cluster, Ollama, and Open WebUI in the Instruqt track. Then return to this tutorial to continue.

Nomad node pools

Node pools are a way to group similar nodes in a Nomad cluster. In this tutorial, each node is part of a separate node pool grouped by size: small, medium, or large. You can group nodes into pools based on any factor you choose, from instance sizing or instance location, to the presence of certain hardware like Graphical Processing Units (GPUs).

In Nomad’s web UI, the Clients page lists each client node and the node pool it belongs to.

Advantages of using node pools

You already may be familiar with Nomad’s constraint block, which allows you to select nodes for job placement based on client attributes and custom metadata. This feature allows fine-grained control over where your job’s allocation should run but can add extra configuration to your jobspec.

Nomad’s node pools help you place specific allocations on specific nodes while managing node requirements in separate configurations. Without node pools, you must write constraints for every job into the jobspec to ensure proper placement on an appropriate node type. The following example shows the constraints required to place a small job on a public node.

constraint {
  attribute = "${meta.public}"
  value     = "true"
}

constraint {
  attribute = "${node.class}"
  value     = "small"
}

Using constraint blocks at scale presents several challenges:

Every job requires similar constraint logic.
Constraint syntax is more prone to human error.
Changes require updating multiple job files followed by rolling updates.
Adding new node types or modifying existing node requirements is difficult.

Node pools address these limitations by grouping clients into logical sets for appropriate job placement. Instead of writing constraints for every job, you define node pools once and then reference them in job specifications, significantly simplifying job configuration and management.

Node pools in jobspecs

Both the Ollama and Open WebUI job specifications already contain the node_pool attribute, which is set to an appropriate node based on the resource requirements of each application.

Ollama requires a large amount of resources so it was placed on a node in the large node pool, while Open WebUI was placed on a node in the small node pool because it requires much less.

The following examples show the node pool configuration in each jobspec.

job "ollama" {
  type      = "service"
  node_pool = "large"
  # ...
}

In both jobspecs, you only need a single line to define the node pool.

Create a node pool

There are two ways to create a node pool.

Use the nomad node pool apply command from the Nomad CLI and pass in a node pool spec file.
Create a client node using a client specification file that contains the node_pool attribute.

The Terraform configuration used in this tutorial collection uses a client specification file to create node pools. Nomad creates a new node automatically as soon as the first client in that node pool is registered with Nomad.

name = "small-public-aws-client-0"

client {
  enabled = true
  node_pool = "aws-small-public"
  ## …
}

In the above example, Nomad creates a new node pool named aws-small-public and adds the client to it during registration.

The Instruqt track has three node pools with one node each. In the next section, you will deploy a new Granite model to the node in the "medium" node pool.

Create the medium node

To create the medium node pool, update the defaults.auto.tfvars file as below. Save the file.

aws_region                        = "us-east-1"

aws_server_count                  = "1"

aws_small_private_client_count    = "0"
aws_small_public_client_count     = "1"
aws_medium_private_client_count   = "1"
aws_medium_public_client_count    = "0"
aws_large_private_client_count    = "1"
aws_large_public_client_count     = "0"

Apply the Terraform configuration. When prompted, type yes and then press Enter to confirm. This deployment process will take a few minutes to complete.

$ terraform apply

After Terraform completes the infrastructure update, open the Clients page in the Nomad UI and verify that the medium node is present in the client list.

Create the Granite 4.0 jobspec

The jobspec for the Granite 4.0 model is similar to the original Ollama job. The important differences are highlighted below.

job "ollama-granite-4-0" {
  type      = "service"
  node_pool = "medium"

  group "ollama" {
    count = 1
    network {
      port "ollama" {
        to     = 11434
        static = 8080
      }
    }

    task "ollama-task" {
      driver = "docker"

      service {
        name     = "ollama-backend"
        port     = "ollama"
        provider = "nomad"
      }
      config {
        image = "ollama/ollama"
        ports = ["ollama"]
      }

      resources {
        cpu    = 4000
        memory = 3500
      }
    }

    task "download-granite4.0-model" {
      driver = "exec"
      lifecycle {
        hook = "poststart"
      }
      resources {
        cpu    = 100
        memory = 100
      }
      template {
        data        = <<EOH
{{ range nomadService "ollama-backend" }}
OLLAMA_BASE_URL="http://{{ .Address }}:{{ .Port }}"
{{ end }}
EOH
        destination = "local/env.txt"
        env         = true
      }
      config {
        command = "/bin/bash"
        args = [
          "-c",
          "curl -X POST ${OLLAMA_BASE_URL}/api/pull -d '{\"name\": \"hf.co/ibm-granite/granite-4.0-h-tiny-GGUF:Q4_K_M\"}'"
        ]
      }
    }
  }
}

Make a new file, add the contents of the jobspec above, and then save the file with the name ollama-granite-4.nomad.hcl.

Submit the Ollama job to Nomad.

$ nomad job run ollama-granite-4.nomad.hcl

Nomad registers the Ollama instance running the Granite 4.0 model as an instance of the ollama-backend service. When Nomad receives a service discovery request for this service, it will now return both instances of Ollama: one running the Granite 3.2b model and the other running the Granite 4.0 model.

Update the Open WebUI jobspec

The Open WebUI job retrieves the networking addresses for the Ollama backends using Nomad’s native service discovery. Nomad returns the locations of allocations that run a service named ollama-backend.

Open WebUI can handle multiple Ollama backends with the OLLAMA_BASE_URLS environment variable. Open the jobspec, update the configuration as shown in the following example, and then save the file.

job "open-webui" {
  # …
    group "open-webui" {
      # …
      task "open-webui-task" {
        # …
        template {
          data = <<EOH
- OLLAMA_BASE_URL={{ range nomadService "ollama-backend" }}http://{{ .Address }}:{{ .Port }}{{ end }}
+ OLLAMA_BASE_URLS={{ range nomadService "ollama-backend" -}}http://{{ .Address }}:{{ .Port }};{{- end }}
EOH
        }
      }
    }
}

There are two important changes to the jobspec:

The URL variable is now the pluralized version, OLLAMA_BASE_URLS. This modification instructs Open WebUI to expect more than one Ollama address.
The additional semicolon after the port value. Open WebUI expects a semicolon as a delimiter. Now Nomad will iterate over the available instances of the service, print them to the value, and separate them with a semicolon.

Redeploy the Open WebUI job

Submit the Open WebUI job to Nomad.

$ nomad job run openwebui.nomad.hcl

Refresh the Open WebUI, and log in again if necessary.

At the top left, click the model selection dropdown. It now contains the Granite 4.0 model.

In the Nomad web UI, select Clients. Click on the large private client to open the client overview page. This page shows the total resource utilization of the client node.

In a new browser tab, open the Clients page again and click on the medium private client node.

Interact with each model and compare both response quality and response time between the available models. Take note of the resource usage on the client overview page for each node as the model is formulating a response.

When you are ready, continue to the next section to clean up the infrastructure and resources.

Clean up

Before you clean up your infrastructure, you should stop the jobs you are running.

Navigate to the Jobs page, click on the ollama job, and click on the Stop job button on the right side of the page.

Navigate back to the Jobs page and follow the same process to stop the ollama-granite-4-0 and open-webui jobs.

Now that Nomad is not running any jobs, you can clean up your infrastructure and cloud resources.

In the lab, click on the Check button to complete the scenario.

Next steps

In this tutorial, you added a new client node to your Nomad cluster to create a new node pool. Then you used Nomad’s native service discovery operations to enable multiple Ollama backends for a single frontend instance of Open WebUI. Now Nomad will allocate your AI application’s component services on nodes according to their underlying resource requirements and make them available to each other automatically.

Nomad’s native service discovery functions best in simple networking environments, and can be ideal for application development and debugging. For more complex networking scenarios with greater security requirements, we recommend HashiCorp Consul for service discovery. To learn more about how to use Consul’s service discovery features in your Nomad cluster, refer to the following resources:

Migrate a monolith tutorial demonstrates Nomad operations using Consul service discovery at several different levels of application complexity, from monolithic to microservice architectures.
The Consul integration documentation page explains the features and provides links to integration guides.
The consul block documentation page provides a reference for the available attributes of the block.

Nomad is a flexible workload orchestrator that can support many kinds of AI workloads. For example, you can also use Nomad to allocate jobs directly to NVIDIA GPUs. For more information, refer to NVIDIA GPU Device plugin.

If you are a Nomad user who wants to learn more about using large language models in workloads, we recommend the following external resources to continue your learning:

Run a Granite AI workload

Next Collection

Autoscaling