Nomad
Nvidia GPU Device Plugin
Name: nvidia-gpu
The Nvidia device plugin is used to expose Nvidia GPUs to Nomad. The Nvidia plugin is built into Nomad and does not need to be downloaded separately.
Fingerprinted Attributes
| Attribute | Unit | 
|---|---|
| memory | MiB | 
| power | W (Watt) | 
| bar1 | MiB | 
| driver_version | string | 
| cores_clock | MHz | 
| memory_clock | MHz | 
| pci_bandwidth | MB/s | 
| display_state | string | 
| persistence_mode | string | 
Runtime Environment
The nvidia-gpu device plugin exposes the following environment variables:
- NVIDIA_VISIBLE_DEVICES- List of Nvidia GPU IDs available to the task.
Additional Task Configurations
Additional environment variables can be set by the task to influence the runtime environment. See Nvidia's documentation.
Installation Requirements
In order to use the nvidia-gpu the following prerequisites must be met:
- GNU/Linux x86_64 with kernel version > 3.10
- NVIDIA GPU with Architecture > Fermi (2.1)
- NVIDIA drivers >= 340.29 with binary nvidia-smi
Docker Driver Requirements
In order to use the Nvidia driver plugin with the Docker driver, please follow
the installation instructions for
nvidia-docker.
Plugin Configuration
plugin "nvidia-gpu" {
  ignored_gpu_ids = ["GPU-fef8089b", "GPU-ac81e44d"]
  fingerprint_period = "1m"
}
The nvidia-gpu device plugin supports the following configuration in the agent
config:
- ignored_gpu_ids- (array<string>: [])- Specifies the set of GPU UUIDs that should be ignored when fingerprinting.
- fingerprint_period- (string: "1m")- The period in which to fingerprint for device changes.
Restrictions
The Nvidia integration only works with drivers who natively integrate with Nvidia's container runtime library.
Nomad has tested support with the docker driver and plans to
bring support to the built-in exec and java
drivers. Support for lxc should be possible by installing the
Nvidia hook but is not
tested or documented by Nomad.
Examples
Inspect a node with a GPU:
$ nomad node status 4d46e59f
ID            = 4d46e59f
Name          = nomad
Class         = <none>
DC            = dc1
Drain         = false
Eligibility   = eligible
Status        = ready
Uptime        = 19m43s
Driver Status = docker,mock_driver,raw_exec
Node Events
Time                  Subsystem  Message
2019-01-23T18:25:18Z  Cluster    Node registered
Allocated Resources
CPU          Memory      Disk
0/15576 MHz  0 B/55 GiB  0 B/28 GiB
Allocation Resource Utilization
CPU          Memory
0/15576 MHz  0 B/55 GiB
Host Resource Utilization
CPU             Memory          Disk
2674/15576 MHz  1.5 GiB/55 GiB  3.0 GiB/31 GiB
Device Resource Utilization
nvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416]  0 / 11441 MiB
Allocations
No allocations placed
Display detailed statistics on a node with a GPU:
$ nomad node status -stats 4d46e59f
ID            = 4d46e59f
Name          = nomad
Class         = <none>
DC            = dc1
Drain         = false
Eligibility   = eligible
Status        = ready
Uptime        = 19m59s
Driver Status = docker,mock_driver,raw_exec
Node Events
Time                  Subsystem  Message
2019-01-23T18:25:18Z  Cluster    Node registered
Allocated Resources
CPU          Memory      Disk
0/15576 MHz  0 B/55 GiB  0 B/28 GiB
Allocation Resource Utilization
CPU          Memory
0/15576 MHz  0 B/55 GiB
Host Resource Utilization
CPU             Memory          Disk
2673/15576 MHz  1.5 GiB/55 GiB  3.0 GiB/31 GiB
Device Resource Utilization
nvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416]  0 / 11441 MiB
// ...TRUNCATED...
Device Stats
Device              = nvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416]
BAR1 buffer state   = 2 / 16384 MiB
Decoder utilization = 0 %
ECC L1 errors       = 0
ECC L2 errors       = 0
ECC memory errors   = 0
Encoder utilization = 0 %
GPU utilization     = 0 %
Memory state        = 0 / 11441 MiB
Memory utilization  = 0 %
Power usage         = 37 / 149 W
Temperature         = 34 C
Allocations
No allocations placed
Run the following example job to see that that the GPU was mounted in the container:
job "gpu-test" {
  datacenters = ["dc1"]
  type = "batch"
  group "smi" {
    task "smi" {
      driver = "docker"
      config {
        image = "nvidia/cuda:9.0-base"
        command = "nvidia-smi"
      }
      resources {
        device "nvidia/gpu" {
          count = 1
          # Add an affinity for a particular model
          affinity {
            attribute = "${device.model}"
            value     = "Tesla K80"
            weight    = 50
          }
        }
      }
    }
  }
}
$ nomad run example.nomad
==> Monitoring evaluation "21bd7584"
    Evaluation triggered by job "gpu-test"
    Allocation "d250baed" created: node "4d46e59f", group "smi"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "21bd7584" finished with status "complete"
$ nomad alloc status d250baed
ID                  = d250baed
Eval ID             = 21bd7584
Name                = gpu-test.smi[0]
Node ID             = 4d46e59f
Job ID              = example
Job Version         = 0
Client Status       = complete
Client Description  = All tasks have completed
Desired Status      = run
Desired Description = <none>
Created             = 7s ago
Modified            = 2s ago
Task "smi" is "dead"
Task Resources
CPU        Memory       Disk     Addresses
0/100 MHz  0 B/300 MiB  300 MiB
Device Stats
nvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416]  0 / 11441 MiB
Task Events:
Started At     = 2019-01-23T18:25:32Z
Finished At    = 2019-01-23T18:25:34Z
Total Restarts = 0
Last Restart   = N/A
Recent Events:
Time                  Type        Description
2019-01-23T18:25:34Z  Terminated  Exit Code: 0
2019-01-23T18:25:32Z  Started     Task started by client
2019-01-23T18:25:29Z  Task Setup  Building Task Directory
2019-01-23T18:25:29Z  Received    Task received by client
$ nomad alloc logs d250baed
Wed Jan 23 18:25:32 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.48                 Driver Version: 410.48                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 00004477:00:00.0 Off |                    0 |
| N/A   33C    P8    37W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+