Nomad
Nomad agents
A Nomad agent is a long running process that runs on every machine in your Nomad cluster. The behavior of the agent depends on if it is running in client or server mode. Clients run tasks, while servers manage the cluster.
Server agents are part of the consensus protocol and gossip protocol. The consensus protocol, powered by Raft, lets the servers perform leader election and state replication. The gossip protocol allows for server clustering and multi-region federation. The higher burden on the server nodes means that you should run them on dedicated instances because the servers are more resource intensive than a client node.
Client agents use fingerprinting to determine the capabilities and resources of the host machine, as well as what drivers are available. Clients register with servers to provide node information and a heartbeat. Clients run tasks that the server assigns to them. Client nodes make up the majority of the cluster and are very lightweight. They interface with the server nodes and maintain very little state of their own. Each cluster has usually 3 or 5 server agents and potentially thousands of clients.
Run an agent
Start the agent with the nomad agent
command.
This command blocks, running forever or until told to quit. The nomad agent
command takes a variety of configuration options, but most have sane defaults.
Linux Users
You must run client agents as root, or with sudo
, so that cpuset accounting
and network namespaces work correctly.
This example starts the agent in development mode, which means the agents runs
as both the server and the client. Do not use -dev
in a production environment.
$ sudo nomad agent -dev
==> Starting Nomad agent...
==> Nomad agent configuration:
Client: true
Log Level: INFO
Region: global (DC: dc1)
Server: true
==> Nomad agent started! Log data will stream in below:
[INFO] serf: EventMemberJoin: server-1.node.global 127.0.0.1
[INFO] nomad: starting 4 scheduling worker(s) for [service batch _core]
...
The nomad agent
command outputs the following important information:
Client: This indicates whether the agent is running as a client. Client nodes fingerprint their host environment, register with servers, and run tasks.
Log Level: This indicates the configured log level. Nomad logs only messages with an equal or higher severity.You may turn change the log level to increase verbosity for debugging or reduce to avoid noisy logging.
Region: This is the region and datacenter in which the agent runs. Nomad has first-class support for multi-datacenter and multi-region configurations. Use the
-region
and-dc
flags to set the region and datacenter. The default is theglobal
region indc1
.Server: This indicates whether the agent is running as a server. Server nodes have the extra burden of participating in the consensus protocol, storing cluster state, and making scheduling decisions.
Stop an agent
By default, any stop signal, such as interrupt or terminate, causes the
agent to exit after ensuring its internal state is written to disk as
needed. You can configure additional behaviors by setting shutdown
leave_on_interrupt
or leave_on_terminate
to respond to the
respective signals.
For servers, when you set leave_on_interrupt
or leave_on_terminate
, the
servers notify other servers of their intention to leave the cluster, which
allows them to leave the consensus peer set. It is especially important that
a server node be allowed to leave gracefully so that there is a minimal
impact on availability as the server leaves the consensus peer set. If a server
does not gracefully leave, and will not return into service, use the server
force-leave
command to eject that server from the consensus peer set.
For clients, when you set leave_on_interrupt
or leave_on_terminate
and the
client is configured with drain_on_shutdown
, the client drains its
workloads before shutting down.
Signal handling
In addition to the optional handling of interrupt (SIGINT
) and terminate
signals (SIGTERM
) described in the Stop an agent
section, Nomad supports special behavior for several other
signals useful for debugging.
SIGHUP
causes Nomad to reload its configuration.SIGUSR1
causes Nomad to print its metrics without stopping the agent.SIGQUIT
,SIGILL
,SIGTRAP
,SIGABRT
,SIGSTKFLT
,SIGEMT
, orSIGSYS
signals are handled by the Go runtime. These the Nomad agent to exit and print its stack trace.
When using the official HashiCorp packages on Linux, you can send these signals
via systemctl
.
This example outputs the Nomad agent's metrics.
$ sudo systemctl kill nomad -s SIGUSR1
You can then read those metrics in the service logs:
$ journalctl -u nomad
Lifecycle
Every agent in the Nomad cluster goes through a lifecycle. Understanding this lifecycle is useful for building a mental model of an agent's interactions with a cluster and how the cluster treats a node.
When a client agent starts, it fingerprints the host machine to identify its attributes, capabilities, and task drivers. The client then reports this information to the servers during an initial registration. You provide the addresses of known servers to the agent via configuration, potentially using DNS for resolution. Use Consul to avoid hard coding addresses and instead resolve them on demand.
While a client is running, it sends heartbeats to servers to maintain liveness. If the heartbeats fail, the servers assume the client node has failed. The server then stops assigning new tasks and migrates existing tasks. It is impossible to distinguish between a network failure and an agent crash, so Nomad handles both cases in the same way. Once the network recovers or a crashed agent restarts, Nomad updates the node status and resumes normal operation.
To prevent an accumulation of nodes in a terminal state, Nomad does periodic garbage collection of nodes. By default, if a node is in a failed or 'down' state for over 24 hours, Nomad garbage collects that node.
Servers are slightly more complex since they perform additional functions. They
participate in a gossip protocol both to cluster
within a region and to support multi-region configurations. When a server starts, it does not know the address of other servers in the cluster.
To discover its peers, it must join the cluster. You do this with the
server join
command or by providing the
proper configuration on start. Once a node joins, this information is gossiped
to the entire cluster, meaning all nodes will eventually be aware of each other.
When a server leaves, it specifies its intent to do so, and the cluster marks that node as having left the cluster. If the server has left, replication to it stops, and it is removed from the consensus peer set. If the server has failed, replication attempts to make progress to recover from a software or network failure.
Permissions
Nomad servers and Nomad clients have different requirements for permissions.
Run Nomad servers with the lowest possible permissions. The servers
need access to their own data directory and the ability to bind to their ports.
You should create a nomad
user with the minimal set of required privileges.
Run Nomad clients as root
due to the OS isolation mechanisms that
require root privileges. While it is possible to run Nomad as an unprivileged
user, you must do careful testing to ensure the task drivers and features
you use function as expected. The Nomad client's data directory should be
owned by root
with filesystem permissions set to 0700
.