Vault
Raft telemetry
Raft telemetry provides information on Vault integrated storage.
Default metrics
vault.raft.apply
| Metric type | Value | Description |
|---|---|---|
| counter | number | Number of transactions in the configured interval |
The vault.raft.apply metric is generally a good indicator of the write load
on your raft internal storage.
vault.raft.barrier
| Metric type | Value | Description |
|---|---|---|
| counter | number | Number of times the node started the barrier |
A node starts the barrier by issuing a blocking call when it wants to ensure that all pending operations that need to be applied to the finite state machine are properly queued.
vault.raft.candidate.electSelf
| Metric type | Value | Description |
|---|---|---|
| summary | ms | Time required for a node to send a vote request to a peer |
vault.raft.commitNumLogs
| Metric type | Value | Description |
|---|---|---|
| gauge | number | Number of logs processed for application to the finite state machine in a single batch |
vault.raft.commitTime
| Metric type | Value | Description |
|---|---|---|
| summary | ms | Time required to commit a new entry to the raft log on the leader node |
vault.raft.compactLogs
| Metric type | Value | Description |
|---|---|---|
| summary | ms | Time required to trim unnecessary logs |
vault.raft.fsm.apply
| Metric type | Value | Description |
|---|---|---|
| summary | number | Number of logs committed by the finite state machine since the last interval |
vault.raft.fsm.applyBatch
| Metric type | Value | Description |
|---|---|---|
| summary | ms | Time required by the finite state machine to apply the most recent batch of logs |
vault.raft.fsm.applyBatchNum
| Metric type | Value | Description |
|---|---|---|
| counter | number | Number of logs applied in the most recent batch |
vault.raft.fsm.enqueue
| Metric type | Value | Description |
|---|---|---|
| summary | ms | Time required to queue up a batch of logs for the finite state machine to apply |
vault.raft.fsm.restore
| Metric type | Value | Description |
|---|---|---|
| summary | ms | Time required by the finite state machine to complete a restore operation from a snapshot |
vault.raft.fsm.snapshot
| Metric type | Value | Description |
|---|---|---|
| summary | ms | Time required by the finite state machine to record state information for the current snapshot |
vault.raft.fsm.store_config
| Metric type | Value | Description |
|---|---|---|
| summary | ms | Time required to store the most recent raft configuration |
vault.raft.get
| Metric type | Value | Description |
|---|---|---|
| summary | ms | Time required to retrieve an entry from underlying storage |
vault.raft.list
| Metric type | Value | Description |
|---|---|---|
| summary | ms | Time required to retrieve a list of keys from underlying storage |
vault.raft.peers
| Metric type | Value | Description |
|---|---|---|
| guage | number | The number of peers in the raft cluster configuration |
vault.raft.restore
| Metric type | Value | Description |
|---|---|---|
| counter | number | Number of times that the node performed a restore operation |
In the context of raft storage, a restore operation refers to the process where raft consumes an external snapshot to restore its state.
vault.raft.restoreUserSnapshot
| Metric type | Value | Description |
|---|---|---|
| timer | ms | Time required to restore the finite state machine from a user snapshot |
vault.raft.rpc.appendEntries
| Metric type | Value | Description |
|---|---|---|
| timer | ms | Time required to process a remote appendEntries call from a node |
vault.raft.rpc.appendEntries.processLogs
| Metric type | Value | Description |
|---|---|---|
| timer | ms | Time required to completely process the outstanding logs for the given node |
vault.raft.rpc.appendEntries.storeLogs
| Metric type | Value | Description |
|---|---|---|
| timer | ms | Time required to record any outstanding logs since the last request to append entries for the given node |
vault.raft.rpc.installSnapshot
| Metric type | Value | Description |
|---|---|---|
| timer | ms | Time required to process an installSnapshot RPC call |
Only nodes currently in the follower state report
vault.raft.rpc.installSnapshot metrics.
vault.raft.rpc.processHeartbeat
| Metric type | Value | Description |
|---|---|---|
| timer | ms | Time required to process a heartbeat request |
vault.raft.rpc.requestVote
| Metric type | Value | Description |
|---|---|---|
| summary | ms | Time required to complete a requestVote call |
vault.raft.snapshot.create
| Metric type | Value | Description |
|---|---|---|
| timer | ms | Time required to capture a new snapshot |
vault.raft.snapshot.persist
| Metric type | Value | Description |
|---|---|---|
| timer | ms | Time required to record snapshot meta information to disk while taking snapshots |
vault.raft.snapshot.takeSnapshot
| Metric type | Value | Description |
|---|---|---|
| timer | ms | Total time required to create and persist the current snapshot |
In most cases, vault.raft.snapshot.takeSnapshot is approximately equal to
vault.raft.snapshot.create + vault.raft.snapshot.persist.
vault.raft.state.candidate
| Metric type | Value | Description |
|---|---|---|
| counter | number | Number of times the raft server initiated an election |
vault.raft.state.follower
| Metric type | Value | Description |
|---|---|---|
| summary | number | Number of times in the configured interval that the raft server became a follower |
Nodes transition to follower state under the following conditions:
- when the node joins the cluster
- when a leader is elected, but the node was not elected leader
vault.raft.state.leader
| Metric type | Value | Description |
|---|---|---|
| counter | number | Number of times the raft server became a leader |
vault.raft.transition.heartbeat_timeout
| Metric type | Value | Description |
|---|---|---|
| summary | number | Number of times that the node transitioned to candidate state after not receiving a heartbeat message from the last known leader |
vault.raft.transition.leader_lease_timeout
| Metric type | Value | Description |
|---|---|---|
| counter | number | The number of times the leader could not contact a quorum of nodes and therefore stepped down |
vault.raft.verify_leader
| Metric type | Value | Description |
|---|---|---|
| counter | number | Number of times in the configured interval that the node confirmed it is still the leader |
Autopilot metrics
Note
Autopilot only runs on the active node, so autopilot metrics are only captured for the current active node.vault.autopilot.failure_tolerance
| Metric type | Value | Description |
|---|---|---|
| gauge | nodes | The number of healthy nodes in excess of quorum |
The failure tolerance indicates how many currently healthy nodes can fail without losing quorum.
vault.autopilot.healthy
| Metric type | Value | Description |
|---|---|---|
| gauge | boolean | Indicates whether all nodes are healthy |
- A value of
1on the gauge means that Autopilot deems all nodes healthy. - A value of
0on the gauge means that Autopilot deems at least 1 node unhealthy.
vault.autopilot.node.healthy
| Metric type | Value | Description |
|---|---|---|
| gauge | boolean | Indicates whether the active node is healthy |
- A value of
1on the gauge means that Autopilot deems the node indicated bynode_idis healthy. - A value of
0on the gauge means that Autopilot cannot communicate with the node indicated bynode_id, or deems the node unhealthy.
Leadership change metrics
Leadership change metrics indicate the overall performance of the integrated storage on raft servers and the network connection between raft nodes.
vault.raft.leader.dispatchLog
| Metric type | Value | Description |
|---|---|---|
| timer | ms | Time required for the leader node to write a log entry to disk |
vault.raft.leader.dispatchNumLogs
| Metric type | Value | Description |
|---|---|---|
| gauge | number | Number of logs committed to disk in the most recent batch |
vault.raft.leader.lastContact
| Metric type | Value | Description |
|---|---|---|
| summary | ms | Time since the leader was last able to contact the follower nodes when checking its leader lease |
Raft replication metrics
vault.raft.replication.appendEntries.log
| Metric type | Value | Description |
|---|---|---|
| summary | number | Number of logs replicated to a node to establish parity with leader logs |
vault.raft.replication.appendEntries.rpc
| Metric type | Value | Description |
|---|---|---|
| timer | ms | Time required to replicate leader node log entries to all follower nodes with appendEntries |
vault.raft.replication.heartbeat
| Metric type | Value | Description |
|---|---|---|
| timer | ms | Time required to invoke appendEntries on a peer so the peer does not time out |
vault.raft.replication.installSnapshot
| Metric type | Value | Description |
|---|---|---|
| timer | ms | Time required to process an installSnapshot RPC call |
Only nodes currently in the follower state report
vault.raft.replication.installSnapshot metrics.
Storage metrics
vault.raft_storage.bolt.cursor.count
| Metric type | Value | Description |
|---|---|---|
| gauge | number | Number of cursors created in the Bolt database |
vault.raft_storage.bolt.freelist.allocated_bytes
| Metric type | Value | Description |
|---|---|---|
| gauge | bytes | Total space allocated for the freelist for the Bolt database |
vault.raft_storage.bolt.freelist.free_pages
| Metric type | Value | Description |
|---|---|---|
| gauge | number | Number of free pages in the freelist for the Bolt database |
vault.raft_storage.bolt.freelist.pending_pages
| Metric type | Value | Description |
|---|---|---|
| gauge | number | Number of pending pages in the freelist for the Bolt database |
vault.raft_storage.bolt.freelist.used_bytes
| Metric type | Value | Description |
|---|---|---|
| gauge | bytes | Total space used by the freelist for the Bolt database |
vault.raft_storage.bolt.node.count
| Metric type | Value | Description |
|---|---|---|
| gauge | number | Number of node allocations for the Bolt database |
vault.raft_storage.bolt.node.dereferences
| Metric type | Value | Description |
|---|---|---|
| gauge | number | Total number of node dereferences by the Bolt database |
vault.raft_storage.bolt.page.bytes_allocated
| Metric type | Value | Description |
|---|---|---|
| gauge | bytes | Total space allocated to the Bolt database |
vault.raft_storage.bolt.page.count
| Metric type | Value | Description |
|---|---|---|
| gauge | number | Number of page allocations in the Bolt database |
vault.raft_storage.bolt.rebalance.count
| Metric type | Value | Description |
|---|---|---|
| gauge | number | Number of node rebalances performed by the Bolt database |
vault.raft_storage.bolt.rebalance.time
| Metric type | Value | Description |
|---|---|---|
| summary | ms | Time required by the Bolt database to rebalance nodes |
vault.raft_storage.bolt.spill.count
| Metric type | Value | Description |
|---|---|---|
| gauge | number | Number of nodes spilled by the Bolt database |
vault.raft_storage.bolt.spill.time
| Metric type | Value | Description |
|---|---|---|
| summary | ms | Total time spent spilling by the Bolt database |
vault.raft_storage.bolt.split.count
| Metric type | Value | Description |
|---|---|---|
| gauge | number | Number of nodes split by the Bolt database |
vault.raft_storage.bolt.transaction.currently_open_read_transactions
| Metric type | Value | Description |
|---|---|---|
| gauge | number | Number of in-process read transactions for the Bolt DB |
vault.raft_storage.bolt.transaction.started_read_transactions
| Metric type | Value | Description |
|---|---|---|
| gauge | number | Number of read transactions started by the Bolt DB |
vault.raft_storage.bolt.write.count
| Metric type | Value | Description |
|---|---|---|
| gauge | number | Number of writes performed by the Bolt database |
vault.raft_storage.bolt.write.time
| Metric type | Value | Description |
|---|---|---|
| counter | ms | Total cumulative time the Bolt database has spent writing to disk. |
vault.raft_storage.follower.applied_index_delta
| Metric type | Value | Description |
|---|---|---|
| gauge | number | The difference between the index applied by the leader and the index applied by the follower as reported by echoes |
vault.raft_storage.follower.last_heartbeat_ms
| Metric type | Value | Description |
|---|---|---|
| gauge | ms | Time since the follower last received a heartbeat request |
vault.raft_storage.stats.applied_index
| Metric type | Value | Description |
|---|---|---|
| gauge | number | Highest index of raft log last applied to the finite state machine or added to fsm_pending queue |
vault.raft_storage.stats.commit_index
| Metric type | Value | Description |
|---|---|---|
| gauge | number | Index of the last raft log committed to disk on the node |
vault.raft_storage.stats.fsm_pending
| Metric type | Value | Description |
|---|---|---|
| gauge | number | Number of raft logs queued by the node for the finite state machine to apply |
vault.raft-storage.delete
| Metric type | Value | Description |
|---|---|---|
| timer | ms | Time required to insert log entry to delete path |
vault.raft-storage.entry_size
| Metric type | Value | Description |
|---|---|---|
| summary | bytes | The total size of a raft entry during log application |
vault.raft-storage.get
| Metric type | Value | Description |
|---|---|---|
| timer | ms | Time required to retrieve a value for the given path from the finite state machine |
vault.raft-storage.list
| Metric type | Value | Description |
|---|---|---|
| timer | ms | Time required to list all entries under the prefix from the finite state machine |
vault.raft-storage.put
| Metric type | Value | Description |
|---|---|---|
| timer | ms | Time required to insert a log entry to the persist path |
vault.raft-storage.transaction
| Metric type | Value | Description |
|---|---|---|
| timer | ms | Time required to insert operations into a single log |
Thread saturation metrics
vault.raft.thread.main.saturation
| Metric type | Value | Description |
|---|---|---|
| gauge | percentage | Approximate proportion of time the main raft goroutine could not accept new work |
High saturation of the raft goroutines can increase latency in the rest of the system and cause cluster instability.
vault.raft.thread.fsm.saturation
| Metric type | Value | Description |
|---|---|---|
| gauge | percentage | Approximate proportion of time the raft FSM goroutine could not accept new work |
High saturation of the raft goroutines can increase latency in the rest of the system and cause cluster instability.
Write-ahead logging (WAL) metrics
| Metric type | Value | Description |
|---|---|---|
| counter | number | Number of log entries that have been truncated from the head. |
Counts the number of log entries truncated from the head (i.e. the oldest entries).
If you track the rate of change in head truncations over time, individual truncate calls appear as spikes.
| Metric type | Value | Description |
|---|---|---|
| counter | number | Number of log entries that have been truncated from the tail |
Counts the number of log entries truncated from the tail (i.e. the newest entries).
If you track the rate of change in tail truncations over time, individual truncate calls appear as spikes.
| Metric type | Value | Description |
|---|---|---|
| counter | number | Number of calls to GetLog() |
| Metric type | Value | Description |
|---|---|---|
| counter | number | Number of entries written |
| Metric type | Value | Description |
|---|---|---|
| counter | number | Number of bytes of log entries read from segments before decoding. |
The log-entry-bytes-read counter is technically an overestimate because it
includes bytes from headers, index entries, and secondary reads for entries
too large to fit in buffers.
| Metric type | Value | Description |
|---|---|---|
| counter | number | Number of bytes of log entry after encoding with Codec. |
The log-entry-bytes-written counter is technically an overestimate because it
includes bytes from headers and index entries.
| Metric type | Value | Description |
|---|---|---|
| counter | number | Number of calls to StableStore.Get() or GetUint64() |
| Metric type | Value | Description |
|---|---|---|
| counter | number | Number of calls to StableStore.Set() or SetUint64() |
| Metric type | Value | Description |
|---|---|---|
| counter | number | Number of calls to StoreLog() |
Counts the number of entry batches appended to the log with calls to StoreLog().
| Metric type | Value | Description |
|---|---|---|
| counter | number | Number of times Vault moves to a new segment file |
| Metric type | Value | Description |
|---|---|---|
| gauge | seconds | Number of seconds between segment creation and seal. |
The last-segment-age-seconds gauge shows the number of seconds between when a
segment is created and when it is sealed. The gauge resets each time Vault
rotates a segment and provides a rough estimate of how quickly writes are
filling the disk.