Telemetry Metrics Reference
The following reference provides detailed information about key metrics available from Vault telemetry which you can actively monitor to gauge Vault server health and performance.
Note
This is not a comprehensive collection of all telemetry metrics, which can be found in the Telemetry documentation. This reference includes critical metrics for monitoring as identified by Vault operators in the field, and is intended as a helpful starting point for identifying key metrics when building out your own monitoring solutions.
Table of contents
- Vault Operational Metrics
- Seal status Consul health check
- Garbage collection metrics
- Disk metrics
- Audit device related metrics
- Request handling metrics
- Route-specific metrics
- Leadership metrics
- Replication metrics
- Replication RPC metrics
- Write-ahead log metrics
- Identity metrics
- Expiration metrics
- Integrated Storage metrics
- Consul storage metrics
- Vault Usage Metrics
- Vault Agent Metrics
Vault Operational Metrics
The following are critical Vault operational metrics from Vault telemetry and from the Telegraf agent itself related to overall server health and system-level performance.
These metrics are the most useful for ops teams to monitor and alert on in production deployments.
Seal status Consul health check
Note
This seal status health check metric is relevant only when using Consul for high availability coordination or storage and in such cases, the metric is emitted by the Consul agent, not Vault itself.
This metric is formatted as:
For this metric, a value of 1 indicates Vault is unsealed, whereas 0 means that Vault is sealed.
Why it is important:
By default, Vault is sealed on startup, so if this value changes to 0 during the day, Vault has restarted for some reason. And until it's unsealed, it won't answer requests from clients.
What to look for:
A value of 0 being reported by any host.
vault.core.unsealed
Metric source | Description | Unit | Type |
---|---|---|---|
Vault | Has a value 1 when Vault is unsealed, and 0 when Vault is sealed. | bool | gauge |
Why it is important:
Immediately answers the question is this Vault server sealed or unsealed?
What to look for:
- 1: Vault server is sealed
- 0: Vault server is unsealed
vault.runtime.alloc_bytes
Metric source | Description | Unit | Type |
---|---|---|---|
Vault | This metric represents the number of bytes allocated by the vault process. | byte | summary |
- vault.runtime.alloc_bytes.value provides the value.
vault.runtime.sys_bytes
Metric source | Description | Unit | Type |
---|---|---|---|
Vault | This metric represents the total number of bytes of memory obtained from the OS by the vault process. | byte | summary |
- vault.runtime.sys_bytes.value provides the value.
vault.runtime.num_goroutines
Metric source | Description | Unit | Type |
---|---|---|---|
Vault | This metric represents the number of goroutines associated with the vault process. This metric can serve as a general system load indicator and is worth establishing a baseline and thresholds for alerting. | goroutines | summary |
- vault.runtime.num_goroutines.value provides the value.
Why it is important:
Blocked goroutines can increase memory usage and slow garbage collection.
swap.used_percent
Metric source | Description |
---|---|
Telegraf | This metric represents the percentage of swap space in use. |
Why it is important:
Vault requires sufficient memory to hold its working data set and if it exhausts available memory it can crash. You should also monitor total available memory to make sure some memory is available for other processes, and swap usage should remain at 0% for best performance.
What to look for:
If sys_bytes exceeds 90% of total_bytes, if mem.used_percent is over 90%, or if swap.used_percent is greater than 0.
Garbage collection metrics
These metrics represent garbage collection related measurements that are provided by the Vault runtime.
vault.runtime.gc_pause_ns
Metric source | Description | Unit | Type |
---|---|---|---|
Vault | This metric represents the number of nanoseconds consumed by garbage collection (GC) pauses since Vault started. | nanosecond | sample |
- vault.runtime.gc_pause_ns.count provides a count of GC pauses.
- vault.runtime.gc_pause_ns.lower provides the lower bound for time taken by GC pauses.
- Use vault.runtime.gc_pause_ns.mean provides the mean for time taken by GC pauses.
- vault.runtime.gc_pause_ns.stddev provides the standard deviation for time taken by GC pauses.
- vault.runtime.gc_pause_ns.sum provides the sum of time taken by GC pauses.
- vault.runtime.gc_pause_ns.upper provides the upper bound for time taken by GC pauses.
Why it is important:
As mentioned above, GC pause is a stop-the-world event, meaning that all runtime threads are blocked until GC completes. Normally these pauses last only a few nanoseconds. But if memory usage is high, the Go runtime may GC so frequently that it starts to slow down Vault.
What to look for:
Warning if total_gc_pause_ns exceeds 2 seconds/minute, critical if it exceeds 5 seconds/minute
Disk metrics
These metrics represent system level disk measurements that are provided by a system level agent, such as Telegraf.
diskio.read_bytes
Metric source | Description |
---|---|
Telegraf | This metric represents bytes read from each block device. |
diskio.write_bytes
Metric source | Description |
---|---|
Telegraf | This metric represents bytes written to each block device. |
disk.used_percent
Metric source | Description |
---|---|
Telegraf | This metric represents per-mount-point block device utilization. |
Why it is important:
When using Integrated Storage, Vault disk I/O performance becomes a more critical factor and proactive monitoring and alerting on disk performance for Vault servers is crucial.
When using storage backends other than Integrated Storage, Vault generally doesn't require too much disk I/O, so a sudden change in disk activity could mean that debug or trace logging has accidentally been enabled in production, which can impact performance.
Too much disk I/O can cause the rest of the system to slow down or become unavailable as the kernel spends all its time waiting for I/O to complete.
What to look for:
Sudden large changes to the diskio metrics (greater than 50% deviation from baseline, or more than 3 standard deviations from baseline). Over 80% utilization on block device mount points on which Vault data are persisted.
Audit device related metrics
These are critical Vault metrics, and can often provide a first alert that an audit device log is blocked.
vault.audit.file/.log_request
Metric source | Description | Unit | Type |
---|---|---|---|
Vault | This metric represents a count of requests to an enabled file audit device. | ms | summary |
- vault.audit.file/.log_request.count provides a count of audit device requests.
- vault.audit.file/.log_request.lower provides the lower bound for time taken by audit device requests.
- Use vault.audit.file/.log_request.mean provides the mean for time taken by audit device requests.
- vault.audit.file/.log_request.stddev provides the standard deviation for time taken by audit device requests.
- vault.audit.file/.log_request.sum provides the sum of time taken by audit device requests.
- vault.audit.file/.log_request.upper provides the upper bound for time taken by audit device requests. | ms | summary |
vault.audit.file/.log_response
Metric source | Description | Unit | Type |
---|---|---|---|
Vault | This metric represents a count of responses to log requests specifically to an enabled file audit device. | ms | summary |
- vault.audit.file/.log_response.count provides a count of audit device responses.
- vault.audit.file/.log_response.lower provides the lower bound for time taken by audit device responses.
- Use vault.audit.file/.log_response.mean provides the mean for time taken by audit device responses.
- vault.audit.file/.log_response.stddev provides the standard deviation for time taken by audit device responses.
- vault.audit.file/.log_response.sum provides the sum of time taken by audit device responses.
- vault.audit.file/.log_response.upper provides the upper bound for time taken by audit device responses. | ms | summary |
vault.audit.log_request
Metric source | Description | Unit | Type |
---|---|---|---|
Vault | This metric represents a count of requests specifically to an enabled file audit device. | ms | summary |
- vault.audit.log_request.count provides a count of audit device requests.
- vault.audit.log_request.lower provides the lower bound for time taken by audit device requests.
- Use vault.audit.log_request.mean provides the mean for time taken by audit device requests.
- vault.audit.log_request.stddev provides the standard deviation for time taken by audit device requests.
- vault.audit.log_request.sum provides the sum of time taken by audit device requests.
- vault.audit.log_request.upper provides the upper bound for time taken by audit device requests.
vault.audit.log_response
Metric source | Description | Unit | Type |
---|---|---|---|
Vault | This metric represents a count of responses to log requests to an enabled audit device. | ms | summary |
- vault.audit.log_response.count provides a count of audit device responses.
- vault.audit.log_response.lower provides the lower bound for time taken by audit device responses.
- Use vault.audit.log_response.mean provides the mean for time taken by audit device responses.
- vault.audit.log_response.stddev provides the standard deviation for time taken by audit device responses.
- vault.audit.log_response.sum provides the sum of time taken by audit device responses.
- vault.audit.log_response.upper provides the upper bound for time taken by audit device responses.
vault.audit.log_request_failure
Metric source | Description | Unit | Type |
---|---|---|---|
Vault | This metric represents a count of failed attempts to log requests to an enabled audit device. | failures | counter |
- vault.audit.log_request_failure.value provides the number of audit device log request failures since startup.
vault.audit.log_response_failure
Metric source | Description | Unit | Type |
---|---|---|---|
Vault | This metric represents a count of failed attempts to log responses to an enabled audit device. | failures | counter |
- vault.audit.log_response_failure.value provides the number of audit device log response failures since startup.
Why it is important:
These metrics are of utmost importance as a blocked audit device can cause Vault to deliberately stop servicing requests. Review the Blocked Audit Devices documentation for more information.
Request handling metrics
These metrics represent counts and measurements that are provided by Vault core request handlers.
vault.core.handle_request
Metric source | Description | Unit | Type |
---|---|---|---|
Vault | This metric represents the duration of requests handled by Vault core. | ms | summary |
- vault.core.handle_request.count provides a count of requests handled by core.
- vault.core.handle_request.lower provides the lower bound for time taken by requests handled by core.
- Use vault.core.handle_request.mean provides the mean for time taken by requests handled by core.
- vault.core.handle_request.stddev provides the standard deviation for time taken by requests handled by core.
- vault.core.handle_request.sum provides the sum of time taken by requests handled by core.
- vault.core.handle_request.upper provides the upper bound for time taken by requests handled by core.
Why it is important:
This is a key measure of Vault's response time or number of requests.
What to look for:
Changes to the count or mean fields that exceed 50% of baseline values, or more than 3 standard deviations above baseline.
vault.core.handle_login_request
Metric source | Description | Unit | Type |
---|---|---|---|
Vault | This metric represents the duration of login requests handled by Vault core. | ms | summary |
- vault.core.handle_login_request.count provides a count of login requests handled by core.
- vault.core.handle_login_request.lower provides the lower bound for time taken by login requests handled by core.
- Use vault.core.handle_login_request.mean provides the mean for time taken by login requests handled by core.
- vault.core.handle_login_request.stddev provides the standard deviation for time taken by login requests handled by core.
- vault.core.handle_login_request.sum provides the sum of time taken by login requests handled by core.
- vault.core.handle_login_request.upper provides the upper bound for time taken by login requests handled by core.
Why it is important:
This is a key measure of Vault's user login response time or number of login requests.
What to look for:
Changes to the count or mean fields that exceed 50% of baseline values, or more than 3 standard deviations above baseline.
Route-specific metrics
Vault also provides metrics about operations against specific routes, including those in use by enabled secrets engines.
vault.route.<operation>.<mount>
The general format of the route based metrics is as follows:
These Vault metrics represent the time to handle an operation by a particular mount point. Instead of labels, there is one metric per operation/mount pair.
If measured in telemetry metrics, you can gain a good approximation of response time per API endpoint. The metric originates from slightly lower than HTTP, auditing and some common request processing in the Vault stack, but it's at a higher level than any per-API handling.
The following are some specific examples taken from live metrics.
Displays the mean time for rollback operations against the system backend.
Another example is vault.route.read.auth-token-, which represents the authentication token read route; here are the available values:
- vault.route.read.auth-token-.count provides a count of authentication token reads.
- vault.route.read.auth-token-.lower provides the lower bound for time taken by authentication token reads.
- Use vault.route.read.auth-token-.mean provides the mean for time taken by authentication token reads.
- vault.route.read.auth-token-.stddev provides the standard deviation for time taken by authentication token reads.
- vault.route.read.auth-token-.sum provides the sum of time taken by authentication token reads.
- vault.route.read.auth-token-.upper provides the upper bound for time taken by authentication token reads.
Leadership metrics
These are critical operational metrics related to Vault cluster leadership changes, and can help you spot an unhealthy cluster or leadership flapping condition.
vault.core.leadership_setup_failed
Metric source | Description | Unit | Type |
---|---|---|---|
Vault | This metric represents the duration of time taken by cluster leadership setup failures which have occurred in a highly available Vault cluster. This should be monitored and alerted on for overall cluster leadership status. | ms | summary |
- vault.core.leadership_setup_failed.lower provides the lower bound for time taken by cluster leadership setup failures.
- Use vault.core.leadership_setup_failed.mean provides the mean for time taken by cluster leadership setup failures.
- vault.core.leadership_setup_failed.stddev provides the standard deviation for time taken by cluster leadership setup failures.
- vault.core.leadership_setup_failed.sum provides the sum of time taken by cluster leadership setup failures.
- vault.core.leadership_setup_failed.upper provides the upper bound for time taken by cluster leadership setup failures.
vault.core.leadership_lost
Metric source | Description | Unit | Type |
---|---|---|---|
Vault | This metric represents the duration of time taken by cluster leadership losses which have occurred in a highly available Vault cluster. This should be monitored and alerted on for overall cluster leadership status. | ms | summary |
- vault.core.leadership_lost.lower provides the lower bound for time taken by cluster leadership losses.
- Use vault.core.leadership_lost.mean provides the mean for time taken by cluster leadership losses.
- vault.core.leadership_lost.stddev provides the standard deviation for time taken by cluster leadership losses.
- vault.core.leadership_lost.sum provides the sum of time taken by cluster leadership losses.
- vault.core.leadership_lost.upper provides the upper bound for time taken by cluster leadership losses.
Why it is important:
The measured value of this metric answers the question "how long was this server the leader, when it lost leadership?"
What to look for:
Any count greater than zero means that Vault experienced a leadership change and could potentially be cause for alerting. Do note that a high mean value here is considered better than a low value, because a low value means there is leadership flapping.
vault.core.post_unseal
Metric source | Description | Unit | Type |
---|---|---|---|
Vault | This metric represents the duration of time taken by post-unseal operations handled by Vault core. | ms | gauge |
- vault.core.post_unseal.lower provides the lower bound for time taken by post-unseal setup.
- vault.core.post_unseal.mean provides the mean for time taken by post-unseal setup.
- vault.core.post_unseal.stddev provides the standard deviation for time taken by post-unseal setup.
- vault.core.post_unseal.sum provides the sum of time taken by post-unseal setup.
- vault.core.post_unseal.upper provides the upper bound for time taken by post-unseal setup.
Why it is important:
This metric is good for support or debugging problems with Vault startup after unsealing.
Replication metrics
If you use Vault Enterprise Replication, then these metrics will be of importance in monitoring for every primary and secondary cluster that participates in replication.
vault.merkle.flushDirty
Metric source | Description | Unit | Type |
---|---|---|---|
Vault | This metric represents the duration of time taken to flush dirty pages to storage. | ms | summary |
- vault.merkle.flushDirty.lower provides the lower bound for time taken by post-unseal setup.
- vault.merkle.flushDirty.mean provides the mean for time taken by post-unseal setup.
- vault.merkle.flushDirty.stddev provides the standard deviation for time taken by post-unseal setup.
- vault.merkle.flushDirty.sum provides the sum of time taken by post-unseal setup.
- vault.merkle.flushDirty.upper provides the upper bound for time taken by post-unseal setup.
Why it is important:
Monitoring this metric can provide insights into replication usage and saturation, while also revealing issues with storage performance.
vault.merkle.diff
Metric source | Description | Unit | Type |
---|---|---|---|
Vault | This metric represents the duration of time taken to perform a Merkle diff between the two clusters. | ms | summary |
- vault.replication.merkleDiff.lower provides the lower bound for time taken by a Merkle tree diff operation.
- vault.replication.merkleDiff.mean provides the mean for time taken by a Merkle tree diff operation.
- vault.replication.merkleDiff.stddev provides the standard deviation for time taken by a Merkle tree diff operation.
- vault.replication.merkleDiff.sum provides the sum of time taken by a Merkle tree diff operation.
- vault.replication.merkleDiff.upper provides the upper bound for time taken by a Merkle tree diff operation.
Why it is important:
Increased time spent in Merkle diff operations can be indicative of a larger problem in determining differences between data in the two clusters. Closely monitoring this metric helps to identify potential replication breaking issues.
vault.merkle.remote_state_snapshot
Metric source | Description | Unit | Type |
---|---|---|---|
Vault | This metric represents the duration of time taken to perform a remote state snapshot before checking for differences. | ms | summary |
- vault.merkle.remote_state_snapshot.lower provides the lower bound for time taken by remote state snapshot operation.
- vault.merkle.remote_state_snapshot.mean provides the mean for time taken by remote state snapshot operation.
- vault.merkle.remote_state_snapshot.stddev provides the standard deviation for time taken by remote state snapshot operation.
- vault.merkle.remote_state_snapshot.sum provides the sum of time taken by remote state snapshot operation.
- vault.merkle.remote_state_snapshot.upper provides the upper bound for time taken by remote state snapshot operation.
vault.merkle.reindex
Metric source | Description | Unit | Type |
---|---|---|---|
Vault | This metric represents the duration of time taken by reindex operations. | ms | summary |
- vault.merkle.reindex.lower provides the lower bound for time taken by reindex.
- vault.merkle.reindex.mean provides the mean for time taken by reindex.
- vault.merkle.reindex.stddev provides the standard deviation for time taken by reindex.
- vault.merkle.reindex.sum provides the sum of time taken by reindex.
- vault.merkle.reindex.upper provides the upper bound for time taken by reindex.
Why it is important:
A reindex is typically IOPS intensive, affects all relevant keys, and requires updating of all sub trees. Monitoring reindex usage and saturation is key to ensuring healthy replication and detecting performance bottlenecks.
vault.replication.wal.last_wal
Metric source | Description | Unit | Type |
---|---|---|---|
Vault | This metric represents the index of the last WAL. | sequence number | gauge |
- vault.replication.wal.last_wal.value provides the last WAL index.
vault.replication.wal.last_dr_wal
Metric source | Description | Unit | Type |
---|---|---|---|
Vault | This metric represents the index of the last Disaster Recovery (DR) replication WAL. | sequence number | gauge |
- vault.replication.wal.last_dr_wal.value provides the last DR mode replication WAL index.
vault.replication.wal.last_performance_wal
Metric source | Description | Unit | Type |
---|---|---|---|
Vault | This metric represents the index of the last Performance Replication WAL. | sequence number | gauge |
- vault.replication.wal.last_performance_wal.value provides the last Performance mode replication WAL index.
vault.replication.fsm.last_remote_wal
Metric source | Description | Unit | Type |
---|---|---|---|
Vault | This metric represents the index of the last remote WAL. | sequence number | gauge |
- vault.replication.fsm.last_remote_wal.value provides the last remote WAL index.
Replication RPC metrics
These metrics represent replication RPC measurements that are provided by Vault.
replication.rpc.client.stream_wals
Metric source | Description | Unit | Type |
---|---|---|---|
Vault | This metric represents the duration of time taken by the client to stream WALs. | ms | summary |
- replication.rpc.client.stream_wals.lower provides the lower bound for time taken by a client to stream WALs.
- Use replication.rpc.client.stream_wals.mean provides the mean for time taken by a client to stream WALs.
- replication.rpc.client.stream_wals.stddev provides the standard deviation for time taken by a client to stream WALs.
- replication.rpc.client.stream_wals.sum provides the sum of time taken by a client to stream WALs.
- replication.rpc.client.stream_wals.upper provides the upper bound for time taken by a client to stream WALs.
vault.replication.rpc.client.fetch_keys
Metric source | Description | Unit | Type |
---|---|---|---|
Vault | This metric represents the duration of time taken by a client to perform a fetch keys request. | ms | summary |
- vault.replication.rpc.client.fetch_keys.lower provides the lower bound for time taken by a client to perform a fetch keys request.
- Use vault.replication.rpc.client.fetch_keys.mean provides the mean for time taken by a client to perform a fetch keys request.
- vault.replication.rpc.client.fetch_keys.stddev provides the standard deviation for time taken by a client to perform a fetch keys request.
- vault.replication.rpc.client.fetch_keys.sum provides the sum of time taken by a client to perform a fetch keys request.
- vault.replication.rpc.client.fetch_keys.upper provides the upper bound for time taken by a client to perform a fetch keys request.
vault.replication.rpc.client.conflicting_pages
Metric source | Description | Unit | Type |
---|---|---|---|
Vault | This metric represents the duration of time taken by a client conflicting page request. | ms | summary |
- vault.replication.rpc.client.conflicting_pages.lower provides the lower bound for time taken by a client conflicting page request.
- Use vault.replication.rpc.client.conflicting_pages.mean provides the mean for time taken by a client conflicting page request.
- vault.replication.rpc.client.conflicting_pages.stddev provides the standard deviation for time taken by a client conflicting page request.
- vault.replication.rpc.client.conflicting_pages.sum provides the sum of time taken by a client conflicting page request.
- vault.replication.rpc.client.conflicting_pages.upper provides the upper bound for time taken by a client conflicting page request.
vault.replication.merkleSync
Metric source | Description | Unit | Type |
---|---|---|---|
Vault | This metric represents the duration of time to perform a Merkle Tree based synchronization using the last delta generated between the clusters participating in replication. | ms | summary |
- vault.replication.merkleSync.lower provides the lower bound for time to perform a Merkle Tree based synchronization.
- Use vault.replication.merkleSync.mean provides the mean for time to perform a Merkle Tree based synchronization.
- vault.replication.merkleSync.stddev provides the standard deviation for time to perform a Merkle Tree based synchronization.
- vault.replication.merkleSync.sum provides the sum of time to perform a Merkle Tree based synchronization.
- vault.replication.merkleSync.upper provides the upper bound for time to perform a Merkle Tree based synchronization.
vault.replication.merkleDiff
Metric source | Description | Unit | Type |
---|---|---|---|
Vault | This metric represents the duration of time to perform a Merkle Tree based delta generation between the clusters participating in replication | ms | summary |
- vault.replication.merkleDiff.lower provides the lower bound for time to perform a Merkle Tree based delta generation.
- Use vault.replication.merkleDiff.mean provides the mean for time to perform a Merkle Tree based delta generation.
- vault.replication.merkleDiff.stddev provides the standard deviation for time to perform a Merkle Tree based delta generation.
- vault.replication.merkleDiff.sum provides the sum of time to perform a Merkle Tree based delta generation.
- vault.replication.merkleDiff.upper provides the upper bound for time to perform a Merkle Tree based delta generation.
Write-ahead log metrics
These metrics relate to Vault Write Ahead Log (WAL) operations.
vault.wal_gc_total
Metric source | Description | Unit | Type |
---|---|---|---|
Vault | This metric represents the total Number of Write Ahead Logs (WAL) on disk. | WAL | counter |
- vault.wal_gc_total.value provides the total number.
vault.wal.persistWALs
Metric source | Description | Unit | Type |
---|---|---|---|
Vault | This metric represents the amount of time required to persist the Vault write-ahead logs (WAL) to the storage backend. | ms | summary |
- vault.wal.persistWALs.lower provides the lower bound for time required to persist WALs to the storage backend.
- Use vault.wal.persistWALs.mean provides the mean for time required to persist WALs to the storage backend.
- vault.wal.persistWALs.stddev provides the standard deviation for time required to persist WALs to the storage backend.
- vault.wal.persistWALs.sum provides the sum of time required to persist WALs to the storage backend.
- vault.wal.persistWALs.upper provides the upper bound for time required to persist WALs to the storage backend.
vault.wal.flushReady
Metric source | Description | Unit | Type |
---|---|---|---|
Vault | This metric represents the amount of time required to flush the Vault write-ahead logs (WAL) to the persist queue. | ms | summary |
- vault.wal.flushReady.lower provides the lower bound for time required to flush the Vault WALs to the persist queue.
- Use vault.wal.flushReady.mean provides the mean for time required to flush the Vault WALs to the persist queue.
- vault.wal.flushReady.stddev provides the standard deviation for time required to flush the Vault WALs to the persist queue.
- vault.wal.flushReady.sum provides the sum of time required to flush the Vault WALs to the persist queue.
- vault.wal.flushReady.upper provides the upper bound for time required to flush the Vault WALs to the persist queue.
Why it is important:
The Vault write-ahead logs (WALs) are used to replicate Vault data between clusters. WALs are written and stored even if Enterprise Replication is not currently enabled. The WAL is purged every few seconds by a garbage collector. But if Vault is under heavy load, the WALs may start to accumulate, putting pressure on the storage.
What to look for:
- flushReady is over 500ms
- persistWALs is over 1000ms
Note
Refer to the Monitoring Vault Replication for additional information.
Identity metrics
These metrics represent identity entity measurements that are provided by Vault.
vault.identity.num_entities
Metric source | Description | Unit | Type |
---|---|---|---|
Vault | This metric was introduced in version 1.4.1 and represents the number of identity entities. | entities | gauge |
- vault.identity.num_entities.value provides the total number of identity entities.
Expiration metrics
These metrics represent lease measurements that are provided by Vault.
vault.expire.num_leases
Metric source | Description | Unit | Type |
---|---|---|---|
Vault | This metric represents the number of all leases which are eligible for eventual expiry. | leases | gauge |
- vault.expire.num_leases.value provides the total number of leases which are eligible for eventual expiry.
Why it is important:
This value represents an approximate total lease count for Vault across all lease generating auth methods and secrets engines.
What to look for:
Large and unexpected delta in count can indicate a bulk operation, load testing, or runaway client application is generating excessive leases and should be immediately investigated.
vault.expire.revoke
Metric source | Description | Unit | Type |
---|---|---|---|
Vault | This metric represents the duration of time to revoke a token. | ms | summary |
- vault.expire.revoke.lower provides the lower bound for time to revoke a token.
- Use vault.expire.revoke.mean provides the mean for time to revoke a token.
- vault.expire.revoke.stddev provides the standard deviation for time to revoke a token.
- vault.expire.revoke.sum provides the sum of time to revoke a token.
- vault.expire.revoke.upper provides the upper bound for time to revoke a token.
Integrated Storage metrics
These metrics relate to the Integrated Storage (Raft) backend. If you use this storage backend, you should monitor these metrics.
vault.raft.leader.lastContact
Metric source | Description | Unit | Type |
---|---|---|---|
Vault | Measures the time since the leader was last able to contact the follower nodes when checking its leader lease. | ms | summary |
Why it is important:
A Vault cluster with Integrated Storage should have a stable leader. If there are frequent elections or leadership changes, this can indicate network issues between the servers, or that the servers themselves are unable to keep up with the load.
What to look for:
For a healthy cluster, you’re looking for a lastContact lower than 200ms, leader > 0 and candidate == 0. Deviations from this might indicate flapping leadership.
Issues with integrated storage should be escalated to the operations team.
vault.raft.state.candidate
Metric source | Description | Unit | Type |
---|---|---|---|
Vault | Increments whenever raft server starts an election. | elections | counter |
vault.raft.state.leader
Metric source | Description | Unit | Type |
---|---|---|---|
Vault | Increments whenever raft server becomes a leader. | leaders | counter |
vault.raft.delete
Metric source | Description | Unit | Type |
---|---|---|---|
Vault | This metric represents the time to delete key from underlying storage. | ms | summary |
- vault.raft.delete.lower provides the lower bound for time to delete key from underlying storage.
- Use vault.delete.mean provides the mean for time to persist key in underlying storage.
- vault.raft.delete.stddev provides the standard deviation for time to insert a log entry into the delete path.
- vault.raft.delete.sum provides the sum of time to insert a log entry into the delete path.
- vault.raft.delete.upper provides the upper bound for time to insert a log entry into the delete path.
vault.raft.get
Metric source | Description | Unit | Type |
---|---|---|---|
Vault | This metric represents the time to retrieve a key from underlying storage. | ms | summary |
- vault.raft.get.lower provides the lower bound for time to retrieve a key from underlying storage.
- Use vault.raft.get.mean provides the mean for time to retrieve a key from underlying storage.
- vault.raft.get.stddev provides the standard deviation for time to retrieve a key from underlying storage.
- vault.raft.get.sum provides the sum of time to retrieve a key from underlying storage.
- vault.raft.get.upper provides the upper bound for time to retrieve a key from underlying storage.
vault.raft.put
Metric source | Description | Unit | Type |
---|---|---|---|
Vault | This metric represents the time to persist a key in underlying storage. | ms | summary |
- vault.raft.put.lower provides the lower bound for time to persist a key in underlying storage.
- Use vault.raft.put.mean provides the mean for time to persist a key in underlying storage.
- vault.raft.put.stddev provides the standard deviation for time to persist a key in underlying storage.
- vault.raft.put.sum provides the sum of time to persist a key in underlying storage.
- vault.raft.put.upper provides the upper bound for time to persist a key in underlying storage.
vault.raft.list
Metric source | Description | Unit | Type |
---|---|---|---|
Vault | This metric represents the time to retrieve a list of keys from underlying storage. | ms | summary |
- vault.raft.list.lower provides the lower bound for time to retrieve a list of keys from underlying storage.
- Use vault.raft.list.mean provides the mean for time to retrieve a list of keys from underlying storage.
- vault.raft.list.stddev provides the standard deviation for time to retrieve a list of keys from underlying storage.
- vault.raft.list.sum provides the sum of time to retrieve a list of keys from underlying storage.
- vault.raft.list.upper provides the upper bound for time to retrieve a list of keys from underlying storage.
Consul storage metrics
These metrics relate to the Consul. If you use this storage backend, you should monitor these metrics.
The metrics below relate to Consul when used as a storage backend. They are available in Vault telemetry. However, for a full list of Consul metrics, refer to the Monitoring Consul Datacenter Health tutorial.
vault.consul.get
Metric source | Description | Unit | Type |
---|---|---|---|
Vault | This metric represents GET operations against the Consul storage backend. | ms | summary |
- vault.consul.get.count provides the number of GET operations against the Consul storage backend.
- vault.consul.get.lower provides the lower bound for duration of GET operations against the Consul storage backend.
- Use vault.consul.get.mean provides the mean for duration of GET operations against the Consul storage backend.
- vault.consul.get.stddev provides the standard deviation for duration of GET operations against the Consul storage backend.
- vault.consul.get.sum provides the sum of duration of GET operations against the Consul storage backend.
- vault.consul.get.upper provides the upper bound for duration of GET operations against the Consul storage backend.
vault.consul.put
Metric source | Description | Unit | Type |
---|---|---|---|
Vault | This metric represents PUT operations against the Consul storage backend. | ms | summary |
- vault.consul.put.count provides the number of PUT operations against the Consul storage backend.
- vault.consul.put.lower provides the lower bound for duration of PUT operations against the Consul storage backend.
- Use vault.consul.put.mean provides the mean for duration of PUT operations against the Consul storage backend.
- vault.consul.put.stddev provides the standard deviation for duration of PUT operations against the Consul storage backend.
- vault.consul.put.sum provides the sum of duration of PUT operations against the Consul storage backend.
- vault.consul.put.upper provides the upper bound for duration of PUT operations against the Consul storage backend.
vault.consul.list
Metric source | Description | Unit | Type |
---|---|---|---|
Vault | This metric represents LIST operations against the Consul storage backend. | ms | summary |
- vault.consul.list.count provides the number of LIST operations against the Consul storage backend.
- vault.consul.list.lower provides the lower bound for duration of LIST operations against the Consul storage backend.
- Use vault.consul.list.mean provides the mean for duration of LIST operations against the Consul storage backend.
- vault.consul.list.stddev provides the standard deviation for duration of LIST operations against the Consul storage backend.
- vault.consul.list.sum provides the sum of duration of LIST operations against the Consul storage backend.
- vault.consul.list.upper provides the upper bound for duration of LIST operations against the Consul storage backend.
vault.consul.delete
Metric source | Description | Unit | Type |
---|---|---|---|
Vault | This metric represents DELETE operations against the Consul storage backend. | ms | summary |
- vault.consul.delete.count provides the number of DELETE operations against the Consul storage backend.
- vault.consul.delete.lower provides the lower bound for duration of DELETE operations against the Consul storage backend.
- Use vault.consul.delete.mean provides the mean for duration of DELETE operations against the Consul storage backend.
- vault.consul.delete.stddev provides the standard deviation for duration of DELETE operations against the Consul storage backend.
- vault.consul.delete.sum provides the sum of duration of DELETE operations against the Consul storage backend.
- vault.consul.delete.upper provides the upper bound for duration of DELETE operations against the Consul storage backend.
Why it is important:
These metrics indicate how long it takes for Consul to handle requests from Vault.
What to look for:
Large deltas in the count, upper, or 90_percentile fields.
Vault Usage Metrics
The following are fine-grained usage metrics from Vault telemetry introduced in version 1.5. They are related to common types of usage including identity, lease, secret, and token usage.
These metrics are the most useful for business users to measure Vault usage for metering, billing, and similar use cases.
vault.token.creation
Metric source | Description |
---|---|
Vault | A new service or batch token was created. (Name chosen to be distinct from vault.token.create, an existing sample metric.) |
- vault.token.creation.value provides the number.
vault.token.count
Metric source | Description |
---|---|
Vault | This metric was introduced in version 1.5.0 and represents the number of service tokens available for use. |
- vault.token.count.value provides the number.
vault.token.count.by_auth
Metric source | Description |
---|---|
Vault | This metric was introduced in version 1.5.0 and represents the number of existing tokens broken down by the auth method used to create them. |
- vault.token.count.by_auth.value provides the number.
vault.token.count.by_policy
Metric source | Description |
---|---|
Vault | This metric was introduced in version 1.5.0 and represents the number of existing tokens, counted in each policy assigned. |
- vault.token.count.by_policy.value provides the number.
vault.token.count.by_ttl
Metric source | Description |
---|---|
Vault | This metric was introduced in version 1.5.0 and represents the number of existing tokens, aggregated by their TTL at creation. |
- vault.token.count.by_ttl.value provides the number.
vault.secret.kv.count
Metric source | Description |
---|---|
Vault | This metric was introduced in version 1.5.0 and represents the count of secrets in key-value stores. |
- vault.secret.kv.count.value provides the number.
vault.secret.lease.creation
Metric source | Description |
---|---|
Vault | This metric was introduced in version 1.5.0 and represents a count of leases created by a secret engine (excluding leases created internally for token expiration.) |
- vault.secret.lease.creation.value provides the number.
vault.identity.entity.count
Metric source | Description |
---|---|
Vault | This metric was introduced in version 1.5.0 and represents the number of identity entities. |
- vault.identity.entity.count.value provides the number.
vault.identity.entity.creation
Metric source | Description |
---|---|
Vault | This metric was introduced in version 1.5.0 and represents a count of identity entity creation, either from manual creation or automatically upon login with an auth method. |
- vault.identity.entity.creation.value provides the number.
vault.identity.entity.alias.count
Metric source | Description |
---|---|
Vault | This metric was introduced in version 1.5.0 and represents the number of identity aliases to entities. |
- vault.identity.entity.alias.count.value provides the number.
Vault Agent Metrics
As of version 1.10, Vault Agent supports the telemetry configuration stanza, and also emits a collection of useful Agent-specific operational metrics which are documented in the Vault Agent documentation.
Refer to this documentation if you wish to monitor Agent telemetry metrics.
Summary
You have been introduced to the most critical Vault operational, usage, and Agent metrics along with information about monitoring and responding to specific examples.
If you want to review a practical configuration example or try the example in an online tutorial, continue you can do so in either Monitor Telemetry & Audit Device Log Data or Monitor Telemetry with Prometheus & Grafana.
This tutorial focuses on the key Vault telemetry. Refer to Vault Limits and Maximums to understand the known upper limits on the size of certain fields and objects, and configurable limits on others.