Vault creates leases for both dynamic secrets and service tokens, and it maintains the lifecycle of those leases with an internal component called the expiration manager.
The expiration manager handles the revocation of a lease when the time to live value associated with the lease is reached.
Certain problems can prevent Vault from successfully revoking a lease. For example, leases on secrets issued from a dynamic secrets engine can become irrevocable if Vault cannot communicate with the server configured in the secrets engine.
Irrevocable leases accumulate over time and can cause degraded performance at critical stages of Vault operations, such as during startup or when the server assumes active cluster leadership.
Before Vault 1.8.0, the server would attempt to revoke all expired leases at once during startup. With the accumulation of tens of thousands of irrevocable leases request handling can become degraded when the expiration manager is attempting revocation.
Vault 1.8.0 introduced enhanced expiration manager functionality to internally mark leases as irrevocable after 6 failed attempts at revocation.
This provides a way to stop attempting revocation on leases which are identified as irrevocable.
An HTTP API and CLI command are also available to assist operators in identifying irrevocable leases.
You can follow the example scenario in this tutorial to learn more about Vault lease handling and troubleshooting irrevocable leases.
To perform the steps in this tutorial, you need:
The example scenario runs in a Docker environment. You will create a Docker network, and run a Vault dev mode server container. The scenario script will create the PostgreSQL container, configure the secrets engine, and create a dynamic credential with lease to save time on those steps so that you can focus on interpreting log output and using the new API and CLI functionality.
Before you can explore the scenarios, you need to prepare the environment.
First, define a
learn-vault Docker network.
$ docker network create --attachable --subnet 10.42.74.0/24 learn-vault
Start a Vault dev mode container.
$ docker run \ --name learn-vault \ --publish 8200:8200 \ --ip 10.42.74.100 \ --network learn-vault \ --detach \ --rm \ vault server -dev -dev-root-token-id root
Export environment variables for communicating with the Vault dev mode container using the root token value.
$ export VAULT_ADDR=http://localhost:8200 VAULT_TOKEN=root
Look up your token to ensure that you can communicate with the Vault dev mode container.
$ vault token lookup | grep policies policies [root]
Now that the Vault container is ready, you can begin exploring the example lease revocation scenarios.
Retrieve the example scenario scripts by cloning or downloading the
hashicorp/vault-guides repository from GitHub.
Clone the repository.
$ git clone https://github.com/hashicorp/vault-guides.git
Or download the repository.
This repository contains supporting content for all of the Vault learn tutorials. The content specific to this tutorial can be found within a sub-directory.
Change the working directory to
$ cd vault-guides/monitoring-troubleshooting/leases-lab
Before you can begin to resolve issues with problematic leases, you should first learn how to identify situations in which Vault is unable to revoke leases.
In this scenario you will identify the appearance of successful and unsuccessful lease revocation entries in the Vault server log, along with identifying an irrevocable lease entry.
The example script starts a PostrgreSQL container, configures the Vault container to connect to it, defines a role for creating dynamic credentials, and creates one dynamic credential.
dynamic-postgres.sh file to executable.
$ chmod +x dynamic-postgres.sh
With the Vault server running, execute the script.
$ ./dynamic-postgres.sh Start PostgreSQL container. 572094a9ad9b7d8dd500945f22e7ff5692c378f10ebb7ac1c5de2024f09ac474 . Configure PostgreSQL secrets engine. Success! Enabled the database secrets engine at: database/ Success! Data written to: database/roles/db-dba Success! Uploaded policy: db-dba Create PostgreSQL dynamic credential using DBA token. Complete.
Wait over 1 minute for the TTL value on the 2 leases to expire, then check the Vault server logs.
$ docker logs 2>&1 learn-vault | grep revoked
You should find a log line indicating successful revocation of the 1 lease which was created by the script and is now expired.
2021-07-27T10:52:23.083-0400 [INFO] expiration: revoked lease: lease_id=database/creds/db-dba/3er4DsaHXUbkwj0lh5wDGroJ
The log entry shows that the lease for the credential was successfully revoked by the expiration manager. Note that the lease_id entry is prefixed by the secrets engine type
database and contains a reference to the role name
Let's now examine a case where revocation is failing so you can understand how that situation is reflected in the server logs.
First, disable the database secrets engine that the script enabled. This will remove all associated configuration so that you can then reconfigure it with a second execution of the script.
$ vault secrets disable database Success! Disabled the secrets engine (if it existed) at: database/
Then stop the PostgreSQL container.
$ docker stop learn-postgres
NOTE: The script starts the containers with the remove flag
--remove so the container will be automatically cleaned up when you stop it.
dynamic-postgres.sh script again, but this time immediately stop the PostgreSQL container after the script execution completes.
$ ./dynamic-postgres.sh ; docker stop learn-postgres
By immediately stopping the PostgreSQL container, you prevent Vault from connecting to it and revoking the lease when it reaches expiration.
Wait a minute for the TTL value on the leases to expire, then check the Vault server logs.
$ docker logs 2>&1 learn-vault | grep revoked
You should find an
[ERROR] line indicating failure to revoke the lease.
2021-07-27T10:56:05.428-0400 [ERROR] expiration: failed to revoke lease: lease_id=database/creds/db-dba/k7yLujVyreaN8HWXNKnpWTz1 error="failed to revoke entry: resp: (*logical.Response)(nil) err: dial tcp [::1]:5432: connect: connection refused"
The information making up the
lease_id value contains details about the secrets engine type and role name.
Note also that there is an error message, which states that Vault failed to revoke the entry, and more detail is provided in the response. In this case, Vault cannot connect to the PostgreSQL server at
[::1]:5432 because you stopped the Docker container.
Since Vault cannot connect to PostgreSQL, it cannot issue the revocation statements required to completely revoke the credentials and associated lease.
When Vault encounters irrevocable leases, it behaves differently depending on the version in use.
For versions prior to 1.8.0, Vault will always attempt to revoke all expired leases. This means that if you have a scenario like that which you just explored where the database server is unavailable, Vault will be periodically and indefinitely attempting connections with that server to finally revoke the credentials.
For versions at or beyond 1.8.0, Vault will attempt to revoke an expired lease 6 times. If it fails to revoke the lease on the sixth attempt, it will internally mark the lease as irrevocable. You can identify such leases with the CLI.
For this scenario, after several minutes have elapsed, you can check the logs again to determine if the expiration manager has attempted to revoke the lease at least 6 times.
NOTE: The time taken for revocation attempts is considerable because Vault uses exponential backoff to avoid overloading the PostgreSQL server with revocation requests.
$ docker logs 2>&1 | grep 'failed to revoke lease' | wc -l 6
Once you have observed that 6 revocation attempts have occurred and failed, use the
vault CLI to report on the irrevocable leases.
$ vault read sys/leases/count type=irrevocable Key Value --- ----- counts map[database_23ec392d:1] lease_count 1
The result is one irrevocable lease associated with the database secrets engine accessor 23ec392d.
You can clean up leases by revoking them based on their prefix.
In this case, the prefix corresponds to the path you have observed in the lease ID,
Attempt to revoke the irrevocable lease by its prefix.
$ vault write -force sys/leases/revoke-prefix/database/creds/db-dba
This fails with an error that is similar to the one logged when the expiration manager cannot revoke the lease.
Error writing data to sys/leases/revoke-prefix/database/creds/db-dba: Error making API request. URL: PUT http://localhost:8200/v1/sys/leases/revoke-prefix/database/creds/db-dba Code: 400. Errors: * failed to revoke "database/creds/db-dba/k7yLujVyreaN8HWXNKnpWTz1" (1 / 1): failed to revoke entry: resp: (*logical.Response)(nil) err: dial tcp [::1]:5432: connect: connection refused
How can this irrevocable lease be cleaned up, then?
You can use the Revoke Force API, instead.
Try to forcibly revoke the lease.
$ vault write -force /sys/leases/revoke-force/database/creds/db-dba Success! Data written to: sys/leases/revoke-force/database/creds/db-dba
CAUTION: This operation will revoke all leases at the specified prefix.
Now attempt to list irrevocable leases again, and you should find that the 1 lease has now been forcibly revoked.
$ vault read sys/leases/count type=irrevocable Key Value --- ----- counts map lease_count 0
Note that when revoking large batches of leases, and you do not wish to queue the revocation operation and instead have the lease revocation return only when completed, you can change the sync parameter to
You can confirm token leases are revoked and cleaned up by listing the path, and noting that leases are no longer found.
$ vault list sys/leases/lookup/auth/$PREFIX No value found at sys/leases/lookup/auth/$PREFIX
In addition to exploring the Vault server logs for indications of lease revocation issues, there are a number of key Vault telemetry metrics which you can monitor and alert on related to the expiration manager.
|Time taken to fetch lease times||ms||summary|
|Time taken to fetch lease times by token||ms||summary|
|Number of all leases which are eligible for eventual expiry||leases||gauge|
|Number of leases set to expire, grouped by a time interval. This time interval and total number of time intervals are configurable via ||leases||gauge|
|Count of lease expirations||leases||counter|
|Total pending revocation jobs||leases||sample|
|Total pending revocation jobs by auth method||leases||sample|
|Count of lease expirations||leases||counter|
|Time taken for lease to get to the front of the revoke queue||ms||summary|
|Count of lease expiration errors||errors||counter|
Follow these steps to clean up your example scenario environment.
Stop the PostgreSQL and Vault containers.
$ docker stop learn-postgres learn-vault
Remove the Docker network
$ docker network rm learn-vault
You learned about about the Vault expiration manager and lease handling behavior along with how to identify irrevocable leases, and resolve issues with them.
You also learned about some key Vault telemetry metrics which you can monitor and alert on that are related to the expiration manager and lease handling.