Troubleshooting Vault
Troubleshooting is a fundamental task for Vault operators. However, catching an error with Vault can be a complex exercise; Vault connects to so many other systems that it can be difficult to ascertain what's gone on, but doing so in a timely and efficient manner is of the utmost importance.
Vault may often fire a cluster of errors, and getting to the root of the issue may take some time. There are a few general steps, however, that you can take to gather as much information as possible about the error that's being created, what's responsible for it (Vault, a third party service, the UI, the API, etc.), and then fix it. This article will run through a few general approaches for making sense of Vault errors by reproducing the error, looking at the logs, checking the error's source, and looking at our external resources.
- Vault Logs
- Troubleshoot the storage backend
- Troubleshoot common HTTP API and client errors
- Troubleshooting approach
- Troubleshooting tools
- Help and Reference
Vault Logs
Vault has two types of logs - Vault server operational logs and audit logs. The audit logs record every request made to Vault as well as the response sent from Vault. The server logs are operational logs that provide insights into what the server is doing internally and in the background as Vault runs.
Logging is extremely useful when you are troubleshooting because it provides context for the error. You can see the Vault server configuration, as well as the actions Vault tried to take in the moments that precede the error, which provides an insight into fixing it.
Audit Logs
Audit devices are the components in Vault that responsible for managing audit logs. Every request to Vault and response from Vault goes through the configured audit devices. This provides a simple way to integrate Vault with multiple audit logging destinations of different types.
The generated audit log contains every authenticated interaction with Vault including errors. There is an audit log entry for each request and its response, a compressed JSON object that looks like this:
Note
The log output is pretty printed with jq for readability. Notice that sensitive information such as the client token value is obfuscated with HMAC-SHA256 by default to prioritize safety over availability.
Enable an audit device
When a Vault server is first started, no auditing is enabled. Audit devices must be enabled by a privileged user whose policy must include the following rules:
To enable an audit device, execute the vault audit enable
command.
Example:
The following command enables the audit device, file
at the file/
path. The
output logs are stored in the /vault/vault-audit.log
file.
As a best practice, enable multiple audit devices for your production servers; this way, you have some audit trace even if one of the audit devices becomes unavailable.
You can also use vault audit list -detailed
to list enabled audit devices, and get the full path for audit device options.
Errors encountered when enabling audit devices
You could potentially encounter errors when enabling an audit device. These are some of the most common errors with associated root causes.
If you attempt to enable a filesystem based audit device, but do not specify a log file path, the following error is emitted to the standard error output:
The following error is also logged to the Vault server operational log:
If you attempt to enable a filesystem based audit device, but the vault
process user does not have access to the log file path, the following error is emitted to the standard error output:
The following error is also logged to the Vault server operational log:
If an error occurred with your request or response, the error message is
included in the error
field's value.
To quickly find a list of all non-empty and non-null error
fields from the
log, use the following command (where $AUDIT_LOG_FILE
is the actual filename
of the Vault audit device log you are analyzing):
If this command returns nothing, then there are no errors in the log.
Note
When you run Vault in production, you absolutely should enable audit devices. However, keep in mind that should Vault be unable to write to the audit log location for any reason, Vault won't be able to proceed. Also, don't forget that audit logging introduces performance overhead, since every request and response must be recorded.
Vault Server Logs
When the Vault server is starting up, it logs the configuration information such as listener ports, logging level, storage backend type, and Vault version that you are running.
Once the server is started, the rest of the log entries include the time, the
log level (e.g., INFO
), the log source, and the log message. Even if you can't
fix the error, these logs will be invaluable in troubleshooting.
In the server logs, you'll find errors in the log level as ERR
, but you
may find further context in WARN
as well as in the other preceding and
surrounding log entries.
Server Log Level
To specify the Vault server's log level, you can do one of the following:
- Use the
-log-level
CLI command flag - Set in the
VAULT_LOG_LEVEL
environment variable - Specify with
log_level
parameter in the server configuration file
Supported values (in order of detail) are trace
, debug
, info
, warn
, and
err
. The default log level is info
.
Using the CLI command
When starting the Vault server via CLI, pass the
-log-level
flag to specify the log level.VAULT_LOG_LEVEL environment variable
Set the log level in an environment variable.
Server configuration file
Specify the
log_level
parameter in the server configuration file.Note
The log level specified in the server configuration file can be overridden by the CLI or the
VAULT_LOG_LEVEL
environment variable.
Changing the log level
When you change the log level by editing the server configuration file or the
VAULT_LOG_LEVEL
environment variable value, the change won't take an effect
until the Vault server is restarted. When you have an HA cluster, apply the
change on the standby nodes first, and then lastly on the active node. By doing
this, you are ensuring that if the active node fails and one of the standby
nodes becomes the new active node, it has the desired level of server logs.
Finding server logs on Linux systems
On modern systemd
based Linux
distributions, the journald
daemon will capture Vault's log output
automatically to the system journal. Assuming your Vault service is named
vault
, use a command like this to retrieve only the Vault-specific log entries
from the system journal:
If your Vault systemd
service is not named vault
or you're unsure of the
service name, then you can use a more generic command:
The output should go back to the system boot time and will sometimes also
include restarts of Vault. If the output from the above includes log lines
prefixed with vault[NNNN]:
, then you've found the operational logs.
To package these logs for sharing, you can execute a command such as:
This will generate a compressed log file in the /tmp
directory:
Not finding the server logs?
If you don't find these vault[NNNN]
lines in your output, your Vault startup
script could be instead sending the log output elsewhere. To find it, take a
look into the Vault systemd
unit, which is often (but not always) located at
/etc/systemd/system/vault.service
. If you notice something similar to the
following:
Then Vault is likely storing its operational logging in the static file
located at /var/log/vault.log
.
If Vault is not operating on on Linux or is not operating on a systemd based
Linux, it could be configured to log to the system log via a facility like
logger
, and so Vault's logs could be part of the main system logs in these
locations:
Docker
Logs from Vault Docker containers can be retrieved with the docker logs command:
Where vault0
is the container name.
To grab all Vault logs from a container and compress them, use a command line like:
Kubernetes
Logs from Vault Kubernetes pods can be retrieved with the kubectl logs
command:
Where vault-55bcb779b4-8mfn6
is the pod name.
Troubleshoot the storage backend
Vault offers a number of configurable storage options (e.g. Consul, MySQL, etc.) and root cause of Vault failure may be the storage backend.
When Vault encountered an outage, you may need to troubleshoot the storage backend as well.
If using Consul as the storage backend, refer to the Consul Troubleshooting tutorial.
Troubleshoot common HTTP API and client errors
Users of the Vault HTTP API or CLI can encounter some fairly common errors or warnings, which are fortunately straightforward to diagnose and resolve. The following are some of the most commonly encountered client errors.
Missing client token
Here is an example of this error when attempting to list enabled secrets engines using the HTTP API using the /sys/mounts endpoint, which requires authentication.
This error can occur either when using the HTTP API and not passing in a valid "X-Vault-Token" header value or when using the CLI without a cached token that the token helper can load. This cached token is typically in a .vault-token
file in the user home directory, and written there by the token helper after a successful authentication with Vault.
The simplest way to immediately resolve the first example is to include a valid "X-Vault-Token" header value in the request. This example does that and also adds the --silent
option and pipes the output to jq
for a clean and compact listing.
The command now successfully returns the results.
With the CLI, the error will appear as in this example.
To resolve this issue for the CLI, you need to authenticate against Vault and cache a new token with the token helper.
Here is a simple example using the username and password auth method to get a new Vault token and cache it locally. Use the authentication method you are familiar with to authenticate, instead.
Now, try the command line to list secrets engines again.
The command succeeds because there is now a cached token value again, which you can check like this.
Note
This command will print your current Vault token to the screen.
server gave HTTP response to HTTPS client
Here is an example of the error when attempting to enable a KV version 2 secrets engine in a new Vault server that was started in dev mode.
This issue is frequently encountered in non-production environments, and occurs because the Vault server is operating with TLS disabled, but the CLI always attempts to use a TLS enabled connection to the server (note the "https" protocol in the Post from the error message).
Note
TLS is always disabled whenever using the server in dev mode and can also be disabled if the server is a non-production server, and uses a configuration that explicitly includes setting the tls_disable configuration option value to "true"
.
To immediately resolve this issue, export a VAULT_ADDR
environment variable that explicitly sets the HTTP protocol instead of HTTPS, like this.
Now, upon trying the command again, it will succeed:
Troubleshooting approach
Reproduce the bug
Review the Vault configuration and environment as shown in the Vault server logs. If possible, try to reproduce the error in a clean environment and a new vault storage state. Try reproducing the bug as cleanly as possible; some errors in Vault can be temporary.
Source of the error
Determine if the error is coming from the Vault UI or the API, or if it's from Vault or a third-party service. If the issue is observed in the UI, check the network inspector to understand the API call and response. This should help you ascertain it if is an API or a UI error. For example, if an AWS backend is being used, is the error coming from the AWS API?
If it's from Vault, check if the parameters in your request are mentioned in the error at all, then check documentation for those parameters. Remember that the audit logs can provide the insight into every request came into Vault.
During the troubleshooting, you may need the raw audit data with no hashing. To
collect the raw data, you can enable an audit device with log_raw=true
parameter.
Reproduce the error to generate the audit log with raw data.
After collecting the information you need, be sure to disable the raw audit:
Vault policies
When you receive the 403 permission denied
error, it is necessary to review
the policies. The permission denied
errors can often be the result of a policy
path mis-match.
You can use the vault token capabilities
command to check allowed operations
against a path.
Example:
Create a token with the policy you want to test.
Using the token with policy attached, check the capabilities against the path of question.
This example shows that the client token has no permission (deny) against
the transit/decrypt/phone-number
path which explains why Vault returned the
permission denied
error when the application tried to invoke the endpoint.
Note
Some of the API endpoints are root protected that sudo
capability
is required in addition. Refer to the Vault
Policies
tutorial.
Search the Vault GitHub and Google Group
Often times, the issue you encountered may be a known issue and perhaps, it's been fixed or a workaround is provided. Search the Vault GitHub repository as well as in our Google Group.
Also, search the Vault Changelog to see if the issue was fixed in the newer version.
If you are comfortable reading the source code, you can search for a particular error string in the Vault repository.
Narrowing down to the particular Vault version branch to match the version that you are running may speed up your search.
Troubleshooting tools
The following are HashiCorp supported tools that you can use to enhance your troubleshooting workflows.
Vault debug tool
The vault debug
command can be executed on a Vault server node for a specific period of time,
recording information about the node, its cluster and its host environment. The
information collected is packaged and written to the user specified path.
To create a debug package using default duration (2 minutes) and interval (30 seconds) in the current directory capturing all applicable targets, execute the command with no parameter.
The output name scheme is vault-debug-<time-stamp>
which gets written to the
current directory. To specify the output location and the file name different
from the default, use the -output
flag.
To create a debug package with 1 minute interval for 10 minutes, execute the following command:
The generated debug package contents may look similar to the following.
First, untar the file.
List the extracted files and folders.
Note
Certain endpoints that this command uses require ACL permissions to access. If not permitted, the information from these endpoints will not be part of the output. The command uses the Vault address and token as specified via the login command, environment variables, or CLI flags.
Vault Metrics
The debug package contains Vault metrics data (metric.json
).
To learn more about these metrics, refer to the Vault Telemetry documentation for the unit of measurement and definition.