Every operation with Vault is an API request or response and these requests and responses can be logged in detail by enabling one or more audit devices.
As applications and users make requests to Vault, those requests and their responses will be written to the audit device as described in the audit device documentation. Vault operates normally, responding to requests as well in this case.
It is important that audit devices are writable by Vault, and that they do not block Vault from connecting. In circumstances with blocking devices (network blocks, etc.), Vault can stop responding to requests until it can write to the audit device.
This tutorial provides examples of blocked audit device behavior, related Vault operational log messages, and suggestions for resolution.
The behavior of Vault by design is to no longer service requests if it cannot write to at least one enabled audit device.
The example diagram has been updated to show a blocked audit device condition. Vault has enabled a socket audit device at
127.0.0.1:9090, but that device is currently unreachable.
As a result, applications and users can make requests of Vault, but they will not be serviced until the socket audit device is again writable by Vault. Please review the blocked audit devices section of the audit devices documentation for more details.
NOTE: It is a general operational best practice to enable multiple audit devices when possible for redundancy and availability.
When you encounter a blocked audit device, your first goal should be to restore Vault's ability to write to the device.
The following sections provide details about discovering and resolving blocked audit devices by type.
A common condition that can arise to block a file audit device is lack of capacity on the storage device containing the audit device log file. If the storage capacity is exhausted, the audit device will effectively be blocked and Vault will stop servicing requests until sufficient storage capacity is made available.
To diagnose a blocked file audit device, check the Vault operational log output for ERROR level lines from the
core subsystems that reference failing to log responses as shown in this example:
[ERROR] audit: backend failed to log response: backend=file/ error="write /mnt/log/vault-audit.log: file already closed" [ERROR] core: failed to audit response: request_path=pki/issue/example-dot-com error="1 error occurred: * no audit backend succeeded in logging the response"
These types of log entries occur in pairs as shown in the example and will repeat for every request made to Vault until the file audit device issue has been resolved.
By analyzing the log lines, you can also determine the path to the device in question. In this example it is
/mnt/log/vault-audit.log, so the partition at
/mnt/log can then be inspected with operating system tools such as
df to determine if it has indeed exhausted capacity or if there is another issue.
If you examine the
/mnt/log device using
df, you can observe the current capacity.
$ df --human-readable --exclude-type=tmpfs --exclude-type=devtmpfs Filesystem Size Used Avail Use% Mounted on /dev/mapper/vagrant--vg-root 62G 2.0G 57G 4% / /dev/sdb 390M 390M 0 100% /mnt/log
In this example output, the
/mnt/log filesystem associated with the device
/dev/sdb shows a
Use% of 100% so there is not sufficient capacity for Vault to write to the audit device log any longer. This means that as an operator engaged in troubleshooting the issue, you should prioritize increasing available capacity on the target file audit device in order to restore Vault service to applications and users.
It is important to focus on restoring use of the storage device as specified in the Vault configuration for the audit device at the configured path. It is not possible to add an additional device or change the path to a storage device with suitable capacity as a means to resolve a blocked file audit device issue.
TIP: Use a monitoring solution to alert at high watermark values for your storage devices that contain Vault file audit device logs so that you can proactively respond to an increasingly exhausted storage device and prevent it from impacting Vault operations. Enable multiple audit devices so that Vault can attempt writes to other devices when one is blocked if possible to do so.
An extreme example of a blocked socket audit device is a closed TCP socket.
For example, suppose that you have a log aggregation agent or other service listening locally on Vault servers through a TCP socket and you have enabled a Vault socket audit device to communicate with this agent.
If the agent process is stopped, crashes, or otherwise stops listening on the TCP socket, then Vault will no longer write to that device since it cannot connect to its socket. This situation would require the operator to restore service to the listening process or otherwise enable Vault to connect to it again.
TIP: In cases where Vault is writing to a socket audit device across a routed network, filters or firewalls could interfere with communication and cause a blocked audit device so keep this in mind when working with socket audit devices.
To diagnose a blocked socket audit device, check the Vault operational log output for ERROR level lines from the audit subsystems that reference failing to log responses as shown in this example:
[ERROR] audit: backend failed to log response: backend=socket/ error="2 errors occurred: * write tcp 127.0.0.1:59660->127.0.0.1:9090: write: broken pipe * dial tcp 127.0.0.1:9090: connect: connection refused
These log lines spell out important details about the issue.
- The socket audit device with a problem is enabled at the path
- There were two distinct errors:
- Vault could not write to the socket due to broken pipe
- Vault could not dial the socket due to connection refused
These details are enough to help you determine that the service is not listening on the socket and is no longer accepting connections from Vault.
The log messages will differ based on your environment, but will follow this general pattern.
Tip: If you have a log aggregation and analysis stack ingesting Vault operational logging, you might consider configuring an alert for occurrences of
no audit device succeeded in logging the response so that you can detect blocked device incidents early.
Based on alerts and operational log output, your primary task for resolution is to unblock the audit device; you can expect Vault to return to immediate service after doing so.
The syslog audit device can become blocked when there are issues with process capabilities, user permissions, and the size of data Vault is attempting to write to the device.
In the former case, it is important to ensure that the Vault process user has correct capabilities such as
CAP_SYSLOG and permissions where required to write to the system log.
In the latter case, certain cumulative Vault data, such as certificate revocation lists and LDAP groups can grow to become large enough that when logged to remote syslog audit devices, they actually exceed the allowed UDP datagram size required by the syslog protocol specification.
To diagnose a blocked syslog audit device, check the Vault operational log output for ERROR level lines from the audit subsystems that reference failing to log responses as shown in the following examples.
If the syslog service is not accessible on the system, you can encounter an error like this when Vault attempts to write to it (and also when you first attempt to enable it as shown in the following example).
[ERROR] enable audit mount failed: path=syslog/ error="Unix syslog delivery error" [ERROR] core: failed to audit response: request_path=sys/audit/syslog error=1 error occurred: * no audit backend succeeded in logging the response
This example occurred when attempted to enable the audit device. The error
Unix syslog delivery error can mean that the syslog service is not enabled on the host or that Vault is not able to access it. This can often be due to restrictions imposed by SELinux configuration on the host, for example.
If the items being written to the syslog audit device are larger than the syslog host's configured maximum socket send buffer, then you can encounter errors such as this example.
[ERROR] audit: backend failed to log response: backend=syslog/ error=write unixgram ->/var/run/log: write: message too long [ERROR] core: failed to audit response: request_path=pki/certs/ error=1 error occurred: * no audit backend succeeded in logging the response
In this example the audit device is not actually permanently blocked, but the messages cannot be logged to the device because they are too large and this does have an intermittent blocking effect so it must be resolved.
The critical clue in these types of log messages comes from the error "write: message too long" and this is a good error to focus or alert on. The syslog socket being written to in the example is
/var/run/log but could differ in your environment. The second error line contains another clue as to the source of the issue in the
request_path value contains
pki/certs, the most likely cause is that a large number of PKI certificates were listed and this list operation results in Vault attempting to write an excessively large audit device log entry.
You can consult the Linux Programmer's Manual manual page for socket(7) to learn more about raising the value of the kernel tunable
/proc/sys/net/core/wmem_default to also increase the socket send buffer size for handling larger message bodies. Using to a TCP based syslog listener can also help with larger log messages.
write: message too long errors can be indicative of a deeper issue with your Vault use cases, such as a poorly maintained revocation list (CRL) in a PKI Secrets Engine, an excessively large list of LDAP groups, or other list that can accumulate many members over time. It is usually worth digging further into these kinds of issues to determine if there is a use case that is perhaps not ideally implemented to help with mitigation of excessively large audit device log entries.