Recover blocked audit devices

9min
|
Vault

Every operation with Vault is an API request or response and these requests and responses get logged in detail by enabling one or more audit devices.

Example socket audit device configuration diagram

As applications and users make requests to Vault, it writes those requests and responses to the audit device as described in the audit device documentation. Vault operates as expected, responding to requests as well in this case.

Audit devices must be writable by Vault, and that they do not block Vault from connecting. In circumstances with blocking devices (network blocks, etc.), Vault can stop responding to requests until it can write to the audit device.

This tutorial provides examples of blocked audit device behavior, related Vault operational log messages, and suggestions for resolution.

Audit device filters

Starting in Vault 1.16.0, you can enable audit devices with a filter option that Vault uses to evaluate audit entries to decide whether it writes them to the log. You should learn if your own audit devices use filtering, and make necessary changes to expose the log fields which you need to monitor for your use case.

You can familiarize yourself with Vault filtering concepts and filtering audit entries and how to enable audit filters in the documentation.

Blocked audit device behavior

Example blocked socket audit device diagram

If any enabled audit devices fail in a blocking manner, Vault requests will not complete until you recover from the blocked device.

The example diagram shows a blocked audit device condition. Vault has enabled a socket audit device at 127.0.0.1:9090, but that device is not reachable.

This means applications and users can make requests of Vault, but Vault will not respond to requests until the socket audit device is again writable by Vault. Please review the blocked audit devices section of the audit devices documentation for more details.

Note

It's a good operational best practice to enable redundant audit devices when possible for enhanced availability.

When you discover a blocked audit device, your first goal should be to restore Vault's ability to write to the device.

The following sections share details about discovering and resolving blocked audit devices by type.

Blocked file audit device

Example file audit device with blocked storage

A common condition that can arise to block a file audit device is lack of capacity on the storage device containing the audit device log file. If storage capacity is inadequate, the audit device will block, and Vault will stop servicing requests until you make adequate storage capacity available.

Example log message

To diagnose a blocked file audit device, check the Vault operational log output for ERROR level lines from the audit and core subsystems that resemble this example:

[ERROR] audit: backend failed to log response: backend=file/ error="write /mnt/log/vault-audit.log: file already closed"
[ERROR] core: failed to audit response: request_path=pki/issue/example-dot-com error="1 error occurred: * no audit backend succeeded in logging the response"

These types of log entries occur in pairs as shown in the example and will repeat for every request made to Vault until you resolve the file audit device issue.

By analyzing the log lines, you can also learn the path to the device in question. In this example the path is /mnt/log/vault-audit.log, so inspect the partition at /mnt/log with operating system tools such as df to decide if it exhausted capacity or if there is another issue.

If you examine the /mnt/log device using df, you can observe the current capacity.

$ df --human-readable --exclude-type=tmpfs --exclude-type=devtmpfs
Filesystem                    Size  Used Avail Use% Mounted on
/dev/mapper/vagrant--vg-root   62G  2.0G   57G   4% /
/dev/sdb                      390M  390M     0 100% /mnt/log

In this example output, the /mnt/log filesystem on the device /dev/sdb shows a Use% of 100%, so there is not enough capacity for Vault to write to the audit device log. This means that as an operator engaged in troubleshooting the issue, you should increase available capacity on the target file audit device to restore Vault service to applications and users.

Your focus is on restoring use of the storage device as specified in the Vault configuration for the audit device at the configured path. It's impossible to add another audit device or change the path to a storage device with suitable capacity as a means to resolve a blocked file audit device issue.

Tip

Use a monitoring solution to alert at high watermark values for your storage devices that contain Vault file audit device logs. This enables you to proactively respond to an increasingly exhausted storage device and prevent it from impacting Vault operations. Enable redundant audit devices so that Vault can try writes to other devices when one blocks if possible to do so.

Blocked socket audit device

You can configure a socket audit device to use a TCP, UDP, or Unix socket type.

An extreme example of a blocked socket audit device is a closed TCP socket.

Suppose you have a log aggregation agent listening locally on Vault servers through a TCP socket, and you have enabled a Vault socket audit device to communicate with this agent.

If the agent process stops, crashes, or otherwise is not listening on the TCP socket, then Vault will no longer write to that device. This situation would require the operator to restore service to the listening process or otherwise enable Vault to connect to it again.

Tip

In cases where Vault is writing to a socket audit device across a routed network, a filter or firewall can interfere with communication, and cause a blocked audit device. Keep this in mind when working with socket audit devices.

Example log message

To diagnose a blocked socket audit device, check the operational log output for ERROR lines from the audit subsystems that reference failing to log responses as shown in this example:

[ERROR] audit: backend failed to log response: backend=socket/ error="2 errors occurred:
* write tcp 127.0.0.1:59660->127.0.0.1:9090: write: broken pipe
* dial tcp 127.0.0.1:9090: connect: connection refused

These log lines spell out important details about the issue.

The socket audit device enabled at the path socket/ has a problem.
There were two distinct errors:
- Vault can't write to the socket due to broken pipe
- Vault can't dial the socket due to connection refused

These details are enough to help you decide that the service is not listening on the socket and is no longer accepting connections from Vault.

The log messages will differ based on your environment, but will follow this general pattern.

Tip

If you have a log aggregation and analysis stack ingesting Vault operational logging, consider configuring an alert for instances of no audit device succeeded in logging the response to detect blocked device incidents.

Based on alerts and operational log output, your primary task for resolution is to unblock the audit device; you can expect Vault to return to immediate service after doing so.

Blocked syslog audit device

The syslog audit device can become blocked when there are issues with process capabilities, user permissions, and the size of data Vault is attempting to write to the device.

In the former case, it's important to ensure that the Vault process user has correct capabilities such as CAP_SYSLOG and permissions where required to write to the system log.

In the latter case, cumulative Vault data, like CRLs and LDAP groups can grow enough to exceed the allowed UDP datagram size required by the syslog protocol specification. This applies specifically when Vault logs to remote syslog devices.

Example log messages

To diagnose blocked syslog audit devices, check Vault operational log output for ERROR lines from the audit subsystems which reference failing to log responses as shown in the following examples.

If syslog is not accessible on the system, you can observe errors like this when Vault tries to write to it, and when you first try to enable it.

[ERROR] enable audit mount failed: path=syslog/ error="Unix syslog delivery error"
[ERROR] core: failed to audit response: request_path=sys/audit/syslog error=1 error occurred:
* no audit backend succeeded in logging the response

This example occurred when attempting to enable the audit device. The error Unix syslog delivery error can mean that the syslog service is not enabled on the host or that Vault is not able to access it. This can often be due to restrictions imposed by SELinux configuration on the host, for example.

If the items written to the syslog audit device are larger than the syslog host's configured maximum socket send buffer, then Vault logs errors such as this example.

[ERROR] audit: backend failed to log response:  backend=syslog/ error=write unixgram ->/var/run/log: write: message too long
[ERROR] core: failed to audit response: request_path=pki/certs/ error=1 error occurred:
* no audit backend succeeded in logging the response

In this example the audit device is not blocked, but the messages aren't logged to the device because they are too large, and this does produce a blocking effect. An operator must resolve the issue.

The critical clue in these types of log messages comes from the error "write: message too long" and this is a good error to focus or alert on. Vault writes to the syslog socket /var/run/log in this example, but that path can differ in your environment. The second error line holds another clue about the source of the issue in the request_path value.

Since the request_path value includes pki/certs, the usual cause is listing a great number of PKI certificates. The list operation results in Vault trying to write an excessively large audit device log entry.

Consult the Linux Programmer's Manual manual page for socket(7) to learn more about raising the kernel tunable /proc/sys/net/core/wmem_default value to increase the socket send buffer size. Using to a TCP based syslog listener can also help with larger log messages.

Tip

These write: message too long errors relate to deeper issues with your use cases, such as unmaintained certificate revocation list (CRL) in a PKI Secrets Engine, or a large list of LDAP groups. You should dig into these issues to decide if there is a use case that is not ideally implemented, and mitigate large audit device log entries.

Help and reference

Inspect data in Integrated Storage

Query audit device logs