pki health-check

The pki health-check command verifies the health of the given PKI secrets engine mount against an optional configuration.

This runs with the permissions of the given token, reading various APIs from the mount and /sys against the given Vault server

Mounts need to be specified with any namespaces prefixed in the path, e.g., ns1/pki.

Examples

Performs a basic health check against the pki-root mount:

$ vault pki health-check pki-root/

Configuration can be specified using the -health-config flag:

$ vault pki health-check -health-config=mycorp-root.json pki-root/

Using the -list flag will show the list of health checks and any known configuration values (with their defaults) that will be run against this mount:

$ vault pki health-check -list pki-root/

Usage

The following flags are unique to this command:

-default-disabled - When specified, results in all health checks being disabled by default unless enabled by the configuration file explicitly. The default is false, meaning all default-enabled health checks will run.
-health-config (string: "") - Path to JSON configuration file to modify health check execution and parameters.
-list - When specified, no health checks are run, but all known health checks are printed. Still requires a positional mount argument. The default is false, meaning no listing is printed and health checks will execute.
-return-indicator (string: "default") - Behavior of the return value (exit code) of this command:
- permission, for exiting with a non-zero code when the tool lacks permissions or has a version mismatch with the server;
- critical, for exiting with a non-zero code when a check returns a critical status in addition to the above;
- warning, for exiting with a non-zero status when a check returns a warning status in addition to the above;
- informational, for exiting with a non-zero status when a check returns an informational status in addition to the above;
- default, for the default behavior based on severity of message and only returning a zero exit status when all checks have passed and no execution errors have occurred.

This command respects the -format parameter to control the presentation of output sent to stdout. Fatal errors that prevent health checks from executing may not follow this formatting.

Return status and output

This command returns the following exit codes:

0 - Everything is good.
1 - Usage error (check CLI parameters).
2 - Informational message from a health check.
3 - Warning message from a health check.
4 - Critical message from a health check.
5 - A version mismatch between health check and Vault Server occurred, preventing one or more health checks from being fully run.
6 - A permission denied message was returned from Vault Server for one or more health checks.

Note that an exit code of 5 (due to a version mismatch) is not necessarily fatal to the health check. For example, the crl_validity_period health check will return an invalid version warning when run against Vault 1.11 as no Delta CRL exists for this version of Vault, but this will not impact its ability to check the complete CRL.

Each health check outputs one or results in a list. This list contains a mapping of keys (status, status_code, endpoint, and message) to values returned by the health check. An endpoint may occur in more than one health check and is not necessarily guaranteed to exist on the server (e.g., using wildcards to indicate all matching paths have the same result). Tabular form elides the status code, as this is meant to be consumed programatically.

These correspond to the following health check status values:

status not_applicable / status code 0: exit code 0.
status ok / status code 1: exit code 0
status informational / status code 2: exit code 2.
status warning / status code 3: exit code 3.
status critical / status code 4: exit code 4.
status invalid_version / status code 5: exit code 5.
status insufficient_permissions / status code 6: exit code 6.

Health checks

The following health checks are currently implemented. More health checks may be added in future releases and may default to being enabled.

CA validity period

Name: ca_validity_period

Accessed APIs:

LIST /issuers (unauthenticated)
READ /issuer/:issuer_ref/json (unauthenticated)

Config Parameters:

root_expiry_critical (duration: 182d) - for a duration within which the root's lifetime is considered critical
intermediate_expiry_critical (duration: 30d) - for a duration within which the intermediate's lifetime is considered critical
root_expiry_warning (duration: 365d) - for a duration within which the root's lifetime is considered warning
intermediate_expiry_warning (duration: 60d) - for a duration within which the intermediate's lifetime is considered warning
root_expiry_informational (duration: 730d) - for a duration within which the root's lifetime is considered informational
intermediate_expiry_informational (duration: 180d) - for a duration within which the intermediate's lifetime is considered informational

This health check will check each issuer in the mount for validity status, returning a list. If a CA expires within the next 30 days, the result will be critical. If a root CA expires within the next 12 months or an intermediate CA within the next 2 months, the result will be a warning. If a root CA expires within 24 months or an intermediate CA within 6 months, the result will be informational.

Remediation steps:

Perform a CA rotation operation to check for CAs that are about to expire.
Migrate from expiring CAs to new CAs.
Delete any expired CAs with one of the following options:

Run tidy manually with vault write <mount>/tidy tidy_expired_issuers=true.
Use the Vault API to call delete issuer.

CRL validity period

Name: crl_validity_period

Accessed APIs:

LIST /issuers (unauthenticated)
READ /config/crl (optional)
READ /issuer/:issuer_ref/crl (unauthenticated)
READ /issuer/:issuer_ref/crl/delta (unauthenticated)

Config Parameters:

crl_expiry_pct_critical (int: 95) - the percentage of validity period after which a CRL should be considered critically close to expiry
delta_crl_expiry_pct_critical (int: 95) - the percentage of validity period after which a Delta CRL should be considered critically close to expiry

This health check checks each issuer's CRL for validity status, returning a list. Unlike CAs, where a date-based duration makes sense due to effort required to successfully rotate, rotating CRLs are much easier, so a percentage based approach makes sense. If the chosen percentage exceeds that of the grace_period from the CRL configuration, an informational message will be issued rather than OK.

For informational purposes, it reads the CRL config and suggests enabling auto-rebuild CRLs if not enabled.

Remediation steps:

Use vault write to enable CRL auto-rebuild:

$ vault write <mount>/config/crl auto_rebuild=true

Hardware-Backed root certificate

Name: hardware_backed_root

APIs:

Config Parameters:

enabled (boolean: false) - defaults to not being run.

This health check checks issuers for root CAs backed by software keys. While Vault is secure, for production root certificates, we'd recommend the additional integrity of KMS-backed keys. This is an informational check only. When all roots are KMS-backed, we'll return OK; when no issuers are roots, we'll return not applicable.

Read more about hardware-backed keys within Vault Enterprise Managed Keys

Root certificate issued Non-CA leaves

Name: root_issued_leaves

APIs:

LIST /issuers (unauthenticated)
READ /issuer/:issuer_ref/pem (unauthenticated)
LIST /certs
READ /certs/:serial (unauthenticated)

Config Parameters:

certs_to_fetch (int: 100) - a quantity of leaf certificates to fetch to see if any leaves have been issued by a root directly.

This health check verifies whether a proper CA hierarchy is in use. We do this by fetching certs_to_fetch leaf certificates (configurable) and seeing if they are a non-issuer leaf and if they were signed by a root issuer in this mount. If one is found, we'll issue a warning about this, and recommend setting up an intermediate CA.

Remediation steps:

Restrict the use of sign, sign-verbatim, issue, and ACME APIs against the root issuer.
Create an intermediary issuer in a different mount.
Have the root issuer sign the new intermediary issuer.
Issue new leaf certificates using the intermediary issuer.

Role allows implicit localhost issuance

Name: role_allows_localhost

APIs:

Config Parameters: (none)

Checks whether any roles exist that allow implicit localhost based issuance (allow_localhost=true) with a non-empty allowed_domains value.

Remediation steps:

Set allow_localhost to false for all roles.
Update the allowed_domains field with an explicit list of allowed localhost-like domains.

Role allows Glob-Based wildcard issuance

Name: role_allows_glob_wildcards

APIs:

Config Parameters:

allowed_roles (list: nil) - an allow-list of roles to ignore.

Check each role to see whether or not it allows wildcard issuance and glob domains. Wildcards and globs can interact and result in nested wildcards among other (potentially dangerous) quirks.

Remediation steps:

Split any role that need both of allow_glob_domains and allow_wildcard_certificates to be true into two roles.
Continue splitting roles until both of the following are true for all roles:
- The role has allow_glob_domains or allow_wildcard_certificates, but not both.
- Roles with allow_glob_domains and allow_wildcard_certificates are the only roles required for all SANs on the certificate.
Add the roles that allow glob domains and wildcards to allowed_roles so Vault ignores them in future checks.

Role sets `no_store=false` and performance

Name: role_no_store_false

APIs:

Config Parameters:

allowed_roles (list: nil) - an allow-list of roles to ignore.

Checks each role to see whether no_store is set to false.

Warning

Vault will provide warnings and performance will suffer if you have a large number of certificates without temporal CRL auto-rebuilding and set no_store to true.

Remediation steps:

Update none-ACME roles with no_store=false. NOTE: Roles used for ACME issuance must have no_store set to true.
Set your certificate lifetimes as short as possible.
Use BYOC revocations to revoke certificates as needed.

Accessibility of audit information

Name: audit_visibility

APIs:

READ /sys/mounts/:mount/tune

Config Parameters:

ignored_parameters (list: nil) - a list of parameters to ignore their HMAC status.

This health check checks whether audit information is accessible to log consumers, validating whether our list of safe and unsafe audit parameters are generally followed. These are informational responses, if any are present.

Remediation steps:

Use vault secrets tune to set the desired audit parameters:

$ vault secrets tune \
  -audit-non-hmac-response-keys=certificate \
  -audit-non-hmac-response-keys=issuing_ca \
  -audit-non-hmac-response-keys=serial_number \
  -audit-non-hmac-response-keys=error \
  -audit-non-hmac-response-keys=ca_chain \
  -audit-non-hmac-request-keys=certificate \
  -audit-non-hmac-request-keys=issuer_ref \
  -audit-non-hmac-request-keys=common_name \
  -audit-non-hmac-request-keys=alt_names \
  -audit-non-hmac-request-keys=other_sans \
  -audit-non-hmac-request-keys=ip_sans \
  -audit-non-hmac-request-keys=uri_sans \
  -audit-non-hmac-request-keys=ttl \
  -audit-non-hmac-request-keys=not_after \
  -audit-non-hmac-request-keys=serial_number \
  -audit-non-hmac-request-keys=key_type \
  -audit-non-hmac-request-keys=private_key_format \
  -audit-non-hmac-request-keys=managed_key_name \
  -audit-non-hmac-request-keys=managed_key_id \
  -audit-non-hmac-request-keys=ou \
  -audit-non-hmac-request-keys=organization \
  -audit-non-hmac-request-keys=country \
  -audit-non-hmac-request-keys=locality \
  -audit-non-hmac-request-keys=province \
  -audit-non-hmac-request-keys=street_address \
  -audit-non-hmac-request-keys=postal_code \
  -audit-non-hmac-request-keys=permitted_dns_domains \
  -audit-non-hmac-request-keys=policy_identifiers \
  -audit-non-hmac-request-keys=ext_key_usage_oids \
  -audit-non-hmac-request-keys=csr \
   <mount>

ACL policies allow problematic endpoints

Name: policy_allow_endpoints

APIs:

Config Parameters:

allowed_policies (list: nil) - a list of policies to allow-list for access to insecure APIs.

This health check checks whether unsafe access to APIs (such as sign-intermediate, sign-verbatim, and sign-self-issued) are allowed. Any findings are a critical result and should be rectified by the administrator or explicitly allowed.

Allow If-Modified-Since requests

Name: allow_if_modified_since

APIs:

READ /sys/internal/ui/mounts

Config Parameters: (none)

This health check verifies if the If-Modified-Since header has been added to passthrough_request_headers and if Last-Modified header has been added to allowed_response_headers. This is an informational message if both haven't been configured, or a warning if only one has been configured.

Remediation steps:

Update allowed_response_headers and passthrough_request_headers for all policies with vault secrets tune:

$ vault secrets tune \
    -passthrough-request-headers="If-Modified-Since" \
    -allowed-response-headers="Last-Modified" \
    <mount>

Update ACME-specific headers with vault secrets tune (if you are using ACME):

$ vault secrets tune \
    -passthrough-request-headers="If-Modified-Since" \
    -allowed-response-headers="Last-Modified" \
    -allowed-response-headers="Replay-Nonce" \
    -allowed-response-headers="Link" \
    -allowed-response-headers="Location" \
    <mount>

Auto-Tidy disabled

Name: enable_auto_tidy

APIs:

READ /config/auto-tidy

Config Parameters:

interval_duration_critical (duration: 7d) - the maximum allowed interval_duration to hit critical threshold.
interval_duration_warning (duration: 2d) - the maximum allowed interval_duration to hit a warning threshold.
pause_duration_critical (duration: 1s) - the maximum allowed pause_duration to hit a critical threshold.
pause_duration_warning (duration: 200ms) - the maximum allowed pause_duration to hit a warning threshold.

This health check verifies that auto-tidy is enabled, with sane defaults for interval_duration and pause_duration. Any disabled findings will be informational, as this is a best-practice but not strictly required, but other findings w.r.t. interval_duration or pause_duration will be critical/warnings.

Remediation steps

Use vault write to enable auto-tidy with the recommended defaults:

$ vault write <mount>/config/auto-tidy \
    enabled=true \
    tidy_cert_store=true \
    tidy_revoked_certs=true \
    tidy_acme=true \
    tidy_revocation_queue=true \
    tidy_cross_cluster_revoked_certs=true \
    tidy_revoked_cert_issuer_associations=true

Tidy hasn't run

Name: tidy_last_run

APIs:

READ /tidy-status

Config Parameters:

last_run_critical (duration: 7d) - the critical delay threshold between when tidy should have last run.
last_run_warning (duration: 2d) - the warning delay threshold between when tidy should have last run.

This health check verifies that tidy has run within the last run window. This can be critical/warning alerts as this can start to seriously impact Vault's performance.

Remediation steps:

Schedule a manual run of tidy with vault write:

$ vault write <mount>/tidy \
    tidy_cert_store=true \
    tidy_revoked_certs=true \
    tidy_acme=true \
    tidy_revocation_queue=true \
    tidy_cross_cluster_revoked_certs=true \
    tidy_revoked_cert_issuer_associations=true

Review the tidy status endpoint, vault read <mount>/tidy-status for additional information.
Re-configure auto-tidy based on the log information and results of your manual run.

Too many certificates

Name: too_many_certs

APIs:

Config Parameters:

count_critical (int: 250000) - the critical threshold at which there are too many certs.
count_warning (int: 50000) - the warning threshold at which there are too many certs.

This health check verifies that this cluster has a reasonable number of certificates. Ideally this would be fetched from tidy's status or a new metric reporting format, but as a fallback when tidy hasn't run, a list operation will be performed instead.

Remediation steps:

Verify that tidy ran recently with vault read:
```
$ vault read <mount>/tidy-status
```

Schedule a manual run of tidy with vault write:

$ vault write <mount>/tidy \
  tidy_cert_store=true \
  tidy_revoked_certs=true \
  tidy_acme=true \
  tidy_revocation_queue=true \
  tidy_cross_cluster_revoked_certs=true \
  tidy_revoked_cert_issuer_associations=true

Enable auto-tidy.
Make sure that you are not renewing certificates too soon. Certificate lifetimes should reflect the expected usage of the certificate. If the TTL is set appropriately, most certificates renew at approximately 2/3 of their lifespan.
Consider setting the no_store field for all roles to true and use BYOC revocations to avoid storage.

Enable ACME issuance

Name: enable_acme_issuance

APIs:

READ /config/acme
READ /config/cluster
LIST /issuers (unauthenticated)
READ /issuer/:issuer_ref/json (unauthenticated)

Config Parameters: (none)

This health check verifies that ACME is enabled within a mount that contains an intermediary issuer, as this is considered a best-practice to support a self-rotating PKI infrastructure.

Review the ACME Certificate Issuance API documentation to learn about enabling ACME support in Vault.

ACME response headers

Name: allow_acme_headers

APIs:

READ /sys/internal/ui/mounts

Config Parameters: (none)

This health check verifies if the "Replay-Nonce, Link, and Location headers have been added to allowed_response_headers, when the ACME feature is enabled. The ACME protocol will not work if these headers are not added to the mount.

Remediation steps:

Use vault secrets tune to add the missing headers to allowed_response_headers:

$ vault secrets tune \
  -allowed-response-headers="Last-Modified" \
  -allowed-response-headers="Replay-Nonce" \
  -allowed-response-headers="Link" \
  -allowed-response-headers="Location" \
  <mount>

pki health-check

Examples

Usage

Return status and output

Health checks

CA validity period

CRL validity period

Hardware-Backed root certificate

Root certificate issued Non-CA leaves

Role allows implicit localhost issuance

Role allows Glob-Based wildcard issuance

Role sets no_store=false and performance

Accessibility of audit information

ACL policies allow problematic endpoints

Allow If-Modified-Since requests

Auto-Tidy disabled

Tidy hasn't run

Too many certificates

Enable ACME issuance

ACME response headers

Role sets `no_store=false` and performance