// secrets management · dynamic credentials · PKI · auth methods · policies · senior → principal
create, read, update, delete, list, sudo, deny. deny overrides all others.
Policies use path globs: secret/data/app/* matches any sub-path. + matches a single path segment: secret/data/+/config matches any team's config.
Root policy is all-powerful and cannot be modified. Tokens with root policy should be used only for bootstrapping, then revoked.
vault lease revoke) or automatic on expiry. Revoke-prefix revokes all leases under a path — useful for incident response.
Lease expiry without renewal causes app outages — the credential disappears mid-flight. Vault Agent or Vault SDK with background renewal handles this automatically.
template mode so it writes credentials to a file and atomically refreshes them, or use the Vault SDK's Renewer / LifetimeWatcher which handles renewal, grace periods, and re-fetch after max TTL automatically.
secret/data/<path> and metadata at secret/metadata/<path>. KV v1 uses secret/<path>. Policies and API calls written for v1 break on v2 mounts because the path is different. The Vault CLI and UI handle this transparently, but the HTTP API does not — a v1 policy granting secret/* does not cover secret/data/* on a v2 mount.
vault operator init has unrestricted access and bypasses all policies. It should be used only to create initial auth methods and policies, then revoked immediately (vault token revoke <root>). Keep the unseal keys / recovery keys in separate, secure, offline storage (separate people or HSMs). Leaving a root token active is an audit finding and a major blast-radius risk.
vault token create -orphan) or via auth method login, which produces orphan tokens by default.
| AppRole | Role ID (non-secret) + Secret ID (single-use or TTL-bound). Best for CI/CD, services without a platform identity. Secret ID delivery is the hard problem. |
| Kubernetes | Pod's service account JWT validated against the K8s API server. Zero secret to manage — platform provides identity. Standard for K8s-hosted workloads. |
| AWS IAM | Instance identity document or IAM role credentials signed by AWS. No secret to pre-provision. Automatic for EC2, ECS, Lambda. |
| GCP / Azure | GCP service account JWT or Azure managed identity token. Same zero-secret model as AWS IAM for GCP/Azure workloads. |
| OIDC / JWT | Validates JWT issued by any OIDC provider (Okta, Auth0, Google). For human users via browser SSO or machine tokens from GitHub Actions, GitLab CI. |
| LDAP / AD | Authenticates against an existing LDAP/Active Directory. Maps AD groups to Vault policies. For human operators in enterprises. |
| Token | Direct token auth. Used internally and for bootstrapping. Avoid issuing long-lived tokens to services; prefer auth method login. |
| KV v2 | Versioned static secrets. Paths: secret/data/ |
| Database | Dynamic creds for Postgres, MySQL, MongoDB, Cassandra, etc. Vault creates/drops DB users. Each lease = unique user. Rotation on revoke. |
| PKI | Certificate Authority. Issue X.509 certs with configurable TTL, SANs, key type. Supports root and intermediate CAs. Integrate with ACME for automatic renewal. |
| Transit | Encrypt, decrypt, sign, verify, HMAC. Keys never leave Vault. Key rotation + re-wrapping without decrypting all data. |
| AWS | Generates STS tokens or IAM user credentials scoped to an IAM policy or role. TTL-bound, auto-revoked. Avoids long-lived IAM access keys. |
| SSH | Signs SSH public keys with a CA key. Servers trust the CA. Short-lived signed certs replace static authorized_keys. Full audit trail for SSH access. |
| TOTP | Generates and validates TOTP codes. Use for MFA workflows inside applications. |
| create | Write a new secret (fails if it already exists at path). |
| read | Read the secret or credential at a path. |
| update | Overwrite an existing secret. |
| delete | Delete a secret or revoke a credential. |
| list | List keys at a path (no values returned). |
| sudo | Access root-protected paths; required for some admin ops. |
| deny | Explicitly deny access; overrides all other capabilities. |
| VAULT_ADDR | Vault server URL. Export in shell or pass to every CLI call. |
| VAULT_TOKEN | Active token. Set by vault login; used by CLI and SDK. |
| VAULT_NAMESPACE | Target namespace (Vault Enterprise). Omit for root namespace. |
| VAULT_CACERT | Path to CA cert for TLS verification of Vault's TLS cert. |
| max_lease_ttl | Server-wide ceiling on lease TTL. Role TTL cannot exceed this. |
| default_lease_ttl | Default TTL if role doesn't specify one. |
| audit_non_hmac_request_keys | Keys whose values are logged in plaintext in audit log (handle carefully). |
| Dimension | HashiCorp Vault | AWS Secrets Manager | Azure Key Vault | GCP Secret Manager |
|---|---|---|---|---|
| Dynamic credentials | Yes — database, AWS, GCP, Azure, PKI, SSH | Yes — RDS, Redshift rotation via Lambda | Limited — managed identities, no DB dynamic creds | No — static secrets only |
| Encryption as a service | Yes — Transit engine (encrypt/decrypt/sign) | No | Yes — key operations via Key Vault Keys | No |
| Auth methods | 20+ (K8s, AWS, GCP, OIDC, LDAP, AppRole…) | IAM only | Azure AD / managed identity only | GCP IAM only |
| Multi-cloud / on-prem | Yes — cloud-agnostic, runs anywhere | AWS only | Azure only | GCP only |
| Audit logging | Built-in, every request, pluggable backends | CloudTrail | Azure Monitor / Event Hub | Cloud Audit Logs |
| Policy model | Path-based HCL policies, fine-grained | IAM policies (resource-based) | Azure RBAC + access policies | IAM conditions |
| Operational burden | High — you run and manage the cluster | Low — fully managed | Low — fully managed | Low — fully managed |
| Namespaces / multi-tenancy | Yes (Enterprise) — hierarchical namespaces | Per-account isolation | Per-vault isolation | Per-project isolation |
| Open source | Yes (BSL license since 2023; OSS fork: OpenBao) | No | No | No |
| Best for | Multi-cloud, on-prem, rich dynamic creds, EaaS | AWS-only workloads, simple secret storage | Azure-native workloads | GCP-native, simple secret storage |
Pod starts with a Kubernetes service account. The kubelet mounts a service account JWT
at /var/run/secrets/kubernetes.io/serviceaccount/token (projected, time-limited).
Vault Agent (sidecar) or the app itself sends the JWT to Vault's Kubernetes auth endpoint:
POST /v1/auth/kubernetes/login with {role: "myapp", jwt: "<sa-jwt>"}.
Vault validates the JWT against the Kubernetes API server (using a configured reviewer
service account or the JWT itself, if disable_local_ca_jwt=false). It checks that the
service account and namespace match the configured role binding.
Vault issues a token with the policies attached to the Kubernetes auth role. The token has a TTL (e.g., 1 h) and is an orphan token (not tied to Vault's internal token hierarchy).
App reads secret: GET /v1/secret/data/myapp/config with the token in X-Vault-Token.
Vault checks the token's policies, returns the secret.
Vault Agent writes the secret to a shared memory volume (tmpfs) as a rendered template
file. The app reads the file — no Vault SDK needed in the app.
secret/<key>.
KV v2: versioned. Every write creates a new version (default retention: 10 versions). Read always returns the latest unless a version is specified. Path changes: data is at secret/data/<key>, metadata at secret/metadata/<key>. This path difference breaks v1 policies — a policy on secret/* does not cover secret/data/*.
CAS (Check-And-Set): a write guard. You must pass the current version number with your write. If it doesn't match (another writer updated concurrently), Vault rejects the write. Enable with cas_required=true on the mount to prevent blind overwrites in concurrent pipelines.
Soft delete: marks versions as deleted (hides them from reads) but retains metadata and the data itself. vault kv undelete can restore. vault kv destroy permanently removes version data — irreversible.vault read database/creds/<role>. 2. Vault connects to the database using a privileged connection (stored in Vault's encrypted storage). 3. Vault executes the creation SQL with a generated username/password. 4. Vault returns the new credentials with a lease TTL (e.g., 1 h). 5. At TTL expiry (or explicit revocation), Vault executes the revocation SQL — the database user is dropped.
Why better than static credentials: - Blast radius: a leaked credential is usable only until its short TTL expires — typically hours, not years. - Uniqueness: each app instance gets its own credential. Compromise of one doesn't expose all. - Audit: every credential is tied to the lease that created it — you know exactly which Vault token (and which Kubernetes pod / AppRole) requested it. - No rotation ceremony: static credential rotation requires coordinating all consumers simultaneously. Dynamic creds rotate continuously without a ceremony.default policy). 2. For the requested path, find all matching policy rules (most specific path wins for
globs, but deny overrides everything).
3. deny capability on any matching policy overrides all other capabilities from
all other policies — you cannot grant around a deny.
4. If no policy grants access to the path, the request is denied.
Policy HCL example: hcl path "secret/data/myapp/*" {
capabilities = ["read", "list"]
} path "secret/data/myapp/admin/*" {
capabilities = ["deny"]
} The token can read any myapp secret, but the admin sub-path is explicitly denied.
default policy: automatically attached to all tokens. By default it allows token self-lookup and renewal. Modify carefully — it applies to every token.{{identity.entity.name}}) so a single policy template governs many entities without duplicating policy documents. Use namespaces (Enterprise) to isolate teams — each team gets a namespace with its own auth methods, secret engines, and policies, governed by the platform team's root namespace. Treat policies as code — version them in Git, review changes, and apply with Terraform (hashicorp/vault provider).POST /v1/auth/approle/login → Vault issues a token.
Secret ID delivery is the hard problem. Common patterns: - Cubbyhole response wrapping: CI/CD pipeline generates a wrapped secret ID (a
single-use token that, when redeemed, returns the secret ID). The wrapped token is
passed to the app; only one system can unwrap it — if already unwrapped, it's
compromised.
- Vault Agent: runs with a bootstrap token (from a more trusted auth method like
Kubernetes) to obtain and renew AppRole tokens automatically.
- Platform injection: secrets management platform generates and injects secret IDs
into the environment at deploy time.Work systematically from the most common causes:
1. Token expired: vault token lookup -accessor <accessor>. Check expire_time.
If expired, the service failed to renew — check Vault Agent logs or the SDK's
renewal loop.
Policy changed: someone updated the policy and accidentally removed the capability. Check audit log for the policy version at the time of failure. Compare with current policy.
Token revoked: check the audit log for a revoke event on the token or its parent. Parent token revocation cascades to children (unless orphan).
Secret engine unmounted / path changed: if the mount path changed, the policy and the application path must both be updated.
KV v1 → v2 migration: path changed from secret/myapp to secret/data/myapp.
Policy doesn't cover the new path.
Namespace wrong (Enterprise): token is in a different namespace than the secret.
Audit log (vault audit list) is your ground truth — it records every request, the token accessor, the path, and the result. Enable file audit and ship logs to your SIEM. A 403 in Vault is either token issue, policy issue, or path issue — nothing else.
"type":"response" "error":"*" filtered to your service's token accessor gives you an instant view of every denied request with its path — which tells you exactly which policy capability is missing. Without this, you're guessing. Add a runbook for "403 from Vault" to your service's ops docs — it's a predictable and recurring incident type.secret/data/{env}/{team}/{service}/{key} # e.g. secret/data/prod/payments/api/db_password Environment and team at the top allows policy to be scoped cleanly.
Policy pattern using templating (avoids one policy per team): hcl path "secret/data/{{identity.groups.names.team-name.metadata.team}}/*" {
capabilities = ["create", "read", "update", "delete", "list"]
} Map LDAP/OIDC groups to Vault identity groups. The policy is parameterized by the caller's identity — one policy document for all teams.
Tiered access: - Service tokens: read-only to their own {team}/{service}/* path - Team admins: read/write to {team}/* (via OIDC/LDAP auth) - Platform team: write to all paths, manage auth methods and policies - Operators: vault system paths (sys/*) only
Namespaces (Enterprise): stronger isolation. Each namespace has its own auth, engines, and policies. The platform team's root namespace can manage child namespaces. Prevents one team's policy mistake from affecting another.secret/{service}/* structure works for 10 services but fails at 100 when you add staging environments and want to isolate prod from non-prod. Build environment and team into the hierarchy from day one. Managing Vault config with Terraform (vault_policy, vault_auth_backend, vault_generic_secret) is non-negotiable at scale — clicking through the UI doesn't scale to hundreds of services and creates configuration drift. Treat Vault configuration exactly like infrastructure code.exec command: Vault Agent can watch for secret version changes and run a command when the template changes (e.g., SIGHUP the service). The service re-reads the rendered file. This makes rotation a Vault-side operation with no service code change.
Avoid: rotating the secret and invalidating the old key simultaneously — running instances will fail between the rotation and their next reload. Always have an overlap window.Vault Agent (sidecar/daemon): - Handles auth, token renewal, and secret templating outside the application - Writes rendered secrets to files (tmpfs mount) or provides a local API proxy - Application reads files or calls http://127.0.0.1:8200 (Agent's cache proxy) - Application code has zero Vault dependency — reads env vars or config files - Best for: polyglot environments, apps you don't control, gradual adoption, Kubernetes
Vault SDK (direct integration): - Application authenticates and manages its own token lifecycle - Fine-grained control: request dynamic creds at exact callsites, handle lease renewal per credential - Better for: Go/Java/Python services where the Vault SDK is well-supported, when you need
per-operation audit identity, when dynamic creds (not static) are the primary use case
- Requires implementing LifetimeWatcher/Renewer correctly — getting this wrong causes
credential expiry or token exhaustion
Hybrid: Agent for auth and token management + app uses the Agent's local cache proxy with the SDK for reads. Simplifies auth while keeping SDK flexibility. In Kubernetes, Agent as a sidecar with Kubernetes auth is the standard pattern for most services. Direct SDK integration is appropriate when the service already has Go/Java and a team experienced with the SDK.
Raft cluster setup: 3 or 5 nodes (odd number for quorum). One node is the active leader; others are standby. Standby nodes forward write requests to the leader. Reads from standby return slightly stale data unless X-Vault-Index consistency tokens are used.
Quorum requirement: Raft requires a majority. A 3-node cluster tolerates 1 failure. A 5-node cluster tolerates 2. During a network partition, the minority side seals itself (becomes read-only and eventually sealed) to prevent split-brain writes.
Failure modes: - Leader crash: Raft elects a new leader from standby nodes in seconds. Auto-unseal ensures
the new leader doesn't need manual unsealing. Brief downtime during election (~5–10 s).
- 2 of 3 nodes down: cluster loses quorum and stops serving writes. Manual intervention required. - Network partition: minority nodes stop serving and eventually seal. The majority partition
continues. After partition heals, minority nodes rejoin and catch up via log replication.
- Disk full on leader: Vault writes fail. Raft log grows unboundedly without snapshots.
Configure raft_snapshot_threshold and ensure adequate disk.
Operations: take regular Raft snapshots (vault operator raft snapshot save). Store snapshots offsite. Test restore (vault operator raft snapshot restore) — untested snapshots are not backups.
vault operator raft remove-peer) and snapshot restore. Run a "Vault DR drill" annually: simulate 2-node failure, practice restore from snapshot, measure RTO. The operations runbook for "Vault unavailable" must be rehearsed before the first production incident. The second failure mode to test is auto-unseal KMS unavailability — if AWS KMS is down during a Vault restart, Vault cannot unseal. Have a procedure for providing Shamir recovery keys in that scenario.Multi-namespace architecture (Vault Enterprise) or dedicated clusters per trust boundary (open source). Namespaces provide logical isolation — each business unit or regulated environment (PCI, SOC2) gets its own namespace with independent auth methods, secret engines, and policies. The platform team's root namespace governs namespace lifecycle and cross-namespace policies.
Platform team as enabler, not bottleneck: - Publish a Vault onboarding Terraform module: teams call the module with their app name,
team, and environment. Module creates the KV mount, the Kubernetes auth role, the policy,
and the identity group. No manual Vault operations required by the platform team for standard
cases.
- Golden path: standardize on Vault Secrets Operator (K8s) or Vault Agent Injector.
Teams don't integrate directly with Vault's HTTP API — they declare a VaultStaticSecret
CR and the operator delivers the secret.
- Policy as code: all policies in Git. Pull requests for policy changes. A CI job
applies changes via Terraform. Audit trail is the Git history.
Compliance segmentation: PCI-scoped secrets go in a namespace with stricter policies and a dedicated audit log forwarded to the PCI SIEM. No cross-namespace access from non-PCI namespaces. Automated access reviews quarterly. Observability for the platform: - Vault cluster health dashboard: sealed status, active node, Raft peer count, GC, request rate - Token expiry trending (token_count, policy distribution) - Audit log pipeline to SIEM with alerts on failed auth spikes Disaster recovery: primary cluster per region, DR replication (Enterprise) or cold standby via snapshot restore. RTO target drives the architecture choice.
Immediate containment (minutes): 1. Seal the cluster: vault operator seal from a trusted node. This stops all
request processing and clears the master key from memory.
2. If the attacker may have the unseal keys, take the cluster offline at the
network level (security group / firewall rules) before unsealing again.
3. Revoke the root token: if not already used to escalate, vault token revoke <root>.
Blast radius assessment (hours): 4. Pull the audit log — every path the compromised token accessed, every secret read,
every credential generated. The audit log is append-only and was written before
the seal. Ship it to a forensics system.
5. Identify all dynamic credentials (database, cloud IAM, PKI certs) generated by the
compromised token or its children. These must be revoked immediately at the upstream
systems, not just in Vault.
6. Audit any Vault policies modified by the attacker — they may have broadened access
before using it.
Recovery (hours to days): 7. Rotate the master key (vault operator rekey) with a new key set — old unseal
keys no longer work. Issue new unseal keys / recovery keys to key holders.
8. Rotate all static secrets stored in Vault (API keys, passwords). The attacker read
them — assume all are compromised.
9. Revoke and re-issue all Vault tokens (all token hierarchies are suspect). 10. Perform a configuration audit: compare current state against your last known-good
Terraform state. Any drift is suspect.
11. Rotate Vault's TLS certificates and the storage encryption keys.
Root cause and hardening: - How was the root token obtained? Was it never revoked after init? - Add controls: break-glass root token procedure with alerting on every use,
SIEM alert on any root-policy token login.
VaultStaticSecret and VaultDynamicSecret CRDs. Application teams declare what secrets they need; the operator syncs them to Kubernetes Secrets. Application reads from env vars or a mounted file — zero Vault SDK in app code.vault-app-onboarding. Inputs: app_name, team, environment. Module creates: KV mount path, Kubernetes auth role bound to the service account, policy granting read on secret/data/{team}/{app}/*, identity group. Application team runs terraform apply — no platform team ticket.{{identity.entity.name}} templating ensures a service can only access its own path even if it somehow acquires another service's token.default_ttl=1h, max_ttl=4h — short enough for blast radius control, long enough to reduce churn.BeforeAcquire callback — validate the connection's credential age against the current rendered file. On mismatch, close the connection; the pool opens a new one with the new credential. No connection is dropped mid-transaction.max_connections to accommodate both. Drain old connections before they expire by shortening their idle timeout.