Why does secrets management need an architecture decision record?

Secrets management is one of the most consequential and least-documented architectural choices in production systems. The secrets store you chose — environment variables, AWS Secrets Manager, HashiCorp Vault, GCP Secret Manager, or a managed SaaS platform — determines three structural properties that compound over time: the rotation automation model (whether credentials can be rotated without restarting application processes, and whether rotation is manual or automated on a schedule), the audit trail granularity (whether you can produce a complete access log showing which service accessed which credential at what time, essential for SOC 2 Type II and incident post-mortems), and the emergency revocation latency (how many minutes elapse between discovering a compromised credential and the moment every consumer of that credential is denied access). Without a secrets management ADR, the team responding to a credential leak at 3am discovers that the rotation procedure in the runbook describes a secrets store that was replaced six months ago, that nobody knows which services read which credentials, and that the estimated revocation time — if anyone has estimated it at all — is measured in hours rather than minutes. The ADR is also where the developer access model must be documented: how engineers working locally access secrets needed for development without the production store, and without production credentials leaking to developer laptops.

What is the difference between static secret rotation and dynamic secrets in HashiCorp Vault?

Static secret rotation is the process of changing a long-lived credential — a database password, an API key, a TLS certificate — and distributing the new value to all consumers. The credential has a fixed identity (the same username, the same key name) but a rotating value. Rotation requires that every consumer either polls for the new value on a schedule, receives a push notification that the value has changed, or is restarted to pick up the new value from the environment. AWS Secrets Manager implements static secret rotation via Lambda functions that execute on a schedule: the Lambda calls the target system to generate a new credential, stores the new value in Secrets Manager with the AWSPENDING label, tests the new credential, and then promotes it to AWSCURRENT while demoting the old value to AWSPREVIOUS. Applications that cache the secret must handle the AWSPREVIOUS value during the rotation window to avoid authentication failures while some instances hold the old value and others hold the new value. Dynamic secrets in HashiCorp Vault are a fundamentally different model: instead of rotating a shared credential, Vault generates a unique credential on demand for each requester and automatically revokes that credential when its lease expires. For database dynamic secrets, Vault connects to the database, creates a new role with a name like 'vault-1a2b3c' and the configured privileges, issues a credential with a TTL of (for example) one hour, and automatically executes the revocation SQL when the lease expires. Each application instance holds a distinct credential that is only valid for its lease duration. There is no 'shared password to rotate': when an application instance restarts, it requests a new credential and receives a new unique value. Emergency revocation of a compromised dynamic credential requires revoking the specific lease rather than rotating a shared secret — and all other application instances are unaffected because their credentials are distinct. The tradeoff is operational complexity: Vault must be a highly available, persistently operated service, and the application must implement lease renewal logic to extend the TTL before it expires during long-running transactions.

When should a team use AWS Secrets Manager versus HashiCorp Vault?

AWS Secrets Manager is the appropriate choice when the application runs primarily on AWS infrastructure, the team wants a fully managed service with no operational overhead for the secrets store itself, the rotation model is static secret rotation on a schedule, and the audit trail requirement is met by CloudTrail logging of GetSecretValue API calls with IAM principal attribution. Secrets Manager costs $0.40 per secret per month plus $0.05 per 10,000 API calls — predictable pricing for teams with a known number of secrets. The rotation Lambda functions provided by AWS cover common cases: RDS credentials, Redshift credentials, DocumentDB credentials, and custom Lambda for arbitrary targets. The caching SDK reduces API call costs and latency by holding the secret in application memory and refreshing it on a configurable interval or when an authentication failure indicates the cached value is stale. HashiCorp Vault is appropriate when the team needs dynamic secrets (unique credentials per consumer per lease period), when the secrets management requirement extends beyond AWS (multi-cloud or on-premises workloads), when fine-grained access policies are needed (Vault's policy language allows expressing complex allow/deny rules based on path, method, and identity metadata), or when the audit log must record every Vault API operation rather than only GetSecretValue calls. Vault's audit log includes both request and response details for every operation, making it more comprehensive than CloudTrail for secrets access auditing. HCP Vault (HashiCorp's managed Vault offering) eliminates cluster operation overhead while retaining Vault's full feature set. The decision point that most teams miss: dynamic secrets require that the target system (the database, the cloud provider, the certificate authority) supports the programmatic credential generation pattern — Vault cannot generate dynamic secrets for systems that do not expose an API for creating and revoking credentials. For systems that do support it, dynamic secrets eliminate the most common failure mode in static rotation: the window during which the old and new credentials both need to be valid.

How do you measure emergency credential revocation latency for a secrets management ADR?

Emergency credential revocation latency is the time from the moment a credential compromise is detected to the moment every consumer of that credential is denied access. Measuring it requires tracing the revocation path for each secret type in the system. For a database credential stored in AWS Secrets Manager: the revocation path is (1) rotate the secret in Secrets Manager, generating a new password for the database role, (2) wait for all application instances to refresh their cached value — applications using the caching SDK with a default 1-hour refresh interval may hold the old (now invalid) credential for up to an hour, (3) applications that cached the secret in process memory at startup require a restart to pick up the new value, adding the time for a rolling deployment to the revocation path. Total latency for a startup-cached credential with no automatic refresh: deployment time (15–45 minutes for a rolling restart across all instances). For an API key stored as an environment variable: revocation requires contacting the API provider to invalidate the key, then updating the environment variable on every deployment target and restarting all consuming processes. If the key is embedded in a Kubernetes Secret that is mounted as an environment variable, the pod must be restarted; Kubernetes Secrets mounted as volume files update automatically when the Secret is updated, but environment variable injection requires pod restart. For a HashiCorp Vault dynamic database credential: revocation is a single API call to revoke the specific lease, which immediately causes Vault to execute the revocation SQL on the database, invalidating the credential. All other application instances hold distinct credentials and are unaffected. Time from detection to revocation: seconds. The ADR must document the measured or estimated revocation latency for each credential type and each consuming system — and must include whether the credential caching layer (application-level cache, Vault agent cache, Secrets Manager caching client) is bounded or unbounded in its refresh interval, since an unbounded cache converts a credential rotation into a service disruption when the old value is invalidated before consumers refresh.

2026-06-21 · ~20 min read

The secrets management decision record: why the secrets store you chose determines your rotation automation capability and your audit trail at every production access

Secrets management looks like a security detail until a leaked credential triggers an incident response at 3am and your team discovers that the rotation procedure in the runbook — the one linked from the incident response playbook — assumes the secrets store you replaced six months ago. The store you chose in year one determines your rotation automation model, your audit trail granularity, and how many minutes elapse between detecting a compromised credential and denying access to every consumer of it.

A fourteen-person fintech startup stored their database credentials, Stripe secret key, and third-party API keys in environment variables. Not `.env` files committed to the repository — they had learned that lesson. Instead: environment variables set directly on three EC2 instances via a shared .env.production file managed by their DevOps contractor, plus a copy in their GitHub Actions secrets for CI, plus a copy in their Heroku config for the staging environment, plus a copy in a Notion page titled "Production Secrets (DO NOT SHARE)" that the founder had created in the company's fourth month and which was still shared with every engineer who had ever joined.

When the DevOps contractor left the company, the CTO rotated the Stripe key manually. It took three and a half hours to update GitHub Actions, the three EC2 instances, the Heroku staging config, the Notion page, and the personal notes of the two engineers who had separately copied the key to their local `.env` files for a feature they'd been building. Two weeks later, during the due diligence for a Series A, the lead investor's security firm flagged that the old Stripe key — the one that had been rotated eighteen days earlier — was still present in a forked copy of the private repository that a contractor had kept on their own GitLab account after their engagement ended.

The question "what is your credential rotation procedure?" produced six different answers from six engineers. The most recent answer was closest to correct but described the Heroku config flow the company had moved away from eight months earlier. The ChatGPT session where the CTO had evaluated HashiCorp Vault versus AWS Secrets Manager versus Doppler during the company's infrastructure build-out had been closed nineteen months ago. The reasoning behind the choice — which turned out to have been "environment variables for now, we'll evaluate Vault when we need to" — was gone. Nobody remembered that the evaluation had happened at all.

The investor's security firm graded the credential management finding as a medium-severity issue, not a blocker. The company spent four engineer-days migrating to AWS Secrets Manager, updating IAM policies, rewriting the rotation runbook, and filing evidence for the SOC 2 Type II pre-audit. The four engineer-days were not the expensive part. The expensive part was discovering, during the migration, that five services were reading the same database password from five different places — three EC2 instances, one Lambda function, and one container in ECS — and that the Lambda function had been added by a contractor eight months earlier whose access to the Notion secrets page had never been reviewed. The secret had spread further than anyone had mapped.

This is the failure mode that a secrets management ADR is designed to prevent: not the credential leak itself, but the inability to answer "what reads this credential?" and "how fast can we revoke access to it?" when the credential is compromised.

The three structural properties that secrets management determines

When engineers evaluate secrets management options, the discussion centers on features: does it support automatic rotation? does it have a nice dashboard? does it integrate with Kubernetes? These are implementation details. The structural properties that determine your security posture and incident response capability are more fundamental.

Rotation automation capability

Rotation automation is the ability to change a credential value across all consumers without manual intervention and without service disruption. The capability is not binary — it exists on a spectrum from fully manual (change it in one place, remember to update everywhere) to fully automated with zero-downtime transition (generate new credential, distribute it, validate it works, retire the old one, all without application restart).

The rotation model is determined by two independent choices: where the credential is stored (the secrets store) and how the application reads it (at startup vs at runtime on each access vs via a sidecar agent). These two choices interact in ways that are not obvious until you are in an incident.

An application that reads a credential from an environment variable at process startup cannot rotate the credential without restarting the process. The process holds the credential value in memory for its entire lifetime. A rolling restart of the application fleet takes 15–45 minutes for a typical multi-instance deployment. During that window, instances that have not yet restarted hold the old credential, and if the old credential has been revoked (because it was compromised rather than rotated on a schedule), those instances will fail authentication. This is not a rotation window problem — it is a fundamental architectural constraint imposed by the decision to read secrets at startup rather than at request time.

An application that reads a credential via an API call on each authentication attempt (or with a cached value that refreshes on a configurable interval) can rotate the credential without restarting: the new value is written to the secrets store, and each application instance picks it up when its cache expires or when it receives a cache invalidation signal. AWS Secrets Manager's caching SDK defaults to a 1-hour refresh interval; during rotation, the SDK also checks for a new value when it receives an authentication failure response, which limits the disruption window to at most one failed request per consumer before the cache refreshes.

HashiCorp Vault's dynamic secrets model changes the rotation model entirely: the credential is not rotated because the credential is not shared. Each consumer holds a distinct credential with a lease duration, and Vault automatically revokes it when the lease expires. There is no rotation event that must be distributed to consumers — each consumer manages its own lease renewal. The rotation automation capability is implicit in the architecture rather than an explicit scheduled operation.

Audit trail granularity

An audit trail for secrets access answers the question: "which identity accessed which credential at which time?" The granularity and completeness of that answer determines what you can say during a SOC 2 audit or an incident post-mortem.

Environment variables produce no audit trail. The credential was set at an unknown point in the past; whoever had SSH access to the server could read it via printenv; the application reads it at startup and holds it in memory for the process lifetime. There is no log of which services read the credential, when they first accessed it, or whether an operator read it interactively.

AWS Secrets Manager logs every GetSecretValue API call to CloudTrail with the IAM principal identity (the role ARN, the assumed-role session name, the source IP address, and the user agent). For applications running on EC2 or ECS with IAM instance roles, this produces a complete access log attributable to the specific instance or task. For applications running on Lambda, the log includes the function name and invocation ARN. The CloudTrail log is immutable and retained for the period configured in the trail's S3 bucket lifecycle policy. The audit answer is: "at 14:23:07 UTC, the ECS task running in us-east-1a with role ARN arn:aws:iam::123456789012:role/app-production read the secret named /production/database/primary-password."

HashiCorp Vault's audit log is more comprehensive: it records every API request and response — not only GetSecretValue equivalents but also token creation, policy lookups, lease renewals, and revocations. The audit log includes the Vault client token's associated policies and metadata, allowing attribution to the specific application identity rather than just the IAM role. For dynamic secrets, the audit log records the credential issuance with the generated username, the lease ID, and the requesting identity — creating a complete trail from credential generation to automatic revocation that is not available in any static secret rotation model.

Emergency revocation latency

Emergency revocation latency is the time between detecting a compromised credential and the moment every consumer of that credential is denied access. It is the most important metric for the secrets management ADR and the one most rarely measured before an incident forces the measurement to happen in production.

For a credential stored as an environment variable, the revocation path has no automation. The steps are: identify every place the credential is used (which requires institutional memory or a search of deployment configs, CI secrets, and any documentation), update the credential in every location, and restart every consuming process. Estimated latency: hours, depending on the completeness of the mental map and the speed of the deployment pipeline.

For a credential stored in AWS Secrets Manager with automatic rotation: the revocation path is to rotate the secret, which invalidates the old value and generates a new one. Consumers using the caching SDK receive the new value on their next cache refresh (up to 1 hour by default, or immediately on next authentication failure if the old value is rejected). Consumers that read the secret at startup require a restart. Estimated latency for a fully dynamic application using the caching SDK: minutes. Estimated latency for an application with startup-time credential reading: deployment time (15–45 minutes).

For a HashiCorp Vault dynamic database credential: the revocation path is a single API call — vault lease revoke <lease-id> — which immediately causes Vault to execute the revocation SQL on the database, invalidating the compromised credential within seconds. All other application instances hold distinct credentials and are completely unaffected: their database sessions continue without interruption. Estimated latency: seconds, regardless of the number of consumers.

The ADR must document the measured or estimated revocation latency for each credential type in the system, broken down by consuming service and reading pattern. This document is what the incident commander reaches for at 3am when the question is "how do we contain this in the next fifteen minutes?"

The options and their structural tradeoffs

Environment variables (no secrets store)

Environment variables set on the server or in the deployment platform's config are the default starting point for almost every production system. They are simple, universally supported, and require no infrastructure. They are also the worst choice for all three structural properties at any meaningful scale.

Rotation is manual and error-prone because the credential exists in multiple places simultaneously: the deployment config (Heroku config vars, ECS task definition environment, Kubernetes Secret, deployment script), CI/CD secrets, local developer environments, and any documentation. Each location must be updated independently, and there is no mechanism to verify that the update was complete. The most common rotation failure mode is not the update itself but the discovery, during the update, that the credential had been copied to a location that was not in the mental map.

The audit trail is absent. There is no log of which process read the variable or when. The credential is visible to any process running with the same environment and to any operator with access to the deployment platform's config interface.

Emergency revocation requires knowing where the credential is used, which requires institutional memory that degrades as the team changes. The fintech startup's four-engineer-day migration cost was mostly paid in credential discovery: finding all the places a credential had spread before the new store could be authoritative.

Environment variables remain appropriate for: local development (where credentials are not production-grade and do not require rotation), non-secret configuration values (feature flag overrides, log levels, service URLs), and the bootstrapping credential for the secrets store itself — the credential that gives the application permission to read from Secrets Manager or Vault must come from somewhere, and that somewhere is typically an IAM role (on AWS) or a Vault AppRole or Kubernetes service account token injected at the platform level.

AWS Secrets Manager

AWS Secrets Manager stores secrets as versioned JSON blobs with automatic rotation via Lambda functions. The managed rotation Lambdas cover RDS credentials (MySQL, PostgreSQL, MariaDB, Oracle, SQL Server), Redshift, DocumentDB, and custom targets via a template Lambda that implements a four-phase rotation protocol: createSecret, setSecret, testSecret, finishSecret.

The rotation protocol is designed for zero-downtime credential changes. During rotation, both the old value (AWSPREVIOUS) and the new value (AWSCURRENT) are valid simultaneously. Applications using the caching SDK that receive an authentication failure with the cached AWSCURRENT value will re-fetch the secret, receiving the new value within the rotation window. The rotation window — the period during which both values must be accepted — depends on the maximum staleness of any consumer's cached value.

The IAM access model is the key advantage over environment variables: each service has an IAM role with a policy that allows secretsmanager:GetSecretValue for only the specific secret ARNs that service needs. This produces a service-to-secret dependency map enforced by IAM policy rather than maintained by documentation. When a new service is deployed, the IAM policy for that service's role is the canonical record of which secrets it reads.

Pricing is predictable: $0.40 per secret per month at volume, plus API call costs that are negligible with the caching SDK. For a system with 50 secrets, the monthly cost is $20 in secret storage, a rounding error relative to infrastructure costs. Cross-region replication adds $0.40 per secret per replica region per month for disaster recovery scenarios where the primary region is unavailable.

The limitation: Secrets Manager does not support dynamic secrets. Every credential in Secrets Manager is a static value that must be rotated. For the highest-security requirement — database credentials that are unique per consuming service and automatically expire — Secrets Manager is insufficient; Vault's dynamic secrets engine is required.

HashiCorp Vault

Vault is a secrets management platform rather than a managed service: it requires operating a cluster (or using HCP Vault, HashiCorp's managed offering), configuring auth methods, defining policies, and enabling secrets engines. The operational overhead is substantially higher than Secrets Manager. The capability set is correspondingly broader.

The database secrets engine is Vault's most differentiated feature for application-layer secrets. Configuration: mount the database secrets engine, configure a connection to the target database using a management credential with role creation privileges, define a role template (the SQL for creating and revoking a database role), and set the TTL for issued credentials. Application usage: the application authenticates to Vault using an auth method (AWS IAM auth, Kubernetes serviceaccount token, AppRole), requests a database credential from the configured path, receives a unique username and password with a TTL, and renews the lease before expiry. Vault executes the revocation SQL when the lease expires — DROP ROLE 'vault-1a2b3c' — without application involvement.

The Vault agent sidecar pattern eliminates the need for application code to call the Vault API directly. Vault agent runs as a sidecar container, authenticates to Vault, writes the requested secret to a shared volume as a file, and handles lease renewal automatically. The application reads the credential from the file; when Vault agent refreshes the credential, it overwrites the file. Applications that re-read the credential file on each authentication attempt pick up the new value without restart. Applications that cache the file value in memory require a process signal to reload — the ADR must specify which pattern the application uses and how credential refresh is triggered.

The policy system allows expressing fine-grained access rules: path "database/creds/read-role" { capabilities = ["read"] } grants the read capability on only the specific role path, not on the database engine root or any other role. Policies compose via token-bound policy sets; the Vault audit log records which policies authorized each operation. This level of access policy expressiveness is unavailable in Secrets Manager's IAM-based model, which grants access at the secret ARN level but cannot express conditions like "only allow reads during business hours" or "deny reads from outside the VPC" without custom Lambda authorizers.

The operational requirement: Vault requires high availability. A single-node Vault is a single point of failure for every application that reads secrets from it. Integrated storage (Raft) enables HA without an external Consul cluster, with a three-node cluster providing fault tolerance for one node failure. The unsealing ceremony — required on startup and after a seal — must be proceduralized: either using auto-unseal via AWS KMS (the Vault master key is encrypted with a KMS key and Vault unseals automatically on startup) or via manual unseal shares (where unsealing requires a quorum of key holders). The ADR must document the unsealing procedure and the KMS key ARN, and must specify the key rotation policy for the auto-unseal key.

GCP Secret Manager

GCP Secret Manager is the GCP-native equivalent of AWS Secrets Manager: a managed service for storing secret values with versioning, IAM-based access control, and audit logging via Cloud Audit Logs. Rotation is supported via Pub/Sub notifications that trigger a Cloud Run or Cloud Function to execute a custom rotation handler — a slightly more complex wiring than Secrets Manager's native Lambda rotation, but functionally equivalent.

The access model uses GCP IAM roles: roles/secretmanager.secretAccessor grants read access to the secret payload; roles/secretmanager.viewer grants access to secret metadata but not the payload value. Service accounts bound to GCP resources (Cloud Run services, GKE workloads via Workload Identity, Compute Engine instances via service account attachment) receive IAM-based access without credential management — no API keys or service account key files needed.

Regional and multi-regional storage: secrets can be stored in a specific region or replicated across GCP's multi-regional locations, with Google managing the replication. The replication policy is set at secret creation and cannot be changed after the fact — a constraint that must be in the ADR if cross-region availability requirements exist.

External Secrets Operator (Kubernetes workloads)

External Secrets Operator (ESO) is a Kubernetes controller that synchronizes secrets from an external store (AWS Secrets Manager, Vault, GCP Secret Manager, Azure Key Vault, and others) into Kubernetes Secrets. The application reads from a standard Kubernetes Secret — mounted as a volume file or environment variable — while ESO handles the fetch and refresh from the upstream store.

The key advantage: applications require no code changes to move from Kubernetes Secrets (which are stored base64-encoded in etcd and have no access control beyond RBAC) to a managed secrets store. ESO handles the synchronization transparently. The ExternalSecret custom resource defines which path in the external store maps to which Kubernetes Secret, the refresh interval for polling the upstream store, and the template for transforming the external value into the Kubernetes Secret format.

The limitation: ESO-synchronized Kubernetes Secrets are still stored in etcd. The security benefit of the external store — encrypted at rest with store-managed keys, IAM-controlled access — exists only at the store layer. The Kubernetes Secret in etcd is accessible via RBAC to any principal with get or list on the Secrets resource in that namespace. Encrypting etcd at rest (a cluster-level configuration on managed Kubernetes offerings) is necessary to close this gap.

ESO is appropriate when the team cannot modify application code to use a secrets SDK, when the deployment target is Kubernetes and the team wants to retain the Kubernetes Secret interface for compatibility with Helm charts and operators that expect secrets in a specific format, or when the upstream store is changing (migrating from Secrets Manager to Vault) and the team wants to decouple the application interface from the upstream store during the migration.

Managed SaaS platforms (Doppler, Infisical)

Managed secrets platforms like Doppler and Infisical provide a centralized secrets dashboard that syncs to deployment targets: Heroku config vars, Vercel environment variables, AWS Parameter Store, Kubernetes Secrets, and others. The developer experience is the primary value proposition: a single interface for managing secrets across multiple deployment environments (development, staging, production), with automatic sync to each target when a secret changes.

The rotation model depends on the downstream target: Doppler syncs to the target on a push or pull model, but the consuming application still reads from the target (the Heroku config var, the Kubernetes Secret) rather than directly from Doppler. The audit trail records changes in Doppler's dashboard but not downstream access — CloudTrail or Vault audit would be needed for per-access attribution.

These platforms are appropriate for teams with heterogeneous deployment targets (one service on Heroku, one on Vercel, one on AWS) where maintaining per-platform secret management is the primary operational burden. For teams running entirely on AWS or GCP, Secrets Manager or Secret Manager is typically more cohesive: deeper integration with IAM, better CloudTrail/Cloud Audit Logs coverage, and no additional vendor dependency.

The developer access model

The secrets management ADR must specify how engineers access secrets locally during development. This is where security decisions made for production are most commonly undermined.

The least-secure approach: a separate set of development credentials that are stored in a team-shared `.env` file committed to a private repository, or distributed via Slack or email when new engineers join. Development credentials shared this way spread to every laptop of every engineer who has ever needed them, and are rarely rotated when engineers leave the team. If the development and production credentials are different, the risk is bounded; if developers use production credentials locally "just for this one test," the risk is not bounded at all.

AWS Secrets Manager with IAM role assumption: engineers assume an IAM role for development (via aws sso login or aws sts assume-role), which grants read access to development-tier secrets. The same application code reads from Secrets Manager locally and in production — no credential files, no shared secrets. The local AWS credential session expires on a schedule (typically 8 hours for SSO sessions), requiring the engineer to re-authenticate. This re-authentication is the security checkpoint: it is the moment when access for engineers who have left the team is naturally terminated, assuming their SSO account has been deprovisioned.

Vault with Kubernetes or AppRole authentication: engineers authenticate to Vault via a Vault CLI token obtained from the Vault UI or CLI (vault login), with the token bound to a policy granting access to development-tier secrets. Token TTLs enforce re-authentication. The Vault agent sidecar pattern can be emulated locally via the Vault agent binary, writing secrets to local files that the application reads — identical to the production deployment pattern.

The ADR must specify: the authentication method for local development, the TTL for developer tokens or sessions, the process for revoking access when an engineer leaves the team (account deprovisioning in the identity provider, which should automatically revoke derived IAM sessions or Vault tokens), and whether production secrets are accessible to engineers at all or only to deployment systems.

The four AI chat session types that create undocumented secrets management decisions

Secrets management decisions appear in ChatGPT and Claude sessions in predictable patterns. The decisions made in these sessions are rarely captured in any persistent record because they feel like implementation steps rather than architectural choices — until an incident reveals that the "implementation step" had structural consequences nobody can now explain.

The initial infrastructure session. "We're setting up our AWS account. Should we use Secrets Manager, Parameter Store, or environment variables for our database password?" The ChatGPT response covers the tradeoffs, the team picks an approach, and the session closes. The reasoning — "Secrets Manager is worth the $0.40/month per secret because of the automatic rotation and CloudTrail integration" or "Parameter Store is free and sufficient at this scale" — is gone. What remains: the infrastructure code that implements the choice, which will be maintained and extended by engineers who never saw the evaluation.

The rotation incident session. "We need to rotate our database password immediately. How do we do this safely with Secrets Manager without downtime?" The ChatGPT session produces a procedure: update the Secrets Manager value, trigger the rotation Lambda, wait for AWSPENDING to promote to AWSCURRENT, monitor application logs for authentication failures. The procedure is executed. It works. The session closes. The procedure is never documented. The next rotation — triggered by a different engineer eighteen months later — starts from first principles.

The compliance preparation session. "We're doing a SOC 2 Type II audit. What do we need to show for secrets management?" The ChatGPT session produces a list: access control evidence (IAM policy showing least-privilege access per service), rotation evidence (CloudTrail showing rotation events and their frequency), audit log evidence (CloudTrail retention period and immutability proof), and key management evidence (KMS key rotation policy). The engineer gathers the evidence, passes the audit, and the session closes. The next audit, two years later, finds that the rotation policy was set to "as needed" in the SOC 2 evidence but has never actually been triggered — because the criteria for "as needed" were never defined in a rotation policy document.

The post-incident forensics session. "We had a credential leak. We've rotated the Stripe key. How do we know if it was used maliciously?" The ChatGPT session produces a query: check Stripe's event log for API calls during the window when the key was potentially compromised. The engineer checks, finds no fraudulent charges, and closes the incident. The session closes. The question that was not asked — "how would we know which service originally exposed the credential?" — is not answered, and the audit trail that would answer it either doesn't exist (environment variables) or was never looked at (CloudTrail, if Secrets Manager was in use).

Each of these sessions produces a decision that shapes the secrets management posture for years. The decisions that never get written down are not the big architectural choices — those get documented because they feel important. They are the operational decisions made under pressure, during incidents, or as "obvious next steps" during compliance preparation. The rotation policy, the audit log retention period, the credential refresh interval, the developer access model — these are set once in a ChatGPT session and then treated as immutable facts of the system until a new incident reveals they were always just undocumented assumptions.

What the secrets management ADR must contain

An architecture decision record for secrets management has different requirements from an ADR for a stateless architectural choice. The consequences are operational, not structural — they determine what your team can do in the first fifteen minutes of a credential compromise, not how the system is structured. The ADR must document operational procedures alongside the architectural choice.

Section 1: Context and current state

Where do production secrets currently reside? List every category of secret and its current storage location: database credentials, API keys for external services, webhook signing secrets, TLS private keys, encryption keys for data at rest, OAuth client secrets. For each category, note how many distinct secrets exist and how many consuming services exist. This inventory is not obvious — the fintech startup's migration revealed five consumers of the database password that were not in anyone's mental map. The context section of the ADR is where the actual inventory is recorded, not the assumed inventory.

Section 2: Secrets store decision

Which store was chosen and why. The why must address the three structural properties: rotation automation model (what is automated, what requires manual action, what requires a restart), audit trail coverage (which access events are logged, where the log is retained, how to query it), and emergency revocation latency (measured or estimated, per credential type, per consuming service). It must also address the alternative considered and the specific reason it was rejected — "we considered Vault but the operational complexity of running a Vault cluster was not justified given our current team size and the absence of dynamic secrets requirements" is a complete rationale; "we chose Secrets Manager because it's easier" is not.

Section 3: Rotation policy per secret type

Rotation schedules are not uniform across secret types. Database credentials may rotate every 90 days on a schedule, or immediately on engineer departure. API keys for external services may be tied to the external provider's rotation capability — some providers support rotation without disruption, others require a coordination window with the provider's support team. TLS certificates have their own rotation timeline driven by the certificate lifetime (90 days for Let's Encrypt, 1–2 years for CA-issued certificates). Encryption keys for data at rest must be rotated without requiring re-encryption of the existing data if the encryption scheme uses envelope encryption (AWS KMS handles this transparently).

The rotation policy section must specify: rotation trigger (schedule or event-triggered), rotation procedure (automated via Lambda or Vault, or manual with step-by-step procedure), the credential refresh interval in consuming applications (bounded cache TTL or startup-time read), and the expected consumer disruption during rotation (zero-disruption, one failed request, requires restart).

Section 4: Audit trail configuration

The audit trail section must specify what is logged and where. For Secrets Manager: the CloudTrail trail configuration, the S3 bucket retention period, and whether CloudTrail log file integrity validation is enabled (which provides cryptographic evidence of log immutability for audit purposes). For Vault: the audit device type (file, syslog, or socket), the log destination, the retention and rotation policy for the audit log files, and whether the HMAC-protected log format is used. For either store, the section must include a sample query demonstrating how to answer "which services accessed the database password between 14:00 and 15:00 UTC on a specific date" — this query is what the incident commander will run, and it should be documented before the incident rather than discovered during it.

Section 5: Emergency revocation procedure

This is the most important section and the most frequently absent one. For each credential type, document the step-by-step procedure for emergency revocation: what command to run, what flag to set, what policy to update, and what application behavior to expect during and after revocation. Include the estimated time for each step and the total elapsed time from detection to full revocation.

For AWS Secrets Manager: step 1, log into AWS console or use CLI; step 2, navigate to the secret and initiate rotation or update the value directly; step 3, note the time at which the AWSCURRENT value was updated; step 4, monitor CloudTrail for GetSecretValue calls returning the old value — these indicate consumers that have not yet refreshed; step 5, for consumers that read at startup, initiate a rolling deployment. Estimated total time: 2–5 minutes to rotate, 1 hour maximum for all caching consumers to refresh, 15–45 minutes for startup-reading consumers after a rolling deployment is triggered.

For Vault dynamic secrets: step 1, identify the lease ID for the compromised credential from the Vault audit log; step 2, run vault lease revoke <lease-id>; step 3, verify in the Vault audit log that the revocation was recorded and the revocation SQL was executed. Estimated total time: 30 seconds for a specific lease, 60 seconds including log verification. For all credentials issued by a specific Vault role: vault lease revoke -prefix database/creds/app-role/, which revokes all active leases for that role simultaneously.

Section 6: Developer access model

How engineers access secrets locally without production credentials. Specify the authentication method (AWS SSO role assumption, Vault CLI login, service account key file for GCP), the credential TTL and re-authentication procedure, the process for revoking developer access on engineer departure (which identity provider group membership must be removed, and what downstream access that removal terminates), and whether any production secrets are accessible to engineers directly or only via the deployment system.

Also specify the handling of local development secrets that are not production credentials — the development database URL, the test API key for the payment provider's sandbox environment. These are lower-risk than production credentials but still require a defined distribution and rotation model, since they spread to developer laptops and CI environments in the same way production credentials spread if not managed deliberately.

The rotation model determines what you can do in the first fifteen minutes

Every secrets management decision is evaluated once, during the initial infrastructure build, and then tested exactly once — during the first real credential compromise. The test is pass or fail. There is no partial credit for "we would have rotated faster if we'd had better tooling."

The fintech startup's rotation took three and a half hours for the Stripe key because the credential had spread to six locations and the manual rotation procedure touched each location sequentially. If the same incident happened with AWS Secrets Manager in place, and all consumers were using the caching SDK with a 5-minute refresh interval, the total rotation time would have been under 10 minutes: 30 seconds to update the secret, 5 minutes maximum for all consumers to refresh their cached value, no restarts required. The old key would be in AWSPREVIOUS for the rotation window and then retired. The contractor's forked repository would contain a key value that was already invalid at the time the security firm found it.

The security ADR and threat model establishes the threat landscape; the secrets management ADR is where the specific mitigations are operationalized. A threat model that lists "credential exposure" as a risk and points to "secrets management" as a mitigation is incomplete without a secrets management ADR that specifies the rotation automation model, the audit trail coverage, and the emergency revocation latency — because those are the numbers that determine whether the mitigation is adequate for the risk level.

The WhyChose decision extractor was built specifically for the kind of decision that the fintech startup lost: the ChatGPT session where the CTO evaluated Vault versus Secrets Manager versus environment variables, decided on environment variables "for now," and never wrote down the reasoning or the acceptance criteria for when "for now" would expire. That session is the secrets management ADR that the company needed and didn't have. The extractor recovers it — not from the closed ChatGPT window, but from the next engineering team that makes the same decision and captures it this time.

The rotation model, the audit trail, the revocation latency — these are not implementation details. They are the operational parameters that determine whether your incident response is measured in minutes or hours. The decision that set those parameters was made in a ChatGPT session that is now closed. The ADR is how you keep it from being made again from scratch, under pressure, in the dark.

Further reading on related architectural decision records:

The authentication strategy decision record — the auth system and the secrets store interact: the credentials that prove application identity to the secrets store (IAM roles, Vault AppRole) are themselves secrets that require a bootstrap model.
Security ADR: threat model and compliance — where the secrets management ADR fits in the broader security posture and SOC 2 evidence chain.
The infrastructure-as-code strategy decision record — how secrets are referenced (not stored) in Terraform and whether the IaC tool has access to the secrets store during plan and apply operations.
The database connection pooling decision record — the connection pooler's authentication to the database is one of the credentials that the secrets management system must manage and rotate.
The CI/CD pipeline decision record — how CI/CD accesses secrets for deployment (OIDC federation to Vault, GitHub OIDC to AWS IAM, or static CI secrets) is a secrets management decision made in the CI/CD ADR.
How to document architecture decisions — the ADR format and conventions used across all decision records in this series.
WhyChose decision extractor — recover the secrets management decisions buried in your AI chat history.