The infrastructure-as-code strategy decision record: why the IaC approach you chose in year one determines your drift detection capability and your compliance audit surface
Infrastructure-as-code looks like a solved problem until a manually-applied security group fix from a late-night incident sits in production for eight months, invisible to your Terraform state, and a junior engineer's next terraform plan would delete it. The IaC tool and module structure you chose — Terraform, Pulumi, CloudFormation, direct CLI — determines whether drift between your declared configuration and your actual infrastructure is detectable before it becomes a production incident, whether your compliance audit has a change trail that satisfies SOC 2 requirements, and whether the engineer running apply has enough context to understand what they are about to change and who authorized it. Most teams discover the limits of their IaC approach not during initial setup, but during their first compliance audit, their first major refactor, or their first incident where the production state and the IaC configuration have diverged past the point where reconciliation is straightforward.
A five-person SaaS team has been using Terraform to manage their AWS infrastructure for two years. The setup was done well in year one: remote state in S3 with DynamoDB locking, separate directories for dev and production, modules for the repeating patterns (VPC, RDS cluster, ECS service), and a CI pipeline that runs terraform plan on every pull request. Infrastructure changes go through the same code review process as application code. The team is proud of this. It is more disciplined than most startups at the same stage.
In October, a production incident at 2am requires the on-call engineer to allow a new IP range to reach the RDS cluster. The security group managed by Terraform is the right place to add the rule, but running the IaC pipeline takes twelve minutes — the CI run, the plan review, the apply — and the database connection errors are cascading. The engineer adds the security group rule directly from the AWS console. The incident is resolved in four minutes. A Slack message goes into #incidents: "Added SG rule manually to unblock — someone clean this up in IaC." Nobody does. The incident postmortem does not include an action item for it. Eight months later, a new engineer joins the team and is asked to tighten the security group rules as part of a compliance preparation sprint. She runs terraform plan. The plan shows that Terraform will remove the manually-added ingress rule and add two new ones from the IaC configuration, which was updated in November for a different reason. She does not know what the removed rule is for. The Slack message from October is buried in three months of channel history. The rule is protecting access from a managed service that the team added post-incident but never fully documented. She opens a pull request with the plan output attached. It goes to the lead engineer for review. He approves it without noticing the removal. The apply runs. The postmortem that follows teaches the team more about their IaC approach than two years of normal operation did.
The incident reveals four things about the IaC strategy that were never written down: there is no mechanism for detecting drift between the Terraform state and the actual AWS configuration in the months between applies; there is no procedure for emergency manual changes that would have created an obligation to reconcile the change into IaC within 48 hours; the plan output in CI shows what Terraform will do but does not explain why, and the reviewer did not have the context to evaluate whether the removal was intentional; and the compliance audit beginning in January will require an evidence trail of every production infrastructure change that git blame on .tf files does not cleanly satisfy. These decisions about how IaC integrates with the team's operational process were never written down — not because they were unimportant, but because they were discovered gradually through operation rather than made explicitly at the start.
Why IaC tool and structure is an architectural decision, not a configuration preference
Infrastructure-as-code choices are often framed as tooling preferences — "we use Terraform" or "we're a CloudFormation shop" — rather than as architectural decisions with long-lived structural consequences. This framing misses what the IaC approach actually determines.
The IaC tool determines the state model and its failure modes. Terraform and OpenTofu maintain a state file that records the last-known configuration of every managed resource. The state file is the source of truth for what Terraform believes exists — and when that belief diverges from reality (due to manual changes, provider-side updates, or resources deleted outside of Terraform), the divergence is only discovered at plan time, not continuously. CloudFormation maintains state internally within the AWS service, tied to stack operations — it has no external state file to manage, lock, or lose, but the tradeoff is that CloudFormation's stack operations are atomic at the stack level, meaning a failed update requires rollback to the previous state rather than partial recovery at the resource level. Pulumi maintains state in a backend (Pulumi Cloud, S3, or local) and supports the same remote-state patterns as Terraform, with the additional complexity that the program that generates the resource graph is general-purpose code (TypeScript, Python, Go) rather than a declarative configuration language — which means the state is coupled to the version of the code that generated it in a way that Terraform's HCL is not.
The IaC tool determines the compliance audit surface. SOC 2 CC6.1 requires evidence of authorized changes to production infrastructure. The evidence trail available depends entirely on the IaC approach: raw Terraform without a plan-and-apply orchestration layer provides git history (who committed the configuration) but not apply history (who executed the apply, when, and with what plan output). Terraform Cloud and Atlantis add plan-and-apply runs with approval gates — the compliance-relevant signal is the run record showing who approved and who applied. CloudFormation's stack updates are recorded in AWS CloudTrail with the IAM principal that initiated the change, giving a native audit trail without additional tooling. Pulumi Cloud provides stack update history with the user who initiated each update. The choice between these approaches determines whether the compliance audit evidence is generated automatically as a byproduct of normal operation, or requires retroactive log extraction and correlation that may not satisfy the auditor's requirement for a clear and direct trail.
The module and environment structure determines the blast radius of errors. A Terraform configuration organized as a single root module with all environments in one directory tree means a terraform apply with an incorrect workspace selection can modify production infrastructure when the engineer intended to change the dev environment. A structure organized with separate root modules per environment — environments/dev/, environments/staging/, environments/production/ — means the blast radius of a mistyped directory is bounded to a single environment. Module versioning — whether child modules are pinned to specific versions or referenced from a local path — determines whether a change to a shared module applies immediately to all environments that reference it or requires explicit version bumps per environment.
The apply authorization model determines the team's ability to maintain change discipline at scale. In year one, with five engineers and one environment, the implicit rule "anyone can apply but be careful" works well enough. In year three, with twenty engineers, three environments, and a compliance audit requirement, the implicit rule produces a situation where the auditor asks "who was authorized to apply this production change?" and the answer is "anyone on the team with AWS access." This is not an acceptable answer for most compliance frameworks. The apply authorization model must be explicit: which environments require approval before apply, what form that approval takes (pull request review, Atlantis plan approval, JIRA ticket number attached to the run), and what the break-glass procedure is for emergency applies that bypass normal approval gates.
IaC tools: the structural properties of each approach
Five infrastructure management approaches are in common production use. Each has structural properties that shape what is easy, what is possible with additional tooling, and what requires architectural investment to achieve.
Terraform / OpenTofu is the dominant choice for cloud-agnostic infrastructure management. The HCL configuration language is declarative and readable without a general-purpose programming background. The provider ecosystem is the largest of any IaC tool — virtually every cloud service and SaaS product with an API has a Terraform provider. The structural properties that matter for the IaC ADR: state management is the team's responsibility — remote state must be configured, locking must be configured, state backups must be maintained, and state drift must be detected by running terraform plan rather than by continuous monitoring. The plan-and-apply separation is a first-class concept: plan is a safe, read-only preview of what will change, and apply executes the changes — this separation makes the CI gate (plan on PR, apply on merge) natural to implement. The module system is straightforward but has composition limits: modules cannot call other modules across providers without explicit data sources, and the absence of general-purpose programming constructs means complex conditional logic is expressed through count / for_each patterns that are readable once familiar but unfamiliar to engineers from a general-purpose programming background. OpenTofu, the open-source Terraform fork maintained by the Linux Foundation, is API-compatible with Terraform 1.5 and appropriate for organizations that require an OSS license or want independence from HashiCorp's commercial licensing decisions.
Pulumi manages infrastructure using general-purpose programming languages: TypeScript, Python, Go, C#, Java. The structural advantage is the full expressiveness of a real programming language — loops, functions, classes, type-safe interfaces, IDE autocompletion, unit tests against the resource graph — which makes complex infrastructure configurations substantially more maintainable than the equivalent in HCL. The structural limitation is the same advantage: the infrastructure definition is code that must be executed to produce the resource graph, and the resource graph depends on the version of the code that runs it. Refactoring Pulumi code can change the resource graph in non-obvious ways — a resource whose name was generated by a loop index now has a different logical name after the loop is refactored, requiring a Pulumi alias or an explicit import to avoid destroy-and-recreate. Engineers from a software background find Pulumi natural; engineers from an operations background who are comfortable with YAML and HCL but not with TypeScript or Python may find the language overhead disorienting. The state management model is similar to Terraform: a backend stores the stack state, and the program is run to produce the desired state, with the diff applied. Pulumi Cloud provides a managed backend with update history and approval gates comparable to Terraform Cloud.
AWS CloudFormation / CDK is the native infrastructure-as-code tool for AWS-only infrastructure. CloudFormation's state management is handled by the AWS service itself — there is no external state file, no locking configuration, and no state drift in the Terraform sense because CloudFormation's source of truth is the stack's last-applied template, maintained by AWS. Stack updates are atomic: if a change fails partway through, CloudFormation rolls back to the previous state automatically, which is both a feature (no partial failures that require manual remediation) and a limitation (large stacks can take 30+ minutes to roll back, and rollback itself can fail for resources with complex lifecycle dependencies). The AWS CDK (Cloud Development Kit) allows CloudFormation stacks to be defined in general-purpose languages (TypeScript, Python, Go, Java) and synthesized to CloudFormation templates, combining CloudFormation's managed state model with the expressiveness of a programming language. The structural limitation of the CloudFormation / CDK approach is AWS-only: managing infrastructure across AWS, GCP, or Azure, or managing SaaS provider configurations (Datadog, PagerDuty, Cloudflare) alongside cloud resources, requires either separate tooling or the use of the CloudFormation registry for third-party providers — a more limited ecosystem than Terraform's provider catalog.
Ansible is a configuration management and orchestration tool that can also provision cloud resources, but its model — imperative task lists rather than declarative desired state — produces a fundamentally different operational property: Ansible playbooks are not idempotent by default. Running a Terraform configuration twice produces the same result as running it once (idempotency is the core contract of the declarative model). Running an Ansible playbook twice may produce different results depending on the task implementation. Ansible is appropriate for configuration management of existing instances (OS configuration, software installation, service setup) and for orchestrating multi-step processes that do not fit the declarative resource model. It is less appropriate as the primary cloud resource provisioning tool for infrastructure that must be consistently recreatable — the absence of a state model means there is no plan operation to preview what will change, and no drift detection beyond running the playbook and observing whether any tasks report changes.
Direct cloud provider CLI or console — ClickOps, in the industry vernacular — is not a tool choice so much as the absence of one. It is the approach that every IaC migration starts from and that every mature team has some residual of (the manually-created resource that was never imported, the console-applied security group rule from an incident). The structural properties of direct-provider management: zero overhead to apply a change, maximum risk of drift between the actual configuration and any documentation, no audit trail beyond provider CloudTrail or equivalent, and no plan operation to preview the effect of a change. The compliance and drift implications make direct-provider management unacceptable as a primary approach for any infrastructure that is part of a regulated product surface or that is subject to a change management requirement.
State management and drift detection
For Terraform-based approaches, state management is the most operational decision in the IaC strategy — the one that determines what happens when things go wrong. The IaC ADR must specify the state management approach in enough detail that a new engineer can operate the infrastructure without needing to discover the conventions by reading the code.
Remote state backends — S3 with DynamoDB locking for AWS teams, GCS with a lock table for GCP teams, Terraform Cloud or Azure Blob Storage for cross-cloud teams — are non-negotiable for any infrastructure beyond a solo developer's personal projects. Local state means the state file lives on the engineer's laptop: when that engineer leaves, the state goes with them; when two engineers run apply concurrently, they corrupt the state. The ADR must specify the backend, the bucket / container name and path convention, the locking mechanism, and the backup / versioning configuration (S3 bucket versioning to allow state recovery after accidental terraform state rm).
State locking prevents concurrent applies from corrupting the state file. The locking mechanism must match the backend: DynamoDB for S3-backed state, GCS object preconditions for GCS-backed state, or the managed locking in Terraform Cloud / Atlantis. The ADR must specify what happens when a lock cannot be acquired — whether the CI pipeline fails with a clear message, whether engineers are permitted to force-unlock in specific circumstances, and who is authorized to force-unlock. A force-unlock applied while another apply is running produces state corruption; the authorization to force-unlock must be tightly controlled and logged.
Drift detection requires a proactive approach, not just relying on drift appearing in the next plan. The recommended approach is a scheduled CI job that runs terraform plan against each environment on a defined cadence — daily for production, after each deployment for dev and staging — and alerts on any non-empty plan output that was not triggered by a pending configuration change. This separates expected drift (a pending PR is waiting to be applied) from unexpected drift (a manual change was applied that is not in any PR). The alert must route to the team's incident channel with enough context for the on-call engineer to determine whether the drift is dangerous or benign: the resource type, the attribute that changed, and whether the drift would result in a destructive operation on the next apply. A drift alert on a modified instance tag is a minor discrepancy; a drift alert on a modified security group or IAM policy requires immediate investigation. The observability strategy ADR must address where drift alerts route and who is on-call to respond.
Importing manually-created resources is the reconciliation procedure for resources that exist in the cloud but not in the IaC state. terraform import adds an existing resource to the Terraform state without modifying the resource; the corresponding configuration must then be written manually (or generated from the state with terraform show -json and converted to HCL). The ADR must specify the import procedure as a required step in the break-glass emergency change protocol: any manual change to production infrastructure must be followed by an import PR within 48 hours, with the import linked to the incident that produced the manual change. Without this requirement, imported resources become emergency changes that were never reconciled — accumulating until a plan shows destructive changes that nobody can explain.
Module and environment structure
The module and environment structure is the set of conventions that determines how the IaC configuration scales from one environment to twenty resources and from five engineers to fifty. These conventions are the decisions most frequently made implicitly — in the directory layout of the initial commit — and most painful to change later when the team has accumulated months of state against a structure that no longer serves the operational needs.
The environment directory structure must separate root modules per environment so that an incorrect workspace or directory selection cannot apply changes to the wrong environment. The common pattern: environments/dev/main.tf, environments/staging/main.tf, environments/production/main.tf, each calling shared modules from modules/ with environment-specific variable files. The alternative — a single root module with Terraform workspaces selecting the environment — shares state backends across environments, makes it easy to accidentally apply to the wrong workspace, and complicates environment-specific variable management. The ADR must specify which pattern is used and why, so that new engineers do not introduce a second convention when creating a new environment.
Module versioning determines how a change to a shared module propagates across environments. If modules are referenced with local paths (source = "../../modules/rds"), a change to the module applies to all environments that reference it on the next apply — which is convenient for rapid iteration in early stages but dangerous in year three when a module change must be validated in staging before production. Module versioning via Git tags (source = "git::https://gitlab.com/org/infra-modules.git//rds?ref=v2.1.0") or a Terraform registry allows each environment to pin to a specific version and promotes changes explicitly. The ADR must specify the module versioning approach and the promotion procedure — how a module change moves from dev to staging to production, what the required validation steps are at each stage, and who approves the promotion.
Naming and tagging conventions for cloud resources are an IaC-strategy concern because they determine whether a resource visible in the cloud console can be traced back to its IaC definition. A resource named app-db-prod in the console can be located in the IaC configuration by searching for that name. A resource named rds-20241015182334 (auto-generated timestamp suffix from a poorly-designed module) cannot. The ADR must specify the required tags — at minimum: Environment, ManagedBy (value: "terraform"), Module (the module that created the resource), TerraformWorkspace — that are applied to every resource by the shared tagging module. These tags are the link between the cloud console view of the infrastructure and the IaC definition, and their absence is the primary reason a security group rule from an incident is untraceable eight months later.
Testing strategy and CI gates
IaC testing is a category where most teams under-invest in year one and pay the cost in year three when a module refactor silently changes the resource graph for production infrastructure. The IaC ADR must specify the testing strategy across three levels: static analysis, plan-level validation, and integration tests against ephemeral environments.
Static analysis runs on every pull request without executing any cloud API calls. The standard toolchain: terraform fmt --check to enforce formatting, terraform validate to catch syntax errors and invalid resource references, tflint to catch deprecated arguments and provider-specific best practice violations (unused variables, missing required tags, instance types not in the allowed list). These checks catch the class of errors that would fail at plan time without the cost of running a plan. The security review ADR should specify which tflint rules enforce security-relevant constraints — instance types that prohibit burstable instances in production, security group rules that prohibit 0.0.0.0/0 ingress on sensitive ports, S3 bucket configurations that require encryption and versioning.
Policy-as-code gates run against the Terraform plan output to enforce compliance and security rules before apply. Open Policy Agent (OPA) with Conftest, HashiCorp Sentinel (Terraform Cloud Enterprise), or Checkov with a custom rule set can evaluate the plan JSON against a rule library: "no security group may allow 0.0.0.0/0 on port 22," "all RDS instances must have deletion protection enabled," "all S3 buckets must have block public access enabled," "all IAM policies must not include * as the Resource." These rules enforce the security baseline that the security ADR specifies, but at the infrastructure provisioning layer rather than the application layer. A policy-as-code gate that fails the CI pipeline when a plan would create a misconfigured resource is the IaC equivalent of a failing unit test — it catches the misconfiguration before it exists in the cloud environment, before a security scanner finds it in production, and before an auditor flags it during a compliance review.
Plan-level CI gates run terraform plan against the real remote state on every pull request and post the plan output as a PR comment. This serves two purposes: it confirms that the configuration is valid against the actual current state (not just syntactically valid as a file), and it gives the reviewer concrete information about what the merge will change. The plan output must be reviewed as carefully as the code change — a PR that removes a resource should be reviewed with the same scrutiny as a PR that removes a code path. The ADR must specify that destructive operations (resources marked as -/+ or - in the plan) require explicit reviewer acknowledgement in the PR description before approval. Without this requirement, reviewers approve the code change without reading the plan output, which is the failure mode in the opening narrative.
Ephemeral environment tests apply the IaC configuration to a temporary environment, run integration tests against the resulting infrastructure, and then destroy the environment. This is the highest-confidence test but also the most expensive: it requires a cloud account or namespace that is safe to create and destroy resources in, and the test run takes as long as the actual provisioning. The ADR must specify whether ephemeral environment tests are required before production apply or only for major infrastructure changes (new service, new database, new network topology). The cost of ephemeral tests is proportional to the resources created; a test that provisions an RDS cluster on every PR may be cost-prohibitive. The right scope is often determined by risk: ephemeral tests for the module that manages the database cluster, CI plan gates for modules that manage supporting resources like S3 buckets or CloudWatch alarms.
Compliance audit surface and apply authorization
The compliance audit surface of the IaC strategy is the set of evidence that can be produced to demonstrate that infrastructure changes were authorized, reviewed, and applied by authorized principals following a defined process. This surface is determined by the toolchain choices and the apply authorization model — and it must be specified in the ADR, not left to be reconstructed under audit pressure.
Apply authorization in CI/CD is the primary mechanism for ensuring production infrastructure changes are authorized. The pattern: a pull request with a passing CI plan output is approved by a designated reviewer, merged to the default branch, and the merge triggers an automated apply. The apply is not available to individual engineers outside of the CI pipeline. This model means the audit evidence is the git log showing approvals plus the CI run log showing the plan output and the apply result — a complete record of what was proposed, who approved, and what was executed. The alternative — engineers running terraform apply from their local machines with their own AWS credentials — means the audit evidence is only the IAM CloudTrail entry showing a CLI call, without the associated plan output or approval record. The ADR must specify which model is used and, if local applies are permitted, under what circumstances and with what logging requirements.
The break-glass procedure is the authorized path for emergency infrastructure changes that must bypass the normal CI/CD pipeline. The procedure must be as explicit as the normal apply path: who is authorized to initiate a break-glass apply, what authorization is required before the apply runs (a second approver via Slack message, a PagerDuty incident number recorded in the run log), what the maximum time window is for reconciling the emergency change into IaC (48 hours is the common standard for SOC 2 preparation, matching the incident postmortem window), and what happens if the reconciliation window is missed (an automated alert triggers a compliance exception record). Without an explicit break-glass procedure, emergency manual changes happen under incident pressure without any associated authorization record — and the compliance auditor's question "who authorized this security group change at 2am in October?" cannot be answered from any available log.
The change audit trail for compliance is the combination of git history, CI run logs, and cloud provider logs that together constitute evidence of a controlled infrastructure change process. For Terraform with Atlantis or Terraform Cloud: the PR shows the proposed change and the plan output; the Atlantis run log shows the approval and the apply; the S3 state backend versioning shows the state file before and after. For CloudFormation: CloudTrail records every CreateStack, UpdateStack, and DeleteStack call with the IAM principal and timestamp. The ADR must specify where each component of the audit trail is stored, how long it is retained (the SOC 2 requirement is typically 12 months; PCI DSS requires 12 months with 3 months immediately available), and how the trail is accessed during an audit. An auditor who asks for "the change log for the production RDS security group from January to March" must receive a specific answer with specific evidence, not "check git blame and the AWS console."
Access control for production apply — which IAM roles or identities are permitted to create, modify, or destroy production resources — must be specified in the ADR as a complement to the apply authorization process. The CI/CD pipeline should run with a least-privilege IAM role that has permission to manage only the resources declared in the IaC configuration, not AdministratorAccess. Individual engineer access to production should be read-only (describe/list/get) with write access granted only for break-glass scenarios through a separate role that requires MFA and generates a CloudTrail audit entry. This separation — pipeline applies with a specific limited role, engineers read-only in production — is the structural control that makes the apply authorization model enforceable rather than advisory.
IaC decisions in AI chat history
Infrastructure-as-code strategy produces a specific pattern in AI chat history: the initial tool selection appears once at the start of the project, a cluster of state management and module structure conversations appears during the first major infrastructure refactor, and the compliance and drift conversations appear under deadline pressure during the first audit or after the first significant incident. None of these conversations are connected to each other in any organized documentation.
The initial tool selection sessions are typically brief and conclusive in the wrong direction. The conversation pattern: "should we use Terraform or Pulumi for our AWS infrastructure?" followed by a discussion that covers the surface-level tradeoffs (Terraform is more mature, Pulumi uses TypeScript, CloudFormation is AWS-native) without reaching conclusions about the state management approach, the compliance audit requirements for the product's target market, or the engineering team's proficiency with HCL versus general-purpose languages. The decision is made — often "let's use Terraform, everyone knows it" — and the conversation ends without documenting what alternatives were evaluated, why CloudFormation was ruled out, what the module versioning strategy will be, or what the plan CI gate requirement is. The rationale for the tool selection is in a closed chat session from two years ago; the engineer who made the decision has since left; and the new engineer who is evaluating whether to migrate to OpenTofu for licensing reasons has no access to why Terraform was chosen in the first place.
The first major drift or state corruption sessions contain the operational learning that should have updated the IaC ADR but rarely does. The conversation pattern: "our terraform plan is showing that it wants to destroy a resource we need — how do we figure out why?" or "someone applied a manual change to production and now the plan shows a conflict — what's the safest way to reconcile this?" or "we have resources in AWS that aren't in our Terraform state — how do we import them without breaking anything?" These sessions contain the team's concrete encounter with state management complexity: the order in which terraform import and configuration writing must happen, the specific flags that prevent terraform plan from treating unimported resources as requiring destruction, the procedure for splitting a monolithic state file into per-resource state files when the module structure is refactored. Three months of AI chat history on infrastructure topics typically contains two or three of these high-density operational sessions that hold more practical knowledge about the team's IaC approach than any written documentation that exists.
The compliance preparation sessions are the most valuable recovery target for IaC decisions in AI chat history. The conversation pattern: "we're doing a SOC 2 audit in 90 days and they're asking for evidence of a change management process for infrastructure — what do we need to show?" or "the auditor is asking who can approve and apply Terraform changes to production — how do we document that?" or "we need to implement policy-as-code for our Terraform CI — what's the right tool and what rules should we enforce first?" These sessions contain the decisions that were made explicitly in response to a compliance requirement: which CI pipeline tool was chosen for Atlantis vs Terraform Cloud, which OPA rules were written for the initial policy-as-code gate, what the break-glass procedure looks like, how long CI run logs are retained. These decisions were made deliberately, under deadline pressure, with real understanding of the compliance requirement they were satisfying. They belong in the IaC strategy ADR — but they are captured only in the chat sessions from the 90-day audit preparation sprint, never consolidated into the documentation that would help the next compliance cycle start from a known baseline rather than from scratch. The open-source extractor surfaces these sessions from the AI chat history, where the compliance rationale is preserved in full but has never been linked to the infrastructure repository's documentation.
The incident postmortem sessions capture the retrospective analysis that contains the clearest articulation of what the IaC strategy should have specified. The conversation pattern after an incident like the opening narrative: "how do we prevent manual changes from being applied to production without going through IaC?" or "we need a way to detect when someone adds a security group rule outside of Terraform — is there a way to get alerted?" or "we had a terraform plan that would have deleted a necessary resource and the reviewer didn't catch it — how do we prevent this?" The answers in these sessions produce the most operationally-grounded IaC strategy decisions — the drift detection job specification, the destructive-operation review requirement, the break-glass reconciliation window — because they are derived from a specific failure rather than from general best practices. A postmortem that produces action items for the IaC strategy ADR is more durable than one that produces Jira tickets — the ADR is what the next engineer reads when setting up a new environment, not the ticket backlog from three years ago.
Writing the IaC strategy ADR
The IaC strategy ADR needs five sections. Each section answers questions that different stakeholders will ask at different stages of the infrastructure lifecycle — from the new engineer setting up a development environment to the compliance auditor reviewing the production change trail to the on-call engineer responding to a drift alert at 2am.
Section 1: Tool selection. The IaC tool chosen — Terraform (or OpenTofu), Pulumi, CloudFormation / CDK, or a hybrid — with the specific version or version range pinned. The evaluation of alternatives that were considered, with rejection rationale specific to the team's cloud provider footprint, language expertise, and compliance requirements: why Pulumi was not selected (team's HCL familiarity outweighed the language expressiveness benefit at the current infrastructure scale), why CloudFormation was not selected (multi-cloud or multi-provider requirements ruled out an AWS-only tool), why Terraform was chosen over OpenTofu (or vice versa, with the licensing rationale). The provider list: which cloud providers and SaaS providers are managed by IaC, which are managed by other means (console, provider-specific CLI, separate automation) and why. The constraint that would trigger a tool reconsideration: if the provider ecosystem for a required integration does not have a Terraform provider, if the infrastructure complexity grows to a level where HCL's expressiveness limits are reached, if a compliance requirement mandates a specific audit trail that the chosen tool cannot provide without additional tooling.
Section 2: State management and drift detection. The remote state backend configuration: backend type, bucket or storage container name and path convention, locking mechanism, backup and versioning configuration, and access control for the state bucket (which IAM roles can read, write, and delete state — the state file contains sensitive information including database connection strings and API keys embedded in resource configurations). The drift detection approach: whether a scheduled plan job runs against each environment, the cadence, where the plan output is reported, what alert threshold distinguishes expected drift (pending PR) from unexpected drift (untracked manual change), and the response procedure for a drift alert. The import procedure for manually-created resources: when import is required, the specific command sequence, the requirement to write matching configuration before importing, and the post-import verification process. The procedure for state file recovery if the state is corrupted or accidentally deleted: which backup mechanism enables recovery, the expected recovery time, and who is authorized to execute a state recovery operation.
Section 3: Module and environment structure. The environment directory layout with the rationale for the chosen structure (separate root modules per environment versus workspace-per-environment). The module versioning approach: local path references (appropriate for early stage, fast iteration, bounded team) or versioned registry references (required once module changes need controlled promotion). The variable promotion pattern: how environment-specific values are managed (separate .tfvars files, Terraform Cloud workspace variables, AWS SSM Parameter Store references), with the naming convention for each category of variable (non-sensitive configuration, sensitive credentials, environment identifiers). The naming convention for managed resources: the required name prefix or suffix pattern that makes resources identifiable in the cloud console, and the required tags that must be applied to every resource. The convention for managing Terraform output values that are referenced across module boundaries (output references versus separate data sources).
Section 4: Testing and CI strategy. The CI pipeline stages: which static analysis checks run on every PR commit, which plan gates run on every PR, whether policy-as-code gates run on the plan output, and whether ephemeral environment tests run before production apply. For each stage: the specific tools and commands, the failure behavior (fail the PR, post a warning comment, require explicit reviewer acknowledgement), and the authorization required to override a failing gate in an emergency. The policy-as-code rule library: which security and compliance rules are enforced at the plan gate, how rules are added (PR to the rule library, reviewed by security role), and how exceptions are tracked (a documented exception with an expiry date and a justification, reviewed at each compliance cycle). The destructive operation review requirement: what constitutes a destructive operation in the plan output, how the reviewer must acknowledge it in the PR description, and whether a destructive operation in production requires a secondary approver beyond the normal PR review process. The database migration strategy intersects here for infrastructure that manages database resources: a plan that would replace an RDS instance (destroy and recreate) is a database migration, not just an infrastructure change, and the review process must account for data loss risk.
Section 5: Apply authorization and compliance audit policy. Who is authorized to run apply in each environment: dev (any engineer with the appropriate IAM role), staging (any engineer via CI pipeline, no manual applies), production (CI pipeline only, triggered by merge to main after PR approval). The specific CI pipeline tool and configuration that enforces this: Atlantis with an atlantis.yaml that requires approval before apply; Terraform Cloud with a run approval gate; GitHub Actions with environment protection rules that restrict production deploys to specified reviewers. The IAM roles structure: the pipeline IAM role with least-privilege permissions for each managed resource type, the read-only engineer role for production inspection, the break-glass role with write permissions requiring MFA. The break-glass procedure: who can initiate a break-glass apply, what authorization is required before apply (on-call channel message, PagerDuty incident link, second engineer acknowledgement), where the apply is executed (a dedicated break-glass run environment that logs the plan output and apply result with the initiating engineer's identity), and the reconciliation obligation (IaC PR within 48 hours that brings the manual change into state, linked to the break-glass event). The change audit trail: where each component is stored (git history for configuration changes, CI run logs for plan and apply records, state backend versions for state snapshots, cloud provider audit logs for all API calls), how long each component is retained, and the procedure for producing the audit evidence for a compliance review or incident investigation. The data retention strategy applies to IaC audit artifacts as well as application data — the retention period for CI run logs and state file versions must be specified and enforced, not left to the default retention of the storage service.
Infrastructure-as-code is the decision that determines what every future infrastructure change costs — in review time, in compliance overhead, and in incident recovery time. A tool chosen in year one — "let's just use Terraform, everyone knows it" — is not wrong by itself. It becomes wrong when it is operated without a state management discipline, without a drift detection process, without a module structure that scales to twenty engineers, and without an apply authorization model that satisfies the compliance audit requirement that appears in year three. The decisions that determine infrastructure manageability are made early, when the cost of the wrong choice is invisible, and discovered late, when reconciling three years of accumulated manual changes and undocumented conventions under audit pressure is the most expensive engineering work on the roadmap. The ADR preserves the tool selection rationale alongside the state management approach, the module conventions, the drift detection procedure, and the apply authorization policy — the decisions that shape the infrastructure practice in operation rather than the tool choice that only established the foundation. The open-source extractor recovers the AI chat sessions where the operational and compliance decisions were made, from the conversations where a real incident or a real audit deadline produced a real policy, before that policy was archived in a closed chat session and forgotten.