Why does a CI/CD pipeline need an architecture decision record?

The CI/CD pipeline decisions made in year one — build tool, artifact model, deployment strategy, rollback mechanism, environment promotion model — are not easily changed once a team has built operational muscle around them. A team that starts with a simple push-to-deploy model and adds staging environments, artifact registries, blue-green infrastructure, and approval gates incrementally will have a pipeline in year three that nobody can fully describe, where the rollback procedure requires tribal knowledge not written in any runbook, and where the deploy time has grown from 4 minutes to 25 minutes through accumulated tooling without anyone making an explicit decision about the tradeoff. The pipeline ADR captures the decisions about build pipeline structure, test gate placement, artifact promotion model, deployment strategy, rollback mechanism, and pipeline security posture — the decisions that determine what happens when something goes wrong in production, not just how code ships on normal days.

What is the difference between a deployment strategy and a rollback strategy?

A deployment strategy defines how new code reaches production: all at once (rolling replace), to a parallel environment that receives live traffic after validation (blue-green), to an increasing percentage of traffic (canary), or behind a feature flag that controls which users see it (feature-flagged deploy). A rollback strategy defines what happens when the deployed code causes problems: whether rollback is achieved by re-deploying the previous artifact (deploy-based rollback), by switching traffic back to the previous environment (blue-green cutback), by reducing the canary percentage to zero (canary abort), or by toggling a feature flag off (flag-based rollback). The two strategies are coupled — the rollback mechanism available to the team is determined by the deployment strategy chosen. Blue-green enables instant traffic cutback because the previous environment is running and healthy. Rolling replace requires re-deploying the previous artifact, which takes as long as the original deployment. Canary allows percentage-based rollback without full re-deployment. Feature flags allow rollback without any infrastructure change. The pipeline ADR must specify both strategies together, because a deployment strategy that has no corresponding rollback path is an incomplete decision — it specifies how to ship, but not how to recover.

What should a CI/CD pipeline decision record include?

A CI/CD pipeline ADR needs five sections. First, the pipeline tool and structure: CI platform choice (GitHub Actions, GitLab CI, Jenkins, CircleCI, Buildkite), the pipeline-as-code format, how pipelines are shared across services, and the constraints that would trigger a platform reconsideration. Second, the build and artifact model: how artifacts are built and stored, artifact versioning and immutability requirements, the artifact promotion model from CI through staging to production, and the retention policy for build artifacts. Third, the deployment strategy: rolling, blue-green, canary, or feature-flag-driven, with the rationale for each service type and the infrastructure requirements each strategy imposes. Fourth, the rollback strategy: the specific rollback mechanism for each deployment strategy, the expected rollback time, who is authorized to initiate a rollback, and what the runbook looks like under incident pressure. Fifth, the pipeline security posture: secrets management, build environment isolation, artifact attestation requirements, and the supply chain security controls (SLSA level, SBOM generation) required by the product's compliance posture.

How does the build artifact model affect deployment reliability?

The build artifact model — whether the pipeline builds and stores immutable artifacts (Docker images in a registry, compiled binaries in object storage, language-specific packages) or builds from source at deploy time — determines two properties that become critical under incident pressure. First, deploy time under rollback: an immutable artifact that already exists in the registry deploys in the time it takes to pull and run the image; building from source at deploy time means rollback requires re-running the full build pipeline, adding the build time to the rollback latency. A team whose normal deploy takes 8 minutes with artifact-based deployment discovers during an incident that their rollback takes 22 minutes because the rollback path goes through the build pipeline. Second, reproducibility: an immutable artifact built from a specific commit SHA is the same artifact every time it is deployed, in every environment. An environment that builds from source at deploy time may produce different artifacts depending on dependency resolution at build time — a dependency that was updated between the original deploy and the rollback attempt can produce a different binary from the same source, which is not the artifact that was validated in staging.

2026-06-20 · ~20 min read

The CI/CD pipeline decision record: why the deployment pipeline you chose determines your rollback capability and your mean time to deploy under incident pressure

CI/CD pipelines look like plumbing until a production incident requires an emergency rollback and the pipeline that ships code in 12 minutes under normal conditions takes 47 minutes to roll back because nobody designed the rollback path. The deployment pipeline decisions made in year one — build tool, artifact model, deployment strategy, rollback mechanism, test gate structure, environment promotion model — are not revisited until something breaks badly in year two or three, by which point the team has built irreversible operational dependencies on the original choices. The pipeline architecture determines whether your team can respond to production incidents with a single command and 90 seconds of wait time, or with a multi-step manual procedure, an all-hands Slack channel, and thirty minutes of uncertainty about whether the previous version is actually what is running in production now.

A B2B SaaS team ships a payment processing update to production on a Thursday afternoon. The change looked clean in code review — a new field validation in the checkout form, a minor adjustment to the Stripe session creation payload. CI passed: 847 tests, all green. The deploy finished in 11 minutes. Forty seconds later, the first error alert fires: the payment endpoint is returning 422 on every request. The Stripe dashboard shows zero successful charges in the last two minutes. The on-call engineer opens PagerDuty, acknowledges the alert, and types the obvious question in the incident channel: "How do we roll back?"

The answer the team discovers at that moment reveals something they had never explicitly decided. The deployment strategy is a rolling update: the new container replaces old containers one at a time, and once the rollout completes, the old containers are gone. The artifact registry retains the previous image — tagged with the prior commit SHA — but deploying it requires triggering a new pipeline run pointing at that SHA, which means re-running the entire CI pipeline: build, test, push, deploy. The CI pipeline takes 11 minutes under normal conditions, but under this incident the staging environment is also being used for a different rollback investigation, and the pipeline queue adds 14 minutes of wait time. At the 47-minute mark, the previous version is finally running in production. The incident cost 47 minutes of revenue on a payment-critical endpoint, and the post-incident review reveals that the rollback path was never documented in the runbook — the team had assumed "just re-deploy the previous tag" and had never measured what that actually took.

The incident reveals something that was never written down as a decision: the team had a deployment strategy (rolling update) but not a rollback strategy. The rollback mechanism is not the inverse of the deployment strategy — it is a separate decision with its own infrastructure requirements, its own time budget, and its own failure modes. A rolling update strategy with an artifact registry gives you a rollback path, but not a fast one. Blue-green gives you a fast rollback path (traffic cutback to the idle environment) but requires double the infrastructure. Canary gives you a rollback path that limits blast radius (abort the canary, return traffic to the stable version) but requires traffic-splitting infrastructure. Feature flags give you instant rollback without any infrastructure change, but require the code change to be behind a flag — which this payment validation change was not.

Why CI/CD pipeline architecture is a set of decisions, not a configuration

CI/CD pipelines are typically introduced as a configuration problem: pick a CI platform, write a YAML file, ship code. The decisions embedded in that initial configuration — which test gates block the deploy, whether artifacts are built and stored or built at deploy time, how environments are promoted through, what the rollback mechanism is — are treated as implementation details rather than architectural choices with long-lived consequences.

The pipeline structure determines the team's ability to validate changes before they affect production. A pipeline that runs all tests in a single stage, then deploys to production, provides a binary signal: tests pass or fail, then the deploy happens. A pipeline that promotes artifacts through environments — dev, staging, production — with validation gates at each stage provides a signal about behavior in production-equivalent conditions, not just about unit and integration test coverage. The difference matters most for changes that are hard to test in isolation: configuration changes, infrastructure changes, changes to third-party integration behavior, changes to database query plans that only manifest under production data volumes. The test strategy decision record and the pipeline structure are coupled — the test gates in the pipeline are only meaningful if the test strategy specifies what each gate validates and what failure rate is acceptable before blocking the deploy.

The artifact model determines the reproducibility of every deploy and every rollback. An immutable artifact — a Docker image tagged with a specific commit SHA, a compiled binary stored in object storage with a content-addressed key, a language package pinned to a specific version — is the same artifact in every environment and at every point in time. Deploying the previous SHA in production is exactly the same operation as deploying the current SHA, because the artifact already exists. Building from source at deploy time means the artifact is a function of the source code AND the dependency resolution at build time AND the build environment state — three variables that can diverge between the original build and the rollback build.

The deployment strategy determines the blast radius of a bad deploy. A strategy that replaces all running instances simultaneously maximizes both deployment speed and blast radius: all users see the new version, and if it is broken, all users are affected. A canary strategy that routes 5% of traffic to the new version first limits the blast radius to 5% of users while the canary is being monitored. Blue-green eliminates blast radius entirely — the new version receives no user traffic until the cutover — but requires infrastructure to run two full production environments simultaneously. The right blast-radius tradeoff is not universal: a payment-critical endpoint warrants a more conservative strategy than a read-only API that serves cached data.

The pipeline security posture determines the attack surface of the build and deploy process itself. A pipeline that has write access to the production environment with long-lived credentials stored as environment variables is an attractive target: a compromised CI runner, a malicious pull request that exfiltrates the credentials, or a supply chain attack on a build dependency can use the pipeline's credentials to deploy arbitrary code to production. The decisions about secrets management (short-lived tokens versus long-lived credentials), build environment isolation (ephemeral runners versus persistent runners with shared state), artifact provenance (signed artifacts with attestation versus unsigned artifacts), and the scope of pipeline credentials (least-privilege deploy role versus AdministratorAccess) are security decisions that belong in the pipeline ADR, not in a later security review that happens after the pipeline is already operational.

CI platform and pipeline structure

The CI platform choice is the most visible pipeline decision and the one most frequently made by whoever sets up the first repository. GitHub Actions for teams already on GitHub, GitLab CI for teams on GitLab, Jenkins for teams with a legacy investment or a need for self-hosted runners, Buildkite for teams with complex runner requirements that hosted CI cannot satisfy. Each has structural properties that shape what is easy to implement and what requires custom tooling.

GitHub Actions is the dominant choice for teams on GitHub, with a large marketplace of pre-built actions and native integration with GitHub pull requests, environments, and deployment protection rules. The structural properties that matter for the pipeline ADR: pipeline configuration lives in .github/workflows/ as YAML files that are part of the repository — changes to the pipeline go through the same code review as changes to the application. Reusable workflows allow shared pipeline logic across repositories without copy-paste, but require a workflow call syntax rather than direct YAML inclusion. GitHub-hosted runners are ephemeral — each job starts in a clean environment, which eliminates state leakage between builds but means dependencies must be installed on every run (or cached explicitly). GitHub Actions' OIDC integration allows the pipeline to obtain short-lived cloud provider credentials (AWS STS, GCP Workload Identity) without storing long-lived credentials as secrets, which is the correct secrets management model for production deployments.

GitLab CI is the natural choice for teams on GitLab, with an architecture where the pipeline definition lives in .gitlab-ci.yml and pipelines are first-class GitLab objects with a full UI, API, and audit log. The structural properties: GitLab CI's DAG pipeline model (using needs:) allows jobs to start as soon as their dependencies complete rather than waiting for an entire stage to finish, which reduces total pipeline time for pipelines with independent parallel jobs. GitLab's environments and deployments model is more structured than GitHub's: each deployment creates an environment record with a deployment ID, a deploy time, and a link to the pipeline that produced it — which provides the audit trail that compliance requirements need without additional tooling. GitLab CI's protected environments allow environment-specific approval gates (a production deploy requires approval from a member of the production-approvers group) natively, without requiring a third-party deployment tool.

Jenkins is the legacy choice for teams with an existing Jenkins investment or a requirement for self-hosted CI that managed CI services cannot satisfy (air-gapped environments, specific compliance requirements for build environment isolation, custom hardware requirements for build jobs). The structural properties that make Jenkins operationally costly: Jenkins itself requires maintenance — version upgrades, plugin compatibility management, runner capacity management — which is engineering work that managed CI services absorb. Jenkinsfiles (the pipeline-as-code format) are Groovy-based and have a steeper learning curve than YAML-based pipeline definitions. The agent model (executors on Jenkins agents rather than ephemeral VMs) means builds can accumulate shared state on the agent between runs, which produces the class of bugs where "it works on CI" refers to the specific agent state that happened to be present, not a clean environment. Modern Jenkins configurations with ephemeral cloud agents (Kubernetes pod agents, EC2 spot agents) address the state problem but add operational complexity.

Buildkite separates the CI orchestration (Buildkite's SaaS) from the execution (team-managed agents that run on any infrastructure). This model is appropriate for teams with specific runner requirements: GPU builds for ML pipelines, macOS runners for mobile CI, specific network access requirements for integration tests, or cost optimization through Spot/Preemptible instances at scale. The structural property that matters for the pipeline ADR: the team is responsible for operating the agent fleet, which adds operational complexity that managed CI eliminates. The tradeoff — lower per-minute cost and full control over the build environment — is worth it at scale (hundreds of CI minutes per day) and not worth it at early stage (dozens of CI minutes per day).

Pipeline-as-code and shared pipeline templates are the mechanism for avoiding copy-paste pipeline configuration across multiple services. GitHub Actions reusable workflows, GitLab CI include: directives, Jenkins shared libraries, or a purpose-built internal tool (a pipeline generate command that emits standard pipeline configuration for a service type) all serve the same function: a change to the shared pipeline logic applies to all services that use it without requiring per-repository changes. The pipeline ADR must specify how shared pipeline logic is managed — where it lives, how it is versioned, how updates propagate, and how service teams opt into or out of specific pipeline stages.

Build and artifact model

The artifact model is the decision that most directly affects rollback latency, deploy reproducibility, and the team's ability to reason about what exactly is running in each environment.

Immutable artifact builds produce an artifact — Docker image, compiled binary, Lambda ZIP, Helm chart — that is tagged with an immutable identifier (typically the git commit SHA), stored in a registry or object store, and promoted through environments by deploying the same artifact rather than building it again. The properties of this model: the artifact that ran in staging is exactly the artifact that will run in production — not a new build from the same source, but the same bytes. Rollback is a deploy of a previously-stored artifact, which takes exactly as long as a normal deploy (pulling and running the existing image), not as long as a build-and-deploy cycle. The artifact registry is the authoritative record of what has been deployed to each environment and when — a query against the registry's deployment history answers "what version was running in production at 14:37 UTC on June 15?" without requiring log archaeology. The cost of this model is the artifact registry infrastructure (an OCI-compatible registry like ECR, GCR, GHCR, or a self-hosted Harbor) and the discipline of tagging and retention policies.

Build-at-deploy models — where the pipeline checks out source and builds at deploy time rather than promoting a pre-built artifact — are simpler to set up in year one and produce the reproducibility and rollback-latency problems described in the opening narrative. The team that builds at deploy time discovers the gap when a rollback requires re-running the build pipeline under incident time pressure, and a dependency that changed between the original build and the rollback build produces a different binary from the same source SHA. The most common example: a pip install -r requirements.txt or npm install without locked versions pulls whatever is current at install time, which may be different from what was pulled during the original build three days ago.

Artifact tagging and promotion is the record-keeping model for which artifact is in which environment. The minimum tagging convention: :sha-{commit-sha} for the immutable build artifact, :staging and :production as mutable tags that are moved to point at the current deployed artifact. This convention allows both "deploy a specific SHA" (for a targeted rollback to an exact previous version) and "what is deployed in production right now?" (read the :production tag). A richer convention adds the pipeline run ID, the branch name, and the semantic version where applicable. The pipeline ADR must specify the tagging convention, who moves the mutable environment tags (the pipeline, after automated validation passes), and the retention policy for old artifacts (how long does the registry retain images tagged only with a commit SHA before garbage collection?).

Artifact attestation and SBOM generation are the supply chain security controls that compliance frameworks increasingly require. SLSA (Supply-chain Levels for Software Artifacts) provides a framework for the provenance claims an artifact can make: SLSA Level 1 is a signed build attestation from the CI system; SLSA Level 2 adds a hosted build platform with auditable build parameters; SLSA Level 3 adds hardened build environments and non-forgeable provenance. Software Bill of Materials (SBOM) documents the dependencies included in the artifact — the packages, their versions, and their known vulnerabilities at build time. The pipeline ADR must specify the target SLSA level, the SBOM format (SPDX or CycloneDX), and where attestations are stored and verified — either in the artifact registry (OCI attestation references) or in a separate attestation store (Sigstore Rekor, an internal transparency log).

Deployment strategy

The deployment strategy determines how new code replaces old code in production. The choice is not a single decision for the entire product — different services with different availability requirements, traffic patterns, and blast-radius tolerances may use different strategies. The pipeline ADR must specify the default strategy and the criteria for deviating from it.

Rolling update replaces running instances of the old version one at a time (or in batches) with instances of the new version, until all instances are running the new version. The structural properties: both versions run simultaneously during the rollout window, which means the application and its API must be backward-compatible with the previous version's clients for the duration of the rollout. Database migrations that are applied before the rolling update must be backward-compatible with the old version of the code. The rollback mechanism is re-deploying the previous artifact — which takes as long as the original deployment and requires that the previous artifact is available in the registry. Rolling updates are appropriate for stateless services with good backward-compatibility discipline. They are inappropriate for services where the old and new versions cannot run simultaneously — any schema migration that makes the old code non-functional is incompatible with a rolling update strategy.

Blue-green deployment maintains two identical production environments (blue and green), with one serving live traffic and the other idle. A deployment updates the idle environment with the new version, validates it while it is not serving traffic, and then cuts over the traffic (typically via a load balancer rule change or a DNS update). The structural properties: the cutover is instantaneous — the load balancer routes traffic to the new environment in milliseconds. Rollback is equally instantaneous — cut traffic back to the previous environment, which is still running and healthy. The cost is double the production infrastructure: two ECS services (or two Kubernetes deployments, or two Auto Scaling groups) running simultaneously at all times. The idle environment must be kept warm and up-to-date with production configuration (environment variables, secrets, external service credentials) so it is deployable without a cold start. Blue-green is appropriate for services where rollback speed is critical (payment processing, authentication, anything that directly blocks revenue) and where the infrastructure cost of two environments is acceptable. For the opening narrative's payment endpoint, blue-green would have reduced the rollback from 47 minutes to under 90 seconds.

Canary deployment routes a small percentage of production traffic (typically 1–10%) to the new version while the rest continues to receive the old version. The canary is monitored — error rate, latency, business metrics — and either progressively increased to 100% (if metrics are healthy) or aborted back to 0% (if metrics regress). The structural properties: the blast radius is bounded to the canary percentage during the validation window. Rollback is aborting the canary — moving the traffic back to 100% old version, which takes seconds. Canary requires traffic-splitting infrastructure: a load balancer or service mesh that can route by percentage (NGINX upstream weights, AWS ALB weighted target groups, Istio VirtualService weight rules, Argo Rollouts). The validation logic must be specified: which metrics are monitored, what threshold triggers automatic abort versus human review, and what the maximum canary duration is before a human must decide to promote or abort. Canary is appropriate for high-traffic services where the error rate signal is statistically meaningful at small percentages, and where the infrastructure and observability required to monitor the canary are already in place. The observability ADR is a prerequisite — a canary that is not monitored provides no signal and no automatic rollback capability.

Feature-flag-driven deployment deploys code to all instances but controls which users see the new behavior via a runtime flag. The deploy is complete when the new code is running on all instances; the release happens when the flag is enabled for the target user population. The structural properties: the deploy and the release are decoupled. A rollback is a flag toggle — taking less than a second and requiring no infrastructure change. The blast radius can be controlled by the flag targeting rules — enabling the new behavior for 1% of users, for internal users only, for users in a specific geographic region, or for users who have opted in. The cost of this model is the code complexity: every feature-flagged change requires branching logic in the code, the flag must be cleaned up after the release is confirmed stable, and the flag service itself is a dependency that must be available for the application to function (a flag service outage can affect the application behavior if the fallback behavior for a flag fetch failure is not specified). A LaunchDarkly, Unleash, Flipt, or OpenFeature-compatible flag service is required; a homegrown flag implementation in a database table is appropriate only for simple boolean flags on low-traffic surfaces.

Rollback strategy

The rollback strategy is the decision that is most frequently undocumented and most urgently needed during a production incident. The pipeline ADR must specify the rollback mechanism, the expected rollback time under incident conditions, who is authorized to initiate a rollback, and where the runbook is.

The rollback mechanism follows from the deployment strategy, but it is not automatic. Blue-green gives you the capability for a fast traffic cutback, but someone must know how to execute it: which load balancer rule to change, which Kubernetes service selector to update, what the CLI command is, and whether that command requires special IAM permissions or an approval gate that will delay execution under incident pressure. The rollback runbook must be tested in a non-incident context — not on a staging environment where the failure consequences are low, but on a production-equivalent environment where the same permissions, access controls, and approval requirements apply. A rollback procedure that requires a specific person who is on vacation is not a rollback procedure; it is a hope.

Rollback latency under incident conditions is almost always longer than the expected latency under normal conditions. The gap comes from: the approval gate that blocks automated rollbacks and requires human judgment under incident pressure; the CI pipeline queue that adds wait time when other teams are also deploying during the incident window; the artifact registry pull time when the previous image is large and has not been pulled recently (and therefore is not in the container runtime's local cache); and the time to verify that the rollback is actually serving production traffic rather than still running the broken version. The rollback time in the pipeline ADR must be measured — not estimated — in a controlled drill, with the same tooling and permissions that the on-call engineer would use at 2am.

Rollback authorization is the question of who can trigger a rollback without approval gates and under what conditions. A rollback is a production deployment — it must go through the same authorization model as a normal deployment if the authorization model is designed to prevent unauthorized production changes. But incident response time is a competing constraint: a rollback that requires a PR review, CI pipeline completion, and environment protection approval may be correct for normal deployments and completely unworkable during a revenue-impacting incident. The pipeline ADR must specify the break-glass rollback procedure: who can bypass the normal approval gates, how they signal that a rollback is a break-glass event, what the audit trail is (the rollback must still be logged and attributed, even if it bypasses the approval gate), and what the reconciliation obligation is after the incident is resolved.

Database migration compatibility is the constraint on rollback that the pipeline ADR must address explicitly. A deploy that includes a database schema migration — a column added, a table renamed, an index added to unblock a query — may not be rollable back without also reverting the migration. If the migration is applied before the code deploy (the common convention for additive migrations), rolling back the code to the previous version leaves the migration in place. If the previous code version is not compatible with the migrated schema — because the migration removes a column that the previous code reads — a pure code rollback will produce a different failure than the original problem. The pipeline ADR must specify the migration compatibility requirement: all migrations must be backward-compatible with the previous code version (additive-only, no destructive changes without a multi-phase migration strategy), and the deploy order (migrations before code deploy, not simultaneously). The database migration strategy ADR must be consistent with the pipeline ADR on this point — a migration strategy that permits destructive schema changes constrains the pipeline's rollback capability in ways that must be acknowledged and accounted for in the pipeline design.

Test gate structure and environment promotion

The test gate structure is the set of validation steps that a change must pass before it is allowed to promote to the next environment. The structure is a decision with direct tradeoffs between deployment speed and confidence in the deployed artifact.

The fast feedback gate runs on every commit — unit tests, linting, type checking, security scanning — and must complete within a budget that keeps the commit-to-feedback loop short enough that developers wait for it. A fast gate that takes more than 10 minutes begins to lose its value: developers context-switch away, merge back to the feature branch, or stop waiting for the result before proceeding. The 10-minute budget forces prioritization: not all tests can run in the fast gate. The pipeline ADR must specify which test categories run in the fast gate and which are deferred to later stages, with the rationale for each placement. A playwright end-to-end test suite that takes 45 minutes does not belong in the fast gate; a subset of smoke tests covering the critical user paths does.

The integration gate runs against a deployed environment — typically a per-branch ephemeral environment or a shared integration environment — and validates behavior that cannot be tested against mocked dependencies: database query correctness against a real schema, external API integration behavior, message queue consumer behavior, background job execution. The integration gate is the expensive gate: it requires provisioning environment infrastructure, seeding test data, running the services under test, and tearing down the environment after. The cost is justified for changes that touch infrastructure boundaries; it may be disproportionate for changes that affect only business logic that unit tests can exercise in isolation. The pipeline ADR must specify when the integration gate is required (all changes, or only changes that touch specified directories), the environment provisioning strategy (shared environment with test isolation versus ephemeral per-branch environment), and the expected gate duration.

The staging promotion gate deploys the artifact to the staging environment and runs validation before the artifact is promoted to production. Staging must be production-equivalent in configuration: same service topology, same external service integrations (pointed at sandbox/test endpoints where the provider supports it), same infrastructure specifications (instance types, database sizes — or documented exceptions where production-equivalent sizing is cost-prohibitive). A staging environment with a SQLite database versus a production PostgreSQL cluster does not validate the queries, the connection pool behavior, or the index performance that matter in production. The staging gate must include a smoke test suite that verifies the critical user paths end-to-end: not just that the service starts, but that it can complete a representative set of operations successfully. The test strategy ADR must specify the smoke test coverage requirement for the staging gate.

Environment promotion gates and approval requirements specify who must approve a promotion from staging to production and under what conditions automated promotion is permitted. The pipeline ADR must specify the promotion authorization model: fully automated promotion if the staging gate passes (appropriate for teams with high test confidence and low-risk services), human approval required for every production deploy (appropriate for regulated environments or high-risk services), or human approval required only for changes tagged as high-risk (appropriate for teams that want speed for routine changes and caution for structural changes). The approval mechanism — a GitHub environment protection rule, a GitLab environment approval gate, a Buildkite block step, a PagerDuty change management event — must be specified, tested, and included in the rollback authorization model described above.

Pipeline security

The CI/CD pipeline is a privileged component of the production environment: it has credentials to build artifacts, push to registries, and deploy to production. A pipeline security posture that was acceptable in year one — long-lived credentials, broad IAM permissions, shared runner environments — becomes an active risk in year two when the team grows, the product is in production, and the pipeline is an attractive lateral movement target.

Secrets management in the pipeline must use short-lived credentials wherever the CI platform and cloud provider support it. GitHub Actions' OIDC integration with AWS STS eliminates the need for long-lived AWS access keys stored as repository secrets — the pipeline requests a short-lived token from STS using the OIDC token issued by GitHub, which is valid only for the duration of the pipeline job and only for the IAM role associated with the repository and branch. GitLab CI supports the same pattern with GCP Workload Identity and AWS STS. The pipeline ADR must specify which secrets require long-lived credentials (external services that do not support OIDC — a third-party API key, a database connection string) and how those credentials are stored and rotated. Long-lived secrets that are stored as environment variables in the CI platform are accessible to any pipeline job in the repository — the scope of access must be restricted to the specific jobs that need each credential.

Build environment isolation determines whether a compromised build job can affect other jobs or the host environment. GitHub-hosted runners are ephemeral: each job gets a clean virtual machine, and the VM is destroyed after the job completes, so a compromised build job cannot persist state for a future job on the same runner. Self-hosted runners with persistent state are the opposite: a compromised build job can write files to the runner's disk, modify environment variables that persist across jobs, or exfiltrate credentials from the runner's environment. The pipeline ADR must specify the runner model for each pipeline stage, with the rationale for any persistent runner usage and the compensating controls (runner process isolation, regular re-imaging, network egress restrictions).

Supply chain attestation is the mechanism for verifying that an artifact was built by the expected pipeline from the expected source, and not tampered with between build and deployment. The minimum: artifact signing (Docker Content Trust, cosign for OCI artifacts) so that a deploy can verify the image is the one built by the CI pipeline and not a modified copy. Beyond signing: SLSA provenance attestation that records the builder identity, the build inputs (source SHA, parameters), and the build output (artifact digest) in a format that a verifying system can check before deploying. The pipeline ADR must specify the target SLSA level and the verification step in the deploy process — an attestation that is not verified before deployment provides provenance information for post-incident forensics, but does not prevent a compromised artifact from being deployed.

Pipeline credential scope and least-privilege IAM limit the blast radius of a compromised pipeline. The deploy credential — the IAM role or service account that the pipeline uses to deploy to production — must have exactly the permissions required to deploy the specific services managed by the pipeline, and no more. A deploy credential with AdministratorAccess is a production-wide write credential: a compromised pipeline job with that credential can modify any resource in the account. A deploy credential scoped to the specific ECS service, Lambda function, or Kubernetes namespace being deployed can only affect that scope. The pipeline ADR must specify the IAM role for each pipeline stage (the build stage credential needs registry write access; the staging deploy credential needs access to the staging cluster; the production deploy credential needs access to the production cluster and requires MFA or a separate approval gate for activation).

CI/CD decisions in AI chat history

CI/CD pipeline decisions produce a specific pattern in AI chat history: the initial setup is often a brief conversation that establishes the platform choice and the basic structure, followed by a long tail of incremental conversations that add pipeline stages, fix environment-specific failures, optimize build times, and debug mysterious pipeline failures. The cumulative decisions embedded in this history are rarely consolidated into documentation.

The initial pipeline setup sessions capture the platform choice and the basic structure, but rarely capture the decisions about artifact model, rollback strategy, or deployment authorization that will matter in year two. The conversation pattern: "how do I set up GitHub Actions to deploy to ECS on merge to main?" followed by a pipeline YAML that does exactly that — build, push to ECR, update the ECS service — without an explicit decision about whether the previous task definition version is retained for rollback, whether the deploy should be blue-green or rolling, or what the test gate structure should be. The rollback strategy is absent because rollback is not the use case the initial setup conversation was solving.

The pipeline performance optimization sessions contain decisions about caching strategy, parallelization, and test gate structure that are made under the pressure of "the pipeline is too slow" rather than "what is the right gate structure?" The conversation pattern: "our GitHub Actions pipeline takes 22 minutes, how do we speed it up?" followed by a series of changes — dependency caching, test parallelization, conditional stage execution, Docker layer caching — each of which is an implicit decision about which test categories can be deferred or skipped without reducing confidence in the deployed artifact. Three months of AI chat history on pipeline optimization may contain a dozen incremental decisions about test gate placement that collectively define the team's quality bar for production deploys, none of them documented as a deliberate policy.

The incident-driven pipeline sessions are the most operationally grounded decisions in the pipeline's history. The conversation pattern after an incident like the opening narrative: "our rolling update took 47 minutes to roll back — how do we implement blue-green deployment on ECS?" or "we need a way to deploy a specific SHA to production immediately, bypassing the normal pipeline — how do we set that up as an emergency rollback procedure?" or "we had an incident where a database migration made rollback impossible — how do we structure deployments to prevent that?" These sessions contain the decisions that convert a pipeline from a code-shipping mechanism into an incident-response tool. A postmortem that produces pipeline architecture changes is the most reliable source of rollback strategy decisions — the requirements come directly from a real failure rather than from a general best-practices survey.

The compliance and security pipeline sessions contain the artifact attestation and pipeline credential decisions that are made under the pressure of a security review or a compliance audit. The conversation pattern: "our security team is asking about SLSA compliance for our build pipeline — what does SLSA Level 2 require and how do we implement it?" or "the SOC 2 auditor is asking about our change management process for production deploys — how do we get an approval gate on our production pipeline?" or "we need to prove that the Docker image running in production was built from the source code in our repository and hasn't been modified — how do we implement artifact signing?" These sessions contain the explicit compliance rationale for pipeline security decisions — the rationale that disappears into closed chat sessions unless it is extracted and linked to the pipeline ADR. The open-source extractor surfaces these sessions from the AI chat history so the security and compliance reasoning is available to the next security review cycle, not just the engineer who was in the original conversation.

Writing the CI/CD pipeline ADR

The CI/CD pipeline ADR needs five sections, each addressing the questions that different stakeholders ask at different stages of the pipeline's operational life.

Section 1: CI platform and pipeline structure. The CI platform chosen — GitHub Actions, GitLab CI, Jenkins, Buildkite — with the rationale for the choice relative to the team's source control platform, runner requirements, and budget. The pipeline-as-code format and where pipeline definitions live relative to the application code. The shared pipeline template strategy: how common pipeline stages are shared across services, where the shared templates are maintained, and how updates propagate. The service type taxonomy: what categories of service (web API, background worker, static frontend, CLI tool, data pipeline) exist, and what the default pipeline structure is for each type. The constraints that would trigger a platform reconsideration: runner capacity requirements, pricing at scale, compliance requirements for build environment isolation, or integration gaps with the production environment tooling.

Section 2: Build and artifact model. Whether the pipeline builds immutable artifacts or builds at deploy time, with the rationale for the choice relative to the team's rollback latency requirement and reproducibility requirement. The artifact registry choice and location. The artifact tagging convention: which tags are immutable (commit SHA tags), which are mutable (environment tags), and how mutable tags are moved during promotion. The artifact retention policy: how long commit-SHA-tagged artifacts are retained before garbage collection, what the process is for retaining specific artifacts beyond the normal retention period (a version that is the current production version must not be garbage-collected). The SBOM generation and artifact attestation requirement: whether attestation is required, the target SLSA level, the attestation format, and where attestations are stored and verified.

Section 3: Deployment strategy. The default deployment strategy and the criteria for using an alternative strategy. For each service type: which deployment strategy is used (rolling, blue-green, canary, feature-flag), the infrastructure required by that strategy, the deploy time budget, and the backward-compatibility requirements that the strategy imposes on application code and database migrations. The database migration compatibility requirement and its interaction with the deploy order. The feature flag service choice if feature-flag-driven deploys are used: which service, the flag cleanup procedure, and the fallback behavior for flag fetch failures. The environment promotion model: how an artifact moves from CI through staging to production, what validation is required at each stage, and who or what triggers each promotion.

Section 4: Rollback strategy. The rollback mechanism for each deployment strategy, with the specific commands and the expected rollback time measured in a non-incident context. The rollback authorization model: who can initiate a rollback, whether the rollback requires the same approval gates as a normal deploy, and the break-glass rollback procedure for bypassing approval gates under incident conditions. The rollback runbook location and the cadence for testing it. The database migration constraint on rollback: which migration types block rollback of the corresponding code change, how the deploy process enforces the backward-compatibility requirement, and what the procedure is when a migration makes rollback impossible (forward-fix rather than rollback as the primary incident response). The rollback audit trail: how a rollback is logged, attributed to the initiating engineer, and linked to the incident that triggered it.

Section 5: Pipeline security posture. The secrets management model: which credentials use OIDC short-lived tokens, which require long-lived secrets, where long-lived secrets are stored, and the rotation cadence. The runner model: hosted ephemeral runners versus self-hosted runners, and the compensating controls if persistent runners are used. The pipeline credential scope: the IAM role or service account for each pipeline stage, the specific permissions granted, and the scope restrictions (repository, branch, environment) applied to each credential. The artifact signing and attestation requirement. The supply chain risk management approach: dependency pinning strategy (exact version locks, hash pinning, dependabot auto-PRs for security updates), the process for responding to a CVE in a build dependency, and whether the SBOM is scanned against a vulnerability database as a pipeline gate. The security ADR must be consistent with the pipeline security posture — the threat model for supply chain attacks must include the CI/CD pipeline as an attack surface, and the controls specified in the pipeline ADR must be reflected in the threat model.

A CI/CD pipeline is the team's primary mechanism for translating code changes into running software and for recovering from failures in production software. The decisions embedded in it — which are made incrementally, under performance pressure, under incident pressure, and under compliance pressure over the course of years — determine whether deployment is a routine, low-anxiety operation or a high-stakes procedure that requires senior engineer involvement for every production change. The decisions about rollback mechanism, artifact model, deployment strategy, and pipeline security posture do not have to be made under pressure; they can be made deliberately, documented in an ADR, and revisited when the constraints change. The pipeline decisions that shape the team's incident response capability in year three are made in year one, when rollback latency is not yet a problem, blast radius is not yet a risk, and supply chain security is not yet a requirement — and without a record of those decisions, the team in year three will spend months reconstructing the rationale for choices that took minutes to make. The open-source extractor surfaces the AI chat sessions where those pipeline decisions were made, from the initial setup conversation to the incident-driven blue-green migration to the compliance-driven attestation implementation — connecting the rationale from the moment of decision to the documentation that will inform the next pipeline evolution.