The feature flag decision record: why the flag evaluation mechanism you chose in year one constrains how you do gradual rollouts and A/B testing in year three
Feature flag adoption is treated as a developer experience enhancement, not an architecture decision. The mechanism is chosen quickly during the first dark launch need and rarely documented. Two years later, the evaluation model determines whether gradual rollouts are safe under concurrent deployment versions, the flag store design determines whether A/B testing produces reliable impression data, and the absence of a lifecycle policy produces a codebase with 200 flags where 30 are actively used and none can be safely deleted. None of this was visible when the first flag was added. None of it is written down.
A team ships a 10% gradual rollout for a new checkout flow. The feature flag is configured as "10% of traffic." Within 45 minutes, three users report confusing behavior: the checkout page looks different on different visits within the same session, and one user's cart contents disappeared after the page reloaded. The on-call engineer checks the flag configuration and finds it is working correctly — 10% of evaluations return the new flow. The problem is not the flag percentage. The problem is the assignment model: the flag system is making an independent random draw on every page load, so a single user has a 10% chance of seeing the new checkout flow on each request, independently. A user who sees the new checkout on the cart page, then reloads after adding an item, falls into the 90% bucket and sees the old checkout — which cannot read the cart state that the new checkout flow wrote in a different format.
The fix is user-ID-consistent hashing: derive the bucket assignment from a hash of the user's identifier, so the same user always sees the same variant for the duration of the rollout. Most feature flag SDKs support this by default — if the user context is passed to the evaluation function. Whether the evaluation uses consistent hashing or per-request random selection depends on how the team initially wired the SDK into their request handler, which was decided during the first adoption session, which nobody documented.
Like most foundational infrastructure decisions, the feature flag mechanism is visible as a fact — the codebase uses LaunchDarkly, the flag evaluation is in a shared middleware, the flag store is the LaunchDarkly SaaS backend — but invisible as a decision. The fact answers "what is true now?" The decision record answers "what evaluation model is being used, what the assignment model means for gradual rollouts, what the flag store's availability behavior is under network partition, and what the lifecycle policy is for flags after the rollout completes." Without the record, the rollout incident is a surprise rather than a documented constraint.
What "we use feature flags" actually means across five patterns
The first decision inside "we have feature flags" is the evaluation mechanism — the architectural pattern by which flag configuration is stored, read by the application, and used to determine which code path executes. The mechanism choice is often made by whoever is blocked on a deployment, driven by whatever tool is first found in a search, and carries specific latency, availability, and targeting model commitments that determine the safety and capability of every flag-gated rollout that follows.
Boolean environment variables and config files are the zero-infrastructure entry point. The flag value is set as an environment variable or in a config file read at startup; the application reads it once and the decision is fixed for the lifetime of the process. Changing the flag requires redeploying the application or restarting the process — which means "toggling a flag" is operationally equivalent to a deployment. This mechanism is correct for teams that need to ship code paths that are not yet meant to run in production (the path exists but can never be reached until the env var is set), and it is the wrong mechanism for teams that need to change flag state without a deployment, respond to an incident by turning off a feature in seconds, or roll out to a percentage of users. Teams that start here often remain here longer than the use case warrants, because adding the first env-var flag creates a pattern that is easy to replicate before the limitations are discovered.
Database-backed server-side evaluation stores flag configuration in the application's own database (a flags table with flag name, enabled state, targeting rules, and percentage) and evaluates flags synchronously on each request by reading from the database or from a short-lived in-process cache. This mechanism gives the application direct control over flag configuration without a third-party dependency, makes flag state readable with a SQL query, and eliminates the vendor availability concern. The latency commitment depends entirely on the caching design: if every flag evaluation reads from the database, the flag check adds a database query to every flagged code path; if flag configuration is cached in memory and refreshed on a polling interval, flag changes propagate with the cache TTL delay. The caching policy — TTL, cache invalidation trigger, behavior under cache miss — is the detail that most database-backed flag implementations leave undocumented. It determines how quickly a flag toggle takes effect in production and what happens to in-flight requests during the TTL window when the old configuration is still cached.
SDK-based local evaluation (LaunchDarkly, Unleash, Flipt, GrowthBook) decouples flag configuration storage from flag evaluation latency. The SDK runs a background process that connects to the flag configuration backend via a streaming connection or long-polling and maintains a local, in-process copy of all flag configurations. When the application calls the evaluation function, the SDK evaluates the targeting rules against the provided user context entirely in memory — there is no network call on the evaluation path. Propagation latency is bounded by the streaming connection's delivery latency (typically milliseconds for streaming SDKs, seconds for polling intervals). This mechanism gives sub-millisecond evaluation latency, tolerates network partitions (the local copy remains available as long as the process is running), and supports complex targeting rules (percentage rollouts, user attribute matching, custom targeting functions) without a database query per evaluation.
The bootstrap problem is the key undocumented constraint of this mechanism: when a new application instance starts (a Lambda cold start, a new Kubernetes pod, a container restart), the SDK needs time to connect to the flag backend and receive the initial flag configuration. Until that initial sync completes, what does the SDK return for flag evaluations? Default values defined in the application code? A cached copy from a previous instance's last-known state? A blocking wait until the sync completes? Each SDK has a different default behavior, and teams that use SDK-based evaluation in serverless or frequently-restarting environments discover the bootstrap behavior under load rather than during evaluation. A Lambda function that starts, evaluates a flag, and gets the default value (because the SDK hasn't yet synced the real configuration) produces incorrect flag behavior that is invisible in unit tests and appears only in production traffic patterns.
Remote evaluation via vendor API (some self-hosted Unleash configurations, custom flag services where evaluation happens server-side at the flag vendor) evaluates each flag by making an outbound HTTP request to the flag service with the user context, receiving the evaluation result. The flag service applies the targeting rules and returns the boolean or variant. This model places the targeting logic outside the application codebase — the application does not need the SDK's local evaluation engine, only a simple HTTP client. The consequence is that every flag evaluation on the critical request path requires a network round-trip to the flag service. Vendor API latency (5–50ms for SaaS vendors, 1–5ms for a co-located self-hosted instance) is added to every flagged code path. Vendor availability determines flag availability: if the vendor API is unreachable, the flag evaluation is unavailable and the application must decide whether to fail open (use default values and proceed), fail closed (block the request), or use a stale cache. Teams that adopt remote API evaluation for simplicity and later instrument their application's critical path with multiple flags discover the compounding latency when four flagged code paths in a single request each add 20ms.
Client-side evaluation delivers flag configuration to the browser or CDN edge node, where evaluation occurs without a server round-trip. The flag vendor's SDK bootstraps in the browser with the current flag configuration for the authenticated user, evaluates locally in JavaScript, and the application code reads flag values from the browser-side SDK. This mechanism enables personalization — showing different UI variants to different users — without a server round-trip per page interaction, which is a significant rendering performance benefit for highly dynamic interfaces. The constraint is that the flag configuration object delivered to the browser is visible to the client: targeting rules, user segment definitions, and any user attributes included in the evaluation context are transmitted to and stored in the browser environment. Teams that include PII in the user context object (email, company name, plan tier, role) for server-side targeting purposes sometimes transmit that same context to the client-side SDK without realizing that client-side delivery makes the context object inspectable in browser developer tools. The privacy review of which user attributes are in the browser-side evaluation context is rarely conducted at SDK adoption time; it typically surfaces during the first GDPR or SOC 2 audit.
The gradual rollout safety constraint
Gradual rollouts — releasing a feature to a percentage of users and expanding the percentage as confidence grows — are the primary reason teams adopt a feature flag system beyond simple boolean toggles. The assignment model used to implement the percentage determines whether gradual rollouts are safe for stateful flows, and the interaction between the assignment model and concurrent deployment versions determines whether the rollout is safe during the deployment window itself.
The assignment model distinction is between per-request random selection and user-ID-consistent hashing. Per-request random selection draws a random number on every flag evaluation and compares it to the percentage threshold — 10% means a 10% probability of assignment to the treatment bucket on each evaluation, independently. User-ID-consistent hashing computes a hash of the user's identifier, takes the result modulo 100, and compares to the percentage threshold — a user whose hash modulo 100 is 7 is always in the treatment bucket for any flag with a threshold above 7%, regardless of how many times the flag is evaluated. The first model distributes traffic probabilistically over time; the second distributes users deterministically by identifier.
The practical difference appears in stateful operations. A checkout flow that writes cart state in one format under treatment and reads it in the original format under control requires that a user who begins the flow in treatment completes the flow in treatment. With per-request random selection, the user has a 10% chance of being in treatment on every page load — which means a 90% chance of encountering the format mismatch on the next request after an initial treatment assignment. With user-ID-consistent hashing, the user assigned to treatment on page one is always assigned to treatment for every subsequent evaluation during that flag's lifetime. The assignment model is the difference between a safe gradual rollout and a rollout that produces data corruption or session inconsistency for 10% × 90% = 9% of users on their second request.
The rolling deployment window adds a second dimension. During a Kubernetes rolling deployment, old and new versions of the application run simultaneously — new pods handle some traffic while old pods continue handling other traffic until they are replaced. If the feature flag is evaluated by both old and new application versions, two scenarios produce problems. First, a flag that is enabled in the new version but not yet present in the old version's code: old pods evaluate a flag they do not recognize and return the default behavior; new pods evaluate the flag and may return treatment. Users whose requests are load-balanced to old pods see control; users load-balanced to new pods see treatment — not because of a deliberate percentage rollout, but because of the deployment window. Second, a flag that is being toggled off (rolling back a feature) during a deployment: if the flag evaluation is cached and the cache TTL is longer than the deployment window, some pods serve stale flag state after the flag is toggled. As with service mesh configurations that interact with Kubernetes upgrade timing, the interaction between flag evaluation caching and rolling deployment windows is an architectural constraint that is not visible during flag system selection but determines rollback safety in production.
The rollout safety constraint belongs in the flag decision record because it is the consequence of an evaluation model decision, not an operational detail. A team that adopted an SDK-based local evaluation tool and configured it with user-ID-consistent hashing for percentage targeting made a rollout safety decision implicitly. A team that built a database-backed flag system with per-request random selection made the opposite decision, also implicitly. The decision record converts both from implicit assumptions into explicit policies that new engineers can rely on when designing flagged rollouts.
The A/B testing and impression recording gap
Feature flag systems and A/B testing systems solve adjacent problems, and teams frequently repurpose their flag infrastructure for experiments — adding a multivariate flag for a UI variant test after the flag system was originally adopted for dark launches and gradual rollouts. The gap appears when the flag system was chosen for its rollout characteristics and not evaluated for its experiment characteristics: specifically, whether it produces the impression data that connects variant assignment to business metrics.
An A/B test requires three data points to produce a valid result: which variant each user saw, when they saw it, and what they subsequently did (the conversion event). The first data point is the impression: the record that user X was shown variant A at time T. Without impressions, the experiment analysis must infer variant assignment from the flag evaluation configuration (all users with hash-bucket ≤ 50 saw variant A), which is correct in theory but unreliable if the flag was not exposed to every user in every session — if some users visited before the flag was enabled, or if some users' requests were handled by application instances that had not yet synced the flag configuration, their assignment is unknown. The impression record is the authoritative source of truth for variant exposure, and it must be generated at the moment of flag evaluation.
SDK-based local evaluation tools handle impressions differently, and the default behavior is the source of the gap. In LaunchDarkly, the server-side SDK evaluates flags locally and does not automatically send impression events to the analytics backend — the application must explicitly call the SDK's track method or enable impression tracking via a flag-specific configuration. In GrowthBook, experiment impressions are tracked via an explicit trackingCallback that the application is responsible for wiring to its analytics destination. In Unleash, the server-side SDK has no built-in impression tracking — experiment analysis is the application's responsibility. Teams that adopt these tools for dark launches and gradual rollouts — where impression data is irrelevant — and later use them for A/B tests discover the impression gap when they attempt their first experiment analysis and find that the variant assignment data in the analytics tool is incomplete or missing.
Assignment persistence is the second component of the A/B testing gap. A valid A/B test requires that a user assigned to variant A in session one is still assigned to variant A in session two, three, and four — across browser sessions, device switches, and account logins on different devices. Consistent hashing on a stable identifier (user ID for authenticated users) provides this across sessions on the same account. Browser-side flags using a random UUID cookie provide consistency within a device but break when the user clears cookies, switches browsers, or logs in on a different device. The assignment persistence model determines whether the A/B test is measuring the effect on a consistent cohort or a constantly-shifting population, which affects the statistical validity of the result. Most teams discover this constraint when their first long-running experiment shows unexpected variation in the control group size rather than the expected stable 50%.
The connection from variant assignment to business metrics requires that the analytics events (purchase completed, trial converted, churned) carry the variant assignment as a property, or that the analytics system can join on a shared user identifier. If the feature flag system writes impression events to one data store (the LaunchDarkly dashboard, a Segment track call) and the business metrics land in another (Mixpanel, Amplitude, a warehouse), the joining logic that connects impression to conversion must be designed and implemented by the application team. The retention policy for impression data determines how long historical experiment results are queryable — a flag system that deletes impression data after 30 days cannot support analysis of experiments where the conversion event (annual subscription renewal, churn) occurs more than 30 days after the impression.
None of this is visible when the first flag is added for a dark launch. The A/B testing constraint belongs in the decision record as a policy section, not as a discovered gap when the first experiment fails to produce analyzable data.
The flag lifecycle and cleanup debt
Feature flags accumulate. The gradual rollout that shipped in Q1 is complete — the feature is at 100% and has been for eight months — but nobody removed the flag from the codebase because doing so requires coordinating changes across the call site, the flag management UI, and any monitoring or alerting that watches flag state. The A/B test flag from Q2 has been serving 100% variant A since the experiment concluded, but the experiment concluded informally (the engineer looked at the dashboard and said "variant A wins") without a ticket to clean up the flag. The operational toggle added for the on-call team to disable a rate limiter under load is permanent infrastructure — it will never be removed — but it lives alongside the temporary release toggles with no distinction. Two years later, the flag management dashboard has 200 flags. The on-call engineer cannot tell which ones are load-bearing infrastructure toggles and which ones are completed experiments from two engineering tenures ago.
The flag lifecycle problem is a policy problem, not a technical one. The mechanism provides the capability; the lifecycle policy provides the rules for when flags are created, who owns them, how long they are expected to exist, and what the removal process is. Without a lifecycle policy, every flag becomes permanent because there is no trigger for removal. The flag-as-permanent-feature problem is predictable from the nature of the mechanism — removing a flag requires changing application code, not just clicking a button — but most teams experience it as a surprise two years after adoption.
Flag types with distinct expected lifetimes are the foundation of a lifecycle policy. Release toggles — flags that gate a feature during its rollout, intended to reach 100% and then be removed — should have a defined maximum lifetime: typically two to four weeks from the start of the rollout to code removal. Operational toggles — flags that allow the on-call team to disable behavior under load, kill a third-party integration during an outage, or reduce system load by dropping non-critical background work — are permanent infrastructure and should be documented as such, with a distinct naming convention that distinguishes them from temporary release toggles. Experiment flags — flags used for A/B testing — should be tied to the experiment's planned end date, with a removal trigger when the experiment reaches statistical significance or its maximum runtime. Permission toggles — flags that gate features by account tier or user role, intended to persist for the lifetime of the feature — are also permanent infrastructure, but their permanence is different from operational toggles: they are removed when the feature is sunset, not when an incident occurs.
The flag removal process has a specific execution order that determines whether the removal is safe. The correct sequence is: (1) confirm the flag is at 100% of the intended population (or 0% for a feature being rolled back) and has been stable for at least one release cycle; (2) remove all flag evaluation call sites from the application code, leaving only the code path that was in the 100% bucket; (3) deploy the code removal; (4) confirm the deployment is healthy; (5) archive or delete the flag from the flag management UI. The risky inversion — deleting from the flag management UI before removing the call site — causes flag evaluation to return the default value defined in the application code, which may not match the final deployed state. Teams that invert the order and delete from the UI first create a brief window where the application reverts to its pre-flag behavior, which is an unintended rollback if the feature was fully shipped. The flag removal process is itself a decision that should be documented so every engineer follows the same sequence.
Nested flag evaluation — flags that condition on other flags, or application logic that combines multiple flag values to determine behavior — produces combinatorial conditions that are difficult to test and impossible to remove cleanly. A checkout flow that checks newCheckoutEnabled AND newPaymentProcessor AND experimentCheckoutVariant === 'B' to determine which code path to execute creates a condition that cannot be tested exhaustively at all flag combinations, produces behavior that is unpredictable when one of the three flags changes state independently, and requires removing all three flags in a coordinated way. Like test strategy decisions that determine which failure modes the test suite can detect, the flag composition policy — whether flags are allowed to condition on other flags — determines the testability and removability of every complex flagged flow.
Writing the feature flag decision record
The Nygard ADR format adapts for feature flag decisions with five sections that most adoptions skip entirely.
The evaluation mechanism and flag store decision. Name the evaluation pattern, the flag store backend, and the alternatives evaluated with rejection reasons. "We evaluated three mechanisms in March 2024: LaunchDarkly SDK with local evaluation (SaaS backend, streaming sync, sub-millisecond evaluation latency, $400/month at our seat count), Unleash self-hosted (open source, requires hosted Postgres instance and an Unleash server deployment we maintain, free), and a database-backed custom implementation using our existing Postgres cluster (zero new infrastructure, direct SQL visibility into flag state, evaluation via a 5-second TTL in-process cache). LaunchDarkly was evaluated and rejected on cost grounds — the per-seat pricing model means cost grows with engineering team size, not flag usage, and at our projected team size of 30 engineers in 18 months the cost reaches $600/month. The custom database-backed implementation was evaluated and rejected on feature grounds — implementing a targeting rule engine that supports user attribute matching, percentage rollouts with consistent hashing, and A/B testing impression tracking from scratch is a material engineering investment that the team does not have capacity for. Unleash self-hosted was selected: it provides the local evaluation SDK with consistent hashing support, a hosted UI for flag management, and an open-source targeting rule engine. Operational cost is a Postgres instance ($15/month) and the Unleash server (one container on the existing Kubernetes cluster). Bootstrap behavior under the Unleash SDK: if the SDK has not yet received its first flag sync, it evaluates all flags against an empty configuration and returns the default value specified in the application code — this is the 'fail open' behavior. Lambda and ephemeral compute workloads must either pre-warm the SDK with a local cache file or treat flag evaluations in the bootstrap window as defaulting to control."
The user targeting and assignment model. Name the consistent hashing policy, the user context object, and the PII policy. "All percentage rollouts use user-ID-consistent hashing via the Unleash SDK's built-in gradualRolloutUserId strategy. The user ID passed to the evaluation context is the authenticated user's UUID from our users table — not email, username, or any rotating identifier. For unauthenticated users (the pre-login flow), the evaluation context uses a session UUID generated at session creation and stored in the session cookie; unauthenticated evaluations default to control for any flag that uses gradualRolloutUserId, because the session UUID is not stable across sessions. The evaluation context object passed to the SDK must not include email, full name, or any other PII beyond user ID and account tier. User ID and account tier are acceptable because they are not directly identifying in the context of flag targeting. Email is not acceptable because Unleash transmits the evaluation context to its server for remote evaluation features and metric collection — if PII is in the context, it is transmitted to the Unleash server and may appear in Unleash's logging. This restriction applies to the server-side SDK only; the client-side Unleash SDK is not used in this application."
The A/B testing and impression recording policy. Name whether the flag system is the experiment system and how impressions are tracked. "The Unleash SDK does not generate impression events automatically. For flags used as A/B test assignments, the application must call unleashClient.on('isEnabled', impressionHandler) and route the impression event to our analytics pipeline (Segment). The impression event must include: flagKey (the Unleash flag name), variant (the variant name or 'disabled' for control), userId (the evaluation user ID), timestamp (ISO 8601 UTC). Impressions land in the feature_flag_impressions table in the analytics warehouse via the Segment integration, retained for 365 days to support long-horizon experiment analysis. The Unleash flag system is used only for variant assignment in experiments — it does not analyze experiment results. Statistical analysis uses the analytics warehouse directly. For a flag to be used as an A/B experiment, it must be registered in the experiment tracking spreadsheet with: the hypothesis, the primary metric, the planned sample size, and the planned end date. Flags not registered as experiments are not eligible for A/B analysis regardless of their multivariate configuration."
The flag lifecycle policy. Name flag types with expected lifetimes, ownership, and removal process. "Four flag types with distinct lifecycle rules: (1) Release toggle — gates a feature during rollout. Expected lifetime: not more than 4 weeks from 0% to 100%. Owner: the engineering manager for the feature's team. Removal trigger: 100% for at least one sprint, with a removal ticket created by the owner before the rollout reaches 100%. (2) Operational toggle — allows on-call to disable behavior under load or during incidents. Expected lifetime: indefinite. Owner: the platform team. Naming convention: prefix ops_ (e.g. ops_disable_rate_limiter) to distinguish from temporary release toggles. Removal trigger: only when the feature the toggle guards is permanently removed from the codebase. (3) Experiment flag — used for A/B testing. Expected lifetime: the experiment's planned runtime, not more than 8 weeks. Owner: the engineer who registered the experiment. Removal trigger: when the experiment reaches statistical significance or the planned end date, whichever comes first. The flag is archived in Unleash (not deleted, for audit trail) and the winning variant's code path is cleaned up within 2 sprints. (4) Permission toggle — gates features by account tier or user role. Expected lifetime: the lifetime of the feature. Owner: the product team. Naming convention: prefix perm_. Removal trigger: when the gated feature is sunset. Quarterly flag audit: on the first Monday of each quarter, the platform team reviews all non-ops, non-perm flags older than 6 weeks and contacts the owner to confirm removal status. Flags with no owner response within one week are evaluated against production traffic; if zero evaluations in the previous 30 days, the flag is archived."
The critical path policy. Name which flag types are permitted in the synchronous user-request path. "Operational toggles and permission toggles are permitted in the synchronous user-request path — they control whether features are available and must be evaluated at request time. Release toggles that control UI behavior or response format are permitted in the synchronous path, because the rendering and response must reflect the assigned variant. Release toggles that gate background work (batch processing, async enrichment, scheduled jobs) must be evaluated at job dispatch time or job execution time, not at request time — they do not belong on the user-request critical path. Experiment flags must not be evaluated on the critical path in a way that blocks the response while waiting for a network call — local SDK evaluation is required for any flag evaluated in the synchronous request path. Any flag evaluation on the critical path must be wrapped in a timeout that catches SDK bootstrap state and returns the default value within 1ms — the Unleash SDK's local evaluation already meets this requirement, but new flag integrations must be reviewed for critical-path safety before merging."
Finding feature flag decisions in AI chat
The WhyChose extractor surfaces feature flag decisions from four session types that contain the reasoning most teams cannot reconstruct when a new engineer asks why the flag system works the way it does, or when the third gradual rollout incident in a year prompts someone to ask whether the assignment model is correct.
The initial adoption session. "LaunchDarkly vs. Unleash vs. Flipt for a 20-person startup", "should we build our own feature flag system or use a third-party SaaS?", "how to implement feature flags in Node.js without a vendor dependency", "Unleash self-hosted vs. LaunchDarkly pricing comparison", "what is the difference between a feature toggle and a feature flag?" These sessions contain the mechanism selection and the alternatives rejected. The adoption session is the most important to recover because the mechanism chosen at adoption carries all the downstream constraints — and the rejection reasons are why the mechanism cannot simply be replaced with a different tool when a constraint appears two years later without incurring the migration cost. A team that rejected LaunchDarkly on cost grounds has a documented reason to re-evaluate LaunchDarkly if cost projections change; without the record, the re-evaluation starts from scratch rather than from a known prior position.
The gradual rollout session. "How to roll out a feature flag to 10% of users safely", "feature flag consistent hashing vs. random assignment for percentage rollouts", "how to prevent a user from seeing different UI variants on different page loads", "feature flags during a rolling Kubernetes deployment — what happens if two pod versions evaluate the same flag?", "how to check if my feature flag SDK uses sticky sessions for percentage rollouts." These sessions contain the rollout assignment model decision — or the incident that revealed the assignment model's consequences. Like performance debugging sessions that reveal the actual system constraints under production load, the gradual rollout incident session is the most valuable for documenting what the assignment model actually does, because it describes the behavior that appeared in production rather than the behavior that the documentation promises. The team that worked through the "user sees different variants on page reload" incident in a ChatGPT session has the complete diagnosis of why consistent hashing is required for stateful flows — a decision record written after that session documents the constraint and its resolution rather than repeating the incident.
The A/B testing session. "How to do A/B testing with feature flags in React", "LaunchDarkly experiment vs. feature flag — what is the difference?", "how to track which users saw a specific feature flag variant", "connecting feature flag variant assignment to conversion data in Mixpanel", "Unleash A/B testing — how do I know which variant a user was assigned to?", "why do my A/B test results not match between the flag dashboard and my analytics tool?" These sessions surface when the team first encountered the gap between flag evaluation and experiment measurement. The session that asks "why do my A/B test results not match?" is the impression recording gap being discovered in real time — it contains the diagnosis (impressions not being sent, or sent inconsistently), the analysis of which users are missing from the experiment, and the fix applied. Recovering this session from AI chat history produces the A/B testing policy section of the decision record without requiring the team to reconstruct the problem from first principles. For platform teams defining the experiment infrastructure, recovering the A/B testing sessions from individual engineers identifies the gaps that each team worked around independently — revealing the common infrastructure need that a platform-level impression recording standard would address.
The flag removal session. "How to safely remove a feature flag from the codebase", "stale feature flag cleanup process", "feature flag technical debt — how do we decide which flags can be deleted?", "is there a way to tell if a feature flag is still being evaluated in production?", "how to remove a feature flag without accidentally reverting behavior", "feature flag cleanup — should I delete from the flag tool first or from the code first?" These sessions emerge after flag debt has accumulated — typically in the second or third year of flag system usage — and document the removal process the team constructed reactively. The session that asks "should I delete from the flag tool first or from the code first?" contains the correct removal sequence or the discovery of why the team got it wrong. A technical leader who inherits a codebase with 150+ flags and no flag audit policy cannot determine which flags are operational toggles (permanent, load-bearing) and which are completed release toggles from two engineering tenures ago without reading through each flag's call sites and git history. The removal sessions from the previous engineering team contain the flag classification reasoning that the new leader needs and cannot find anywhere else.
What the decision record prevents
A documented feature flag decision prevents three recurring problems that teams encounter as their flag usage scales and their engineering team turns over.
It prevents the rollout incident that looks like a flag bug. The gradual rollout that produces inconsistent behavior within a user session is not a flag bug — it is the expected behavior of per-request random selection applied to a stateful flow. The team that does not know their flag system uses per-request random selection diagnoses the incident as a flag malfunction and spends hours confirming that the flag is returning the correct percentage before discovering the assignment model is the root cause. The decision record that documents the assignment model converts "why is the flag behaving strangely?" into "the flag is using per-request random assignment — is this flow stateful?" — a question that can be answered in five minutes. Like error handling decisions that determine how errors propagate through the system, the flag assignment model determines how state propagates through the rollout — and both need to be documented before production behavior reveals the gap.
It prevents the A/B testing surprise. A team that adopts a feature flag tool for dark launches and later uses it for experiments discovers the impression recording gap at the moment they attempt to analyze their first experiment. The analysis produces no data, or partial data, or data that does not match between the flag dashboard and the analytics tool. The engineer who discovers the gap must diagnose the impression flow from scratch — SDK documentation, event pipeline inspection, analytics warehouse query — and either fix the impression wiring or reconstruct the experiment cohort from flag evaluation logs. The decision record that documents the A/B testing policy (how impressions are generated, what data store they land in, retention period) converts the discovery into a verification: "the experiment was instrumented according to the policy; if no impression data appears in the warehouse, the impression handler wiring is the first place to check."
It prevents the unintended rollback on flag deletion. Deleting a flag from the flag management UI before removing the call site from the application code causes the application to evaluate the flag and receive the default value — which is the pre-flag behavior. If the feature was fully shipped at 100% and the flag was protecting the new behavior, the UI deletion silently reverts the feature for all users until the on-call engineer realizes what happened. The decision record that documents the removal sequence — code first, then UI archive — converts the dangerous inversion from a possible mistake into a clearly wrong procedure. Like the log level contract that converts logging behavior from per-engineer judgment into a shared standard, the flag removal process documented in the decision record converts a risky manual operation into a procedure that new engineers can follow correctly without knowing the history of why the sequence matters.
Further reading
- Decisions that never get written down — the flag lifecycle policy is one of the decisions most likely to be undocumented: it feels like an operational convention rather than an architecture decision, it is applied per-flag by individual engineers, and the consequences (200 accumulated flags, none safely removable) appear slowly rather than as a single incident; the lifecycle policy is the decision that makes the flag system sustainable beyond the first two years of adoption
- The logging infrastructure decision record — feature flags and logging infrastructure share a structural property: both are adopted as engineering convenience tools (dark launches / centralized log viewing) and both carry query model or assignment model commitments that determine their behavior under production conditions; the log level contract is to monitoring signal what the flag lifecycle policy is to codebase cleanliness — both require a documented standard to prevent silent erosion
- The service mesh decision record — the rolling deployment window interaction is shared between service mesh decisions and feature flag decisions; a sidecar mesh that requires pod restarts for proxy upgrades has the same "concurrent version" problem as a flag evaluated by two simultaneously-running application versions, and both require documenting the interaction between the infrastructure mechanism and Kubernetes deployment behavior
- The performance optimization decision record — flag evaluation latency is a performance characteristic that becomes visible only when multiple flags are evaluated on the critical request path; the decision to use local SDK evaluation versus remote API evaluation is a performance decision as much as an availability decision — the latency budget for flag evaluation on the critical path needs a documented threshold, not just a mechanism description
- The test strategy decision record — the flag composition policy (whether flags can condition on other flags) determines the testability of flagged flows; a checkout flow that requires three concurrent flag states to produce a specific behavior cannot be exhaustively tested at all flag combinations, and the composition policy that produced the nested condition needs to be documented as a constraint that the test strategy must account for
- ADRs for platform teams — the flag lifecycle policy and the A/B testing impression recording standard are platform team decisions with direct consequences for every application team's experiment instrumentation and every on-call engineer's rollout safety; documenting them as platform decisions rather than per-team conventions makes them shared infrastructure rather than individually-reconstructed procedures
- ADR lifecycle: superseding and deprecating decisions — feature flag mechanism migrations (from database-backed to SDK-based evaluation, from one vendor to another) are expensive because the evaluation model is embedded in every flagged call site; documenting the revisitation conditions (cost threshold, evaluation latency threshold, team size threshold) is what makes a mechanism migration deliberate rather than reactive after a scaling event reveals the current mechanism's limits
- The data retention decision record — impression data retention is a sub-decision of the A/B testing policy that interacts with the retention decisions made elsewhere in the stack; a flag system that deletes impressions after 30 days cannot support analysis of long-horizon experiments where conversion events (annual subscription renewal, churn) occur outside the impression retention window, and the interaction between impression retention and experiment design needs to be documented in both records
- Three months of AI chat history, undocumented — feature flag decisions surface in four session types that contain the complete decision record most teams are missing: the adoption session (mechanism choice), the gradual rollout incident session (assignment model constraints), the A/B testing session (impression recording gap), and the flag removal session (lifecycle policy constructed reactively); the rollout incident session is the most valuable because it documents the assignment model's behavior as it appeared in a real production incident rather than as a theoretical property of the SDK
- The new-CTO onboarding problem — a technical leader who inherits a codebase with 150+ flags in the flag management dashboard cannot determine which flags are permanent operational toggles and which are completed release toggles from two engineering tenures ago without reading every flag's call sites and git history; the lifecycle policy and the flag type taxonomy in the decision record convert the flag audit from a weeks-long forensic exercise into a lookup against documented flag types and ownership records
- Nygard ADR template — the standard format adapts for feature flag decisions with the user targeting and assignment model and the flag lifecycle policy as the most critical additions to the standard Consequences section; both are team-level policies that determine the safety and sustainability of every flagged rollout and experiment the team runs for the lifetime of the flag system
- WhyChose extractor — feature flag decisions appear in AI chat in four session types: the initial adoption session (mechanism choice, alternatives rejected); the gradual rollout session (assignment model decisions and incidents); the A/B testing session (impression recording gap discovery); and the flag removal session (lifecycle policy constructed reactively after debt accumulates); the gradual rollout incident session is the most actionable for the decision record because it documents the specific production constraint in concrete terms rather than as an abstract property of the mechanism