What is the difference between server-side and client-side feature flag evaluation?

Server-side flag evaluation means the flag service receives an evaluation request containing the user or request context (user ID, plan tier, region, request attributes) and returns the evaluated flag value — true or false, or a variant string — for that specific context. The flag rules, targeting conditions, and rollout percentages live on the server; the calling code never sees them. Client-side flag evaluation means the flag service sends the full rule set to the client SDK — typically as a JSON payload on SDK initialization — and the SDK evaluates flag rules locally using the context it already has. The structural difference is what happens when a flag rule changes. With server-side evaluation, the change takes effect on the next evaluation request, which may be the next incoming HTTP request — propagation is essentially immediate relative to traffic. With client-side evaluation, the client SDK holds the rule set in memory and must either receive a push notification via Server-Sent Events or poll for an update before the change takes effect. A client that polled two seconds ago and holds a stale rule set will evaluate the old rules for up to the polling interval before it picks up the change. This difference is the primary reason rollback latency varies by orders of magnitude between flag service configurations: server-side evaluation with a streaming flag store can propagate a kill-switch change in under 100 milliseconds; client-side evaluation with a 30-second polling interval has a worst-case rollback latency of 30 seconds plus the time for the rule set to reach every connected SDK instance.

Why does consistent user bucketing matter for A/B testing with feature flags?

Consistent user bucketing means that the same user always receives the same flag variant across multiple evaluation calls, across sessions, and across different application instances. Inconsistent bucketing — where the variant assignment is random on each evaluation rather than deterministic based on the user identifier — produces a Heisenberg experiment: the act of measuring (navigating to a new page, refreshing, switching devices) changes which variant the user sees. The user experiences a flickering interface where the new checkout flow appears on one click and the old one on the next. More damaging for A/B testing validity: if the variant assignment is random on each evaluation, a user who makes ten conversion-relevant requests during a session may be assigned to the treatment group on six requests and the control group on four. Attribution of the conversion event becomes ambiguous. The statistical analysis depends on knowing, for each user who converted, which variant they experienced during the session — and per-request random bucketing makes that question unanswerable. The correct implementation of percentage rollouts uses deterministic bucketing: hash the user identifier (plus an optional experiment seed to prevent correlation between simultaneous experiments), divide the hash space proportionally, and assign the variant based on which bucket the hash falls into. This produces consistent assignment for any given user identifier regardless of how many times evaluation is called. LaunchDarkly, Unleash with sticky sessions enabled, and Flipt all implement deterministic bucketing by default. Homegrown implementations frequently start with random sampling and discover the consistency problem only after running their first real A/B test.

What is OpenFeature and does it solve feature flag vendor lock-in?

OpenFeature is a CNCF (Cloud Native Computing Foundation) incubating project that defines a vendor-neutral SDK interface for feature flag evaluation. An application that evaluates flags via the OpenFeature SDK calls a standardized API — client.getBooleanValue('flag-name', false, evaluationContext) — and the underlying flag evaluation is performed by a provider plugin that connects to the actual flag service: LaunchDarkly, Unleash, Flipt, FlagD, or any other OpenFeature-compatible backend. Switching flag services requires only swapping the provider plugin, not rewriting every flag evaluation call in the application. Without OpenFeature, a migration from LaunchDarkly to Unleash requires finding and updating every call to the LaunchDarkly SDK API (ldClient.variation(), ldClient.allFlagsState()) across the entire codebase — a refactor proportional to how many flag evaluations exist. With OpenFeature, the application code is identical regardless of which flag service backs it, so the migration is a configuration change. OpenFeature partially solves vendor lock-in at the evaluation call site, but does not solve lock-in at the flag management layer: flag rules, targeting configurations, experiment definitions, and SDK initializations are still stored in and managed by the specific flag service, and migrating them requires exporting, transforming, and re-importing the flag definitions into the new service. OpenFeature is most valuable for teams that are not yet locked in — teams building new services who want to avoid the evaluation-layer lock-in by starting with the standard interface — and for teams that operate multiple flag services (production LaunchDarkly, local FlagD for development, Unleash for a specific business unit) who want a unified SDK surface regardless of which backend each environment uses.

When should a team self-host a flag service like Unleash or Flipt instead of using LaunchDarkly?

Self-hosting a flag service is appropriate when two conditions hold simultaneously: the team has the operational capacity to run and maintain the flag service infrastructure (a database, a flag service instance with HA configuration, monitoring, and backup/restore procedures), and the cost difference between LaunchDarkly and a self-hosted alternative is large enough to justify that operational investment. LaunchDarkly's pricing is context-based: a context is any unique combination of user ID and other attributes evaluated against the flag service. At 10,000 monthly active users, LaunchDarkly's Starter tier is $150/month; at 100,000 MAUs, pricing moves to custom contracts in the range of $1,500–5,000/month depending on negotiation and seat count. Unleash self-hosted is free for the open-source version; Unleash Pro cloud starts at $80/month. For a team with 50,000 MAUs and the engineering capacity to run a Postgres-backed Unleash instance, the cost difference may be $1,000+/month — sufficient to justify self-hosting. For a team with 5,000 MAUs and limited DevOps capacity, LaunchDarkly's managed service eliminates operational overhead worth more than the licensing cost difference. The decision is also affected by data residency requirements (some compliance environments prohibit sending user context attributes to third-party services, making self-hosting mandatory regardless of cost) and by the specific capabilities required (LaunchDarkly's Experimentation add-on provides Bayesian experiment analysis and metric tracking that would require integrating a separate analytics platform with a self-hosted flag service).

2026-06-21 · ~22 min read

The flag service infrastructure decision record: why the evaluation model and SDK you chose determine your rollback latency and your A/B testing capability

Feature flag service selection looks like a tooling choice until a bad rollout is burning through your Black Friday traffic and you discover your homegrown flag system uses per-request random bucketing instead of consistent user bucketing — so 10% of users is actually 10% of requests, and the same user sees both versions on alternate page loads. The flag service you chose determines your worst-case rollback latency, your user-consistency guarantee, your A/B testing trustworthiness, and how expensive a future migration will be.

An eleven-person e-commerce startup had been running a homegrown feature flag system for fourteen months. The system was simple and they were proud of it: a Postgres table with flag names and rollout percentages, a middleware function that called Math.random() < rolloutPercentage on each incoming request, and a Redis cache that stored the evaluated result for each request for 60 seconds. It had worked fine for their usual pattern — dark-launching a feature to 5% of traffic, watching error rates, bumping to 100% three days later if nothing broke.

Black Friday. They dark-launched a new checkout flow at 10% traffic. The new flow had a subtle bug: the discount code field on the checkout page was not submitting when users clicked the Apply button on mobile Safari. The bug was invisible in staging because their test users all used Chrome. The first production signal arrived seventeen minutes after the rollout: a spike in checkout abandonment rate from mobile users.

The on-call engineer flipped the kill switch in the internal flag dashboard, setting the rollout percentage to 0. Traffic continued flowing to the buggy checkout for the next five minutes while the Redis cache expired. The engineer checked the code and discovered the cache TTL was 300 seconds, not 60 — someone had "optimized" it during a database performance investigation eight months earlier and never changed it back. The ChatGPT session where the optimization had been discussed was long gone.

They set the percentage to 0. They waited. More abandonment signals came in. Five minutes later, the new checkout traffic dropped to zero. Then they found the second problem.

Users who had seen the new checkout on their first visit but were now seeing the old checkout on their second visit — because the per-request random bucketing had assigned them to the 0% group this time — were confused. Support tickets started arriving: "your checkout looked different and now it looks different again." Several users had abandoned mid-cart because they thought the site was broken. The bucketing inconsistency had created a worse user experience than the original bug would have caused if it had simply stayed visible and been fixed.

Post-incident: the team discovered that their "10% rollout" had never been consistently applied to the same 10% of users. It had been a 10% random sample of each individual page request. Mobile users who had visited the checkout four times during the session had seen the old UI on visits 1, 3 and the new UI on visits 2, 4. The A/B test they had been running for the previous two months — comparing new product page layout against old — had produced statistically meaningless results for the same reason: users had been reassigned to treatment or control groups on every page view, making the conversion attribution impossible to interpret correctly.

The evaluation model, the caching strategy, and the bucketing algorithm had all been chosen in a series of incremental ChatGPT sessions: "how do I implement a simple feature flag system in Node?", "how do I cache database lookups with Redis?", "how do I avoid performance issues with per-request flag evaluation?" None of the sessions had discussed consistent user bucketing. None of them had discussed rollback propagation latency. None of them had discussed A/B testing statistical validity. Each session solved the immediate problem it was asked about and closed.

This is the failure mode that a flag service infrastructure ADR is built to prevent: not the individual incident, but the accumulation of undocumented decisions that, together, determine that your rollback kill switch takes five minutes to propagate and your A/B tests have been statistically invalid since you started running them.

The three structural properties that flag service selection determines

When teams evaluate flag service options, the discussion centers on dashboards, API pricing, and SDK language coverage. These are surface properties. The structural properties that determine your incident response capability and your experiment validity are more fundamental — and they are set at the time of vendor or architecture selection, not during feature development.

Evaluation locality and propagation latency

Evaluation locality is the question of where the flag rule is evaluated: on the client (the browser, the mobile app, the edge worker) or on the server (the application server that calls the flag service API). Propagation latency is the time between a flag change being committed in the flag service and that change taking effect in all active evaluation contexts.

Server-side evaluation with a streaming-connected SDK has propagation latency measured in milliseconds. The flag service maintains a persistent connection — typically Server-Sent Events (SSE) or a WebSocket — to each SDK instance. When a flag rule changes, the flag service pushes the update to all connected SDK instances immediately. The time between the flag change and the change reaching a running application server is typically 50–200 milliseconds in a well-implemented streaming flag service. The next request evaluated after the update arrives uses the new rule. Worst-case rollback latency for a server-side streaming setup: roughly one request cycle after the update arrives, which at typical server response times is under 500 milliseconds total.

Client-side evaluation with polling has propagation latency bounded by the polling interval. The SDK downloads the full rule set from the flag service on initialization and then polls for updates every N seconds. If the polling interval is 15 seconds (Unleash default for the client-side SDK), the worst-case rollback latency is 15 seconds plus network round-trip time — the last SDK instance to poll just before the flag change will hold the old rules for up to 15 more seconds. If the polling interval is 300 seconds (which the e-commerce startup had accidentally configured), the worst-case rollback latency is five minutes. In an incident where a bad flag causes production errors, the difference between 500 milliseconds and five minutes is a material difference in the damage window.

The interaction with caching is critical. Every layer of caching adds to the propagation latency. A server-side streaming SDK that updates its in-memory rule set in 100 milliseconds is still bounded by the HTTP layer: if the application caches flag evaluation results in Redis with a 60-second TTL, the propagation latency is not 100 milliseconds — it is up to 60 seconds, until the cached result expires and the SDK evaluates the updated rule. The ADR must account for every caching layer in the propagation path, not only the flag service SDK's own refresh mechanism.

Consistent user bucketing and experiment validity

Consistent user bucketing is the guarantee that the same user identifier always evaluates to the same flag variant for the duration of a flag's rollout — across multiple requests, across different application server instances, across sessions, and across devices when the same user identifier is present. This guarantee is a correctness requirement, not a performance optimization.

The implementation is deterministic hashing: take the user identifier (user ID, session ID, or any stable identifier), concatenate it with the flag name and an optional experiment seed to prevent variant correlation across simultaneous experiments, hash the concatenated string (MurmurHash and Fowler–Noll–Vo are commonly used for their speed and distribution uniformity), map the hash value into a 0–100 bucket, and compare that bucket to the rollout percentage threshold. A user with bucket value 7 is always in the treatment group for any flag with rollout percentage ≥ 8% and always in the control group for any flag with rollout percentage ≤ 7%. The assignment is invariant to when the evaluation happens, which server handles the request, or how many times the flag is evaluated in a session.

Without consistent bucketing, percentage rollouts are not rollouts — they are per-request sampling. Ten percent of requests see the new behavior, but there is no guarantee that any individual user's experience is coherent. A user making twenty requests during a shopping session may see the new product page layout on requests 1, 4, 7, 11, and 19, and the old layout on all other requests. The user experience flickers. More importantly for product analytics: the conversion event that occurs on request 20 cannot be attributed to either variant with confidence, because the user saw both variants during the session.

The statistical validity consequence is severe. A/B test analysis depends on having a clean assignment: each user is in the treatment group or the control group for the duration of the experiment, and their conversion behavior is attributed to the variant they experienced. Per-request random sampling produces contaminated data: users in both groups simultaneously, conversion attribution ambiguous, sample sizes inflated (each request counts as an independent sample rather than each user), and the statistical tests produce apparent significance that does not correspond to a real treatment effect. The e-commerce startup's two months of A/B test results were unusable.

Vendor surface area and migration cost

The flag service creates coupling at a specific layer of the application: the evaluation call site. Every code path that gates behavior on a flag — if LDClient.variation('new-checkout', user, false) — contains a direct reference to the flag service SDK. For a mature application, there may be hundreds of these call sites across dozens of services. The migration cost from one flag service to another is proportional to the number of call sites and the API differences between the source and target SDK.

LaunchDarkly's server-side Node.js SDK evaluates flags via ldClient.variation('flag-key', context, defaultValue). Unleash's Node.js SDK evaluates via unleash.isEnabled('flag-name', context). Flipt's API is a gRPC or REST call: POST /api/v1/evaluation/boolean. These APIs are not compatible. Migrating from LaunchDarkly to Unleash requires finding every ldClient.variation() call, understanding the context structure (LaunchDarkly contexts have a specific schema with kind, key, and custom attributes; Unleash context uses different field names), and rewriting the call to match the target SDK. In a codebase with 200 flag evaluation calls across 15 services, this is a week-long refactor plus regression testing for every flag-gated code path.

OpenFeature insulates the call site from this migration cost by providing a vendor-neutral evaluation interface. The OpenFeature SDK call is client.getBooleanValue('flag-key', false, evaluationContext) regardless of whether the underlying provider is LaunchDarkly, Unleash, Flipt, or anything else. Migrating flag services with an OpenFeature-instrumented codebase requires swapping the provider plugin configuration — not rewriting evaluation call sites. The coupling still exists, but it is localized to the provider initialization code rather than distributed across every evaluation call in the application.

The options and their structural tradeoffs

LaunchDarkly

LaunchDarkly is the market-leading managed flag service, and the one most teams reach for when they decide to move off a homegrown system. Its primary structural advantages are streaming propagation and a mature experimentation layer.

All LaunchDarkly server-side SDKs use persistent streaming connections to the LaunchDarkly flag delivery network by default. Flag changes propagate from the LaunchDarkly dashboard to connected SDK instances in under 200 milliseconds in typical conditions. The rollback propagation latency for a server-side streaming deployment is bounded by the application's own request cycle, not by a polling interval. Client-side SDKs (JavaScript, mobile) also support streaming via SSE, meaning browser-side flag evaluation can receive kill-switch updates in the same 200-millisecond window rather than waiting for a polling cycle — a significant safety improvement over polling-based client-side SDKs.

LaunchDarkly's targeting model supports complex rules: percentage rollouts by user attribute, individual user targeting overrides, prerequisite flags (flag B only evaluates if flag A is true), multivariate flags (string, number, or JSON variants rather than only boolean), and targeting rules that combine multiple attribute conditions. The context model (introduced in LaunchDarkly 6.x, replacing the older user model) allows flags to be evaluated against arbitrary context kinds — not only users but organizations, devices, request attributes, or any domain object.

The Experimentation add-on provides Bayesian experiment analysis, metric collection, and statistical significance reporting within the LaunchDarkly UI. For teams running A/B tests, the combination of consistent user bucketing, streaming propagation, and integrated metric tracking eliminates the need for a separate experimentation platform for flag-based experiments. The analysis runs against LaunchDarkly's collected impression and metric data rather than requiring a join between flag assignment logs and product analytics events in a data warehouse.

The cost model creates the primary adoption blocker. LaunchDarkly prices on unique context count — the number of unique user or entity contexts evaluated against flags in a billing period. The free tier allows 1,000 monthly contexts. The Starter tier ($150/month) supports 10,000. Above that, pricing moves to custom contracts. For a consumer application with 50,000 monthly active users, LaunchDarkly costs are in the range of $500–2,000/month depending on tier and negotiated rates. For a B2B SaaS with 500 organizations as the flagging unit (organization-level flags rather than user-level), the context count is much lower and the cost more manageable. The billing model means that the cost-effectiveness calculation depends heavily on what the flagging unit is and how many distinct units exist.

The SDK maturity and documentation quality is the best in the category. Every mainstream language has a well-maintained LaunchDarkly SDK with streaming support, and the evaluation semantics are consistent across SDKs. Teams that have struggled with incomplete or inconsistently-behaving SDKs in open-source flag services frequently cite LaunchDarkly's SDK reliability as a primary retention factor even when cost pressure is significant.

Unleash

Unleash is the dominant open-source feature flag platform. The self-hosted version is free; the managed cloud offering (Unleash Pro) starts at $80/month. The structural tradeoff versus LaunchDarkly is operational overhead for cost savings.

The propagation model in the open-source Unleash Node.js SDK uses polling by default, with a 15-second interval. The Unleash Pro cloud offering and the Enterprise tier add SSE streaming, reducing propagation latency for those tiers to LaunchDarkly-comparable levels. For self-hosted Unleash on the open-source version, the worst-case propagation latency for a flag change is 15 seconds — manageable for most gradual rollout scenarios, but a meaningful limitation for kill-switch scenarios in high-severity incidents where every additional second of bad behavior has cost.

Unleash's bucketing implementation uses deterministic hashing by default for percentage rollouts (the groupId parameter seeds the hash, allowing different percentage rollouts to be statistically independent). The "sticky" evaluation setting controls whether the SDK caches the variant assignment in the session context — with stickiness enabled, the SDK ensures the same user sees the same variant within a session even if the flag percentage changes between evaluations. For A/B testing, stickiness is a correctness requirement, not optional.

The flag type model distinguishes operational flags (boolean, for kill switches and dark launches) from experiment flags (A/B with impression tracking) from permission flags (user subset access). The type distinction enforces different review and cleanup workflows: operational flags have a defined operational lifetime and are expected to be removed after the rollout completes; experiment flags have a defined experiment window. This built-in lifecycle classification is something homegrown systems consistently fail to implement, resulting in the 200-flag graveyard where nobody knows which flags are still active.

Unleash's Playground feature allows testing targeting rule evaluation against a hypothetical context before deploying the rule — useful for verifying that a complex targeting rule (enterprise plan AND region is EU AND not on the beta exclusion list) evaluates correctly for representative contexts before it reaches production users. Debugging targeting rule evaluation is one of the most frequent operational pain points with flag services, and having an evaluation sandbox in the dashboard reduces the trial-and-error cycle in the flag configuration UI.

The operational requirements for self-hosted Unleash: a PostgreSQL database for flag definitions and assignment history, a Node.js service for the Unleash API server, and an optional Redis layer for caching evaluated results at scale. Backup and restore procedures for the database are part of the operational responsibility. High availability requires multiple Unleash server instances with the database as the consistency layer. The team that self-hosts Unleash is responsible for upgrades, security patches, and incident response for the flag service infrastructure itself — the cost of the monthly bill is traded for the cost of operational ownership.

Flipt

Flipt is an open-source flag service written in Go, designed for infrastructure-centric teams that want flag definitions managed as configuration rather than as database records. The distinguishing characteristic is GitOps support: flag definitions can be stored as YAML files in a git repository and Flipt reads them from the filesystem or a git remote. A flag change is a pull request that goes through code review, gets merged to main, and Flipt picks up the new definition on its next reload cycle.

The propagation model for GitOps-backed Flipt is bounded by the git polling interval (configurable, default 30 seconds) plus the time for the change to be committed and pushed to the remote. This makes Flipt unsuitable for kill-switch scenarios that require sub-minute propagation — the workflow requires a PR merge and a polling cycle. For feature rollout management where the deployment workflow is the same as the code review workflow and rapid propagation is not a requirement, the git-native model reduces operational complexity: the flag state is in source control with full history, rollback is a git revert, and the flag history is the git log.

Flipt's evaluation API is gRPC-first with a REST gateway, in contrast to the HTTP-native APIs of LaunchDarkly and Unleash. For services already using gRPC (common in microservices architectures), this reduces the integration friction — flag evaluation is another gRPC service call rather than requiring a separate HTTP client. For services that are HTTP-only, the REST gateway provides equivalent functionality at the cost of higher latency per evaluation compared to gRPC (typically 2–5ms vs 0.5–2ms for gRPC over an internal network, negligible for most use cases).

Flipt does not have a dedicated experiment analysis layer. Variant flags (multivariate) are supported, but impression tracking and statistical significance analysis require integrating a separate analytics platform — the evaluation results must be streamed to the data warehouse and joined with outcome metrics to produce experiment results. For teams already operating a mature analytics stack, this integration is straightforward; for teams that want experiment analysis without additional infrastructure, Flipt's experimentation capability is limited relative to LaunchDarkly Experimentation or a purpose-built A/B testing platform.

OpenFeature

OpenFeature is not a flag service — it is a CNCF-standardized SDK interface for flag evaluation that decouples the evaluation call site from the underlying flag service. An application built against the OpenFeature SDK evaluates flags using a provider-neutral API; a provider plugin connects the SDK to the actual flag service backend.

The standard evaluation API covers the primitives that every flag service supports: boolean flags (getBooleanValue), string variants (getStringValue), number variants (getNumberValue), and structured object variants (getObjectValue). Each evaluation method accepts a default value (the value to return if the flag service is unreachable or the flag is not defined) and an evaluation context (the attributes used for targeting). The result includes not only the evaluated value but also the reason for that value (TARGETING_MATCH, DEFAULT, ERROR, CACHED) — metadata that is useful for debugging and for understanding which targeting rule produced the result.

OpenFeature hooks provide a standard mechanism for adding behavior around flag evaluation: logging every evaluation with the flag key and result for observability, reporting flag impressions to an analytics sink for A/B testing purposes, enforcing a flag name convention check at evaluation time, or timing evaluation calls for latency monitoring. Hooks compose: a logging hook and an analytics hook and a validation hook can all run on every evaluation without any of them knowing about each other. In a homegrown flag system, this instrumentation is typically added piecemeal and inconsistently; OpenFeature's hook model standardizes it across all flag evaluations.

The FlagD project (also CNCF) provides a reference OpenFeature provider backed by a JSON file or Kubernetes ConfigMap, enabling local development and testing without a connection to a real flag service. In CI environments where external network access to LaunchDarkly or Unleash is undesirable, FlagD provides a fully local flag evaluation backend that uses the same OpenFeature API the application uses in production. This means test suites can control flag state without mocking the flag service SDK — they set flag values in the FlagD configuration, and the OpenFeature provider evaluates against those values using the same code path that production uses.

OpenFeature's limitation: it standardizes the evaluation interface but not the management interface. There is no OpenFeature-standard API for creating flags, defining targeting rules, scheduling rollouts, or managing flag lifecycle. Each flag service has its own management UI, API, and concept model. Teams that use OpenFeature for evaluation still interact with vendor-specific management interfaces for all administrative operations. The migration path OpenFeature enables is the evaluation code migration, not the operational migration — moving flag definitions from LaunchDarkly to Unleash still requires manual recreation or custom export/import tooling.

Homegrown flag systems

Homegrown flag systems start with a database table and a middleware function and grow by accretion. Each new requirement — targeting by user attribute, percentage rollout, environment-specific values, emergency kill switches — adds code in a session or two, typically without a design review, and almost never with a documented architecture decision. The system works until it doesn't, and the failure modes are predictable.

Consistent bucketing is the most common missing feature. The natural implementation of "show this to 10% of users" is Math.random() < 0.1 — random on each evaluation. This is correct if the requirement is "show this feature on approximately 10% of page impressions" and incorrect if the requirement is "show this feature to a consistently assigned 10% of users across all their page impressions." The requirement is almost always the latter, but the implementation is almost always the former, because the correct implementation (deterministic hashing of a stable user identifier) is not the obvious first implementation and because nobody asks "is this rollout consistent?" until they observe the flickering behavior in production.

Propagation latency is the second failure mode. The homegrown flag system reads from the database, which is fast but creates a database query on every flag evaluation for every request. The obvious optimization is to add a cache. The cache has a TTL. The TTL determines the propagation latency. The TTL is set once and then tuned for performance rather than for rollback requirements, because the propagation latency requirement is not documented anywhere. The TTL in production is typically whatever it was set to the last time someone noticed database load from flag evaluation, which may be 60 seconds or 300 seconds or 10 minutes depending on which performance incident prompted the adjustment.

Audit logging is the third missing feature. A production flag change — changing a rollout from 100% back to 0% during an incident — should produce an audit record: who made the change, when, what the previous value was, and what the new value is. Homegrown systems typically have no audit log for flag changes, only for application events. The post-incident question "when exactly did we change the flag?" is answered by checking git commits (if flag values are in code) or by querying the database updated_at timestamp (if flag values are in a database table). Neither answer is complete: the git commit time may not match the deployment time, and the database timestamp records only the most recent change, not the history of all changes.

The case for keeping a homegrown system: it is justified only when the flag surface area is very small (fewer than five flags, none of which are used for A/B testing), the rollout scenarios are simple (all-or-nothing per environment, no user targeting), and the team has sufficient context in the codebase to evaluate whether any given flag change is safe without an external platform. The moment the system needs consistent user targeting for an A/B test or a rapid kill switch for a production incident, the homegrown implementation's undocumented constraints become load-bearing: the rollback latency, the bucketing model, and the absence of an audit log all matter, and fixing them requires a design session that should have happened before the first flag was evaluated in production.

The A/B testing layer and flag service interaction

Feature flags and A/B testing share the same technical substrate — consistent user bucketing into variant groups — but have different analytical requirements. A feature flag rollout cares only that the same user consistently sees the same variant. An A/B test additionally requires that each variant assignment is logged (impression tracking), that the assignment can be joined to outcome events (conversion tracking), and that the statistical analysis accounts for the experiment design (two-sided test vs one-sided, Bayesian vs frequentist, sequential testing with early stopping vs fixed-horizon).

The feature flag evaluation pattern decision record covers how flag evaluation mechanics work — consistent hashing, rollout percentages, variant assignment. The flag service selection decision record covers a separate question: which platform provides the experiment analysis layer, and what does your A/B testing workflow look like end-to-end?

With LaunchDarkly Experimentation: flag assignments are automatically logged as impression events; metric definitions (conversion events, numeric metrics) are created in the LaunchDarkly UI and linked to experiments; the analysis runs server-side in LaunchDarkly using Bayesian sequential analysis, which allows checking results before the experiment's end date without inflating the false positive rate. The workflow is self-contained: create flag, create metric, run experiment, read results in the LaunchDarkly dashboard.

With Unleash (self-hosted): flag assignments are tracked if the getVariant() method is used with impression event emission enabled. The impression events must be forwarded to an analytics sink — typically the data warehouse — where they can be joined to business outcome events. The statistical analysis runs in the data warehouse, a BI tool, or a purpose-built experimentation platform (GrowthBook, Eppo, Statsig). The workflow requires two systems: Unleash for flag management and variant assignment, a separate analytics platform for experiment analysis.

With a homegrown system: impression tracking is typically absent (nobody thought to add it when building the first version), outcome attribution is done by querying the application database for users in the treatment group (defined as "had flag evaluations that returned true"), and statistical analysis is done in a spreadsheet or Jupyter notebook. The result is frequently a p-value calculation on non-independent samples (a user who visited the site 20 times contributes 20 events rather than 1 user-level outcome), producing apparent statistical significance that evaporates when the analysis is corrected for repeated measures.

The flag service selection decision is also the implicit experiment platform selection decision. If the team uses LaunchDarkly and wants integrated experiment analysis, they adopt LaunchDarkly Experimentation. If they use Unleash and want experiment analysis, they build the data pipeline to a separate analytics platform. If they want to decouple flag service from experiment analysis (so they can swap either independently), they build the impression tracking layer against the OpenFeature hooks interface and route impressions to a dedicated experimentation platform regardless of which flag service backs the evaluation. Each of these is a valid architecture, but each has different operational dependencies and costs — and the choice is made implicitly when the flag service is selected, unless the ADR makes it explicit.

The four AI chat session types that create undocumented flag service decisions

Flag service infrastructure decisions appear in ChatGPT and Claude sessions in a predictable pattern. The sessions are short, incremental, and solution-focused — each one answers a specific implementation question without stepping back to document the architectural choice being made. The consequences compound as the system grows.

The initial implementation session. "We want to dark-launch a feature to 10% of our users. What's the simplest way to implement feature flags in Node.js?" The response describes a database-backed implementation with percentage rollouts. The team implements it. The session closes. Nobody documented whether the implementation uses consistent user bucketing or per-request sampling — that question was not asked, and the natural implementation from the example code is per-request. The first A/B test run on this system will produce invalid results.

The performance optimization session. "Our feature flag database queries are adding 50ms to every request. How do we cache the flag values?" The response describes Redis caching with a configurable TTL. The team adds the cache and sets the TTL to 300 seconds to minimize cache churn. The session closes. Nobody documented that the cache TTL is now the propagation latency floor — that a flag change will take up to 5 minutes to reach all application instances. The next time a bad rollout needs a kill switch, the 5-minute delay is a surprise.

The vendor evaluation session. "We've been using LaunchDarkly but it's getting expensive. Can we migrate to Unleash?" The response describes the Unleash setup, the migration steps, and the API differences. The team evaluates the effort and decides it's too large — there are 150 LaunchDarkly SDK calls across 12 services and rewriting them all is a multi-week project. The session closes. The team stays on LaunchDarkly for cost reasons, not technical reasons. The architectural constraint that produced the migration difficulty — using the vendor SDK directly rather than behind an abstraction — is not documented. The next engineer to evaluate the cost impact of LaunchDarkly pricing will discover the same migration complexity from scratch.

The A/B test debugging session. "Our A/B test shows 15% conversion improvement but when I look at the data more carefully the numbers don't add up. What could cause this?" The response discusses possible causes: sample ratio mismatch, Novelty effect, peeking at p-values early, non-independent samples. The engineer investigates, discovers the per-request bucketing problem, and fixes it. The session closes. The bucketing fix is deployed. But the root cause — that the flag system was using random sampling rather than consistent hashing from the beginning — is not documented, and the A/B tests run before the fix produced data that was never marked as invalid. Future analysts who look at historical experiment results will see the period before the bucketing fix and the period after and may attribute the apparent variance to external factors rather than a measurement artifact.

Each of these sessions is a fragment of the flag service ADR. Together they document the flag service history — but only if they are extracted from the closed chat windows and assembled into a coherent decision record. The decisions that never get written down are not the big vendor selections — they are the implementation choices made under deadline pressure, during performance incidents, and during debugging sessions, where the "why" behind each choice evaporates the moment the session closes.

What the flag service infrastructure ADR must contain

An architecture decision record for flag service infrastructure covers the platform selection and the operational model. Unlike the flag evaluation pattern ADR (which covers how flags should be structured and evaluated within your code), the flag service infrastructure ADR covers which external service or self-hosted platform backs that evaluation and what the operational properties of that backing service are.

Section 1: Current flag usage and classification

What flags exist, what type each one is (operational kill switch, gradual rollout, experiment, permission gate), and what the current evaluation mechanism is. The type classification matters because different flag types have different propagation latency requirements: a kill switch for a production incident needs sub-second propagation; a permission gate for a beta feature can tolerate 30-second propagation. The ADR must specify the propagation requirement for each flag type, because this requirement drives the SDK configuration and caching model that is appropriate.

Also in this section: who changes flags. If flag changes are made only by engineers during deployments, a polling interval of 15 seconds may be acceptable — the flag change happens alongside a deployment and the polling window is a minor delay. If flag changes are made by product managers, customer success teams, or automated systems in response to real-time signals (circuit breakers, error rate monitors), the propagation latency requirement is tighter and the evaluation model (streaming vs polling) is more consequential.

Section 2: Evaluation model and SDK selection

Which flag service is used (LaunchDarkly, Unleash, Flipt, homegrown, or OpenFeature with a named provider), which SDKs are deployed (server-side, client-side, mobile), and what the propagation model is for each SDK (streaming via SSE, polling at what interval, on-demand evaluation). This section must include the measured propagation latency — not the theoretical minimum from the SDK documentation, but the observed latency from a test: change a flag in the service, measure the time until all connected SDK instances have picked up the change. The measured latency is the rollback guarantee that the incident commander has in a production emergency.

The OpenFeature decision belongs in this section: is the application instrumented against the OpenFeature SDK or the vendor SDK directly? If vendor SDK directly, what is the estimated migration cost (number of evaluation call sites across all services) and is there a plan to migrate to OpenFeature abstraction? If already OpenFeature, which provider is active and what is the procedure for switching providers if the flag service must change?

Section 3: Bucketing model and A/B testing validity

How user-to-variant assignment works: the hash function used, the identifier used as the hash input (user ID, session ID, device ID), the seed strategy for preventing variant correlation across simultaneous experiments, and whether the bucketing has been validated for uniform distribution. The validation should be an empirical check, not assumed: bucket 1,000 test user IDs for a 50% rollout and verify that 480–520 fall into the treatment group — a range that would contain 99% of outcomes from a correct uniform distribution. If the distribution is significantly non-uniform, the percentage rollout percentage does not match the actual traffic allocation.

The A/B testing workflow: where impression events are logged, where outcome events are captured, where the join between impression and outcome is computed, and which statistical framework is used for the significance analysis (frequentist with a fixed sample size, sequential Bayesian, or CUPED-adjusted for pre-experiment variance reduction). The ADR must specify whether historical experiment results before a specific date are valid — if the bucketing implementation changed, tests before the change date should be marked as using an invalidated assignment model.

Section 4: Self-hosted vs managed tradeoff rationale

If self-hosting: the infrastructure required (database, service instances, monitoring, backup procedures), the team responsible for operations, and the cost comparison that justified self-hosting over the managed offering. The cost comparison must account for the fully loaded cost of the operational work — database administration time, incident response for the flag service, upgrade cycles — not only the licensing fee delta. A team that self-hosts Unleash to save $150/month on LaunchDarkly but spends two engineer-hours per month on Unleash operations at $100/hour is not saving money — they have only moved the cost from the infrastructure budget to the engineering budget where it is less visible.

If using a managed service: the cost model (per-context, per-seat, flat rate), the context counting strategy (what constitutes a unique context in the billing model and whether the application's user model maps efficiently to that billing unit), and the trigger conditions for re-evaluating the vendor decision (what volume of monthly active users would make the managed service cost unjustifiable, and what migration path would be executed at that point). The build-vs-buy decision record framework applies here: the flag service infrastructure decision is a recurring buy-vs-self-operate decision that should be revisited as usage scales.

Section 5: Flag lifecycle policy

How flags are created, reviewed, and removed. The lifecycle policy is the organizational guardrail that prevents the 200-flag graveyard. Every flag should have: a defined type (operational, experiment, permission), an expected lifetime (permanent for permissions, bounded for experiments, deployment-scoped for operational flags), an owner (the team or engineer responsible for removing the flag after its purpose is served), and a review trigger (when does the flag show up in a cleanup review?). Without a lifecycle policy, flags accumulate because removing a flag requires confidence that nothing depends on it, and that confidence requires knowing when the flag was last changed, what it controls, and whether any A/B test results depend on it remaining in its current state — information that is typically not available without the ADR.

The lifecycle policy is where the CI/CD pipeline decision record intersects the flag service: flag cleanup should be part of the deployment checklist for features that have been fully launched. The engineer who deploys the last phase of a gradual rollout should, as part of that deployment, remove the flag from the code and the flag service. Treating flag cleanup as a separate task that is deferred until "later" is the invariant that produces the 200-flag graveyard.

Section 6: Emergency flag change procedure

This is the section that is read at 2am during a production incident. For each flag type, what is the procedure for rapidly changing a flag to stop bad behavior? Who has permission to change the flag (is there an approval workflow, or can any engineer change any flag at any time?), where is the flag management interface, what is the expected propagation latency after the change is made, and how is the propagation verified (is there a monitoring view that shows which SDK instances have received the updated flag state)?

The propagation latency from section 2 belongs in this section as a reference: "after changing a flag value in the Unleash dashboard, expect the new value to be active within 15 seconds on all server-side SDK instances. If the value has not propagated after 60 seconds, check the Unleash server health endpoint and verify that the application instances' SDK connections to Unleash are active." This is what the incident commander needs, and it should be in the ADR before an incident forces the team to figure it out under pressure.

The propagation latency requirement is always clearer in hindsight

Every team that has used a feature flag for a kill switch during a production incident has the same retrospective insight: the propagation latency was not a design consideration when the flag system was built, but it was the most operationally significant property of the flag system when the incident arrived.

The e-commerce startup's five-minute rollback window during Black Friday was not the result of a bad engineering decision. It was the result of no engineering decision — the cache TTL was set for performance reasons without reference to a propagation latency requirement that had never been stated. If the ADR had included section 2 (evaluation model and propagation latency) and section 6 (emergency flag change procedure), the five-minute TTL would have been a visible choice rather than an invisible default.

The performance optimization decision record covers a related dynamic: performance improvements made without documenting the tradeoffs create operational surprises when the tradeoffs materialize. A cache TTL that eliminates database load is a performance optimization; the same cache TTL that delays a kill-switch propagation is an operational constraint. Both are true simultaneously, and the ADR is where both truths live.

The flag service infrastructure decision is also the A/B testing infrastructure decision. The authentication strategy decision record establishes the user identity model that the flag service's targeting rules operate against — the user attributes available for targeting (plan tier, region, account age) are the output of the authentication and session system, and the consistency of those attributes across requests is what enables consistent user bucketing. A flag service that cannot see stable user identifiers cannot provide consistent bucketing; the session model and the authentication strategy are prerequisites for a well-functioning gradual rollout system.

The WhyChose decision extractor was built for the sessions that the e-commerce startup had lost: the performance optimization session where the cache TTL was changed without documenting the propagation latency consequence, the A/B testing session where the bucketing model was discussed without producing a decision record. Those sessions contained the ADR. The extractor recovers them — not by retrieving closed ChatGPT windows, but by ensuring that the next engineering team that makes the same decisions under the same pressures has a place to put the reasoning before the session closes.

Further reading on related architectural decision records:

The feature flag evaluation pattern decision record — the evaluation mechanism, flag lifecycle policy, and consistent hashing implementation that this post's vendor selection decision must align with.
The build-vs-buy decision record — the framework for deciding when to self-host a flag service versus pay for a managed offering, and how to account for fully-loaded operational costs in the comparison.
The performance optimization decision record — how caching decisions that optimize for throughput create operational constraints that are only visible during incidents.
The authentication strategy decision record — the user identity model and session attributes that flag targeting rules operate against; consistent user bucketing requires stable user identifiers that the auth system must provide.
The CI/CD pipeline decision record — where flag-based deployments intersect the deployment pipeline; flag cleanup should be part of the deploy checklist for fully launched features.
Decisions never written down — the pattern of incremental implementation decisions that together define a system's architecture without any single session being a visible architectural choice.
How to document architecture decisions — the ADR format and conventions used across all decision records in this series.
WhyChose decision extractor — recover the flag service infrastructure decisions buried in your AI chat history.