2026-07-02 · ~23 min read

The feature store decision record: why the feature computation and serving model you chose determines your training-serving skew surface and your feature freshness ceiling

Feature engineering decisions are made in a Jupyter notebook by a data scientist who has no model deployment responsibility. The features that worked in offline training are then reimplemented for the serving path — by a backend engineer, in a different language, against a different data store, without a shared contract specifying that both implementations must produce identical values. The moment the normalization function uses a different base, the null imputation uses a different strategy, or the timestamp interpretation uses a different timezone, offline evaluation metrics cease to predict online performance. The feature store architecture you choose determines whether training and serving share a single transformation definition or accumulate silent divergences that appear as unexplained model degradation three months after deployment.

A 14-person e-commerce startup building a product recommendation system created their feature engineering pipeline in the way most ML teams do: a data scientist opened a Jupyter notebook, wrote Python and Pandas code to compute user features from a PostgreSQL database snapshot, trained a gradient-boosted tree model on a six-month event history, and evaluated it against a held-out validation set. The offline evaluation was strong — 18% click-through rate improvement over the baseline random recommendations. The model was handed to an ML engineer for deployment.

The ML engineer built a serving API in Go that computed the same user features at inference time against a Redis hash containing the latest user event counts. The feature set was straightforward on its face: session count in the last 7 days, average session length in the last 7 days, top category affinity score (the proportion of click events attributed to the user's most-clicked product category), and a recency score based on days since last purchase. The ML engineer read the feature engineering notebook, understood the intent of each feature, and implemented the same four features in Go. The category affinity score in the notebook used a log-normalized proportion — the raw proportion was passed through a natural log transformation to compress high-affinity users and spread the lower end of the distribution. The ML engineer's Go implementation used a log base 10 normalization because the Go standard library's math.Log function takes the natural log but the implementation comment in the engineer's notes referenced a log10 example from a different context. Both values are between 0 and 1, both vary monotonically with click proportion, both are numerically plausible. The difference is invisible at a glance.

The recommendation model was deployed. In the first week, measured click-through rate was 11.2% — materially lower than the 18% improvement the offline evaluation had predicted but not dramatically so. The team attributed the gap to distribution shift: the production user base was slightly different from the training data population, which was expected. The model ran in production for three months. Click-through rate remained around 11-12%, never approaching the offline evaluation figure. The team retrained the model on a more recent three-month window, reasoning that the distribution shift would be reduced with fresher training data. The retrained model's offline evaluation showed 19% improvement. Its production click-through rate was 10.8% — slightly worse than the original deployed model. The retraining had moved offline metrics in the right direction and production metrics in the wrong direction simultaneously.

The investigation that followed took three weeks. A senior ML engineer compared feature values computed by the training notebook against feature values returned by the serving API for the same user entity at the same timestamp. For session count, average session length, and recency score, the values matched. For category affinity score, they differed: a user with a 0.60 raw click proportion received a category affinity score of -0.511 (natural log of 0.60) in the training notebook and a score of -0.222 (log base 10 of 0.60) in the serving API. The model's learned weights for the category affinity feature were calibrated to the natural log distribution. The serving path was feeding it log base 10 values. Every user's category affinity score was wrong in production — numerically plausible, in range, and systematically miscalibrated. The retraining made the skew worse, not better, because the model retrained on the notebook's natural log values and then served log base 10 values had a larger gap between training and serving than the original model, whose weights had been partially adapted by three months of online feedback.

Fixing the skew required changing the Go serving implementation to match the natural log normalization, redeploying the serving API, and monitoring click-through rate for two weeks to confirm recovery. The feature engineering notebook was never shared in a form the Go engineer could execute — it was a training artifact, not a shared computation contract. The divergence had existed since day one of deployment and had been invisible because both implementations produced numerically valid outputs with no errors, no alerts, and no explicit comparison between the two paths.

An 18-person fintech startup building a buy-now-pay-later platform implemented their fraud detection model with real-time velocity features computed directly inside the payment service. When a payment was initiated, the payment service queried Redis for the user's recent payment velocity — the count of payments from the same device in the last 60 minutes, the count of payments to the same merchant in the last 24 hours, and the total payment amount from the same IP in the last 10 minutes. These values were looked up from sorted sets in Redis that were updated by the payment service on every transaction. The fraud model received these three velocity features along with static user attributes (account age, KYC verification tier, linked bank account count) and scored the payment as low, medium, or high risk.

The model had been trained on a dataset assembled by the founding data scientist, who ran a bulk export of six months of payment events from the application database, computed the same three velocity features from the raw event history using a Python script, joined the computed features to each training payment event, and trained the model on the assembled dataset. The Python script computed velocity features by sorting events by timestamp, grouping by device ID and merchant ID, and counting events within a sliding window using Pandas rolling operations. The production computation used Redis sorted sets with score as Unix timestamp and ZRANGEBYSCORE to count events within the time window.

The fraud model performed adequately and the team moved on to building other features. Eight months later, the team wanted to retrain the fraud model on the full eight months of historical data to incorporate new fraud patterns that had emerged from a cohort of compromised accounts. The data scientist wrote a new training script that would recompute the velocity features from the raw event history in the database. The recomputation was necessary because the fraud model in production read from Redis, and Redis had no record of what velocity values had been served eight months ago — it only had the current values computed from the most recent sliding windows. The historical feature values that the original model was actually trained on were not stored anywhere; the original training script had produced them once and they had not been archived.

The data scientist recomputed the velocity features from the database event history using the same Pandas rolling window approach as the original training script. The retrained model's offline evaluation metrics were strong. The retrained model was deployed. Fraud detection recall dropped by 7 percentage points in the first week. The team initially attributed the drop to the new fraud patterns being different from the historical patterns the model had trained on. Closer investigation revealed a more fundamental problem: the device ID velocity feature in the recomputed training data disagreed with the Redis-computed value for the same payment events. The recomputed training script counted payments from the same device within 60 minutes using event timestamp as stored in the database. The Redis computation counted payments within 60 minutes using the timestamp at which the event was written to Redis — which was different from the event timestamp in cases where the payment processing pipeline had retry delays, timezone inconsistencies between the mobile SDK's clock and the server clock, or event ordering differences from concurrent writes. The two computations were semantically equivalent by intent but produced different numeric values in edge cases that occurred in approximately 12% of training examples. The model retrained on recomputed features had learned weights calibrated to a feature distribution that had never existed in production. The original model's weights, calibrated to Redis-computed values from the founding training run, were more accurate than the retrained model's weights — but the original values were gone and could not be recovered.

The remediation required adding a feature logging layer to the payment service that recorded the exact Redis-computed velocity values for each payment alongside the payment event, archiving those values to the data warehouse at the time of computation rather than recomputing them from raw events after the fact. Until sufficient logged data accumulated to retrain the model on real historical feature values, the fraud model was frozen at the original weights. The team ran without the improved fraud patterns for four additional months while the logging layer accumulated a new eight-month training window of properly archived feature values. The architectural decision that caused this was not the choice to use Redis for velocity features — it was the absence of a historical feature store that archived computed feature values at the time of computation, making it impossible to recover what the serving path had actually computed when training the model's replacement.

Structural properties set by the feature store architecture decision

Four structural properties are determined when a team chooses how to compute and serve ML features. None appear explicitly in the founding session that wrote the first feature engineering notebook or wired the first Redis velocity lookup — they are the operational consequences of a design choice made under the pressure of getting a model into production as quickly as possible.

Property 1: Training-serving skew and the feature computation contract. A model's learned weights are calibrated to the distributions of the features it was trained on. When the serving path computes the same named features using different logic — different normalization, different null imputation, different aggregation window boundary handling, different timezone interpretation — the model receives a different distribution at inference time than it was calibrated for. The degradation is proportional to the semantic distance between the training-path and serving-path computations, and it compounds across features: a 5% distributional shift in each of 4 independently-skewed features produces a larger compounded degradation than a 20% shift in a single feature, because the model's joint prediction depends on all features simultaneously.

The root cause of training-serving skew is almost always that the feature transformation is defined twice: once in a training context (Python, Pandas, SQL) and once in a serving context (Go, Java, a database stored procedure, a lookup against a real-time data store). Both definitions are intended to implement the same logic, but there is no mechanism to verify that they do. The fix is not to audit both implementations and make them match — that produces a verified snapshot that diverges again the next time either implementation is changed. The fix is to define the feature transformation once, in a form that is executable in both contexts, and to execute the same artifact in both training and serving. A feature store achieves this by defining features as versioned transformation functions — a Python function, a SQL expression, or a streaming aggregation — that the feature platform executes against the offline store (for training) and against the online store (for serving) using the same transformation code. If the feature transformation must be implemented in different runtimes (the serving path requires sub-millisecond latency that Python cannot deliver), the feature store provides a skew detection mechanism: it logs feature values computed in the serving path and compares them against what the offline store would have computed for the same entity at the same timestamp, alerting when the difference exceeds a configured threshold. The data pipeline decision record documents the batch and streaming pipeline that feeds the offline feature store; the feature store ADR must document the computation ownership model — which team defines each transformation, in what form, and which mechanism guarantees that training and serving execute the same logic rather than trusting that two independent implementations happened to agree.

The computation contract must cover null handling explicitly. Null values arise when an event has not occurred (a new user with no purchase history has a null 30-day purchase count), when data is delayed (a payment event exists in the source but has not yet been processed by the pipeline feeding the feature store), or when the entity type is unexpected (a merchant being scored by a model trained only on user entities). Training-path null handling is often implicit: a Pandas fillna(0) is added when a value unexpectedly becomes NaN during model training without a conscious decision that zero is the correct imputation. The serving path implements its own null handling, often with a different default, because the backend engineer is solving a different problem (preventing a null pointer exception) rather than matching a specific training-time imputation strategy. The feature store ADR must document the null handling policy for each feature as an explicit part of the transformation definition, not as a detail left to the individual implementation.

Property 2: Point-in-time correctness and temporal leakage. When a model is trained on historical data, each training example consists of an entity (a user, a payment, a product) at a specific time and a label (did this user churn, was this payment fraudulent, was this product clicked). The feature values joined to each training example must reflect what the feature store would have served had the model been running at the time of that training example. If the training pipeline joins current feature values — the user's profile as it exists today — to historical training examples, it creates temporal leakage: the model sees information that was not available at prediction time. A user's 30-day purchase count today reflects purchases made in the months after the historical training event; a model trained on these leaked features learns weights that depend on future information and performs significantly worse when deployed, because the future information is no longer available.

Point-in-time correct feature lookup requires the offline store to support time-travel queries: "what was the value of this feature for this entity at timestamp T?" This is not the same as querying the current feature value. It requires either storing a snapshot of each feature's value at each computation time (a history table with effective-from and effective-to timestamps per feature value), or storing the raw events from which features are computed and recomputing the feature at the target timestamp on demand. The history table approach is efficient at query time but expensive at storage time: a feature updated every hour for every user in a million-user dataset produces 24 million rows per day. The recomputation approach is expensive at query time (recomputing a 30-day aggregation at an arbitrary historical timestamp requires scanning 30 days of raw events per entity) but has no storage overhead beyond the raw events. Feature stores choose one of three strategies: (1) materialize point-in-time snapshots at the training data granularity (pre-compute feature values at each training label timestamp and store them alongside the raw events, which requires the label set to be known before feature computation begins); (2) maintain a versioned history table for each feature with efficient time-travel queries; (3) use the raw event history with on-demand recomputation at training time, accepting the query cost in exchange for not needing to predict in advance which timestamps will be needed for training. The data warehouse decision record documents the query compute layer where historical feature data lives; the feature store ADR must document whether the offline store supports time-travel queries, the storage model for historical feature values, and the maximum look-back window available for historical training data, because if the feature history table only retains 90 days and a model needs to be retrained on 18 months of data, the time-travel guarantee does not cover the full retraining window.

Temporal leakage is not always from future events. Label leakage occurs when a feature directly encodes the label or is causally downstream of the label: using the user's "account suspended" flag as a feature for a churn prediction model, when account suspension is itself caused by the churn event, is a label leak. Feature leakage occurs when a feature is computed at training time using data that is available to the feature computation pipeline but was not available to the serving path at the time the historical training event occurred (a batch feature that is computed once daily but the training label corresponds to a mid-day event when the batch had not yet run). The feature store ADR must document the temporal validity window for each feature group — the time period during which a computed feature value is valid at serving time — and the procedure for verifying that training examples use features from within their validity window rather than from a computation that had not yet run at the training event time.

Property 3: Feature freshness SLA and the offline-online serving gap. Different features have different freshness requirements. A user's historical purchase category affinity, computed from 12 months of purchase events, is unlikely to change meaningfully in a single day — a daily batch computation is adequate. A user's payment velocity in the last 60 minutes changes with every payment event and must be computed in near-real-time for a fraud model. A merchant's average transaction size over the last 30 days needs daily freshness for a risk scoring model. A session-level feature (number of product pages viewed in the current session) needs freshness measured in seconds for a real-time recommendation model.

A feature store typically separates features into two serving surfaces: an offline store (a data warehouse or object store containing batch-computed feature values, typically updated on a daily or hourly schedule, used for training and for low-freshness serving use cases) and an online store (a low-latency key-value store — Redis, DynamoDB, Bigtable — containing the latest feature values, pre-computed and loaded from the offline store or from a streaming pipeline, used for serving with sub-100-millisecond latency requirements). The gap between the offline store and the online store creates a freshness ceiling for the online serving path: if the batch computation for a feature group runs nightly, the online store's values are at most 24 hours stale. If the model serving latency budget requires the total feature lookup to complete in under 20 milliseconds, the online store must answer key lookups in under 15 milliseconds (leaving 5 milliseconds for the model scoring computation), which limits the feature store's choice of online backend to in-memory stores or very-low-latency key-value databases. The caching strategy decision record documents the caching architecture for low-latency data access; the online feature store is architecturally a specialized cache — the feature store ADR must document the serving latency SLA per feature group at p99, the cache invalidation model (how the online store is updated when the offline batch computation produces new values), and the serving latency degradation budget (what happens to model serving if the online store is unavailable — does serving fall back to a degraded mode with older feature values, use default values, or fail the prediction request).

Real-time features — those computed from a streaming event source rather than a batch computation — bypass the offline store for the online path. A payment velocity feature cannot wait for a nightly batch; it must be computed from the event stream in near-real-time and written to the online store as each event arrives. The message broker decision record documents the event streaming infrastructure; the feature store ADR must document which features are computed from the streaming path versus the batch path, the streaming computation framework (Flink, Kafka Streams, a stateful function in the serving layer), and the freshness SLA that the streaming computation must deliver. For features computed from both paths — a 30-day historical aggregation (batch) plus a real-time session count (streaming) — the feature store must compose values from both sources in a single serving response, and the ADR must document the composition model (are batch and streaming features served from a unified online store that the streaming pipeline writes to, or are they joined in the serving layer from separate stores?).

The freshness SLA also governs the feature computation failure mode: if the nightly batch computation fails and the online store contains 48-hour-old values, is the model degraded gracefully (serving with stale features, logged as a freshness SLA breach) or is it failed explicitly (returning an error to callers until the computation recovers)? The feature store ADR must document the staleness tolerance per feature group — the maximum age of a feature value at serving time before the serving layer must fail the prediction or substitute a default — and the monitoring alerting threshold that detects when the online store's feature values exceed the staleness tolerance.

Property 4: Feature reuse and the cross-team coordination surface. In organizations where multiple ML teams independently build models, each team tends to independently implement the features their model needs. A fraud team computes "user payment volume in the last 30 days" from the payments database. A credit risk team computes "user total spend in the last 30 days" from the same payments database. A recommendation team computes "user purchase frequency in the last 30 days" from the same source. Each team's implementation uses slightly different field names, different null handling strategies, and different edge case behavior — a user with no payments in 30 days is null in one team's implementation, zero in another's, and absent from the feature vector in the third's. Each implementation runs as a separate batch job against the same source data, tripling the compute load, creating three independent maintenance surfaces for the same logical feature, and producing three different numeric values for the same entity at the same time.

The cross-team coordination failure surfaces in two ways. First, when two models that both use a "30-day payment volume" feature produce different risk scores for the same user — one flags a transaction as high risk, another approves it — and an investigator discovers that the two models used different values for the same named feature, trust in both models erodes. The discrepancy is not attributable to model architecture or training data; it is an implementation inconsistency invisible to anyone who looked only at the model artifacts. Second, when a feature is deprecated or its computation is updated — the payments database schema changes, a currency normalization is added, a new data source replaces the old one — each independent implementation must be updated separately, and the updates will be deployed at different times, producing a window during which the same feature is computed differently in different teams' pipelines without any mechanism to detect or alert on the divergence. The background job infrastructure decision record documents the batch job execution framework; the feature store ADR must document the feature registry — the canonical list of feature definitions shared across teams — and the governance model for adding, updating, and deprecating features: who approves a new feature definition, what documentation is required (computation logic, null handling, data source, freshness cadence, owner), how existing consumers are notified when a feature computation is changed, and what the compatibility guarantee is when a computation is updated (are the old values preserved in a versioned history for models trained on the previous computation?).

What the founding session records and what it omits

The feature engineering decision is almost always made in the founding sprint, in the same week the first model is trained. The session produces the first feature engineering notebook, the first serving-path lookup against a data store, and the first training pipeline. It records the feature names and the general intent. It does not record the computation contract between training and serving, the historical feature archival strategy, the null handling policy, or the cross-team coordination mechanism.

Four types of AI chat sessions generate these gaps:

The "how do we build features for our ML model?" session. The data scientist needs to engineer features for the first model. They ask what features to compute and how to compute them from the available data. The session explains feature engineering concepts, recommends Pandas for tabular data, produces a notebook with the first set of feature transformations, and describes how to join features to training labels. The session does not ask: how will these feature transformations be executed in the serving path, and what mechanism will verify that the serving implementation produces identical values to this notebook? What is the null handling strategy for each feature, and is that strategy explicitly documented so a backend engineer implementing the same feature in Go knows to use the same strategy? Will this model need to be retrained on historical data, and if so, how will the feature values from this training run be archived so that historical feature values are available for future retraining? The session answers the question asked ("how do we compute features for training") rather than the question that determines operational correctness ("how do we ensure that training and serving compute features identically and that historical feature values are preserved for model retraining"). The resulting notebook is correct for training and implicitly incorrect for serving — the serving implementation that a backend engineer writes independently from the notebook will diverge in ways that neither engineer can detect without an explicit comparison mechanism. The data pipeline decision record documents the data transformation layer; the feature store ADR must document the computation ownership model as part of the founding feature engineering decision, not as a retrofit when the first skew investigation reveals that training and serving have been diverging for months.

The "how do we serve our model's features in production?" session. The ML engineer responsible for deployment asks how to serve the features at inference time. The session explains the options: batch precompute and store in a key-value database, compute on demand from the raw data store, or use a managed feature serving platform. The session produces a design for computing features at serving time — querying Redis for recent velocity counts, looking up user attributes from a PostgreSQL read replica, joining values from multiple sources in the serving layer. The session does not ask: where is the canonical definition of each feature's computation, and how does the backend engineer verify that this serving implementation matches the data scientist's training-path computation? What happens to this serving implementation when the data scientist updates the feature computation — does the backend engineer receive a notification, and what is the process for updating the serving implementation in sync with the training-path change? What is the serving latency SLA for each feature group, and does the chosen data store architecture satisfy that SLA at the p99 percentile at the expected query rate? The session produces a serving implementation that is architecturally reasonable and computation-siloed from the training implementation. The two implementations are never explicitly compared, and the computational contract between them — produce identical values for the same entity at the same time — is not documented anywhere. The database vendor decision record documents the database selection for the application layer; the feature store ADR must document the online feature store selection — the data store used for low-latency feature serving — alongside the computation contract that the serving implementation must satisfy, so that serving path changes can be validated against the training-path definition rather than against an intent inferred from a deployment session notes.

The "how do we retrain our model on historical data?" session. The team wants to retrain the model on a larger or more recent dataset. The engineer asks how to run the feature computation over the historical event data. The session explains how to write a batch feature computation that reads from the data warehouse, joins event history to training labels, and computes features for each training example. The session does not ask: are the features being recomputed for training consistent with the features that were computed at serving time in production, and is there a mechanism to verify this? Does the data warehouse retain the complete event history needed to recompute features for the full training window, or has some data been pruned per the retention policy? For features that were computed from ephemeral state (Redis velocity counts), is there an archived record of what values were served in production that can be used for retraining, or must those features be recomputed from raw events — and if so, does the recomputed value match what Redis computed in production? The session produces a historical recomputation pipeline that is semantically equivalent in intent to the production computation but implementationally divergent in the edge cases that matter for model accuracy. The retrained model is calibrated to a feature distribution that has never existed in production. The data retention decision record documents the data lifecycle and retention policies; the feature store ADR must document the historical feature archival strategy: which features are archived at computation time to enable accurate model retraining, the storage format and retention window for archived feature values, and the recomputation fallback for features where archival was not implemented at the time the original model was trained. Without explicit archival, every model retraining cycle introduces the risk that the recomputed training features diverge from the original serving-path computation — a risk that is invisible until a retrained model performs worse than the model it was meant to replace.

The "how do we add features from another team's data?" session. The ML team wants to incorporate features from a different domain — the recommendation team wants payment history features from the fraud team's data pipeline, or the risk team wants engagement signals from the product analytics team's database. The engineer asks how to access the other team's data and compute the needed features. The session explains the data access patterns (database read replicas, shared data warehouse tables, API endpoints, a shared Kafka topic) and produces a feature computation that reads from the other team's data source. The session does not ask: does the other team already compute a feature equivalent to what this team needs, and if so, is there a shared feature definition that both teams can use to guarantee they are computing the same value? What is the ownership model for features computed from cross-team data — who updates the computation when the source data schema changes, and who is responsible for ensuring that the update is synchronized across all teams consuming the feature? What is the agreement between teams about the freshness and availability SLA for the data source being shared? The session produces a new independent computation of a feature that another team may already be computing with different logic, against a data source whose schema may change without advance notice, with no ownership contract for the cross-team maintenance surface. The API schema design decision record documents the schema versioning contract for internal APIs; the feature store ADR must document an equivalent contract for shared feature data sources: the notification protocol when a source schema changes, the compatibility guarantee for existing feature consumers, and the process by which a feature that one team computes independently is merged into the shared feature registry when a second team needs the same feature — to prevent two teams from computing divergent values for the same logical feature and discovering the divergence when both teams' models produce inconsistent scores for the same entity.

The WhyChose extractor surfaces the founding feature engineering session, the first serving deployment session, the first model retraining session, and the first cross-team feature sharing session from AI chat history. The feature store ADR converts the implicit decisions embedded in those sessions — feature transformation logic, null handling defaults, serving architecture, retraining strategy — into a documented computation contract, a point-in-time correctness guarantee, a freshness SLA per feature group, and a cross-team coordination mechanism, so the next engineer who asks "why is our model worse after retraining?" has the feature archival contract documented, not discovered.

The five sections of a feature store ADR

Section 1: Feature computation architecture and transform ownership. Document the feature computation model: where feature transformations are defined (a feature platform's DSL, versioned Python functions in a shared repository, SQL expressions in a dbt project, a streaming aggregation in a Flink job), how the same transformation definition is executed in both the training path (reading from the offline store, which is typically the data warehouse or object storage) and the serving path (reading from the online store, which is a low-latency key-value database populated from the offline store or a streaming pipeline), and the mechanism that verifies that both paths produce identical outputs for the same entity at the same time. Document the null handling policy for each feature group as part of the transformation definition — not as an implementation detail left to each path's engineer — covering the expected null rate for each feature, the imputation strategy (zero, population mean, a sentinel value indicating missingness), and whether null values are passed to the model or filtered from the prediction request. Document the ownership model: for each feature, which team is the canonical owner, who reviews changes to the transformation definition, what documentation is required for a new feature to be added to the shared registry (computation logic, data sources, null handling, freshness cadence, downstream model consumers), and how downstream consumers are notified when a feature transformation is updated. The data pipeline decision record documents the ETL and ELT layer that feeds the offline store; the feature store ADR must document the lineage from source events through the pipeline to the offline store and from the offline store to the online store, because a pipeline failure that stops feature values from being refreshed in the online store is a serving latency SLA failure, not just a data engineering incident, and the on-call escalation path differs accordingly.

Section 2: Point-in-time correctness and temporal leakage prevention. Document whether the offline store supports point-in-time queries — returning the feature value that the online store would have served for a given entity at a given historical timestamp — and the storage model that enables this: versioned history tables with effective-from and effective-to timestamps per feature value, raw event archives with on-demand feature recomputation, or pre-materialized snapshots at the granularity of the training label timestamps. Document the temporal validity window for each feature group — the period during which a computed feature value is valid, reflecting the freshness cadence of the computation — and the procedure for training pipeline engineers to verify that each training example uses a feature value from within the validity window rather than from a computation that had not yet run at the training event time. Document the label leakage review process: for each feature in a model's feature set, the review must confirm that the feature value at training time did not encode information causally downstream of the label, that the feature value was available at the time of the training event (not a future batch that had not yet run), and that the feature source's schema has not changed in a way that makes the historical feature values non-comparable to the current computation. Document the historical feature archival requirement per feature group: which features must be archived at computation time for model retraining (all features with ephemeral serving-path computations that cannot be accurately reconstructed from raw events, such as Redis velocity counts computed from Kafka events with limited retention), the archive storage format and retention window, and the fallback procedure when a feature lacks archival and a model must be retrained on historical data — either accept the recomputation divergence with explicit documentation of the known inconsistency, or defer retraining until sufficient archived data has accumulated. The data warehouse decision record documents the query compute layer; the feature store ADR must document the time-travel query pattern for the offline store — whether the data warehouse supports efficient historical point-in-time lookups at the entity and timestamp granularity needed for training data assembly, or whether a separate feature history store is required.

Section 3: Feature freshness SLA and the offline-online serving split. Document the freshness SLA for each feature group: the maximum acceptable staleness of a feature value at serving time, expressed as a specific duration (feature X must be no more than 1 hour stale when served to the model; feature Y must be no more than 15 minutes stale; feature Z, a real-time session count, must be no more than 30 seconds stale). Document the computation cadence for each feature group that satisfies its freshness SLA: daily batch for features with 24-hour freshness tolerance, hourly batch for features with 1-hour tolerance, streaming computation for features with sub-minute freshness requirements. Document the offline-online split: which features are served from the offline store (acceptable for training data assembly and low-freshness serving use cases), which features require an online store for low-latency serving, and how online store values are populated (a synchronization job that copies offline store values to the online store on each batch completion, or a streaming pipeline that writes to the online store directly from the event source). Document the serving latency budget per feature group at p99: the maximum time a serving layer is allowed to spend looking up features for a single prediction request, and how that budget is allocated across feature groups when a prediction requires features from multiple sources with different serving latency characteristics. The caching strategy decision record documents the caching layer for low-latency reads; the online feature store is architecturally equivalent to a specialized cache — the feature store ADR must document the eviction policy (LRU, TTL-based, never-expire with explicit update-on-compute), the cache warming strategy when the online store is rebuilt after a failure or a schema migration, and the degraded serving behavior when the online store is unavailable or returns a stale value that exceeds the staleness tolerance. The real-time architecture decision record documents the low-latency serving infrastructure; the feature store ADR must cross-reference it for features computed from streaming sources, documenting the streaming computation framework, the watermark policy for time-windowed aggregations, and the failure mode when the streaming computation falls behind the event source.

Section 4: Feature versioning and backward compatibility for model handoffs. Document the feature versioning model: how a feature definition is versioned when its computation is changed, whether both the old and new versions are retained in the offline and online stores simultaneously (enabling a model trained on version N to continue serving while a model trained on version N+1 is being validated in shadow mode), and the compatibility classification for feature changes (backward-compatible changes: adding a new feature group, adding a new feature to an existing group with a default value for historical rows; non-backward-compatible changes: renaming a feature, changing a feature's computation logic, changing a feature's null handling, removing a feature). Document the model handoff procedure when a new model version trained on a different feature version is promoted to production: the verification step that confirms the new model version is consuming features from the correct feature version, the traffic ramp protocol (shadow serving, canary deployment, full rollout), the rollback criterion (the new model's online metrics have degraded below the rollback threshold within the monitoring window, or the feature skew detection alerts for the new version's feature set), and the cleanup procedure for retiring the old feature version after all models trained on it have been replaced. The API schema design decision record documents the external API schema versioning contract; the feature store ADR must document an equivalent versioning contract for features, because a feature change that breaks a model in production has the same impact as an API breaking change that breaks a downstream consumer, and must be managed with the same level of deliberateness. Document the feature deprecation timeline: when a feature is scheduled for deprecation, how far in advance consumers are notified, what the migration path is for models that consume the deprecated feature (retrain on the replacement feature, use a compatibility shim that maps the deprecated computation to the new computation for a defined period), and when the deprecated feature version is removed from the offline and online stores.

Section 5: Feature monitoring — distribution drift, skew detection, and serving latency SLA. Document the feature monitoring strategy as three distinct monitoring surfaces. The first is training-serving skew detection: for each feature that is computed independently in the training and serving paths (rather than sharing a single computation definition), a continuous comparison must run that logs feature values computed at serving time and compares them against what the offline store would have computed for the same entity at the same timestamp, alerting when the difference exceeds the configured threshold for that feature. The skew detection alert is the primary signal that the training-path and serving-path implementations have diverged; it must fire before the divergence produces a measurable degradation in the model's online metrics, because by the time online metrics degrade the model has been miscalibrated for long enough that a significant fraction of predictions have been incorrect. The observability strategy decision record documents the metrics infrastructure; feature skew metrics require a different collection mechanism from application metrics — they require storing per-entity feature values at serving time, which is typically done via a separate logging sink rather than through the standard metrics pipeline, because the cardinality (one record per prediction per entity) exceeds what standard metrics aggregation can accommodate. The second monitoring surface is feature distribution drift: tracking the statistical distribution of each feature's values over time in the online store and alerting when the distribution shifts significantly from the training distribution (mean, standard deviation, null rate, value range). Feature distribution drift is distinct from training-serving skew — drift means the real-world data distribution has changed (user behavior has shifted, a data source has changed its encoding, a pipeline bug has introduced systematic errors in feature computation), while skew means the training and serving paths compute differently. Both produce model degradation; only drift-monitoring can detect distribution shift from external causes. The logging strategy decision record documents the structured logging contract; feature value logging at serving time must produce structured records per prediction with entity ID, feature group version, feature name, feature value, timestamp, and a staleness indicator (the age of the feature value in the online store at the time of the prediction request), so that skew detection and distribution drift analysis can be run as SQL queries against the log data. The third monitoring surface is serving latency SLA compliance: for each feature group, the p99 latency of online store lookups must be below the serving latency budget documented in section 3. A feature store whose online store p99 latency exceeds the budget degrades the model serving latency SLA for every prediction, and the root cause must be traced to a specific feature group lookup rather than attributed generically to "model serving latency." The API rate limiting decision record documents the rate limiting architecture for serving endpoints; the feature store serving latency SLA must be factored into the total serving budget alongside the model scoring latency, the serving API overhead, and the network round-trip to the caller, so that the system-level SLA is achievable given the sum of its components.

None of these five sections appear in the founding sprint session that wrote the first feature engineering notebook. The session records that user session count, average session length, category affinity score, and recency are the four features, that Pandas is used for the computation, and that the model is a gradient-boosted tree. It does not document that the category affinity score uses natural log normalization (so that the backend engineer reimplementing it knows not to use log base 10), that new users with no purchase history should receive a recency score of 30 (not null), that historical feature values must be archived for model retraining (so that the fraud team's Redis velocity counts are not lost when the model is updated), or that cross-team features must be registered before independent implementations proliferate (so that the recommendation team and the credit team do not compute divergent "30-day purchase volume" features from the same source database). The feature store ADR is the document that converts the implicit architectural choice — notebook computation, Redis lookup, independent team implementations — into the operational parameters that determine whether the data science team discovers training-serving skew by logging feature values at serving time and alerting on divergence, or by investigating an unexplained click-through rate gap for three weeks while the model continues to serve miscalibrated predictions to every user who touches the recommendation system. The WhyChose extractor recovers the founding feature engineering session, the first model deployment session, the first retraining session, and the first cross-team feature sharing session from AI chat history; the feature store ADR extracts the durable architectural constraints from those sessions and documents them where engineers encounter them: next to the feature transformation definitions and the online store configuration, not in a Slack thread from the founding month and a Jupyter notebook that nobody has opened since the model was deployed.

FAQs

When does a team need a feature store and when can they use ad-hoc feature engineering?

Ad-hoc feature engineering — computing features in a training notebook and reimplementing them for serving — is acceptable only when all of the following hold: a single engineer owns both training and serving (and can verify both compute identically), the feature set is small and computation is simple (under 10 features, no log normalization, no windowed aggregations, straightforward null handling), the model is trained once and not retrained on historical data, and no other team uses the same features.

When any condition is violated, training-serving skew accumulates at a rate proportional to feature count and engineering team size. The observable signals that a team has outgrown ad-hoc engineering: offline evaluation metrics consistently exceed online metrics by more than a few percent; model retraining produces degraded production performance even as training metrics improve; a data scientist and a backend engineer cannot independently produce the same feature value for the same entity at the same timestamp; two models use a feature with the same name but different computation logic. A feature store addresses all four failure modes by defining each transformation once, executing it from both paths, supporting point-in-time historical queries, and providing a shared registry that surfaces computation conflicts before two implementations diverge.

What is training-serving skew and why is it difficult to detect?

Training-serving skew is a divergence between the feature values a model trained on and the feature values it receives at inference time in production. The model's learned weights are calibrated to the training feature distributions. When serving computes differently — a different normalization base, a different null imputation, a different timezone interpretation — the model receives a different distribution than it was calibrated for. Predictive performance degrades in a way that is disconnected from any observable infrastructure property: the model is not broken, features are not missing, the serving API is healthy.

Detection is hard for three reasons. First, it is identical to distribution shift — both produce offline-exceeds-online metrics, but the remediations are different (fix the computation divergence vs retrain on recent data). Retraining a model while a serving skew exists can worsen the gap. Second, the divergence is in subtle implementation details: natural log vs log base 10, zero vs population-mean null imputation, UTC vs local timezone — all numerically plausible, none raising an error. Third, skew across multiple features compounds: three features with small independent errors produce a larger joint degradation than one feature with a large error. Detecting skew requires logging feature values at serving time and comparing them against what the offline store would compute for the same entity at the same timestamp, continuously, not just at deployment.

What should a feature store ADR document that a general ML pipeline decision does not?

A general ML pipeline decision records which training framework is used, how models are versioned, and where predictions are served. A feature store ADR must document the structural properties the feature computation architecture establishes: (1) the computation ownership model — who defines each transformation, in what form, and how training and serving execute the same definition rather than trusting two independent implementations to agree; (2) point-in-time correctness — whether the offline store supports time-travel queries, how temporal leakage is prevented when joining features to historical training labels, and the archival requirement for ephemeral features like Redis velocity counts; (3) the feature freshness SLA per feature group — the maximum acceptable staleness, the computation cadence, and the serving latency budget at p99; (4) versioning and backward compatibility — how feature definition changes are versioned, the compatibility classification for each change type, and the model handoff procedure when a new model trained on a new feature version is promoted; (5) monitoring — continuous skew detection between training and serving paths, distribution drift alerting, and serving latency SLA compliance per feature group.

None of these appear in the session that wrote the first training notebook. All of them determine whether the team discovers skew by comparing feature logs in an automated daily report or by investigating an unexplained CTR gap three months into production.