2026-07-02 · ~22 min read

The data pipeline decision record: why the batch vs stream processing model you chose determines your data freshness ceiling and your late-arriving event handling complexity

Batch and streaming processing decisions are made in the founding sprint by reaching for a tool the team already knows — Airflow because someone used it at their last job, Kafka Streams because a blog post called it the modern approach, dbt because a conference talk made it sound like the obvious choice. The freshness requirement for the data products that will read the pipeline output is not documented. The arrival latency of the event sources feeding the pipeline is not measured. The reprocessing window needed for backfills and incident recovery is not estimated. The processing model you choose sets a freshness ceiling that every downstream dashboard, ML model, and API endpoint runs against — and in streaming systems, sets a watermark policy that silently drops every event that arrives later than the allowance, without an error, without an alert, and without a count of what was missed.

A 16-person analytics SaaS team building a B2B customer success platform made their pipeline decision in a single afternoon in the founding month. They had product events flowing from their web application into a PostgreSQL database, and they needed to compute engagement metrics for their customer health dashboards. The senior data engineer had used Apache Airflow at his previous company. He set up Airflow on a small EC2 instance, wrote a set of dbt models to transform the raw event tables into engagement scores, and scheduled the DAG to run nightly at 2am. The pipeline processed the previous day's events, updated the summary tables in Redshift, and the dashboards showed data that was current as of midnight. The team's product manager signed off on the first demo, the health scores looked correct, and the nightly batch cadence was never written down as a design constraint — it was simply the operational cadence the pipeline happened to use.

The problem arrived four months after launch. The company's CS team was growing and wanted a real-time customer health view to use during live customer calls. They needed to see events from the last four hours — recent support tickets opened, login failures in the current session, feature adoption events from the current day. The nightly batch updated the health scores once a day. A customer could open three support tickets at 9am and the CS rep reviewing the account at 2pm would see yesterday's score, with zero indication that the morning had been a disaster for that customer. The product manager requested a freshness of "no more than four hours stale" for the real-time view.

The data engineering team investigated what it would take to make the nightly pipeline produce four-hour freshness. Running the DAG every four hours was the obvious first attempt. It failed immediately: the dbt models were written as full-refresh transforms that rebuilt the entire engagement score table from scratch each run. A full-refresh on six months of event history took 47 minutes against Redshift. Running six times a day would require the pipeline to run continuously without overlap. The engineers rewrote the three core dbt models as incremental transforms — processing only events newer than the last run's high-water mark. The incremental rewrite required partitioning the six months of existing event history, verifying that the incremental logic produced identical results to the full-refresh logic for all historical date ranges, and adding a recovery mechanism for runs that failed mid-execution and left the high-water mark in an inconsistent state. The incremental migration was a two-month project that produced no new customer-visible features.

At the end of the project the pipeline ran every four hours and the CS dashboard showed data that was at most four hours stale. A month later the product team asked whether the pipeline could be made to run every 15 minutes, to support a real-time alert when a customer's health score dropped below a threshold. The data engineer's answer was that 15-minute batch runs would require moving to a streaming architecture — batch overhead at that interval would make the pipeline spend more time on startup and teardown than on processing. The team had built a customer success platform whose core value proposition was real-time customer intelligence, and they had chosen a nightly batch architecture without documenting the freshness requirements for the data products it would serve. The 15-minute alert feature was deferred to the following quarter while a streaming migration was evaluated. The architectural ceiling had been set in an afternoon four months earlier, in a session that produced an Airflow DAG but not a freshness SLA.

A 22-person fintech startup building a buy-now-pay-later platform chose Apache Flink for real-time fraud scoring after their CTO read an engineering blog post from a payments company describing how they used Flink to score transactions in under 200 milliseconds. The CTO concluded that Flink was the right tool for any payment pipeline requiring low latency. The team set up a Flink cluster on Kubernetes, configured a Kafka topic as the event source for payment initiation events, and wrote a Flink job that computed a fraud risk score for each payment event using a set of velocity rules — number of payments in the last 5 minutes from the same device, number of payments to the same merchant in the last hour, cumulative payment amount in the last 24 hours.

The velocity rules required aggregating events over time windows. The team implemented 5-minute, 1-hour, and 24-hour tumbling windows using Flink's event-time windowing with a watermark allowance copied from a Flink tutorial: 10 seconds. The tutorial example used server-generated events with sub-second delivery latency; the watermark choice was correct for that example. The fintech's payment events came from three sources: a web checkout flow with sub-second event delivery, a mobile SDK that batched events locally and flushed on network reconnect, and a partner integration that sent events via a webhook with a retry queue that could hold events for up to 90 minutes on partner-side failures. The team was aware of the mobile SDK's batching behavior but had not measured how late its events arrived at the Kafka topic. They were not aware that the partner integration's retry queue created a 90-minute event delivery gap.

The fraud scoring pipeline ran for two months before a data analyst noticed that the fraud model's transaction coverage — the percentage of payment events that received a fraud score — was 71%, not the expected 100%. The analyst expected that every payment initiation event would produce a fraud score; the coverage gap implied that 29% of payment events were being scored without their full velocity context, or not scored at all. Investigation revealed that the Flink job was emitting fraud scores for incomplete windows: a 5-minute window that should have contained 14 events was closing with 9 events because 5 events arrived after the 10-second watermark had passed and were directed to a configured side output that the team had not wired to any downstream sink — the side output events were being silently discarded.

The team measured the actual event arrival latency for each source. The web checkout events arrived with a median latency of 340 milliseconds and a P99 of 2.1 seconds. The mobile SDK events arrived with a median latency of 18 seconds and a P99 of 4.2 hours — mobile users who completed a checkout while offline and reconnected hours later flushed their buffered events with their original event timestamps, arriving in the Kafka topic 4.2 hours after the transaction occurred. The partner integration events arrived with a median latency of 800 milliseconds and a P99 of 94 minutes during a simulated partner-side failure. A 10-second watermark was correct for web checkout events and catastrophically wrong for mobile and partner events. Every mobile payment completed offline and every payment on the partner integration during a retry window had been scored with incomplete velocity context for two months. The fraud model had been making risk decisions on incomplete data without any alert, any error count, or any coverage metric in the pipeline's monitoring dashboard.

Extending the watermark to accommodate mobile events at P99 would have required a 4.2-hour watermark — meaning every velocity window result would be delayed by 4.2 hours, eliminating the real-time fraud scoring use case entirely for the window-based rules. The team's resolution was to split the pipeline by source: web checkout events retained the 10-second watermark and drove the real-time scoring path; mobile and partner events were processed through a separate batch pipeline that computed velocity features with a 6-hour look-back window and updated a feature store that the fraud scoring model read at inference time. The source-specific processing split was not in the original architecture; it was the outcome of two months of silent data loss and a two-week investigation to diagnose it. The watermark had been chosen to satisfy the tutorial example, not to satisfy the measured arrival latency of the actual event sources.

Structural properties set by the pipeline processing model decision

Four structural properties are determined when a data pipeline's processing model is chosen. None appear explicitly in the founding session that set up the first DAG or Flink job — they are the operational consequences of a design choice made in response to team familiarity, a blog post recommendation, or an assumed equivalence between the team's freshness needs and the freshness delivered by a particular tool.

Property 1: Data freshness ceiling and the reprocessing contract. The processing model sets the minimum achievable data freshness for every downstream data product the pipeline serves. A nightly batch pipeline running at 2am produces data that is at most 22 hours stale at the moment it completes, and at most 46 hours stale in the hour before the next run. A pipeline running every 15 minutes produces data that is at most 15 minutes stale. A streaming pipeline with near-continuous processing produces data that is seconds to low-single-digit minutes stale, bounded by processing latency and consumer commit intervals. The freshness ceiling is not a configuration option within a given processing model — it is set by the model itself. Moving from nightly batch to 4-hour batch requires rewriting the transform logic to be incremental. Moving from 4-hour batch to streaming requires replacing the batch scheduler with a stream processing runtime and rewriting every transform as a stateful stream operation. Both transitions are multi-week projects that produce no new features.

The processing model also sets the reprocessing contract: the mechanism for re-running the pipeline over a historical date range when a bug is discovered, a schema migration corrupts downstream data, or an upstream source is backfilled with corrected events. Batch pipelines are naturally suited to reprocessing: an Airflow DAG can be triggered for a specific date range (backfill), dbt can be run with a date parameter, and each run processes a well-defined partition of data. If the pipeline is idempotent — running it twice over the same date range produces identical output — reprocessing is safe and complete. The task scheduling decision record documents the scheduling mechanism for batch jobs; the data pipeline ADR must document the reprocessing trigger (which operator can initiate a backfill, what parameters are required, and what the expected run duration is for a 30-day backfill), the idempotency contract (whether each pipeline step overwrites or appends output, and what constitutes a safe re-run), and the maximum historical window available for reprocessing given the source data retention policy. If the source database only retains 90 days of event history and a data corruption is discovered after 120 days, the pipeline cannot be reprocessed to correct the earlier data — the source is gone. The data retention decision record documents the data lifecycle and compliance requirements; the pipeline ADR must cross-reference it to establish whether the source retention period is longer than the maximum reprocessing window needed for incident recovery — if not, the pipeline must maintain its own event archive to enable reprocessing beyond the source retention window.

Streaming pipelines require a different reprocessing model. A Flink or Kafka Streams job processes events from a Kafka topic by maintaining a consumer group offset — a pointer to the last event processed. To reprocess historical data, the offset is reset to an earlier point (a specific timestamp or the beginning of the topic) and the job replays events from that position. This requires that the Kafka topic retains events for the full reprocessing window needed — if Kafka is configured with a 7-day retention and a data bug is discovered on day 10, events before day 3 are gone and cannot be replayed. It also requires that the streaming job's output is idempotent — writing the same output twice when the same event is processed twice — because a reprocessing run will re-process events that were already processed correctly. Without idempotency, reprocessing produces duplicate records in the downstream data store. The message broker decision record documents the Kafka topic retention policy; the pipeline ADR must document the reprocessing replay window and verify that the topic retention policy covers it, with a plan for what happens if the retention window is shorter than a bug's detection latency.

Property 2: Late-arriving event handling complexity and the watermark policy. In a streaming pipeline, every time-windowed aggregation requires a policy for what to do when an event arrives after the time window it belongs to has already been emitted. This policy is the watermark: a configured maximum lateness that the system will tolerate before closing a window. A 10-minute watermark means the system waits 10 minutes past the latest event timestamp seen before emitting a window result — any event that arrives more than 10 minutes late is either dropped, directed to a side output, or triggers a correction to the already-emitted result.

The watermark value must be derived from the measured P99 arrival latency of each event source, not chosen from a tutorial default. Different event sources in the same pipeline can have radically different arrival latency distributions: server-generated events from a web application may arrive with P99 latency under 2 seconds; mobile SDK events that buffer locally and flush on network reconnect may arrive with P99 latency of 4 hours; partner webhook integrations with retry queues may have P99 latency of 90 minutes during partner-side failures. A single watermark value applied across all sources is correct for the most predictable source and wrong for the others. The resolution is source-specific processing: separate Kafka topics per source, each with a watermark calibrated to that source's measured arrival latency, merged downstream after the time-windowed aggregation is complete. The queue and messaging decision record documents the Kafka topic structure and delivery guarantees; the pipeline ADR must document the watermark value per source, the measurement basis (P99 of measured arrival latency on production traffic, not the P50 or an estimate), and the re-measurement cadence — mobile app behavior changes with new SDK versions and connectivity patterns, so the watermark should be re-validated when the mobile SDK changes significantly.

Late events that exceed the watermark must be handled by an explicit policy, not silently discarded. Three patterns exist: (1) Drop and count — events past the watermark are discarded, but a metric is incremented per dropped event per source; the metric feeds a dashboard and an alert that fires when the drop rate exceeds a threshold indicating the watermark is misconfigured. (2) Side output — late events are directed to a separate Kafka topic or database table; a separate batch process reads the side output, computes the late contribution, and emits a correction to the downstream data store; the downstream data product must be designed to accept corrections and merge them with the original window result. (3) Allowed lateness — Flink and Kafka Streams allow a secondary lateness tolerance beyond the watermark, during which late events trigger additional firings of the already-emitted window; the window output is updated in place with each new late event; downstream consumers must handle multiple versions of the same window result. Each pattern has a different impact on the downstream data product's consistency model: drop-and-count produces results that are permanently incomplete but predictably bounded; side-output corrections produce eventual consistency with a correction latency; allowed lateness produces real-time correction but requires idempotent downstream writes. The pipeline ADR must document which pattern is used for each pipeline and for each downstream data product, because a consumer that expects a window result to be final will behave incorrectly when a side-output correction overwrites it.

Property 3: Schema evolution and pipeline topology management. Data pipelines read from upstream sources whose schemas change over time — new columns are added to database tables, event payloads gain or lose fields, enum values are extended or renamed. The pipeline's behavior when an upstream schema changes is determined by the processing model and the schema handling contract established at pipeline creation time, not at the time the schema change occurs.

In a batch pipeline, a schema change in the source takes effect at the next pipeline run. If a column is removed from a source table and the pipeline's dbt model references that column, the next run fails with a compile error before any data is written. The failure is visible, produces no corrupted output, and is fixed by updating the model. The cost of a batch schema evolution failure is one missed run and an engineer-hour to fix the model. The API schema design decision record documents the external API schema versioning contract; the pipeline ADR must document an equivalent internal contract for upstream data sources: which field changes in the source are breaking (column removed, column renamed, enum value removed, column type changed in an incompatible direction) versus non-breaking (column added with a default, enum value added), and the notification protocol — who must inform the data engineering team before a breaking schema change is deployed so the pipeline model can be updated in the same release window.

In a streaming pipeline, a schema change in the source takes effect as soon as new events begin arriving with the new schema — not at a discrete run boundary. If events with the old schema and events with the new schema are both present in the Kafka topic at the same time (which is the normal case during a source deployment), the streaming job processes a mixed stream. A job written to extract a field by name using a strongly-typed schema (Avro or Protobuf with generated classes) will fail to deserialize events with the new schema if the schema version is not backward-compatible. A job using a weakly-typed schema (JSON with runtime field access) will silently produce incorrect results if it attempts to read a renamed or removed field — null propagates through the computation rather than failing loudly. Streaming pipeline topology upgrades require blue/green deployment: a new version of the job is deployed alongside the old version, both consuming from the same Kafka topic; the old version is drained and shut down once the new version has confirmed it is processing events correctly and its state (accumulated window aggregates) has been rebuilt from the event log. Without this protocol, a mid-stream upgrade during a high-traffic period produces a split-brain window: half the events in the window were processed by the old topology and half by the new, and the aggregation is incorrect because the two halves used different field extraction logic. The database migration strategy decision record documents the schema migration tooling for the application database; the pipeline ADR must document the equivalent topology migration protocol for streaming jobs — the blue/green procedure, the state rebuild verification step, and the rollback criterion if the new topology produces incorrect output.

Property 4: Pipeline observability and the silent failure surface. Batch pipelines fail loudly: a dbt model that references a nonexistent column fails at compile time; an Airflow DAG that times out fails the task and sends an alert; a Spark job that runs out of memory raises an exception and writes no output. The failure is a positive signal — something happened that the monitoring system can detect and alert on. Batch pipeline monitoring is primarily about detecting the absence of a successful run or the presence of a failed task.

Streaming pipelines fail silently in a class of failure modes that has no equivalent in batch processing. A streaming pipeline that drops late-arriving events produces no error — it processes the events it receives and emits results for them. A Flink job whose consumer group lag is growing because the upstream Kafka topic's throughput has exceeded the job's processing capacity continues to run, emit results, and report healthy — but the results are increasingly stale. A streaming join that misses events because the join's state time-to-live has been exceeded before the matching event arrived silently produces partial results. A watermark that has been misconfigured since deployment has been silently dropping late events for months. None of these failures produce errors. They produce incorrect output at a rate that is often indistinguishable from correct output without a reference dataset to compare against. The observability strategy decision record documents the metrics and tracing infrastructure; streaming pipeline observability requires a separate set of metrics from application observability: consumer group lag per topic partition (the gap between the latest offset and the committed offset, measured in event count and in time), window output event count versus input event count per source (a drop indicates events being discarded by the watermark or a join miss), late-event discard rate per source, and freshness watermark — the event timestamp of the most recently processed event, compared to wall clock time to compute actual data staleness.

These metrics must be alarmed, not just dashboarded. A consumer lag alert fires when the lag exceeds the acceptable staleness budget: if the freshness SLA for a dashboard is 15 minutes and the consumer lag grows above 15 minutes, the SLA is breached even if the job is running and emitting results. A late-event discard rate alert fires when the discard rate from any source exceeds a threshold — 1% might be acceptable noise from brief network disruptions; 30% indicates the watermark is misconfigured for that source's actual arrival latency distribution. The logging strategy decision record documents the structured logging contract; pipeline jobs must emit structured logs per processing step with event counts, window boundaries, discard counts, and latency percentiles — not free-form strings that require regex parsing to extract operational metrics. Without structured pipeline metrics, the only way to detect silent streaming failures is to compare pipeline output against the source system's count — a manual audit that happens after a customer complaint, not as a continuous operational signal.

What the founding session records and what it omits

The data pipeline decision is almost always made in the founding month as an extension of the analytics setup: "we need to get our events into the data warehouse, what should we use?" The session produces the first DAG or the first Kafka consumer, the first dbt project structure or the first Flink job skeleton. It records the tool choice. It does not record the freshness SLA, the reprocessing window, the late-event policy, or the schema evolution contract.

Four types of AI chat sessions generate these gaps:

The "how do we build a data pipeline for our analytics?" session. The team needs to move events from the application database to the data warehouse. The engineer asks how to build the pipeline. The session explains the ETL vs ELT distinction, recommends dbt for transforms, suggests Airflow for orchestration, and produces a first DAG skeleton that extracts from PostgreSQL, loads to Redshift, and runs a dbt model to build an engagement metrics table. The session does not ask: what freshness SLA do your dashboards require, and is a nightly batch cadence compatible with that SLA? What is the maximum acceptable data loss if the pipeline fails and the last successful run was 48 hours ago? How will the pipeline be reprocessed if a bug in the transform logic is discovered six months from now? The answers to these questions would change the architecture: a team whose core product value is real-time customer intelligence should not start with a nightly batch pipeline, regardless of how quickly it can be set up. The session answers the question asked ("how do we build a pipeline") rather than the question that should have been asked ("what pipeline model satisfies our data freshness requirements at our current event volume and team operational capacity"). The resulting pipeline is technically correct and architecturally misaligned with the product requirements discovered four months later. The background job infrastructure decision record documents the batch job execution model; the pipeline ADR must document the freshness SLA per data product alongside the processing model, so that the alignment between the two is verified at pipeline design time rather than during a feature scoping conversation when a product manager requests a 15-minute freshness that the existing pipeline cannot deliver.

The "how do we make our pipeline process data in real-time?" session. The team has a batch pipeline and needs lower latency. The engineer asks how to move to real-time processing. The session explains event streaming, recommends Kafka as the message broker, introduces Flink or Kafka Streams for stream processing, and produces a first streaming job that consumes from a Kafka topic and emits aggregations. The session does not ask: what is the P99 arrival latency of each event source feeding this pipeline, and what watermark value does that measurement imply? Which of your current batch transforms are time-windowed aggregations, and what happens to events that arrive after the window closes? How will you reprocess historical data if a streaming job bug is discovered — and does your Kafka topic retain events for the full window needed for that reprocessing? The watermark value in the produced code is the tutorial default — 10 seconds, or sometimes 1 minute — calibrated to the example's event source, not the team's actual event sources. The session produces a streaming job that is correct for the happy path (web events with sub-second delivery latency) and silently incorrect for the unhappy paths (mobile events, partner retry queues, IoT devices with intermittent connectivity). The data quality issue from the misconfigured watermark is discovered weeks or months later, after incorrect aggregations have been serving downstream data products. The message broker decision record documents the Kafka topic structure and retention policy; the pipeline ADR must document the watermark configuration per source with the measurement basis, because the watermark is the most consequential configuration value in a streaming pipeline and the one most likely to be wrong when copied from an example.

The "how do we handle pipeline failures and alerts?" session. A pipeline run has failed or produced incorrect output and the team wants to set up proper failure handling. The engineer asks how to configure alerting and retry logic. The session produces Airflow task failure alerts (email or Slack notification on task failure), retry configuration (retry 3 times with exponential backoff before marking failed), and a general recommendation to monitor the data warehouse for row count anomalies. The session does not ask: if the pipeline fails and the last successful run was 36 hours ago, what is the recovery procedure — do you re-run the failed run, run a backfill from the last known-good state, or both? Is each pipeline step idempotent so that a re-run produces the same output as the original run, or do some steps append data that would be duplicated by a re-run? For streaming pipelines: if the Flink job restarts from its last checkpoint after a failure, and the checkpoint was 20 minutes old when the failure occurred, what happens to the events processed in those 20 minutes — are they replayed from Kafka and processed again, and if so, does the downstream output handle duplicate writes? The disaster recovery decision record documents the recovery procedures for the application layer; the pipeline ADR must document an equivalent recovery runbook for pipeline failures: the procedure for each failure mode (job failure, source unavailability, downstream write failure), the idempotency contract per pipeline step, and the maximum tolerated data gap if the recovery requires a full reprocessing backfill from a date before the failure.

The "how do we add a new data source to the pipeline?" session. The team wants to add a new event source — a new microservice, a partner feed, a third-party SaaS tool — to the existing pipeline. The engineer asks how to integrate it. The session produces a new Kafka topic configuration or a new Airflow extract task, a schema mapping for the new source's data model, and a new dbt model or Flink job branch that processes the new source's events. The session does not ask: how does the new source's schema evolution process work — who owns it, how are schema changes communicated, and what is the pipeline's behavior when an undocumented schema change is deployed? For streaming sources: what is the P99 arrival latency of this new source's events, and is it compatible with the existing watermark configuration, or does this source require a separate processing path? The API schema design decision record documents the external API schema versioning contract; the pipeline ADR must document an equivalent notification protocol for each upstream data source — the mechanism by which schema changes are communicated to the pipeline team before deployment, the pipeline team's responsibility to update models before the schema change takes effect, and the rollback plan if a schema change is deployed without advance notice and causes a pipeline failure. Without this protocol, each upstream schema change is a surprise discovered when the next pipeline run fails or, worse, when the pipeline silently produces incorrect output because the source added a new enum value that the pipeline's filter logic was not expecting.

The WhyChose extractor surfaces the founding pipeline setup session, the first freshness requirement conversation, the first failure incident session, and the first schema evolution incident from AI chat history. The data pipeline ADR converts the implicit decisions embedded in those sessions — tool choice, DAG cadence, watermark value, schema assumptions — into a documented freshness SLA per data product, a watermark policy per source, a reprocessing contract, and a schema evolution notification protocol, so the next engineer who asks "can we make this dashboard more real-time?" has the architectural ceiling documented, not discovered.

The five sections of a data pipeline ADR

Section 1: Processing model selection and freshness SLA per data product. Document the processing model chosen — batch with a specific cadence, micro-batch (e.g., Spark Streaming or dbt on a 15-minute interval), or continuous streaming — with the specific rationale. List every downstream data product that reads pipeline output: dashboards, ML model feature stores, API endpoints, operational reports, alerting systems. For each, document the freshness SLA — the maximum acceptable staleness — and verify that the selected processing model can satisfy it at the team's current event volume and operational capacity. If any data product's freshness SLA is not satisfiable by the selected model, document it explicitly as a known constraint with the estimated cost and timeline for the architectural change that would satisfy it. Do not leave it as an implicit limitation discovered when the product manager requests the real-time dashboard. The data warehouse decision record documents the query compute model where batch pipeline output lands; the pipeline ADR must document the round-trip latency from event generation to queryable result in the data warehouse, accounting for pipeline processing time plus the warehouse's data ingestion latency, to determine the real freshness SLA achievable for each data product.

Section 2: Late-arriving event policy with watermark measurement and correction strategy. For streaming pipelines: document the P99 event arrival latency for each source, measured on production traffic. Document the watermark configuration derived from each measurement — the watermark allowance for a source should cover at least the P99 arrival latency for that source, with explicit acknowledgment of the tradeoff between completeness and latency. Document the handling policy for events that exceed the watermark: drop with metric increment, side output with a correction batch process, or allowed lateness with additional firings. For each policy, document the downstream impact: which data products may receive late corrections and how they handle them (overwrite, merge, reject). For batch pipelines: document the maximum acceptable delay in source data — the time after the batch window closes before which all events from that window must be present in the source for the batch to be considered complete. Document the trigger for re-running a batch when source data arrives late: manual operator intervention, an automated completeness check before the next run, or tolerance of partial-window data with a correction in the following day's run. Document the measurement cadence for re-validating watermark assumptions: mobile SDK behavior changes with new app versions, partner integrations change their retry configurations, and a watermark calibrated to last quarter's traffic distribution may be wrong for this quarter's.

Section 3: Schema evolution contract and topology migration protocol. Document the upstream schema change notification protocol for each data source: who is responsible for notifying the data engineering team before a breaking schema change is deployed, the minimum lead time required (typically one sprint), and what constitutes a breaking change (column removal, column rename, enum contraction, required field added to an event schema). Document the pipeline's behavior when a breaking schema change is deployed without advance notice: which pipeline step fails, what output is produced in the window between the schema change and the fix, and whether downstream data products consume incorrect data or no data during that window. Document the schema versioning approach for event sources: Avro schema registry with compatibility mode (BACKWARD, FORWARD, FULL), Protobuf with reserved field numbers, JSON Schema with explicit version enumeration, or unversioned JSON with runtime field access and explicit null-handling. For streaming pipelines: document the topology migration protocol for deploying a new version of a streaming job — the blue/green procedure, the checkpoint compatibility assessment, the state rebuild verification (does the new topology produce identical aggregations to the old topology when replayed against the same Kafka offset range?), and the rollback criterion. The database migration strategy decision record documents the application database schema migration tooling; the pipeline ADR must document a comparable migration review gate for pipeline topology changes — a topology change that alters windowing logic, join semantics, or output schema requires the same level of review as a database migration, because incorrect streaming output may not be detectable until a customer reports a data discrepancy.

Section 4: Reprocessing and backfill contract with idempotency verification. Document the reprocessing mechanism — how a pipeline operator initiates a reprocessing run for a specific date range, which parameters are required (start date, end date, which data products to rebuild, whether to overwrite or append output), and the expected duration for a 30-day reprocessing run at current event volume. Document the idempotency contract for each pipeline step: whether writing the same events twice produces duplicate records in the downstream data store (requiring deduplication) or whether the write is idempotent by key (an upsert or overwrite that is safe to repeat). Document the maximum reprocessing window available given source data retention: if the application database retains 90 days of event history and the Kafka topics retain 30 days, the pipeline can only reprocess within the shorter of the two retention windows. If the reprocessing window needed for incident recovery (based on the team's historical bug detection latency) exceeds the source retention, document the supplementary archive — a separate long-term event store that preserves a copy of the raw events beyond the source's retention period for reprocessing purposes. Document the reprocessing trigger authority: who can initiate a reprocessing run, what approval is required for a backfill that will overwrite more than N days of downstream data, and how downstream consumers are notified that a reprocessing run is in progress so they can treat output as potentially incomplete during the reprocessing window. The secrets management decision record documents the credential management approach; pipeline reprocessing jobs that read from production sources and write to production data stores require the same credential management as the primary pipeline, not ad-hoc credentials that bypass the standard rotation policy.

Section 5: Pipeline observability SLA and dead letter escalation. Document the metrics that constitute the authoritative signal that the pipeline is producing correct output — not just that it is running. For batch pipelines: successful run completion per schedule, row count in the output table within an expected range per batch window, and data freshness watermark (the maximum event timestamp in the output table, compared to wall clock time). For streaming pipelines: consumer group lag per topic partition (in event count and in time), output event count per input event count per source per window, late-event discard rate per source, and window result finalization latency (time between the window's end timestamp and the emission of the final window result). Document the alerting thresholds for each metric: consumer lag above the freshness SLA budget, discard rate above the acceptable noise floor, output row count below the expected floor for a given event volume. Document the on-call escalation procedure for each alert: who receives the initial page, what the first investigation steps are (check consumer group lag trend, check the late-event discard rate, check for upstream schema changes in recent deployments), and the runbook for each failure class (consumer lag growing due to throughput overload, discards spiking due to watermark misconfiguration, output row count anomaly due to upstream schema change). Document the dead letter queue for both batch and streaming pipelines: events that cannot be processed due to parse errors, type mismatches, or unexpected schema are written to a dead letter store with the original event bytes and the processing error; the dead letter queue depth is monitored and alerted; events in the dead letter queue are reviewed and either replayed after a fix or explicitly discarded after investigation. Without a dead letter queue, parse errors and schema mismatches produce silent gaps in the output — events that were received by the pipeline but never appeared in the downstream data product, with no record of what happened to them. The observability strategy decision record documents the metrics infrastructure; the pipeline ADR must document pipeline-specific metrics as a separate category from application metrics, because pipeline lag and event discard rates require different instrumentation from HTTP request rates and error counts, and the alerting thresholds are calibrated to the pipeline's freshness SLA rather than to generic latency or error rate budgets.

None of these five sections appear in the founding sprint session that set up the first Airflow DAG with a nightly cadence or the first Flink job with a tutorial watermark. The session records that Airflow and dbt are being used, that the first transform produces an engagement score table, and that the job runs at 2am. It does not document that the company's core product will require 15-minute freshness in six months, that 31% of the mobile SDK's events arrive more than 10 minutes late, that the first batch run on corrupted data will require a 90-day reprocessing backfill that exceeds the Kafka retention window, or that the streaming job's output was incorrect for two months before anyone noticed because there was no late-event discard rate metric in the monitoring dashboard. The data pipeline ADR is the document that converts the implicit architectural choice — tool familiarity, tutorial default, first DAG skeleton — into the operational parameters that determine whether the data team discovers a freshness ceiling during a product planning meeting or during a customer complaint call about a health dashboard that showed "healthy" for a customer who opened three support tickets that morning. The WhyChose extractor recovers the founding pipeline session, the first freshness incident session, the first late-event investigation session, and the first schema evolution failure from AI chat history; the data pipeline ADR extracts the durable architectural constraints from those sessions and documents them where engineers encounter them: next to the DAG definitions and the Flink job configurations, not in a Slack thread from the founding month.

FAQs

When should I choose batch processing vs stream processing for a new data pipeline?

Choose batch processing when the freshness SLA for every downstream data product is measured in hours or days — daily sales reports, weekly cohort analytics, monthly churn models all have batch-compatible freshness requirements. Choose batch when the source data is naturally partitioned by time and complete at batch time (transactional databases with well-defined extract windows, daily log archives, end-of-day reconciliation files), or when the pipeline reads a large historical dataset rather than an ongoing event stream. Choose batch when the team does not have operational expertise to run a streaming cluster — Flink and Kafka Streams require continuous cluster operation, checkpoint management, and topology monitoring.

Choose stream processing when at least one downstream data product has a freshness requirement measured in seconds or low-single-digit minutes (fraud scoring, real-time inventory, live dashboards, SLA breach alerting), or when the source generates continuous events rather than batch-extracted snapshots (Kafka topics, Kinesis streams, CDC event feeds from a database write-ahead log). The most common mistake is choosing streaming because it sounds architecturally modern rather than because a specific data product has a freshness requirement that batch cannot satisfy. A team that adopts Flink for a use case with a 4-hour freshness SLA pays the operational cost of streaming without receiving any benefit that a 4-hour batch cadence would not provide.

What are late-arriving events and why do they cause more problems in streaming pipelines than in batch pipelines?

A late-arriving event is an event whose timestamp — when it occurred — is earlier than the time at which it arrives at the processing system. Late-arriving events are common whenever there is a delay between event generation and event delivery: mobile apps that buffer events while offline and flush on reconnect, IoT devices with intermittent connectivity, distributed systems where message delivery is delayed by network partitions, and retry queues where events are re-delivered hours after the original delivery attempt.

In a batch pipeline, late-arriving events are invisible as a category. A batch job processes all events up to the cutoff time of the batch window. If a mobile app's buffered events arrive two hours after they were generated, they arrive in the event log with their original timestamps. The next batch run includes them in the correct time-period partition by filtering on event timestamp. The batch does not care that the events arrived late.

In a streaming pipeline, late-arriving events are a first-class architectural problem. A streaming system uses watermarks to decide when a time window is complete. The watermark represents confidence that no more events with timestamps before a given point will arrive. A 10-minute watermark means the system holds a window open for 10 minutes after the latest event seen, then closes it. Events that arrive after the watermark has passed their window are either dropped silently, directed to a side output, or trigger a correction to the already-emitted result. The watermark configuration must be calibrated against the P99 arrival latency of the actual event sources — not copied from a tutorial example that assumed low-latency server-side event generation. A 10-second tutorial watermark applied to a mobile SDK with P99 arrival latency of 4.2 hours silently drops the majority of mobile events without an error, an alert, or a count of what was missed.

What should a data pipeline ADR document that a general infrastructure decision does not?

A general infrastructure decision records that the pipeline uses Airflow and dbt (or Flink) with the motivation. A data pipeline ADR must document the structural properties that the processing model establishes: (1) the freshness SLA per data product and verification that the chosen model satisfies every SLA; (2) the late-arriving event policy — for streaming, the measured P99 arrival latency per source, the watermark derived from it, and the handling policy for late events; for batch, the maximum tolerated source delay and the re-run trigger; (3) the reprocessing and backfill contract — the mechanism, the idempotency guarantees per step, and the maximum historical window available given source retention; (4) the schema evolution contract — the notification protocol for upstream breaking changes and the pipeline's behavior when a change is deployed without notice; (5) the pipeline observability SLA — the metrics that confirm correct output (not just running status), the alerting thresholds calibrated to the freshness SLA, and the dead letter queue for parse failures and schema mismatches.

None of these appear in the founding session that set up the first DAG. All of them determine whether the data team discovers a freshness ceiling during a product planning meeting or during a customer complaint about a dashboard that showed "healthy" while three support tickets were open.