2026-06-25 · ~22 min read

The data warehouse decision record: why the warehouse you chose determines your query latency floor and your data freshness model

Data warehouse selection looks like a one-afternoon BigQuery setup until the BI tool's dashboard refresh schedule produces a $1,200 scan bill on month one, or a Redshift cluster runs 20 concurrent analyst queries during a nightly ETL load and every query degrades to 30–90 seconds because the WLM queues all on shared cluster compute. The warehouse you chose at analytics setup time sets your query compute model, your data freshness architecture, and your cost isolation boundary — none of which were visible when the first SELECT COUNT(*) FROM events returned results in under a second.

A 22-person B2B SaaS product set up their analytics stack in 2023. The head of growth had used BigQuery at a previous company and advocated for it in an afternoon Slack thread. The engineering team stood up a BigQuery dataset, connected their application's PostgreSQL instance via Fivetran, and gave the growth and product teams access to Looker, configured against the BigQuery dataset. The first dashboards were built within a week. Dashboard refresh was set to five minutes — the Looker default, which the growth team preferred because they were running A/B tests and wanted near-real-time signal. No one asked what "every five minutes" meant in terms of BigQuery API calls or bytes scanned.

The product dashboard had 40 tiles. Each tile was a separate Looker Explore query, and each query scanned the events table — the primary fact table, which at that point held 14 months of behavioral event data across 850 GB in BigQuery. Looker's query result caching reduces redundant BigQuery executions when the same query runs twice with the same parameters within the cache TTL window. The Looker instance was configured with a 5-minute cache TTL — matching the refresh interval — meaning every five-minute refresh triggered fresh BigQuery executions on nearly every tile. At 40 queries × 800 GB average scan per query × $5 per TB × 12 refreshes per hour × 24 hours × 30 days: the theoretical monthly scan cost on that one dashboard was above $34,000. Looker's caching reduced the actual execution count by roughly two-thirds, bringing the realized cost to approximately $1,200 in the first full month the dashboard was live at full refresh frequency.

The team had expected BigQuery to be cheap — "you pay for what you query" was the framing the head of growth used. The first Google Cloud billing statement arrived three days after month close. The $1,200 line for BigQuery Data Analysis was a surprise. The engineering team spent two days auditing the query patterns: filtering Looker's query history by bytes-billed, identifying the ten most expensive tiles, adding partition filters to the underlying LookML views so that queries pruned to the current 30-day partition rather than scanning the full 14-month dataset, and extending the Looker cache TTL to 60 minutes for tiles that did not require sub-hour freshness. The query cost for the following month dropped to $89. The BigQuery pricing model was not the problem — the problem was that the BigQuery pricing model was never documented against the BI tool refresh configuration at setup time. See the pattern in decisions never written down.

The second incident involved not cost but performance. A 16-person data-infrastructure team was running Redshift. Their cluster was a two-node ra3.4xlarge — a configuration that Redshift's documentation describes as appropriate for 4–8 TB of managed storage with moderate query concurrency. Their analytics workload fit this profile at setup time: four analysts running queries during business hours, a nightly dbt run at 2am loading and transforming the previous day's data, and a Tableau Server instance that refreshed extract workbooks once per morning. For fourteen months, this configuration worked. The first symptom of a problem appeared after they hired three more analysts and began running a new near-real-time reporting dashboard in Tableau that refreshed every 15 minutes.

The Redshift cluster's WLM (Workload Management) configuration was at its factory default: five slots per queue, meaning up to five queries could run concurrently, with the remaining queries held in a wait queue. When the dbt nightly run was executing at 2am, its transformation models consumed three of the five WLM slots simultaneously. When an analyst in a different time zone — or an automated dashboard refresh — submitted a query during the nightly run window, that query entered the wait queue and held until a dbt model completed and freed a slot. Analysts reporting that "Redshift was slow at night" were correct — their queries were not slow, they were waiting. The WLM queue filled during the dbt window. When the team hired additional analysts and added the 15-minute Tableau refresh cadence, the symptom extended into business hours: at peak mid-morning concurrency with 20 analyst queries submitted within a 10-minute window, the WLM slot exhaustion produced 30–90 second wait times before any query began executing.

The fix — configuring a separate WLM queue for the dbt service user, allocating it a fixed memory percentage, and raising the analyst queue concurrency from five to eight slots — took four hours. The four hours were not spent fixing a hard technical problem; they were spent understanding what WLM was, how slot allocation worked, and why the default configuration had been appropriate at setup time but not at current concurrency. The information needed to set the WLM policy correctly had been available in Redshift's documentation the day the cluster was launched. No one had read it, because the default configuration worked, and defaults always work until concurrency crosses a threshold. The WLM configuration — how many queues, what concurrency per queue, what memory allocation, which database users route to which queue — was not in the decision record. It was not in any document. It was in no one's head. It was the context embedded in the setup session with the engineer who had built the cluster fourteen months earlier and had since left the company. See the new CTO onboarding problem for the same structural pattern in a different infrastructure domain.

The three structural properties that data warehouse selection determines

When teams choose a data warehouse, the evaluation focuses on SQL compatibility (does it support window functions? CTEs?), integration ease (does Fivetran or Airbyte have a connector?), and the credentials needed to connect a BI tool. These are legitimate criteria and all major warehouses satisfy them. The structural properties that determine whether the selection ages well — whether the query compute model creates cost surprises when BI tool refresh is configured, whether analyst and ETL workloads degrade each other at scale, and whether the freshness model matches the data products the business wants to build — are set at selection time and are almost never revisited until one of them produces an incident.

Query compute model and latency floor. The four major warehouses differ fundamentally in how they map queries to compute resources. BigQuery is serverless: it allocates slots (units of compute, roughly 0.5 vCPU each) from a shared pool to execute each query, scales automatically to the query's complexity, and releases the slots when the query completes. There is no warehouse to provision, no connection pool to configure, and no idle-compute cost when no queries are running. The tradeoff is cold slot allocation: when slot demand is high across all BigQuery tenants, queries may enter a queue before slots are assigned. BigQuery's SLA for on-demand customers does not guarantee slot availability — a query may wait 1–10 seconds before execution begins. For low-latency applications where sub-second query startup is a product requirement, BigQuery's on-demand model is not the right tool. BigQuery's flat-rate capacity commitments (purchased slots that are reserved for your project) solve the cold-start problem but change the cost model from per-query to fixed-capacity.

Snowflake uses Virtual Warehouses — named compute clusters of user-specified sizes (XS, S, M, L, XL, 2XL, 3XL, 4XL, 5XL, 6XL; each size doubles the compute of the previous). A Snowflake query executes on a specific Virtual Warehouse, and the warehouse is billed per second of compute time while it is running. Warehouses auto-suspend after a configurable idle period (default five minutes) and auto-resume when a query arrives, with a 1–3 second resume latency. The architectural implication is workload isolation by design: a Snowflake customer can create a separate Virtual Warehouse for dbt transformation runs, a separate one for Tableau, and a separate one for ad-hoc analyst queries. ETL running on the TRANSFORM_WH warehouse does not contend with analyst queries on the ANALYST_WH warehouse because they are separate compute clusters that do not share slots. This is the primary structural difference between Snowflake and Redshift: Redshift's WLM queues share the same cluster compute, so ETL and analyst workloads can degrade each other through slot exhaustion. Snowflake warehouses are isolated by construction. The cost consequence is that multi-warehouse configurations bill each warehouse independently — a customer running three warehouses simultaneously pays for three warehouses simultaneously, even if each is lightly loaded.

Redshift operates as a managed cluster of EC2 nodes (RA3 managed storage nodes or DC2 dense-compute nodes). The cluster is always running at a fixed hourly cost regardless of query volume. Redshift's WLM determines how many queries can run concurrently within the cluster, how memory is allocated among queues, and whether ETL and analyst workloads share compute. Redshift's query latency on large scans is typically competitive with Snowflake and BigQuery for column-oriented workloads, and Redshift's columnar compression and sort keys can produce excellent performance for workloads that are well-tuned. The cost model is always-on: an ra3.4xlarge two-node cluster costs approximately $1.86/hour per node ($2,678/month for two nodes), regardless of whether 20 queries run that day or 20,000. For consistent high-volume workloads, the always-on model is cost-competitive with Snowflake's per-second billing. For bursty or batch workloads with significant idle time, Snowflake's auto-suspend model is more cost-efficient because idle Snowflake warehouses cost nothing.

ClickHouse stores data in a columnar format optimized for high-cardinality aggregation queries over hundreds of millions or billions of rows. A ClickHouse query that counts distinct user IDs grouped by country across 500 million rows can complete in under one second on a modestly sized cluster, where the same query on Snowflake or BigQuery might take 10–30 seconds. ClickHouse's vectorized query execution processes data in large blocks using SIMD CPU instructions, achieving raw analytical throughput that the managed warehouse vendors do not match on compute-equivalent hardware. The latency advantage compounds at high query concurrency: ClickHouse is designed for high concurrent read workloads (100+ concurrent queries) in a way that Snowflake, Redshift, and BigQuery are not. The tradeoffs: ClickHouse is either self-hosted (requiring infrastructure management, cluster configuration, replication setup, and schema optimization work) or managed via ClickHouse Cloud at consumption pricing. ClickHouse's SQL dialect differs from ANSI SQL in ways that affect query portability — functions for time-series operations, aggregation combiners, and sampling syntax are ClickHouse-specific. A team migrating from ClickHouse to BigQuery or Snowflake faces query rewriting work that a team migrating between BigQuery and Snowflake does not. The build-versus-buy framing applies here: ClickHouse Cloud provides the latency advantages with reduced operational burden, but the SQL dialect specificity is a migration cost regardless of deployment model.

Data freshness model. The freshness of data in the warehouse — how recently the data reflects the source system state — is determined at warehouse selection time because different warehouses have different native streaming capabilities, and the freshness SLA shapes the entire ETL or ELT architecture downstream. Batch loading, where data is extracted from source systems on a schedule (hourly, every four hours, nightly), is the common pattern and is equally supported by all four warehouses: dbt scheduled runs, Fivetran sync intervals, Airbyte scheduled connections, and Glue/Databricks jobs all produce batch loads into the warehouse on user-configured schedules. The minimum freshness achievable with pure batch loading is the batch interval — hourly batch loading produces at most 60-minute-stale data, with the actual staleness depending on extraction time plus load time. For most analytical workloads, T+1 hour or T+4 hours freshness is acceptable. For operational analytics (dashboards that sales teams refresh during calls to see current pipeline value), sub-15-minute freshness may be a product requirement.

Streaming ingestion — where individual events or row changes are delivered to the warehouse in near-real-time rather than in batch loads — is supported by all major warehouses but with different mechanisms and cost implications. BigQuery streaming inserts deliver individual rows to BigQuery in seconds via the BigQuery Storage Write API, at a cost of $0.01 per 200 MB of streaming data inserted (roughly $50 per terabyte inserted via streaming, compared to $0 for batch loading). BigQuery streaming inserts are available immediately for querying after insertion but are not immediately available for export — a consequence of BigQuery's internal streaming buffer that drains into columnar storage. Snowpipe is Snowflake's micro-batch streaming ingestion mechanism: files land in S3 or GCS, an event notification triggers Snowpipe, and the data appears in Snowflake within one to five minutes. Snowpipe is billed per credit consumed during the micro-batch loading, separate from the Virtual Warehouse compute cost — a streaming-heavy workload can add significant Snowpipe cost that is not visible in the warehouse compute billing. Kinesis Data Firehose can deliver streaming data to Redshift via a buffered write mechanism, with a minimum buffer interval of 60 seconds and a maximum of 15 minutes — effectively adding a configurable streaming delivery latency on top of the Firehose buffer. The freshness model — what freshness SLA is required per data source, which sources need streaming versus batch, and what the cost model is for streaming delivery — should be documented at warehouse selection time, because switching from a batch ETL architecture to a streaming architecture after the warehouse is built requires restructuring the ETL pipeline, the dbt model dependency graph, and potentially the warehouse cost model. The infrastructure-as-code strategy applies to the ETL pipeline definition as much as to the warehouse cluster itself — the streaming connectors and trigger configurations are infrastructure that should be version-controlled and reviewable, not manual configurations in a Fivetran dashboard.

Cost model and compute isolation. The cost structure of each warehouse creates different operational incentives and different cost failure modes. BigQuery's on-demand per-query cost ($5/TB scanned) creates the risk that a single poorly-written query — a full table scan without a partition filter on a large table — produces a large cost. The cost is proportional to bytes read, not to query execution time or result rows, so a query that scans 10 TB and returns one row costs $50. This risk is mitigated through partition and cluster key design (ensuring that every analytical query against a large table includes a WHERE clause on the partition key, pruning the scan to the relevant partition), through BigQuery's per-query cost estimator (the dry-run API or the query validator in the BigQuery console, which reports bytes-to-be-scanned before execution), and through project-level or user-level spending limits that cap the bytes billed per query. BigQuery's flat-rate slot commitments eliminate per-query cost risk by converting to a fixed monthly capacity cost — the right choice when monthly on-demand costs consistently exceed the commitment tier price.

Snowflake's cost model charges per second of Virtual Warehouse runtime. The cost driver is not query count or bytes scanned but warehouse uptime: a warehouse that runs continuously at a Large size (8 credits per hour at the standard credit price of $3/credit = $24/hour) costs $17,280 per month if it never suspends. The auto-suspend feature is the primary cost control: a warehouse configured to auto-suspend after five minutes of idle time costs only for the time it is actually executing queries. A multi-warehouse Snowflake configuration — three Virtual Warehouses for ETL, BI, and ad-hoc — each at a Medium size, each auto-suspending after five minutes of idle, and running at 40% utilization (busy 9.6 hours per day) costs approximately $4 × 4 credits/hour × 9.6 hours × 3 warehouses × 30 days = $13,824 per month at $3/credit. The cost is predictable within a given warehouse configuration but sensitive to workload growth — adding a fourth warehouse, increasing warehouse size, or reducing the auto-suspend idle time all increase cost proportionally. The cost model requires monitoring warehouse credit consumption per warehouse per day, which Snowflake's account usage views provide directly.

Redshift's always-on cluster cost is independent of query volume. A two-node ra3.4xlarge cluster at $0.93/hour per node costs $1,339/month per node, $2,678/month for the two-node cluster — regardless of whether the cluster is idle for 12 hours per day or running at full concurrency. This model favors workloads with high utilization: if the cluster is actively running queries for 20 hours per day, the effective cost per query-hour is lower than Snowflake's per-second model at equivalent compute. For intermittent workloads — a startup whose analysts only work weekdays, with a nightly batch ETL and weekend idle — Redshift's always-on cost is paid for idle time that Snowflake would not charge for. Redshift pause and resume (available for the serverless tier and for RA3 clusters via the Redshift API) addresses this by suspending the cluster during idle windows, but pause-resume requires engineering a mechanism to trigger the pause/resume operation, and the resume latency (typically 3–6 minutes for an RA3 cluster) is unsuitable for interactive analyst query use cases. The cost isolation decision — whether ETL and analyst workloads are isolated (Snowflake separate warehouses) or co-resident (Redshift shared cluster, BigQuery shared slot pool) — determines whether cost spikes in one workload affect the cost and performance of another. The observability strategy framing applies to warehouse cost monitoring: a cost metric that isn't tracked by workload and by query before it causes a budget surprise is a metric that will be discovered after the bill arrives.

Data warehouse options and their structural properties

BigQuery is the correct default for teams on Google Cloud Platform and for teams that want zero infrastructure management and a pay-as-you-go cost model that is $0 when no queries run. The operational overhead of BigQuery is minimal: there are no clusters to provision, no connection pools to configure, no warehouse sizes to choose, and no upgrade windows to schedule. Fivetran, Airbyte, Stitch, and every major ELT connector have first-class BigQuery support. dbt has native BigQuery support with partition-aware model configuration. Looker, Tableau, Metabase, Redash, Superset, and every major BI tool connect to BigQuery via the standard BigQuery ODBC/JDBC connector. The critical configuration decision at setup time is table partitioning and clustering: every high-volume fact table should be partitioned on a date column (typically an event timestamp column, partitioned by day or hour) and clustered on the dimensions most commonly used in WHERE clauses (user_id, product_id, event_type). A correctly partitioned BigQuery table turns a multi-TB scan into a single-day-partition scan for most analytical queries, reducing cost by two to three orders of magnitude. A BigQuery table without partition keys has no cost floor on scans — every full-table query costs $5/TB on the entire table. This is the configuration decision that was not made in the $1,200-bill incident. The ADR for BigQuery must explicitly document the partition and cluster key for every table added to the warehouse, because this decision cannot be made retroactively on a populated table without recreating it.

Snowflake is the correct default when workload isolation is a first-class requirement — when ETL and analyst workloads must not compete for compute, when multiple teams with different query patterns need independent compute scaling, or when the team wants per-workload cost attribution and per-workload performance SLAs. Snowflake's Virtual Warehouse model makes workload isolation a configuration option rather than an architectural investment: creating a second warehouse for ETL is a one-line Snowflake command. Snowflake's Time Travel (30-day historical data snapshots available for query at no additional storage cost) is a unique capability for data debugging — querying what a table contained at a specific timestamp without maintaining a separate snapshot table. Snowflake's Zero-Copy Cloning (creating a writable copy of a table without duplicating underlying storage) enables per-analyst development environments where each analyst has their own copy of the data to experiment with, without incurring the storage cost of a full table copy. The cost structure requires active management: Snowflake's default Virtual Warehouse configuration does not have a budget ceiling, and a warehouse that is not configured with auto-suspend can run indefinitely at full compute cost while idle. The Snowflake spend monitoring and alerting setup — account usage queries to track credit consumption by warehouse per day, Snowflake budget objects for project-level spending limits, or third-party cost management tools — is part of the setup work that belongs in the ADR, not an afterthought when the first unexpected credit consumption invoice arrives. The startup decision log context: Snowflake's per-second warehouse billing is a cost model that requires active awareness in a way that Redshift's fixed monthly cluster cost does not — the risk of cost surprise is different in character, not in magnitude.

Amazon Redshift is the correct default for teams running the majority of their infrastructure on AWS and for workloads where predictable, fixed monthly cost is preferred over variable per-query or per-second billing. Redshift RA3 nodes with Redshift Managed Storage decouple compute from storage: the cluster size determines query throughput and concurrency, while data volume can grow independently without requiring a compute resize. Redshift's columnar compression (ENCODE keys on column definitions) and sort keys (SORTKEY on the table definition) enable very low storage overhead and efficient range-scan queries when the data distribution and query patterns are well-understood and correctly modeled at table creation time. The operational investment for Redshift is front-loaded: correctly designing DISTKEY (data distribution key) and SORTKEY for each table requires understanding the query access patterns before the data is loaded. A DISTKEY that produces data skew — where one node holds 80% of the rows for a high-cardinality dimension — degrades all queries that join on that key, because the join requires data redistribution across nodes. The VACUUM and ANALYZE operations (periodically required to reclaim deleted-row storage space and update query planner statistics) are maintenance operations that Redshift does not always schedule automatically — a team that never runs VACUUM on a high-delete table will see query planner degradation over time as stale statistics cause poor join order selection. These operational details are the configuration decisions that belong in the ADR — not in the setup session that ends when the first dbt run completes successfully. See the database migration strategy decision record for the parallel discussion of migration discipline in the OLTP tier; the same documentation discipline applies to the analytical tier, where dbt model changes and Redshift schema changes interact in ways that can silently break downstream BI queries.

ClickHouse is the correct choice when sub-second query latency at high cardinality is a product requirement — when the analytical queries power a user-facing product feature (a real-time dashboard served to end users, a search interface over behavioral data, a fraud detection alert system that must query billions of events in milliseconds) rather than an internal analyst workstation. ClickHouse's columnar, vectorized query engine achieves 10–1,000x lower query latency than managed warehouses on count-distinct and aggregation queries over high-cardinality datasets, because ClickHouse is optimized specifically for this workload while BigQuery, Snowflake, and Redshift are general-purpose analytical platforms. The operational tradeoff is significant: ClickHouse's performance advantage comes from explicit schema design choices (choosing the right PRIMARY KEY expression in the MergeTree table engine, which doubles as the physical sort key and determines which queries benefit from primary key-based pruning), from careful partition management (TTL expressions for automatic data expiration, partition detach for archival), and from understanding ClickHouse's merge behavior (background merges that consolidate inserted parts, which can cause write amplification and CPU spikes that affect concurrent read queries). Self-hosted ClickHouse requires infrastructure management that BigQuery, Snowflake, and Redshift do not. ClickHouse Cloud provides the managed deployment model with consumption-based pricing, but the SQL dialect commitment remains: ClickHouse-specific functions (arrayFlatten, quantileTDigest, argMax, groupArrayMovingSum, and hundreds of others) do not have direct equivalents in BigQuery or Snowflake SQL, so a migration off ClickHouse requires query rewriting work beyond what a schema export covers. For teams that need ClickHouse performance, the correct approach is to treat it as the right tool for a specific product requirement — not as a general-purpose analytical warehouse replacement — and to document that scope explicitly in the ADR so future engineers understand why ClickHouse was chosen for one data product and BigQuery for another.

AI chat session types and what each one misses

The data warehouse setup follows a predictable pattern of AI chat sessions. The WhyChose extractor surfaces these sessions from chat export files, and the structural decisions they omit are consistent across the decision records we've reviewed. The setup session choices close when the first query returns results — not when the structural properties of the warehouse selection have been considered and documented.

The initial warehouse setup session covers: creating the warehouse project or cluster (BigQuery dataset and project; Snowflake account and database; Redshift cluster with VPC and security group; ClickHouse Cloud service), configuring IAM roles or database users with read/write access, connecting the first data source via Fivetran or Airbyte, and verifying that a test query against the loaded data returns results. The session is complete when the analyst can log into the BI tool and build a dashboard. What the session does not cover: whether tables are partitioned and clustered (BigQuery) or have correct DISTKEY and SORTKEY (Redshift) for the expected query patterns; whether a WLM configuration separates ETL from analyst workloads (Redshift); whether separate Virtual Warehouses are created for different workload types (Snowflake); what the BI tool refresh interval means in terms of BigQuery query cost or Snowflake warehouse uptime; and what the freshness SLA for each data source is and whether the batch sync interval matches it.

The dbt project setup session covers: installing dbt Core or signing up for dbt Cloud, configuring the dbt profile to connect to the warehouse, creating the first staging models that clean and rename source tables, building the first mart models that join and aggregate for analyst use, and running dbt build to confirm the models compile and execute. What the session does not cover: whether dbt model materializations (table vs view vs incremental vs snapshot) are chosen to match query freshness requirements — an incremental model that appends only new rows to an existing table runs in seconds rather than minutes for a full rebuild, but only if the underlying table has the correct sort order and index for the incremental filter; whether the dbt project includes generic tests on every staging model (not_null, unique, accepted_values) that catch upstream data quality issues before they propagate into marts; whether schema changes in source systems are handled with dbt's source freshness checks and alerting; and whether the dbt run schedule (hourly, every four hours, nightly) matches the freshness SLA documented for each data source. The dbt model definition is the schema specification for the analytical layer — the same documentation discipline that applies to OLTP schema migrations (as covered in the database migration strategy decision record) applies to dbt model changes, where a breaking change to a staging model's column names silently breaks every downstream mart model and BI dashboard query that references the renamed column.

The BI tool connection session covers: connecting Looker, Tableau, or Metabase to the warehouse credentials, building the first data model or LookML view, creating the first dashboard, and setting the dashboard refresh interval. What it misses: for BigQuery, whether the LookML views specify a required partition filter (Looker's required_access_filter LookML parameter, which prevents full-table scans by requiring users to select a date range before a query executes); for Snowflake, which Virtual Warehouse the BI tool connection routes to and whether that warehouse's auto-suspend idle period is appropriate for the dashboard refresh frequency (a BI connection with a five-minute refresh and a Snowflake warehouse with a five-minute auto-suspend will incur continuous warehouse runtime because each resume-query-suspend cycle completes in about four minutes); for Redshift, whether the BI tool's connection pool holds multiple simultaneous connections that consume WLM slots even when no queries are active. The BI tool session ends when the dashboard displays data. The cost and performance implications of the refresh configuration against the underlying warehouse compute model are not visible until the billing statement arrives or the query wait times accumulate.

The performance optimization session covers: a specific query that is running slowly, identifying the query's execution plan (BigQuery query plan via the Execution Details tab; Snowflake query profile; Redshift EXPLAIN; ClickHouse EXPLAIN PIPELINE), and applying a fix to the query — adding a filter, rewriting a subquery as a CTE, or adding a materialized intermediate table. What it misses: whether the slow query is symptomatic of a structural warehouse configuration issue (wrong partition key, wrong sort key, wrong WLM configuration, undersized Virtual Warehouse) versus a query-specific issue; whether the fix applied to this query will prevent the same class of issue on future queries; and whether the performance problem is co-occurring with a cost problem (a BigQuery query that takes 120 seconds because it has no partition filter is also scanning the full table and billing $5/TB on every execution). The performance optimization decision record pattern applies here: a single slow query is a symptom, and treating the symptom without diagnosing the structural cause means the symptom recurs on the next query against the same table or the next time concurrency increases past the WLM slot limit.

The pattern across all four session types is consistent with what the decisions never written down post describes: the setup session's success criterion is "does it work?" — and it always works, because the first query always returns results. The structural decisions that determine whether it scales — partition strategy, workload isolation, cost governance, freshness architecture — are invisible at the moment the success criterion is met. They become visible only when scale or cost reveal them.

Five ADR sections for data warehouse selection

A data warehouse ADR that prevents the incidents described in this post covers five sections that teams consistently skip.

First, the warehouse selection with alternatives and rejection reasons, and re-evaluation triggers. The ADR records which warehouse was chosen, which alternatives were evaluated, the rejection reasons for each, and the specific conditions under which the selection should be re-evaluated. "BigQuery chosen over Snowflake because: team is GCP-primary, expected query frequency is low (5–15 queries per analyst per day), and the per-query cost model provides zero idle-compute cost at current stage. Snowflake evaluated and rejected: workload isolation advantage not yet material at two-analyst team size; per-second warehouse billing introduces idle cost risk that per-query billing avoids. Re-evaluate Snowflake when: analyst team exceeds eight people, ETL and analyst query concurrency regularly exhaust BigQuery slot capacity during peak hours, or BI tool dashboard count exceeds 20 and per-query cost exceeds $500/month consistently." The rejection reasons prevent the re-evaluation session from starting at zero, and the re-evaluation triggers make the switch criteria explicit rather than "whenever someone advocates for a different warehouse." This mirrors the approach documented in the database vendor decision record — scaling triggers should be specific and measurable, not feelings.

Second, the compute model and cost isolation policy. For BigQuery: which tables are partitioned, on which column, with what partition type (by day, by hour, by ingestion time), and what clustering columns are applied; whether any table is intentionally left unpartitioned and why; the policy for new table creation (all tables above N GB must have a partition key before data is loaded); and whether on-demand or flat-rate slot commitments are used, at what monthly spend threshold the switch to flat-rate is triggered. For Snowflake: which Virtual Warehouses exist, what size each is, what workloads route to each, what the auto-suspend interval is per warehouse, and the monthly credit budget ceiling per warehouse with the alerting mechanism that fires when a warehouse exceeds its budget. For Redshift: the WLM configuration — number of queues, concurrency per queue, memory allocation per queue, and which IAM roles or database users route to which queue — and the policy for adding a new queue when a new workload type is introduced. For ClickHouse: the table engine (MergeTree, ReplacingMergeTree, SummingMergeTree), the PRIMARY KEY expression per table, and the partition expression, with the reasoning for each. None of these are visible in the warehouse connection credentials or the dbt project. They are the analytical infrastructure decisions that every engineer who ever writes a new table or modifies a query depends on, and they are the section of the ADR that was missing in both incidents described at the start of this post. The ADR format guidance for the Consequences section applies here: state the specific cost ceiling and the specific performance degradation that the configuration is designed to prevent, not just the configuration itself.

Third, the data freshness model per data source. The ADR documents each data source, its freshness SLA, and the mechanism by which that SLA is met. "Orders data (from PostgreSQL application database via Fivetran): freshness SLA T+15 minutes; mechanism: Fivetran CDC connector with 15-minute sync interval; alert fires in PagerDuty when last successful sync exceeds 30 minutes. User events (from application via Segment): freshness SLA T+4 hours; mechanism: Segment → S3 → Airbyte batch sync every four hours; alert fires in Slack #data-alerts when sync job fails. Third-party CRM data (from Salesforce via Fivetran): freshness SLA T+24 hours; mechanism: Fivetran scheduled sync at 2am daily; acceptable for weekly reporting use cases; no intraday alert." The freshness SLA is a commitment to the analyst team about what they can rely on when reading a dashboard mid-day. Without it, analysts query data of unknown freshness and either over-trust stale numbers or under-trust fresh ones. Without documented alerting, freshness failures are discovered by analysts noticing that yesterday's numbers haven't changed in six hours — not by an automated system catching the failure within minutes. The freshness model is the data infrastructure equivalent of the deployment pipeline's change detection model in the CI/CD pipeline decision record: it determines how quickly a failure propagates to user-visible impact.

Fourth, the query governance model with cost ceiling and escalation path. The ADR documents who can run ad-hoc queries against the warehouse, whether a query cost review is required before expensive queries are executed, and what the monthly cost ceiling is for the analytical tier. For BigQuery: the per-query dry-run policy (for any query expected to scan above 100 GB, run the dry-run API call first to confirm bytes-to-be-scanned before executing); the project-level spending limit (set at the Google Cloud project level to cap total BigQuery spend per month); and the cost attribution approach (which cost label or project maps to which team or product area). For Snowflake: the resource monitor configuration (a Snowflake object that triggers a warning notification and optionally suspends a warehouse when credit consumption exceeds a defined threshold per time period); the cost attribution by warehouse (Snowflake's account usage WAREHOUSE_METERING_HISTORY view provides per-warehouse credit consumption per hour); and the approval process for creating a new warehouse or resizing an existing one. For Redshift: the WLM concurrency scaling policy (whether Redshift Concurrency Scaling, which adds transient cluster capacity during queue depth spikes at additional cost, is enabled and at what queue depth threshold). The query governance model is not bureaucracy — it is the mechanism that prevents the $1,200 BigQuery bill from happening on month two after the first month's incident triggered a fix for only the most expensive tiles. A governance model documented in the ADR is a governance model that new engineers inherit and apply. A governance model that exists only in the memory of the engineer who fixed the first incident is a governance model that disappears when that engineer leaves.

Fifth, the schema evolution and model promotion model. The ADR documents how new dbt models and schema changes are reviewed and promoted from development to production, what the testing requirement is before a dbt model change is deployed, and what the rollback procedure is when a dbt model change breaks downstream queries. "New dbt models and changes to existing models require: (1) a dbt test suite that includes not_null and unique tests on all primary keys and relationship tests for all foreign key joins; (2) a pull request review by one data engineer before merging to main; (3) a dbt Cloud CI job that runs dbt build --select state:modified+ against a dev schema before merge approval; rollback: revert the git commit and re-run dbt build for the previous model version — Redshift/BigQuery schema changes to add columns are forward-only and require a separate migration if column removal is needed." Schema evolution in the analytical tier breaks downstream BI dashboards in a different way than OLTP schema migration breaks application code: a renamed dbt model output column silently returns NULL or errors in every Tableau calculated field, LookML dimension, or Metabase question that references the old column name — and the failure may not be noticed until an analyst or executive runs a report and sees a chart with no data. The model promotion model is the governance mechanism that catches this failure before it reaches production. The database migration strategy decision record covers the parallel governance for OLTP schema changes; the analytical tier needs the same discipline applied to dbt model changes, with the additional complexity that dbt model lineage means a breaking change to a staging model propagates to every mart model and dashboard that depends on it through the dbt DAG.

None of these five sections are in the dbt project, the Fivetran connector configuration, or the BigQuery/Snowflake/Redshift console settings. They represent the analytical infrastructure reasoning that every data analyst, analytics engineer, and data engineer who joins the team after setup depends on — and that surfaces as an onboarding gap when the new hire's first query produces a 45-second wait time, an unexplained BI dashboard showing stale data, or a $1,200 line on a cloud billing statement. The WhyChose extractor surfaces the warehouse setup and ETL pipeline sessions from AI chat history; the ADR is what takes the reasoning from those sessions and makes it legible to the team that inherits the decisions. The multi-region deployment decision record includes a data residency section for the analytical tier — where data is stored and whether that storage location satisfies GDPR data residency requirements for EU customer data in the warehouse. That residency decision is also made at warehouse setup time, and like the partition key, the WLM policy, and the freshness SLA, it is invisible once the warehouse is running and only discovered under audit pressure.

FAQs

Does BigQuery actually charge per query and how can teams control the cost?

BigQuery's on-demand pricing charges $5 per terabyte of data scanned per query — applied to the amount of data read from storage, not to the query's execution time, complexity, or row count returned. A query that scans a 500 GB table costs $2.50 whether it returns one row or one million rows, and whether it runs in two seconds or two minutes. This pricing model has a structural implication for BI tools: most BI tools execute queries on a schedule to refresh dashboard tiles, often every five minutes for operational dashboards. Each scheduled refresh executes one SQL query per tile against the underlying BigQuery table. If a dashboard has 40 tiles that each scan 800 GB without a partition filter, each five-minute refresh cycle costs approximately $160. At a five-minute interval, 12 times per hour, that is $115,200 per month for a single dashboard — an extreme case, but the $1,200 incident described at the start of this post was the realistic version at 200 GB average scan per query, with Looker caching reducing actual BigQuery executions by two-thirds.

The two mechanisms for controlling BigQuery query costs are partitioning and flat-rate slot commitments. Partitioning a BigQuery table on a date column (using PARTITION BY DATE(event_timestamp) in the table DDL or via Fivetran's BigQuery connector partition setting) means that a query with a WHERE event_timestamp BETWEEN '2026-06-01' AND '2026-06-30' condition reads only June's partition — roughly 1/14th of a 14-month table, reducing the scan from 800 GB to approximately 60 GB, reducing the query cost from $4.00 to $0.30. Adding a CLUSTER BY user_id, event_type clause further prunes the scan within the partition when the query filters on those columns. BigQuery's dry-run API (available in the console query validator and via the BigQuery SDK's job_config.dry_run = True parameter) estimates bytes-to-be-scanned before executing a query — the correct practice for any query expected to scan above 100 GB. Flat-rate slot commitments ($2,000/month for 100 committed slots as of 2024 pricing) convert the per-query cost to a fixed capacity cost — the correct choice when monthly on-demand costs consistently exceed $2,000 and when the workload can be shaped to fit within the committed slot capacity. The BigQuery table partition key is the cost control decision made at table creation time; it cannot be added retroactively to a populated table without recreating it.

When should a startup choose Snowflake vs Redshift vs BigQuery for their first data warehouse?

The choice depends on three factors: existing cloud provider, expected query concurrency pattern, and tolerance for always-on infrastructure cost. BigQuery is the correct default when the team is GCP-primary and expects infrequent analytical queries — the per-query cost model means zero idle cost, and the serverless model means zero cluster management. The risk is BI tool scan cost if partition strategy is not configured at setup. Redshift is the correct default when the team is AWS-primary and expects a consistent, predictable analytical workload — the always-on cluster cost is independent of query volume, which means predictable monthly cost but idle compute spend during nights and weekends. Snowflake is the correct choice when workload isolation between ETL and analyst queries is a first-class requirement — Snowflake's Virtual Warehouse model provides per-workload compute isolation that Redshift's shared-cluster WLM and BigQuery's shared slot pool do not match. ClickHouse is the correct choice only when sub-second query latency over billions of rows is a product requirement, because no managed warehouse matches ClickHouse's raw columnar query throughput.

For most early-stage B2B SaaS companies with fewer than ten analysts, BigQuery is the correct default on GCP and Redshift is the correct default on AWS — not because they are superior to Snowflake in all dimensions, but because they align with the existing cloud provider relationship and because the workload isolation advantage of Snowflake's multi-warehouse model is not material at two to four analysts. The time to switch to Snowflake is when ETL and analyst workloads are regularly producing performance interference on a shared Redshift cluster, or when the team needs per-workload cost attribution across four or more distinct workload types. The ADR should document this re-evaluation trigger explicitly — "evaluate Snowflake when analyst team exceeds eight people or when ETL window regularly degrades analyst query p99 above 10 seconds" — so the re-evaluation happens at the right time rather than when an advocate happens to push for it in a planning session.

What is the difference between a data warehouse and an OLTP database for architecture decisions?

An OLTP database (PostgreSQL, MySQL, PlanetScale) is optimized for high-frequency writes and low-latency single-row reads — it is the database your application writes to when a user creates an account or places an order. An analytical data warehouse (BigQuery, Snowflake, Redshift, ClickHouse) is optimized for full-table-scan queries that aggregate millions or billions of rows — it is the database your analysts query to answer cohort analysis, funnel conversion, or revenue attribution questions. The architectural distinction is columnar storage: an OLTP database stores rows together on disk (row-oriented), so reading one column across a million rows requires reading all columns of all rows; a columnar warehouse stores each column's data together, so reading one column across a billion rows reads only that column's storage pages, achieving 10–100x lower I/O for aggregation queries.

The decision record for a data warehouse is a separate document from the database vendor decision record for the OLTP database, because the structural properties differ entirely: the OLTP record covers referential integrity enforcement, schema migration locking risk, connection pool model, and horizontal scaling architecture. The warehouse record covers query compute model, data freshness SLA, cost isolation policy, and schema evolution in the dbt transformation layer. A startup can correctly choose PostgreSQL for OLTP and BigQuery for analytics, and neither decision constrains the other. The OLTP database is the system of record that the application writes to; the warehouse is the analytical replica that the analysts read from. Moving data between them — via change data capture, scheduled ETL, or event streaming — is the pipeline architecture decision that connects the two, and it belongs in neither the OLTP record nor the warehouse record but in a third document covering the ETL/ELT strategy.

What should a data warehouse ADR document that teams typically skip?

Teams typically document the warehouse name, the connection credentials, and the dbt project structure. The sections that prevent the incidents described in this post are: first, the compute model and cost isolation policy — for BigQuery, the partition and cluster key for every table and the policy for new table creation; for Snowflake, the Virtual Warehouse roster with size, auto-suspend interval, and workload assignment per warehouse; for Redshift, the WLM queue configuration with concurrency and memory allocation per queue; the monthly cost ceiling and the alerting mechanism that fires before the ceiling is reached. Second, the workload isolation design — whether ETL and analyst workloads share compute (Redshift shared cluster; BigQuery shared slot pool) or are isolated (Snowflake separate warehouses), what the policy is when both workloads compete at peak, and how concurrency limits are enforced. Third, the data freshness SLA per source — the freshness commitment to analysts for each data source, the mechanism by which it is met (batch sync interval, streaming pipeline, or CDC connector), and the alerting that fires when a freshness SLA is breached.

Fourth, the schema evolution model — how new dbt models and schema changes are reviewed, tested, and promoted to production, what tests are required before a model change is merged, and what the rollback procedure is when a model change breaks downstream BI queries. Fifth, the query governance model — who can run ad-hoc queries, whether a dry-run cost estimate is required before expensive queries are executed, and what the escalation path is when a query or dashboard is discovered to be consuming disproportionate cost. None of these five sections are visible from the dbt project or the warehouse console settings. They are the analytical infrastructure reasoning that every data analyst, analytics engineer, and data engineer who joins the team after setup depends on — and that is missing when the WhyChose extractor surfaces the warehouse setup and ETL configuration sessions from AI chat history without a corresponding decision record to explain the reasoning behind the choices.