The database migration strategy decision record: why the schema migration approach you chose determines how safely you can evolve your data model at production traffic
Database migration tooling is chosen when the project starts, usually in the same hour the ORM is configured, and rarely revisited as a decision. Two years later, the migration approach you chose determines whether adding a column to a 200-million-row table takes 3 seconds or 40 minutes, whether a failed deploy can be rolled back without data loss, and whether the team can apply schema changes during a rolling deploy without breaking the version that is still running. None of these consequences were visible at the beginning. None of them are written down.
An engineering team deploys a new feature on a Tuesday afternoon. The release contains a migration that renames a column in the users table from email_verified to is_email_verified. The migration runs in 800 milliseconds on the local development database. It runs in 2 seconds on staging. On production, with 4.1 million rows and active query load, it acquires an ACCESS EXCLUSIVE lock that blocks every query touching the users table for 47 seconds.
During those 47 seconds, login fails for every user who attempts to authenticate. The API returns 500 errors. The error monitoring dashboard spikes. The on-call engineer receives a PagerDuty alert. They check the deployment, see the migration in progress, wait for it to complete. The login errors stop when the migration commits.
The post-mortem identifies the root cause: renaming a column in PostgreSQL requires an ACCESS EXCLUSIVE lock on the table, which blocks all concurrent readers and writers until the lock is acquired and the DDL operation completes. At production load, the lock acquisition waited 39 seconds for in-flight queries to complete, then held exclusively for 8 seconds to execute the rename — 47 seconds of total blocking time. The fix identified in the post-mortem: never rename a column in a single migration; use the expand-contract pattern across two deploys.
Like most foundational infrastructure decisions, the database migration strategy is visible as a tool — "we use Flyway" — but invisible as a decision with structural consequences. The tool choice determines what migration operations are available. The migration lifecycle policy determines whether someone reviews a migration before it runs in production. The zero-downtime constraint policy determines whether the expand-contract pattern is a team convention or a surprise discovered during an incident. Without a documented strategy, the ACCESS EXCLUSIVE lock incident is the mechanism by which the team learns the constraint. With it, the engineer writing the column rename migration reads the ADR, recognizes the pattern, and writes two migrations instead of one before the deploy ever happens.
Why database migrations are architectural decisions, not just scripts
A database migration file is easy to treat as a routine artifact — a SQL script that runs once and is forgotten. The schema history table in Flyway or Liquibase records that the migration ran successfully, and the migration is never referenced again. This framing misses what the migration represents: a permanent, hard-to-reverse change to the data model that every application version from this point forward must account for.
The properties that make migrations architectural decisions rather than routine scripts:
Migrations are effectively irreversible at scale. A migration that adds a column is trivially reversible in development: drop the column, and the schema is back to its prior state. A migration that adds a column to a table with 50 million rows and then backfills a default value across all existing rows is not trivially reversible in production: the backfill may have taken 8 hours, and rolling it back means another 8-hour operation. A migration that renames a column is not safely reversible at all while the application is running: rolling back the rename requires running the rename again in the other direction, under the same lock conditions, while the rolled-back application version expects the old column name. The irreversibility of migrations at scale is not a Flyway property or a PostgreSQL property — it is a consequence of the combination of production data volume and the application's dependency on the current schema state.
Migrations carry DDL locking behavior specific to the database engine. MySQL and PostgreSQL handle DDL locking differently, and the safe-versus-unsafe migration operations differ between engines. In PostgreSQL, most DDL operations require an ACCESS EXCLUSIVE lock — the most restrictive lock mode, which conflicts with every other lock mode including reads. ALTER TABLE ... ADD COLUMN with a non-nullable default value in PostgreSQL before version 11 required rewriting the entire table (safe columns: nullable, or not-null with a volatile default added separately). PostgreSQL 11+ handles ADD COLUMN ... NOT NULL DEFAULT 'value' as a metadata operation when the default is a constant, without rewriting the table — but this behavior is engine-version-specific. Creating an index in PostgreSQL requires either a full table lock (standard CREATE INDEX) or a lock-free background build (CREATE INDEX CONCURRENTLY) that takes longer but does not block reads or writes. In MySQL with InnoDB, the online DDL introduced in MySQL 5.6 handles many DDL operations with a concurrent DML lock rather than a full table lock — but the behavior depends on the specific operation and the MySQL version. Platform teams who manage the database engine version and the upgrade policy own the constraint on which DDL operations are safe — and those constraints must be documented somewhere accessible to every developer writing a migration, not discovered through incidents.
Migrations interact with rolling deploys in ways that create application errors. A rolling deploy replaces instances of the application incrementally rather than all at once. During the transition window, some pods run version N of the application and some run version N+1. If the schema migration for version N+1 has already been applied, the database contains schema that version N+1 expects — but version N's queries may fail against the new schema if the change is backward-incompatible. A column that was nullable in version N and is now NOT NULL after the migration can cause INSERT failures from version N pods. A column that was renamed breaks SELECT queries from version N pods the moment the migration is applied. An index that was dropped for version N+1 (because the query that used it was removed) can cause query plan degradation or failures for version N pods that still run that query. These interactions are migration-design constraints: a migration that is backward-incompatible with the current application version must be deployed differently than one that is forward-compatible.
Migration tooling families and the version tracking model
Database migration tooling falls into three families with different version tracking models, different conflict resolution behaviors in branching workflows, and different constraints on what migrations can contain.
Versioned migration tools (Flyway, Liquibase) manage migrations as a linear or branching sequence of version-numbered files. Flyway uses a flyway_schema_history table to record which migrations have been applied, identified by a version number extracted from the migration filename — V20240315__add_user_preferences_table.sql carries version 20240315. Flyway's default behavior is to fail the migration run if any migration in the version sequence has been applied out of order (the outOfOrder flag controls this). In a team with multiple developers working in parallel feature branches, each developer writes a migration with a version timestamp. If developer A's migration (V20240315_083000__add_index.sql) is merged before developer B's migration (V20240315_091000__add_column.sql), Flyway applies A's migration first, B's migration second — the correct order by timestamp. The conflict scenario arises when B's branch was cut before A's migration existed, B develops and tests locally with a schema that doesn't include A's index, and then B's branch is merged. Flyway sees B's migration as applied out of order relative to A's — a failure in strict mode, or a silent concern in outOfOrder mode. The migration file naming convention (whether to use integer sequences, timestamps, or date-hour-minute-second strings) determines how often these conflicts occur and whether they can be resolved automatically.
Liquibase uses a changeset model rather than a filename version model: each changeset has an id and an author attribute, and the execution order is determined by the order of changesets in the changelog XML or YAML file rather than by a filename sort. Liquibase's model is more explicit about ordering (the order in the changelog file is authoritative) but more verbose and more dependent on a correct manual merge of the changelog file when two branches both add changesets. Liquibase's rollback support is more first-class than Flyway's: changesets can include a <rollback> block that defines the inverse operation, which Liquibase executes when a rollback is requested. Whether rollback blocks are worth maintaining is a migration strategy decision — automated rollback is valuable for simple DDL changes but difficult to write correctly for data migrations that transform row values.
ORM-managed migrations (Rails ActiveRecord, Django, SQLAlchemy/Alembic, Prisma Migrate) generate migration files automatically by diffing the current model definitions against the current schema state. A developer adds a field to a Django model and runs python manage.py makemigrations; Django generates a migration file that adds the corresponding column. The convenience of auto-generation comes with a trade-off: the generated migration reflects what the ORM infers the change requires, not necessarily what is safe to run in production. Rails' db:migrate on a table with 50 million rows does not warn before running a full-table ALTER. Django's makemigrations generates an AddField operation that maps to a straightforward ALTER TABLE without considering the CONCURRENTLY keyword for index creation. Teams using ORM-managed migrations typically develop conventions for overriding the generated migration with a production-safe equivalent — replacing a generated index creation with a CONCURRENTLY index creation, or splitting a generated column rename into two migrations. These conventions are decisions that belong in the migration strategy ADR, because a new team member will follow the framework's defaults unless explicitly informed that the team's conventions override them.
State-based migration tools (Atlas, Sqitch, DbDiff) manage the schema as a desired end state rather than as a sequence of migration operations. The developer declares what the schema should look like; the tool diffs the current schema against the desired state and generates the migration operations to move from current to desired. Atlas, for example, takes a schema definition file (HCL or SQL) and generates the migration SQL by comparing the definition against the live database schema. State-based tools are appealing for their ability to always produce a minimal migration — if a column was renamed and then renamed back, a state-based tool sees net zero change and generates no migration, while a versioned tool applies both renames. The trade-off is that state-based tools have less control over the migration execution plan: the tool decides how to achieve the desired state, which may not match the production-safe approach. A state-based tool that decides to achieve a column rename by dropping the old column and adding a new column (rather than ALTER TABLE ... RENAME COLUMN) would cause data loss. Atlas and similar tools have safety annotations that prevent data-loss operations without confirmation, but the safety model must be understood and documented rather than assumed.
DDL locking behavior and zero-downtime migration constraints
The most consequential migration strategy decision for production safety is the classification of DDL operations into safe-by-default and requires-special-handling categories, based on the database engine's locking behavior. This classification is engine-specific and version-specific, and it is the primary thing a developer needs to know before writing a migration against a production database.
For PostgreSQL (the most common engine for teams who have hit this constraint in production):
Operations that are safe to run without special handling on a live production table (they acquire brief locks that do not block concurrent reads for meaningful durations): adding a nullable column with no default value (PostgreSQL adds the column as a metadata operation, no table rewrite required, lock held for milliseconds); adding a NOT NULL column with a constant default value in PostgreSQL 11+ (stored as a metadata operation with a default recorded in pg_attribute, applied to new rows and old rows on access, no table rewrite); dropping a column (marks the column as dropped in pg_attribute, no rewrite, brief lock); adding a foreign key constraint with NOT VALID (creates the constraint in a state that validates only new rows, does not scan existing rows — the expensive validation step is deferred to a separate VALIDATE CONSTRAINT statement, which uses a weaker lock).
Operations that require special handling to avoid blocking production traffic: CREATE INDEX (acquires a SHARE lock that blocks writes for the duration of the index build — use CREATE INDEX CONCURRENTLY instead, which builds the index using a background scan with weaker locking, takes longer but does not block writes); ALTER TABLE ... ADD CONSTRAINT ... CHECK (validates the constraint against all existing rows under an ACCESS EXCLUSIVE lock — use ADD CONSTRAINT ... NOT VALID followed by VALIDATE CONSTRAINT in a separate transaction or migration step, since VALIDATE CONSTRAINT uses a weaker SHARE UPDATE EXCLUSIVE lock); ALTER TABLE ... RENAME COLUMN (brief ACCESS EXCLUSIVE lock for the metadata update, but backward-incompatible with the current application version — any pod running code that references the old column name fails after the rename); ALTER TABLE ... DROP COLUMN of a column that is still referenced by the current application version (the drop succeeds at the database level, then the application code fails on queries referencing the dropped column); ALTER TABLE ... ALTER COLUMN ... TYPE with an incompatible type (requires a full table rewrite under ACCESS EXCLUSIVE lock).
The online schema change tools address the large table problem for operations that cannot be made safe with PostgreSQL's native CONCURRENTLY approach. pg_repack is a PostgreSQL extension that rebuilds a table and its indexes without holding long locks — it creates a new copy of the table, applies changes to the copy while triggering log the original's changes, and then swaps the old table for the new one under a brief final lock. gh-ost (GitHub's online schema change for MySQL) works similarly for MySQL InnoDB tables: it creates a ghost table with the new schema, copies rows in batches while tailing the binary log for concurrent changes, and then does a final cutover swap. These tools are necessary when the operation requires a table rewrite that would be unsafe at the table's current size — a 300-million-row table rewrite that takes 45 minutes under a full table lock is not acceptable in a zero-downtime production environment.
Like the cache invalidation strategy, the migration safety classification requires explicit policy rather than implicit awareness. A developer who has not encountered a production locking incident does not know which operations are safe and which are not. A developer who has encountered one incident knows the specific constraint they hit but may not know the full classification. The migration strategy ADR records the full classification once, in a place where it can be consulted before any migration is written, rather than accumulated through a series of production incidents each teaching one constraint at a time.
The expand-contract pattern for rolling deploy compatibility
The expand-contract pattern is the structured approach for applying backward-incompatible schema changes without taking downtime or causing application errors during a rolling deploy. It is the migration pattern most frequently discovered through incidents rather than through deliberate design — teams learn it the hard way when a column rename causes 5xx errors for the duration of a rolling deploy.
The pattern has three phases:
Phase 1: Expand. Add the new schema element alongside the old one. For a column rename from email_verified to is_email_verified: add the new column is_email_verified as a nullable column alongside the existing email_verified column. Update the application code to write to both columns on INSERT and UPDATE, and to read from the new column with a fallback to the old column (or to populate the new column from the old column for existing rows via a backfill). Deploy this version of the application with the expand migration applied. At the end of Phase 1, both the old and new column exist in the database, all new writes go to both, and all existing rows have been backfilled. Every running application version — both the version before Phase 1 and the version after — can read correctly from the database.
Phase 2: Contract. Remove the old schema element. For the column rename: add a migration that drops the old email_verified column. Update the application code to remove all references to email_verified — the write-to-both and read-with-fallback logic is removed, and the code references only is_email_verified. Deploy this version of the application with the contract migration applied. The contract migration is safe to apply because the Phase 2 application code has already been fully deployed before the contract migration runs — no running pods reference the old column anymore.
The deployment gap between phases. The expand migration and the contract migration must be separated by at least one full rolling deploy cycle — the time it takes for the Phase 1 application version to fully replace the Phase 0 version across all pods. The minimum deployment gap policy depends on the deploy duration: if a rolling deploy takes 8 minutes to cycle all pods, the deployment gap must be at least 8 minutes. If the team performs hourly deploys, the deployment gap is the time between the Phase 1 deploy and the Phase 2 deploy — at minimum one deploy cycle after Phase 1 is complete. Like the error handling strategy's policy on which errors are retried and which are not, the deployment gap policy is a specific operational commitment that must be documented — an implicit assumption that "the old version will be gone soon enough" is not a policy.
The expand-contract pattern applies beyond column renames. Any schema change that is not backward-compatible with the current application version requires it: removing a column that current code still reads, changing a column type in a way that current code cannot handle, removing a table that current code still queries, removing a unique constraint that current code relies on. The test for whether a migration requires expand-contract is the backward-compatibility test: if the migration is applied and the current running application version continues to function correctly, the migration is backward-compatible and can be applied directly. If the current running application version would produce errors after the migration is applied, the migration requires the expand-contract sequence.
One class of operations that often surprises teams: adding a NOT NULL constraint to an existing column. The column may already have no null values in existing rows, so the constraint addition seems like a metadata operation. In PostgreSQL, ALTER TABLE ... ADD CONSTRAINT ... NOT NULL requires an ACCESS EXCLUSIVE lock and a full table scan to validate the constraint against all existing rows — both potentially slow at scale. The NOT VALID + VALIDATE CONSTRAINT approach separates these: ADD CONSTRAINT ... NOT NULL NOT VALID adds the constraint for new rows under a brief lock without scanning existing rows; VALIDATE CONSTRAINT scans existing rows under a SHARE UPDATE EXCLUSIVE lock that permits concurrent reads and writes. Even with this split, the constraint addition is still a two-step operation with a brief period where the constraint is enforced for new rows but not yet validated for old rows — a behavioral state that the migration ADR should describe explicitly.
Large table migration policy: batch sizing and online schema change tools
Large table migrations are a distinct category that requires a separate decision framework from small-table DDL migrations. The threshold above which a table is considered "large" for migration purposes depends on the database engine, the server's I/O performance, and the team's acceptable migration duration — but a working threshold for most teams is approximately 10 million rows or 10 GB of table data on a table with active production load.
The large table migration decision has two sub-cases: DDL changes that restructure the table (add column, add index, change column type) and data migrations that transform existing row values (backfill a new column, normalize a column value, populate a foreign key from a joined table).
DDL changes on large tables. For index creation, PostgreSQL's CREATE INDEX CONCURRENTLY is the standard tool — it builds the index in a background pass that does not block reads or writes, at the cost of taking approximately 2–3× longer than a standard index build and requiring two table scans instead of one. For table restructuring that requires a table rewrite (changing a column type, adding a column with a non-constant default in PostgreSQL before version 11), the online schema change tools are necessary above the size threshold. pg_repack is the standard choice for PostgreSQL — it requires the extension to be installed on the database server, handles foreign key constraints, and produces a result equivalent to running VACUUM FULL plus the DDL change without the extended lock. For MySQL, gh-ost is the standard choice for teams that cannot use MySQL's native online DDL for a specific operation — gh-ost's binary log tailing approach means it works correctly even for operations where InnoDB's online DDL requires a rebuild of the table in a way that conflicts with concurrent DML.
Data migrations on large tables. A backfill migration that updates all existing rows — "populate the user_tier column based on the user's subscription records" — cannot be run as a single UPDATE statement on a table with 50 million rows. A single UPDATE on 50 million rows acquires a row-level lock on every row for the duration of the transaction, blocks concurrent updates to those rows, produces a transaction that is too large to roll back efficiently if something goes wrong, and generates a write-ahead log (WAL) spike that can slow replication lag to unacceptable levels. The standard approach is batched updates: divide the table into batches by primary key range and update each batch in a separate transaction, with a configurable delay between batches to allow replication to catch up and to avoid starving concurrent queries. The batch size needs to be calibrated against the table's row size, the write throughput of the database, and the replication lag budget — a common starting point is 1,000–10,000 rows per batch with a 10–100ms delay between batches, adjusted based on measured replication lag during a test run on staging. The progress tracking requirement follows from the execution time: a backfill that runs for 3 hours must report progress so the operating engineer can estimate completion time and detect stalls, rather than observing an opaque long-running process. A last_processed_id counter written to a migration state table or logged to stdout on each batch completion is the minimum viable progress tracking mechanism.
Like the data retention migration that purges old rows, the data migration's execution duration and its impact on the replication lag during execution are properties that must be measured on staging before the migration is scheduled for production. A backfill that takes 2 hours on staging at 20% of production row count may take 10 hours on production — a migration window that must be planned into the deployment schedule and documented in the migration runbook.
The rollback strategy: forward-only migrations and the deploy failure scenario
The database migration rollback strategy is one of the most commonly underdocumented aspects of the migration approach, and one of the most consequential in a production incident. The question "how do we roll back a deploy that included a migration?" has a different answer depending on whether the migration is reversible, whether the migration tooling supports undo operations, and whether the rollback procedure has been tested.
The two primary rollback postures:
Forward-only migrations. The team's policy is that schema migrations are never automatically reversed. A failed deploy is remediated by deploying a new version (the rolled-back application version plus a new migration that reverses the schema change if needed), not by running a migration tool's rollback command. This is the approach that most experienced engineering teams adopt in practice, for several reasons: most data migrations are not reversible without data loss (a migration that backfills a column from another column cannot be "un-backfilled" without discarding the backfilled values), the rollback migration is additional code that must be written, tested, and maintained but is only executed in failure scenarios, and the application rollback and the migration rollback are independent operations that must be coordinated carefully to avoid leaving the database in a state that neither the old nor the new application version can correctly read. The forward-only posture requires that the migration design is such that the rolled-back application version can tolerate the new schema: an additive migration (add column, add table, add index) leaves the schema in a state that the old application version can still read correctly, because the old code simply ignores the new elements it doesn't know about. A destructive migration (drop column, drop table) that was applied before the application rollback leaves the schema in a state that the rolled-back application version cannot read.
Reversible migrations with undo blocks. Flyway Pro and Liquibase support explicit rollback SQL — a separate SQL block that undoes the forward migration operation. A migration that creates a table includes a rollback block that drops the table. A migration that adds a column includes a rollback block that drops the column. The rollback is invoked when the migration tool's rollback command is run (Flyway's flyway undo, Liquibase's rollback command). Reversible migrations are valuable for simple DDL operations where the inverse is unambiguous and data-safe — creating and dropping a table is trivially reversible; adding and dropping a column is reversible if no data has been written to it. Reversible migrations are difficult or impossible to write correctly for data migrations: a migration that normalizes email addresses from uppercase to lowercase cannot be reversed without knowing the original values, which were overwritten. Teams that adopt reversible migrations typically restrict them to structural DDL (table and column creation) and exclude data migrations from the rollback requirement.
The deploy failure scenario that most frequently triggers the rollback question: a deploy is in progress, the migration has run and committed, and the new application version is failing to start due to an unrelated bug. The team wants to roll back the application to the previous version. If the migration is additive (added a column that the old version ignores), the old application version starts cleanly against the migrated schema. If the migration is destructive (dropped a column that the old version reads), the old application version fails immediately on startup with a missing column error. The migration strategy ADR documents which of these scenarios the team's deploy process is designed to handle — and therefore whether destructive migrations are permitted to go out in the same deploy as the code change that removes the last reference to the dropped element, or whether they require a separate deploy after the application code deploy is fully complete and stable.
Postmortem-driven ADR updates are particularly valuable for the rollback strategy: the specific migration failure scenario that the team encountered — the specific error, the specific remediation, the specific coordination required between application rollback and migration rollback — is the most concrete input for defining what the rollback policy needs to say. A migration strategy ADR written before any production incident will be general and abstract; one updated after a deploy rollback incident will be specific and operational.
Writing the database migration strategy decision record
The Nygard ADR format adapts for database migration strategy decisions with five sections that most engineering teams leave entirely undocumented, instead relying on each developer's prior experience and the institutional memory of whoever wrote the first migration.
The migration tooling decision. "Flyway 9.22 with versioned migrations (V prefix, UTC timestamp to the second as the version number: V20240315_143021__description.sql). Selected November 2023 when the project was initialized. Alternatives evaluated: (1) Liquibase — evaluated and rejected because the XML changeset format adds verbosity without adding capability for our use case (pure PostgreSQL, no multi-database support needed); Liquibase's rollback support is more complete than Flyway's, but our policy is forward-only migrations, eliminating that advantage; (2) Alembic (SQLAlchemy) — the project is written in Go, not Python; Alembic was not considered; (3) Atlas — evaluated as a state-based alternative; rejected because state-based migration management reduces control over the migration execution plan, which is critical for our zero-downtime constraint; Atlas generates the migration SQL from a schema diff and we cannot reliably verify that the generated SQL uses CONCURRENTLY for all index creation without modifying the generated output; (4) ActiveRecord — the project does not use Ruby on Rails; (5) Hand-written SQL with a custom version table — rejected because Flyway provides checksumming (detects modified migration files), out-of-order detection, and baseline support that would require non-trivial re-implementation. Migration file naming: UTC timestamp to the second avoids conflicts when two developers write migrations in the same day, at the cost of longer filenames. The convention is V[YYYYMMDD_HHMMSS]__[present_tense_verb]_[table]_[element].sql — example: V20240315_143021__add_is_email_verified_to_users.sql."
The migration lifecycle policy. "Every migration file must be peer-reviewed before merge — migrations are included in the standard code review process via the same pull request as the application code change that depends on them. Review checklist: (1) verify the migration does not contain any operations from the 'requires special handling' list below without the required safe variant (CONCURRENTLY for index creation, NOT VALID for CHECK constraint addition); (2) verify that the migration is backward-compatible with the current running application version, or that the PR includes the Phase 1 expand-only migration and schedules Phase 2 contract as a separate future deploy; (3) verify that data migrations include a batch loop with a configurable batch size and a delay parameter, not a single UPDATE across all rows. Migration testing: each migration must be run against a local PostgreSQL instance that matches the production version before the PR is marked ready for review. Staging runs migrations automatically as part of the CD pipeline. Production: migrations run as the first step of the deploy, before new application pods are started. The deploy pipeline aborts if any migration fails — a failed migration leaves the database in the state before the migration started (PostgreSQL DDL is transactional; a failed migration transaction is rolled back). Exception: CONCURRENTLY index creation cannot run inside a transaction in PostgreSQL; if a CONCURRENTLY migration fails midway, the partially-built index must be manually dropped before retrying. This is documented in the runbook for index migration failures."
The zero-downtime migration constraints and expand-contract policy. "Target database engine: PostgreSQL 15.4. Safe DDL operations (can run against live production table without downtime risk): (1) ADD COLUMN with no default or with a nullable column — metadata operation, lock held for <1ms; (2) ADD COLUMN NOT NULL DEFAULT 'constant' — metadata operation in PostgreSQL 11+, applies default on read for old rows; (3) DROP COLUMN — metadata operation; (4) CREATE INDEX CONCURRENTLY — background index build, no write blocking, requires CONCURRENTLY keyword; (5) ADD FOREIGN KEY ... NOT VALID — adds constraint without scanning existing rows; (6) VALIDATE CONSTRAINT — scans rows under SHARE UPDATE EXCLUSIVE lock (concurrent reads and writes permitted). DDL operations requiring special handling: (1) CREATE INDEX without CONCURRENTLY — blocks writes for the duration of the build; use CONCURRENTLY always for tables over 1,000 rows; (2) ADD CONSTRAINT ... CHECK without NOT VALID — scans all rows under ACCESS EXCLUSIVE; always use NOT VALID + VALIDATE CONSTRAINT pattern; (3) RENAME COLUMN — ACCESS EXCLUSIVE lock plus backward-incompatibility; always use expand-contract pattern; (4) ALTER COLUMN TYPE to an incompatible type — table rewrite under ACCESS EXCLUSIVE; use pg_repack for tables over 10 million rows; (5) ADD COLUMN NOT NULL DEFAULT (expression or function) — requires table rewrite in PostgreSQL; use add nullable column + backfill + add NOT NULL constraint instead. Expand-contract policy: any migration that is backward-incompatible with the current application version must be split into Phase 1 (expand: add new elements, deploy new code that writes to both old and new) and Phase 2 (contract: remove old elements, clean up dual-write code). The minimum deployment gap between Phase 1 and Phase 2 is one full rolling deploy cycle (currently approximately 12 minutes based on our pod count and rolling deploy rate). Phase 2 must not be deployed until all Phase 0 pods have been replaced by Phase 1 pods — verify in the deploy dashboard before proceeding."
The large table migration policy. "Large table threshold: 10 million rows or 10 GB of uncompressed table data. Tables currently above threshold: users (41M rows), events (310M rows), audit_log (88M rows), sessions (12M rows). For DDL operations on large tables: use CREATE INDEX CONCURRENTLY for all index creation; use pg_repack for table rewrites (install the extension in the production PostgreSQL instance; confirm availability before scheduling a migration that requires it). For data migrations (backfill, normalize, populate foreign keys) on large tables: write a batched loop that processes rows in a configurable batch size by primary key range. Default batch size: 5,000 rows. Default inter-batch delay: 50ms. Configurable via environment variables MIGRATION_BATCH_SIZE and MIGRATION_BATCH_DELAY_MS. Log progress to stdout after each batch: 'Processed rows [N] to [M] of estimated [TOTAL] (batch [K]). Estimated remaining: [HH:MM:SS].' Run the migration on the staging database (current row count approximately 15% of production) and record the duration before scheduling the production run. Large table migration scheduling: schedule outside of peak traffic hours (our peak is 09:00–21:00 UTC; prefer 01:00–06:00 UTC for large table operations). Notify the on-call engineer before starting a large table migration and share the estimated duration. Do not run a large table migration on the same day as a major feature deploy without explicit approval from the team lead."
The rollback strategy. "The team's migration policy is forward-only. Database schema migrations are not automatically reversed when an application rollback is performed. All migrations are designed to be backward-compatible with the prior application version (through the expand-contract pattern where necessary), so that an application rollback can be performed without needing to reverse the schema migration. When a migration must be reversed (due to a design error discovered in production), the reversal is implemented as a new forward migration in a new deploy, not as a rollback command. Undo migration files (Flyway's U prefix) are not used and are not maintained. Rationale: data migrations that have been applied cannot be safely reversed without risk of data loss or inconsistency; the undo migration code is additional surface area that must be tested and maintained but executed only in failure scenarios; the expand-contract pattern eliminates the need for migration rollback by ensuring all migrations are safe for the prior application version. Procedure for a failed deploy where the migration has already run: (1) assess whether the application rollback is safe given the schema state — if the migration is additive (added columns or tables the old version ignores), the application rollback is safe to proceed; if the migration dropped elements the old version references, the application rollback will cause application errors and a forward remediation migration is required before the rollback; (2) for additive migrations, roll back the application and schedule a contract migration as a new forward deploy when the design is corrected; (3) for destructive migrations applied before the rollback, open an incident and page the on-call engineer — the remediation requires restoring the dropped elements from the point-in-time backup or re-adding them via a new forward migration, depending on whether data loss occurred. CONCURRENTLY index failure procedure: if a CREATE INDEX CONCURRENTLY migration fails, the partially built index is left in an INVALID state in pg_indexes. Before retrying, run DROP INDEX CONCURRENTLY idx_name to remove the invalid index, then retry the CREATE INDEX CONCURRENTLY."
Why the migration strategy record prevents the two most common production incidents
It prevents the production lock incident from being the first time a developer learns the DDL locking behavior. A developer who has not previously caused a production lock incident does not know which DDL operations require special handling unless they have been explicitly told. The migration strategy ADR's "requires special handling" list is the document that prevents this: a developer writing their first index creation migration reads the ADR, sees that CREATE INDEX CONCURRENTLY is required, and writes the safe variant. Without the ADR, the developer writes CREATE INDEX, the migration runs safely in development and staging (both with small tables), and the first production run is the discovery event for the locking constraint. A technical leader inheriting a codebase who wants to add an index to a 200-million-row table without the strategy documented has no way to know from the codebase whether the team has a policy on CONCURRENTLY — they must find the person who wrote the original migration tooling setup and ask. The strategy ADR answers this question before the question is asked.
It prevents the rolling deploy backward-incompatibility incident from occurring with every column rename. The expand-contract pattern is not discoverable from the codebase — there is no code structure that reveals whether the team has adopted it as a convention. A developer writing their first column rename migration has no signal that a direct rename will cause 5xx errors during the rolling deploy unless they have been told about the expand-contract pattern explicitly. The migration strategy ADR documents both the fact that backward-incompatible migrations require expand-contract and the specific procedure for how to implement it, making the pattern accessible before the first incident rather than after.
Migration strategy ADRs are superseded when the team changes database engines, migrates to a managed database service with different DDL locking behavior, or adopts an online schema change tool that changes the procedure for large table operations. Each supersession is an opportunity to document what the previous strategy could not handle — what incident triggered the migration tooling change, what limitation of the previous approach caused the change, and what the new approach commits to that the old approach could not. The supersession chain is the institutional memory of the team's accumulated learning about safe database migration — a learning history that would otherwise be scattered across post-mortem documents and Slack threads.
Finding the database migration strategy in AI chat history
Database migration decisions are among the most reliably present in AI chat history, because the migration tooling setup and the first production locking incident are both events that engineers work through with AI assistance. Three months of AI chat history from a developer who owns the backend infrastructure will typically contain all four migration session types.
The initial setup session contains the migration tooling selection, the version numbering convention, and the CI/CD integration: "how do I add Flyway to a Spring Boot project", "should migration files use timestamps or sequential integers for version numbers", "how do I run Flyway migrations as part of a Kubernetes deployment", "should migrations run before or after the application starts". The tooling selection session is the highest-priority recovery target because it captures the alternatives evaluated and the rejection reasons — information that is never recorded in the migration files themselves and that determines which tooling constraints apply to every future migration.
The first production lock incident session is the event that teaches the DDL locking constraint to a specific engineer: "ALTER TABLE is blocking all reads in PostgreSQL", "how long does CREATE INDEX take on a table with 100 million rows", "what is ACCESS EXCLUSIVE lock in PostgreSQL", "how do I create an index without blocking writes". The incident session is valuable because it documents which specific operation caused the incident, on which table, at which row count, and what the resolution was — the origin event for the "requires special handling" list that should be in the migration strategy ADR. Without an ADR to update after the incident, the learning is local to the engineer who experienced it and is not transferred to the team as a policy.
The large table migration session contains the online schema change tool selection and the batch sizing decision: "how do I add a column to a 500 million row table in PostgreSQL without downtime", "what is pg_repack and how does it differ from VACUUM FULL", "what batch size should I use for a backfill migration on a 200 million row table", "how do I avoid replication lag spikes during a large data migration", "how do I track progress of a long-running migration". The batch sizing parameters and the replication lag budget that emerge from this session are the specific operational parameters that the migration strategy ADR's large table migration policy should record — calibrated to the team's specific database hardware and replication configuration rather than to generic best practice guidance.
The rollback incident session contains the most operationally valuable migration content: "we deployed a migration that dropped a column and now we need to roll back — what do we do", "how do we coordinate rolling back the application code when the schema migration already ran", "can Flyway undo a migration that dropped a table", "how do we restore a dropped column in PostgreSQL without losing data". The rollback incident is the event that forces the team to articulate its rollback policy — either by discovering that they have none, or by executing the procedure under pressure and documenting what they did. The postmortem from this incident is where the rollback strategy first exists in documented form; the migration strategy ADR is where it belongs permanently.
The WhyChose extractor surfaces these sessions because they contain the characteristic language of migration decisions under production pressure: specific table names and row counts, specific error messages (ACCESS EXCLUSIVE, lock timeout, replication lag), specific tool names (pg_repack, gh-ost, Flyway, CONCURRENTLY), and specific policy choices framed as questions that are being answered in real time. The extractor identifies these as decision sessions because they contain the decision structure — a question that was resolved, with specific alternatives considered and a specific resolution adopted — rather than as informational sessions that merely explain how migrations work in general.
Further reading
- Decisions that never get written down — the migration tooling setup session, the first production lock incident, and the first large table migration are each decision events that produce institutional knowledge without producing documentation; the institutional knowledge lives in the engineer who experienced the incident and is lost when they leave the team
- The caching strategy decision record — the cache invalidation policy and the migration DDL locking classification share the same structure: both are policies that enumerate which operations are safe-by-default and which require special handling, with specific consequences for violating the policy at production scale
- The data retention decision record — data retention migrations that purge old rows from large tables are a subcategory of large table data migrations; the batch sizing policy, the replication lag budget, and the progress tracking requirement apply equally to retention purges and to backfill migrations
- The observability strategy decision record — the metrics and tracing instrumentation installed during a database migration (migration duration metrics, replication lag monitoring, lock wait time) are observability strategy decisions; the ability to answer "how long did the migration take on each batch?" and "did replication lag spike during the backfill?" requires that the migration process emits the right metrics and that the metrics backend is configured to retain them
- The logging infrastructure decision record — migration progress logging (the per-batch progress output during a long-running data migration) depends on the log aggregation tool's indexing model; a backfill that runs for 3 hours across 600 batch iterations produces logs that must be queryable by migration ID and batch number to determine progress state during execution
- ADRs for platform teams: how infrastructure decisions become constraints for product teams — the migration strategy is a platform decision when the database is shared infrastructure managed by a platform team; the safe DDL operation classification, the large table threshold, and the online schema change tool availability are platform constraints that product team developers must follow; the migration strategy ADR is the document that communicates these constraints to product teams before they write their first migration against the platform database
- The new-CTO onboarding problem: when nobody can tell you why — a technical leader who inherits a codebase with 400 migration files and no migration strategy documentation cannot determine from the migration files alone whether the team has a zero-downtime convention, what the large table threshold is, or whether the rollback policy is forward-only or reversible; the answers are distributed across post-mortem documents, Slack threads, and the institutional memory of the team members who were present for each incident
- Postmortem ADRs: converting incident decisions into decision records — the production lock incident, the rolling deploy backward-incompatibility incident, and the migration rollback incident are each the natural source events for migration strategy ADR updates; the postmortem identifies the constraint, the ADR records the constraint as a permanent policy, and the next developer who would have hit the same constraint reads the policy instead of reproducing the incident
- ADR lifecycle: superseding and deprecating decisions — migration strategy ADRs are superseded when the team changes database engines, adopts an online schema change tool, changes the migration tooling, or changes the rollback policy; the supersession chain is the history of the team's accumulated production incidents and the policies that emerged from them
- Three months of AI chat history, undocumented — the initial migration tooling setup session, the first production lock incident session, the large table migration session, and the rollback incident session each produce decision content that is reliably present in AI chat history and reliably absent from any codebase artifact; the migration strategy ADR is the permanent home for this content
- The error handling strategy decision record — migration failures are a class of operational error with specific retry and remediation procedures; the CONCURRENTLY index failure procedure (drop the INVALID index before retrying) and the failed migration transaction behavior (automatic rollback for most DDL, no rollback for CONCURRENTLY) are migration-specific error handling policies that belong in the migration strategy ADR alongside the general error handling strategy
- The service mesh decision record — service mesh sidecar injection can interfere with database migration jobs that run as Kubernetes Jobs rather than long-lived Deployments; the mesh's connection draining behavior, the sidecar lifecycle, and the Job completion signal interact in ways that can cause migration Jobs to appear to hang after the migration SQL has committed; the service mesh decision record and the migration strategy ADR should document how migration Jobs are excluded from sidecar injection or how the sidecar is terminated on Job completion
- Nygard ADR template — the migration strategy ADR's five sections (tooling decision, lifecycle policy, zero-downtime constraints, large table policy, rollback strategy) each map to the Nygard format's Context, Decision, and Consequences structure; the zero-downtime constraint table (safe operations vs. requires special handling) is the Consequences section written as an operational reference rather than as a prose paragraph
- WhyChose extractor — database migration decisions are reliably present in AI chat history in four session types: the initial tooling setup session (migration tool selection, version numbering convention), the first production lock incident session (DDL locking constraint discovered under production load), the large table migration session (online schema change tool selection and batch sizing), and the rollback incident session (the rollback strategy articulated under production incident pressure); the extractor identifies these sessions by the characteristic language of migration decisions — specific table row counts, specific lock error messages, specific tool names — rather than by keyword matching on generic migration terminology