Why does a testing strategy need a decision record if it just reflects testing preferences?

A testing strategy is not a preference — it is an architectural decision with cascading consequences. The ratio of unit to integration to E2E tests determines what your CI pipeline costs and how long it takes, what class of bugs the test suite can and cannot catch, how safely engineers can refactor internal implementations, and what you can promise to your team and your customers about production behavior. When the test strategy is undocumented, new engineers extend it by copying the existing pattern — which may be the right pattern, or may be an accident of who set up the CI pipeline in year one. The decision record makes the strategy explicit, names the trade-offs it accepts, and gives a concrete revisitation condition rather than letting the strategy drift by accumulation.

What is the mock boundary decision and why is it the most consequential undocumented testing decision?

The mock boundary decision is the answer to: where in the stack do your tests stop using real implementations and start using fakes, stubs, or mocks? The most common version is: do you mock the database, or do tests hit a real database? This decision determines what your tests can tell you about production behavior. Tests that mock the database can verify the application logic in isolation but cannot catch bugs that arise from how the database actually behaves — query plan regressions, constraint violations, transaction isolation failures. Tests that hit a real database catch this class of bug but are slower, require test data management, and need the database to be in a known state before each test. The mock boundary cascades into CI infrastructure requirements, test data management strategy, and what refactoring is safe without updating tests. It is the most consequential undocumented testing decision because it determines the class of production bugs the test suite is structurally incapable of preventing.

How do you find test strategy decisions in AI chat history?

Test strategy decisions appear in AI chat in four recognizable session shapes: (1) setup sessions early in the project — 'how should we structure our tests for a Node.js API?', 'should we use Jest or Vitest?', 'how do we test database queries?' — where the first framework choice locked in the test strategy before it was consciously evaluated; (2) mock boundary debates — 'should we mock the database in our unit tests?', 'how do we test our service layer in isolation?', 'is it okay to test against a real Postgres instance?' — these sessions contain the explicit reasoning the team used to draw the mock boundary; (3) CI speed sessions — 'our tests are taking 40 minutes to run', 'how do we make our test suite faster?' — these are often the first time the team explicitly evaluates the current strategy against its costs and may contain a pivot decision; (4) production incident sessions — 'the bug passed all the tests, how?', 'why didn't our test suite catch this?' — these sessions explicitly name what the current test strategy cannot catch, and often trigger a strategy revision.

What should a test strategy ADR include that a standard ADR format doesn't cover?

A test strategy ADR needs four sections that standard ADR formats underspecify: (1) the pyramid ratio — the intended ratio of unit to integration to E2E tests, with the reasoning for the balance (fast feedback vs. coverage fidelity vs. CI cost); (2) the mock boundary — exactly where in the stack tests use real implementations vs. fakes, with the specific reasoning for each boundary; (3) the deployment confidence threshold — what the test suite must show before deployment is permitted, and what class of bugs is accepted as 'not covered by automated tests'; (4) the refactoring safety guarantee — what the test suite is intended to protect during refactoring (behavior at the API boundary? behavior at the database boundary? internal implementation details?) and what refactoring is not protected. Without these four sections, the ADR describes the test tooling chosen but not the strategy the tooling is intended to implement.

2026-06-15 · ~15 min read

The test strategy decision record: why the testing pyramid your team adopted looks like a preference but acts like a constraint

Every team has a testing pyramid. Most teams didn't choose it consciously — it emerged from whoever set up the CI pipeline first, from the framework the team happened to adopt, from the testing philosophy of the engineer who wrote the first test suite. Six months later, the ratio of unit to integration to end-to-end tests constrains your refactoring safety, your deployment speed, your CI cost, and what you can honestly promise about production behavior. None of this was the preference of a named decision — it's the accumulated consequence of a dozen unremarkable early choices.

The testing pyramid is usually introduced as a heuristic: many unit tests at the base, fewer integration tests in the middle, even fewer end-to-end tests at the top. The heuristic is useful as a starting point. It becomes a problem when it is treated as a natural law rather than a decision — when a team's specific pyramid ratio is treated as obviously correct rather than as a position that was implicitly taken and that carries concrete trade-offs.

The trade-offs are architectural. A team that has 90% unit tests and 5% integration tests has made a structural decision that unit-level behavior will be the primary signal for deployment confidence and that integration-level behavior is secondary. A team that has 50% integration tests hitting a real database has made a structural decision that the cost of real-infrastructure CI is worth the higher fidelity of the resulting signal. Neither position is wrong in the abstract — both are reasonable responses to specific constraints — but both are positions, and the teams that hold them almost never documented why.

This is the same problem as every undocumented architectural decision: the decision is invisible as a decision. A new engineer reads the test suite and extends it in the direction the existing tests point. They don't know whether the existing direction was chosen deliberately or whether it emerged from whoever wrote the first test and the subsequent engineers who copied the pattern. The test suite grows in the direction of its origin, whether or not the origin was the right direction for the current team, product, and infrastructure.

Why the test strategy reads as a preference instead of a constraint

Testing strategies invite the "preference" reading because the language around testing is heavily laden with preference vocabulary. Engineers say they "prefer" unit tests because they're fast, they "prefer" integration tests because they test real behavior, they "believe in" TDD, they "think" E2E tests are too brittle. This preference vocabulary describes a genuine diversity of engineering opinion — but it conceals the fact that the preference, once acted on at the infrastructure level, becomes a structural constraint.

Once a team has a test suite with 2,000 unit tests and 50 integration tests, the 2,000 unit tests are not a preference. They are a fact that the next engineer must extend or refactor against. The test infrastructure that was built to support them — the mock library, the factory functions, the test doubles for the database and the HTTP layer — is a technical dependency. The CI pipeline that runs them — configured for fast parallel unit test execution rather than for database provisioning — is infrastructure. The deployment gate that requires all unit tests to pass is a policy. None of these are preferences. They are decisions that have already been made and implemented, and they constrain what new development can look like.

The constraint is most visible when someone proposes to change the strategy. A team that has a mock-heavy unit test suite and wants to add real-database integration tests discovers that the existing application code is tightly coupled to the mock implementations: the service layer doesn't have a clean interface that can be pointed at a real database, because the tests were written assuming the database would always be mocked. The test suite has coupled the application architecture to the testing strategy. Changing the strategy requires changing the application code, not just the tests.

This is the moment when the undocumented testing strategy becomes expensive: when the team wants to change direction and finds that the existing infrastructure — code, tests, CI, deployment gates — all pull in the direction of the original implicit strategy, and there is no record that explains why the original strategy was the one that was adopted.

The three test strategy archetypes and what each one commits you to

The specific shape of a team's testing pyramid determines what the team has implicitly committed to — in CI cost, in refactoring safety, in the class of bugs the suite can and cannot catch, and in what deployment confidence the suite provides.

The unit-test-dominant pyramid. A wide base of unit tests, a narrow band of integration tests, and minimal end-to-end tests. Individual functions and classes are tested in isolation with mocked dependencies. The mock layer is the boundary between what is tested and what is assumed. This strategy produces fast, parallelizable CI — a 2,000-test unit suite can run in under 60 seconds on commodity hardware with no external infrastructure. It provides strong protection for internal implementation logic. It is the strategy that fits teams building algorithmic software, pure functions, or computation-heavy layers where the logic is the product.

What it commits you to: the mock layer must accurately reflect the behavior of the real implementations it replaces. When the real implementations change — when the database query changes, when the HTTP API response shape changes, when the third-party SDK updates — the mocks must be updated to reflect the change. If they are not, the tests continue passing while the production behavior has changed. The unit-dominant pyramid produces the failure mode where the test suite goes green and the production deploy fails because the mock of the payment provider's response shape was never updated when the provider changed their API.

It also commits you to a specific refactoring safety guarantee: you can safely refactor internal implementations provided the mock interfaces remain stable. You cannot safely change the mock interfaces without updating the tests. If your refactoring changes the shape of the service layer that the mocks replicate, the tests must be rewritten as well as the code. This is not a weakness of the strategy — it is a structural property of it. Engineers need to know this when they design a refactor, so they know what will and will not be protected.

The integration-test-dominant strategy. A significant proportion of tests hit real infrastructure — a real database in a test container, real HTTP endpoints, real message queue interactions. The test setup provisions infrastructure, loads test data, runs through real code paths, and asserts on real outputs. The mock layer is pushed to the edge of the system, at external third-party dependencies that genuinely cannot be provisioned in CI (payment providers, SMS gateways, external OAuth providers).

This strategy produces tests that catch the bugs that unit tests structurally cannot: query plan regressions when a migration changes an index, constraint violations that the application layer assumed the database would prevent, transaction isolation failures that only appear when two operations run against the same real database connection, serialization bugs that only surface when real JSON goes through a real HTTP layer and back. These bugs are real and they happen in production on teams that have 2,000 unit tests and green CI.

What it commits you to: CI infrastructure. A test suite that provisions a Postgres container for each test run needs Docker available in CI, needs the container to be healthy before tests run, needs a migration step to bring the schema to current state, and needs test data to be isolated between test cases. This infrastructure has a setup cost and an ongoing maintenance cost. CI run time is longer — ten minutes where the unit-dominant strategy would be sixty seconds. The team must decide how to handle test data: pre-seeded fixtures loaded before the suite runs, or per-test factory functions that create and clean up their own data. These are real operational decisions that emerge from the strategic choice to test against real infrastructure.

It also commits you to a different refactoring safety guarantee: you can safely refactor internal implementations because the tests assert on real behavior at the system boundary. You can change the ORM query from a Prisma call to raw SQL and the tests will tell you whether the behavior changed, because the tests hit the real database rather than a mock. This is a stronger safety guarantee for certain kinds of refactoring — but it requires the infrastructure overhead to maintain.

The end-to-end primary strategy. The primary test suite drives the product through a real browser or API client, from user action to observable system state. Unit tests may exist for specific algorithmic components, but the primary deployment gate is end-to-end test passage. This strategy is adopted by teams whose product behavior is the integration of many services — teams where the question "does this work?" can only be answered by simulating a real user action across the full stack.

What it commits you to: the slowest CI and the highest infrastructure cost. E2E tests that drive a real browser through a real application stack can run in 20-60 minutes for a mature product. They require full environment provisioning. They are sensitive to environment state — a flaky network call, a slow render, a timing-dependent assertion can produce non-deterministic failures that take significant engineering time to diagnose. The team that commits to E2E as the primary strategy commits to maintaining the test environment, managing flakiness, and accepting a long feedback loop between commit and deployment signal.

What it provides: the highest-fidelity deployment confidence of the three strategies. If the E2E tests pass, the user-visible behavior of the product is correct — at least for the user journeys the tests cover. For teams building products where user-visible correctness is the primary quality signal, this trade-off is often correct. For teams building internal infrastructure or developer tooling, it is often excessive.

The mock boundary: the most consequential undocumented decision

Within any test strategy, the specific location of the mock boundary is the most consequential undocumented decision. The mock boundary is the answer to a precise question: where in the stack does the test switch from using real implementations to using test doubles? The answer determines everything else about what the test suite can and cannot tell you.

The mock boundary question is most sharply posed by the database question: do tests mock the database, or do they hit a real database? This question is where teams have the most direct debates in AI chat — "should we mock the database in our unit tests?" is one of the most common software architecture questions in any team's chat history — and where the debate is most consequential.

A team that mocks the database in unit tests is asserting that the application layer and the database layer are separable concerns: the service calls a repository interface, and the tests mock that interface. The repository's real implementation — the SQL or ORM queries — is tested separately. This separation allows the service layer tests to be fast and to test the service logic in isolation. But it creates a gap: the service layer tests pass because the mock repository behaves as specified, not because the real database behaves as assumed. When those two diverge, the unit tests continue passing while production fails.

A team that hits a real database is asserting that the application layer and the database layer are not separable for testing purposes: the behavior of the application is defined by the interaction between the application code and the real database, and testing the application code without the real database is testing a fiction. This is not a testing preference — it is a claim about the nature of the software. For software where the database is a first-class part of the system behavior (which is most CRUD applications), this claim is correct.

The mock boundary cascades upward from the database question. If tests mock the database, they can also run without an HTTP server — the test can call the service layer directly. If tests hit a real database, the natural next question is whether they should also hit a real HTTP server — and whether they should drive a real HTTP client rather than calling controllers directly. Each choice moves the mock boundary up or down the stack and changes what the test suite can and cannot catch.

The mock boundary also cascades into the application architecture. Code that is written knowing that the tests will mock the database tends to have clean repository interfaces with injectable dependencies. Code that is written knowing that tests will hit a real database tends to be less architecturally abstracted — the database calls can be inline in the service because there's no need to inject a mock. Both architectures are valid; they are shaped by the testing strategy. When a team changes its testing strategy, it often discovers that the application architecture was shaped by the old strategy and must be refactored before the new strategy can be applied.

The test strategy as a hiring constraint

An undocumented testing strategy creates a specific onboarding failure mode: engineers arrive with a testing philosophy from their prior team and extend the existing test suite in the direction their prior experience points, regardless of the direction the existing suite was going.

An engineer who comes from a TDD background finds a test suite with 200 tests for a mature product and reads the sparse coverage as a gap to close. They write unit tests for every function they touch, mock all dependencies, and push the pyramid toward the unit-dominant strategy. An engineer who comes from an integration-test background finds the same suite and finds the mock-heavy approach unsatisfying — they add real-database tests because they've been burned by mock-database tests passing while production fails. An engineer who comes from a high-coverage E2E background finds both orientations insufficient and starts writing Playwright tests for every feature they ship.

All three are doing the right thing as experienced engineers exercising judgment in the absence of constraints. None of them is extending the existing strategy — they're each importing their prior team's strategy onto the current codebase. After three engineers with different backgrounds have each contributed to the test suite in the way their experience suggested, the suite reflects three overlapping strategies rather than one coherent strategy. This is the testing equivalent of the de-standardization failure mode: the codebase ends up in a mixed state that looks like inconsistency rather than a deliberate architectural choice.

The test strategy record solves this by giving every new engineer a documented constraint to work within and argue against. An engineer who disagrees with the mock boundary decision can propose a change to the record with specific reasoning, rather than silently importing their preferred alternative. The record converts a silent drift into an explicit debate — and explicit debates produce better decisions than silent drift.

The constraint also interacts with the new technical leader onboarding problem. A new CTO or engineering manager who sees a test suite with long CI times may read the slow tests as a quality problem to fix rather than as the consequence of a deliberate integration-test strategy. Without a record, they may push for faster tests — which means mocking more, which means moving toward a unit-dominant strategy — without understanding that the slow tests are buying a class of production bug prevention that faster tests cannot provide. The decision record gives them the context to evaluate the trade-off honestly rather than treating CI speed as the primary signal.

The CI speed session as the forcing function

The most common trigger for an explicit test strategy debate is a CI speed problem. A test suite that takes 45 minutes to run surfaces the cost of the current strategy in a way that abstract arguments cannot: every engineer on the team waits 45 minutes for deployment feedback, every PR cycle includes a 45-minute delay, and the team's iteration speed is directly constrained by the test suite's runtime.

The CI speed session — "our tests are taking 45 minutes, how do we fix this?" — is typically the first session where the team explicitly evaluates the current test strategy against its costs. Before this session, the strategy was implicit and the costs were invisible. During this session, the team is forced to examine what the tests are doing that makes them slow, which means examining the architecture of the test suite: how many tests hit a real database, how many tests provision containers, how many tests drive a browser. The speed session surfaces the strategy by surfacing its costs.

These sessions are among the most valuable test strategy decision records to extract from AI chat. They contain the first explicit evaluation of the current strategy, the trade-offs the team was weighing, and the decision about whether to optimize the current strategy or change it. A team that decided to "just cache the Docker layer and add more parallelism" decided to invest in the integration-test strategy rather than pivoting toward mocks. A team that decided to "write more unit tests and push the slow tests into a nightly run" decided to bifurcate the CI pipeline into fast unit gates and slow integration gates. Both are valid decisions. Neither is documented as a decision in most team's records.

The CI speed decision also connects to the performance optimization decision record in a specific way: CI performance optimizations — parallelism, layer caching, selective test execution, test splitting by file — are infrastructure investments made in the service of the current testing strategy. They are the right optimization if the current strategy is the right strategy; they are the wrong investment if the strategy should change. Without a documented strategy, the team optimizes CI without knowing whether they're optimizing toward the right target.

Production bugs that the test suite couldn't catch

Every test strategy has a characteristic production failure mode — the class of bug that the strategy is structurally incapable of preventing. Documenting the testing strategy includes documenting this failure mode honestly, because the failure mode names the coverage gap that the team has accepted.

The unit-dominant strategy's characteristic failure mode is the mock drift bug: the real implementation changed, the mock was not updated, the tests passed, the production deployment failed. The specific version is the integration contract bug — the service's mock of the third-party API accepted a response shape that the real API no longer returns, and the bug only appeared in production when the real API was called. This is not a testing failure in the sense of an inadequate test — it is the structurally inevitable consequence of the unit-dominant strategy. The strategy accepts this risk in exchange for fast CI and isolation of internal implementation logic.

The integration-dominant strategy's characteristic failure mode is the environment parity bug: the test container behaves differently from the production database in a way that the integration tests didn't catch. A PostgreSQL version difference, a collation difference, a timezone configuration difference, a network timeout behavior that doesn't appear in a local container but appears in a geographically distributed production deployment — these bugs pass integration tests because the integration test environment isn't identical to production. The strategy accepts this risk in exchange for catching mock-drift bugs and testing real behavior at the database boundary.

The E2E-dominant strategy's characteristic failure mode is the flakiness tax: non-deterministic test failures that require engineering investigation to distinguish from real bugs, flaky tests that are eventually ignored rather than fixed, and a gradual erosion of the CI gate's credibility as the team learns that red CI doesn't always mean a real regression. This is the most corrosive failure mode because it affects not just the tests that fail but the team's trust in all test results.

Documenting the characteristic failure mode of the chosen strategy is not a statement of inadequacy — it is intellectual honesty about what the strategy can and cannot prevent. Post-mortems that identify production bugs that escaped the test suite are the empirical input to this documentation: the pattern of escaping bugs reveals the actual failure mode of the current strategy, and the decision record should name this pattern explicitly rather than treating it as a series of isolated incidents.

Writing the test strategy decision record

The Nygard ADR format adapts for test strategy decisions with four sections that standard architecture decision records often underspecify.

Context. Name the trigger for establishing a testing strategy: the first significant test infrastructure decision (did the team start with Jest and unit tests, or with Supertest and real HTTP tests?), the product context (what is the software testing, and what class of bugs is most dangerous in production?), and the team context (what is the team's SQL fluency, what CI infrastructure is available, what is the acceptable CI run time?). "We are building a REST API backed by PostgreSQL. The team has three engineers with strong SQL fluency. Our primary deployment environment is a managed PostgreSQL instance on AWS RDS running version 15. Our CI budget per run is approximately 10 minutes. Our primary risk is data correctness bugs — incorrect query results returned to users — rather than UI regression bugs."

The pyramid ratio decision. Name the intended ratio of unit to integration to E2E tests with the reasoning for the balance. "We target approximately 60% integration tests (testing against a real Postgres container), 35% unit tests (testing pure algorithmic logic and parser functions that have no database dependency), and 5% E2E tests (smoke-testing the critical user paths through a real HTTP client). The integration-heavy ratio is chosen because our primary risk is data correctness, and mock-database tests cannot catch the query plan regressions and constraint interactions that are most likely to produce data correctness bugs in production."

The mock boundary decision. Name exactly where in the stack tests switch from real implementations to test doubles, with the specific reasoning for each boundary. "Tests hit a real Postgres database in a Docker container. They do not mock the HTTP layer — tests use Supertest to hit the real Express application. They do mock external third-party dependencies (Stripe, SendGrid, the OAuth provider) because these cannot be provisioned in CI without network access to production systems. The mock boundary sits at the edge of the system, between our code and external services we do not control. Everything within our system boundary is tested with real implementations."

The refactoring safety guarantee and deployment confidence threshold. Name what the test suite is designed to protect during refactoring and what it must show before deployment. "The integration test suite protects behavior at the HTTP API boundary: a refactoring that changes the SQL but not the HTTP response shape is safe without test updates. A refactoring that changes the HTTP response shape must be accompanied by updated integration tests before merging. Deployment requires 100% of integration and unit tests passing. We accept that flaky third-party mock behavior may occasionally produce false-positive CI failures; these are diagnosed by re-running the specific failing test in isolation."

Revisitation conditions. Named triggers under which the decision should be re-evaluated. "Re-evaluate this strategy if: (1) CI run time exceeds 20 minutes — at that point, the integration-heavy approach is creating a deployment velocity constraint that requires either parallelism investment or strategy change; (2) the team's average SQL fluency decreases significantly — if new engineers consistently struggle with direct SQL, the productivity cost of avoiding an ORM abstraction may exceed the query plan control benefit; (3) the product adds a significant UI surface — the current strategy has no browser-level testing, which is appropriate for an API product; a consumer-facing web UI would require adding a browser-level test tier."

The test strategy as a cross-team constraint

For teams that build infrastructure consumed by other teams, the test strategy is also a cross-team constraint. A platform team whose testing strategy doesn't include testing the contracts between their services and consumer services will produce integration failures that only appear when a consumer team deploys against the updated platform.

This is the infrastructure layer of platform team decision constraints: the platform team's testing strategy determines what guarantees they can make to consumer teams. A platform team that only unit-tests their service can tell consumer teams "our internal logic is correct." A platform team that integration-tests their service against a real database can tell consumer teams "our stored behavior is correct." A platform team that tests their service through a real API can tell consumer teams "our contract is correct." The test strategy determines which guarantee the platform team can honestly make.

Contract testing — testing that a provider service and its consumers agree on the shape of the API contract — is the tool that extends test coverage across team boundaries. A platform team that introduces contract testing is making a strategy decision that their testing scope extends to the interfaces consumed by other teams, not just their internal implementation. This is a significant architectural commitment — it requires the platform team to maintain consumer-facing test fixtures and to be notified when consumer teams change their usage patterns. It is a decision that should be documented as a decision, not adopted implicitly when a new engineer adds a Pact test because they came from a team that used Pact.

Finding test strategy decisions in AI chat

The WhyChose extractor surfaces test strategy decisions from four recognizable AI chat session shapes.

The setup session is the most important and the most commonly missed. In the first weeks of a project, engineers ask questions like "how should we structure our tests for this kind of API?" or "should we use Jest or Vitest for this?" or "how do we test database queries — should we use a test database or mock the queries?" These sessions contain the original strategy selection, often made before the team had enough context to make it deliberately. The answer to "how do we test database queries?" in week one is the decision that determines the test strategy for the next two years. It is almost never framed as a strategic decision at the time — it is framed as "how do I set up tests?" — and that framing is why it is rarely documented.

The mock debate session is the second most valuable type. These sessions are explicit: "should we mock the database in our unit tests, or use a real test database?" The team is often divided, with engineers importing different preferences from prior teams. The session contains the argument on both sides and the conclusion. This is exactly the reasoning the decision record should preserve — and it is the reasoning that disappears when the engineer who had the debate leaves the team.

The CI speed session surfaces the cost of the current strategy. "Our tests are taking 40 minutes to run" is an opening that leads to an explicit evaluation of the current test suite architecture. These sessions are valuable because they contain the first honest accounting of what the current strategy costs, which is the information the team needs to decide whether the cost is worth the benefit. The quarterly decision review is the systematic mechanism for surfacing these sessions — a pass over 90 days of chat history looking for sessions that evaluated the test suite's cost against its benefit, including the setup sessions from the beginning of the project that most teams have forgotten.

The escape hatch session is the most diagnostic. When a bug reaches production that the test suite didn't catch, the team's investigation of "why didn't our tests catch this?" is an explicit naming of the test strategy's coverage gap. These sessions reveal the characteristic failure mode of the current strategy — mock drift, environment parity, test coverage holes — and often contain the team's decision about whether to change the strategy or accept the gap. The post-mortem to ADR pipeline is the natural path for surfacing these decisions: the post-mortem identifies what the test suite failed to catch; the ADR documents the decision about whether and how to extend the strategy to close the gap.

What the test strategy record protects

The test strategy decision record protects three things that are otherwise lost when an implicit strategy is left undocumented.

It protects the deployment confidence signal. A team that knows their CI is green because 2,000 unit tests passed has a specific kind of confidence: the internal logic is correct, and the mocks reflect the expected behavior of real dependencies. A team that knows their CI is green because 800 integration tests passed against a real database has a different kind of confidence: the system behavior against real infrastructure is correct. Neither confidence is better in the abstract — but both are different, and the team should know which kind they have before deploying. The decision record makes this explicit. Without it, CI green is CI green, and the team doesn't know what kind of coverage they're standing on.

It protects refactoring work from unexpected test rewrites. An engineer who plans a refactor to the database access layer needs to know whether their tests will break because the tests mock the database (in which case the mocks must be updated along with the code) or because the tests hit a real database (in which case the tests will tell them whether the behavior changed, without requiring manual mock updates). Without the decision record, the engineer discovers this mid-refactor when they run the tests and find either that everything passes because the mocks didn't change, or that 50 tests broke because they were testing the implementation rather than the behavior. The record gives them this information before the refactoring starts, so they can plan accordingly.

It protects the test strategy from implicit supersession by accumulation. The most common way a test strategy changes is not through a deliberate decision but through the gradual accumulation of exceptions: one engineer adds a real database test because they prefer it, another adds more mocks because the database tests are slow, a third adds E2E tests because they want browser coverage. Over time, the suite reflects three strategies in an uneasy mixture rather than one coherent strategy. The decision record doesn't prevent this from happening — engineers will still import their preferences — but it makes the drift visible. When the test suite no longer matches the documented strategy, the team can see the divergence and decide whether to update the record (to document the new direction) or refactor the tests (to bring the suite back into alignment). Without the record, the drift is invisible until the inconsistency reaches the scale where it becomes a maintenance problem.