Blog · 2026-06-06 · ~11 min read

How many decisions should an engineering team make per quarter — and what does "too few" look like?

A four-person team extracted their first quarter of AI chat history and found three decisions. Not three dozen — three. The product manager asked whether the team was moving too slowly. The engineering lead said that couldn't be right, because they'd shipped two new integrations and a full authentication overhaul in the same quarter. The discrepancy was real, but the interpretation was wrong. The team hadn't made three decisions in that quarter. They'd documented three. The rest — the choice of OAuth library after evaluating four candidates, the session token format decision that touched the mobile API contract, the decision to defer multi-tenant support after a ninety-minute whiteboard session — had happened in AI chat and never made it to the log. The extraction gap, not the decision gap, was the problem.

TL;DR

For a team of 4–8 engineers in active development: 5–10 durable decisions per quarter is a realistic baseline. Very early teams (1–3 engineers): 3–6. Scaling teams (10–25 engineers): 8–15 per squad. "Too few" is a diagnostic signal, not a performance verdict — it almost always means the log is dormant, not the team. The distinction matters: a quiet quarter has low activity and low decisions; a dormant log has high activity and low decisions. High activity with a near-empty log is decision debt accumulating. The WhyChose extractor gives you an independent measure — how many decision-shaped conversations happened in AI chat vs. how many made it to the record. That gap is the number to watch.

First: what counts as a decision?

The volume question is ambiguous until you're precise about what you're counting. Not every technical choice is a decision in the record-worthy sense. Engineers make hundreds of small choices per week — variable names, test structure, how to phrase an error message. None of those belong in a decision log. The choices that belong in a log meet three criteria simultaneously:

Using date-fns instead of moment.js without deliberation is not a decision. Evaluating date-fns, moment.js, and native Date, choosing date-fns because moment is deprecated and the team has a stated policy against large bundle dependencies — that's a decision. The difference is whether the choice is recoverable from reading the code (most tactical choices are) versus requiring knowledge of what was rejected and why (most durable choices aren't).

The standard ADR template captures this with the Alternatives Considered section. The decision log template has a lighter-weight version for choices that don't warrant the full ADR ceremony. Either way, the minimum viable record includes: what was decided, what was rejected, and the constraint that drove the selection. Without all three, the record is a history note, not a decision record — it tells you what happened but not enough to prevent future re-litigation.

Expected volume by team size and stage

These figures come from the 1,200-chats extraction walkthrough and from running the extractor on smaller, earlier-stage histories shared by engineers who tested the tool before launch. They're rough benchmarks — not targets. Real quarters vary based on product stage and what the team is actually working on.

Very early stage (1–3 engineers, pre-product-market-fit): 3–6 durable decisions per quarter. Early teams make decisions at high frequency — nearly every hour — but most are below the durability threshold. Technology selections dominate (language, framework, database, hosting, auth approach). These have long half-lives if the product survives, but the product's survival is also uncertain, so over-documenting early choices is a real cost. The extractable decisions in this range are the ones that will constrain the architecture for years: data model shape, API convention, authentication model. Everything else is likely to be replaced if the product finds traction.

Small team, post-launch (4–8 engineers, active iteration): 5–10 per quarter. The mix shifts. Technology selections continue, but process and workflow decisions start appearing — code review policy, deployment cadence, on-call rotation structure, how incidents get documented. Team growth creates decision occasions: the fourth engineer joins and the team has to decide whether pair programming is still the default. Each of these is record-worthy because a fifth engineer who joins three months later shouldn't have to re-ask or re-derive the answer.

Mid-size, scaling (10–25 engineers): 8–15 per squad per quarter, across two to four squads. At this scale, architecture decisions multiply because the product surface area has expanded and different squads are making adjacent choices that need coordination. Service boundary decisions, API contract decisions, shared data model decisions — each is a decision occasion where the record prevents inter-squad re-litigation. Org-level decisions (team structure, hiring criteria, performance review process) are fewer per quarter but higher-stakes, and they're often made entirely in management AI chat sessions that never surface to engineering records.

At scale (25+ engineers): Hard to give a meaningful team-level figure. Per-squad numbers look similar to mid-size. The variable is how many cross-squad architectural decisions happen — in a scaling quarter, those can spike significantly. The decision log at this stage functions less as a per-engineer reference and more as an institutional audit trail for engineering leadership and new hires doing onboarding architecture reviews.

The "too few" diagnostic

Below the floor for your team size by 50% or more, for two consecutive quarters, is the signal threshold worth acting on. A single quiet quarter isn't alarming — it has an obvious explanation in most cases. Two consecutive quarters of low extraction with normal shipping activity is a dormant log.

The key distinction is the relationship between decision volume and activity level:

Quiet quarter: Low activity. The team was in maintenance mode, executing against a well-specified roadmap, or consolidating after a rapid-build phase. Few novel architectural questions arose because the team was implementing known solutions. Low decisions, low extracted decision-shaped conversations, relatively few shipped features. Everything is consistent — the log is accurate, it's just thin.

Dormant log: Normal to high activity. Features shipped, services refactored, team grew, a technology was replaced. But the decision log is near-empty. The decisions were made; they happened in AI chat, in Slack threads, in whiteboard sessions that were never transcribed. The extraction gap — how many decision-shaped conversations happened vs. how many made it to the log — is large. This is the dangerous case because it's invisible from the outside. The new-CTO scenario is almost always a dormant-log scenario: the team was active, they made real choices with real trade-offs, but six months later those choices look arbitrary because the log has nothing.

The most diagnostic check: look at what shipped and ask whether any of it required non-obvious choices. A feature that added external OAuth login without any OAuth library selection record is a dormant-log signal. A database migration with no record of why that migration approach was chosen over alternatives is a dormant-log signal. A new service with no API contract decision is a dormant-log signal. If you can find three or more of these in a single quarter, the extraction gap is your problem, not your decision frequency.

What "rejection records" tell you about log completeness

One of the most reliable completeness checks is the ratio of "yes" decisions to "not building this" decisions. As explored in the essay on decisions that never get written down, the most valuable and most commonly missing category in a decision log is the deliberate rejection: the feature the team evaluated and chose not to build, the architecture approach that was proposed and turned down, the integration that was scoped and deferred.

A healthy log has roughly equal numbers of "chose X" and "didn't choose Y" records. If your quarterly extraction returns fifteen "chose X" records and zero rejection records, the triage is incomplete. Engineers document what they built; they rarely go back to document what they decided not to build. But AI chat captures rejection reasoning in full — engineers work through "should we build real-time sync?" more extensively in chat than in any other medium, precisely because the answer is non-obvious and the trade-offs need to be mapped out.

During the quarterly triage, the Dismiss bucket is where under-examined rejection records accumulate. If the Dismiss bucket is large and the Promote bucket has no rejection records, that's a calibration error — scan the Dismiss entries for conversations where the conclusion was "no" rather than "yes." The reversal markers that the extractor uses (language like "we decided against," "we're not going to," "the reason we chose not to") flag these conversations specifically, but the triage pass is where they get correctly classified as record-worthy rejections rather than noise.

The overcrowding failure mode

The volume question isn't only about floors. Over-promotion — logging everything, or logging choices below the durability threshold — is a failure mode too, and it's more visible than dormancy because it shows up as a large log that nobody uses.

The symptom of an overcrowded log: engineers re-litigate decisions that are already in the record, not because they don't know the record exists, but because the record is so full of low-signal entries that searching it returns too many results. The signal-to-noise ratio has fallen below the threshold where consulting the record is faster than reconstructing the answer from scratch. At that point, the log has the opposite of its intended effect — it adds a research step without subtracting the re-derivation step.

The governor is the durability filter, applied consistently during triage. A choice that would take a new engineer thirty seconds to reconstruct from reading the code doesn't belong in the decision log. A choice that would take a new engineer an hour to reconstruct — because it requires knowing what alternatives were evaluated and why the specific constraints at the time pointed toward this one — does.

The reason ADR practices go stale is often over-promotion followed by log abandonment: the team starts logging everything, the log becomes unwieldy, the ceremony cost exceeds the perceived benefit, and engineers stop maintaining it. The two-tier practice that actually holds — extracted records for the majority, hand-written ADRs only for the load-bearing minority — prevents over-promotion by design. Extracted records are descriptive; ADRs are normative. The distinction keeps the ADR tier lean.

Decision types and their natural frequencies

Not all decision categories appear at the same rate. Understanding the expected frequency by type helps with calibration — if one category is systematically missing, it points to a specific extraction or triage gap.

Technology selections (library choice, service choice, language for a new service) appear one or two times per major initiative. Not per sprint — per initiative. Most sprints don't introduce new technology; they use technology already selected. When a new initiative starts, technology selection decisions cluster. A quarter with two or three new initiatives might have four to six technology selection decisions; a consolidation quarter might have none.

Architecture decisions (service boundary, data model shape, API contract, authentication model) are the highest-stakes category. One to two per quarter for a team working on an established product; three to five during active scaling phases or major refactors. These are the decisions with the longest half-lives — architectural invariants often hold for eighteen months or more before being questioned. They warrant the full ADR treatment: complete Alternatives Considered section, explicit Consequences block, a revisit condition if the decision was made under constraints likely to change.

Process and workflow decisions (code review policy, deployment cadence, incident response protocol, how the team handles tech debt) appear one to two per quarter, with clusters around team composition changes. A new hire who joins with strong opinions about review process creates a decision occasion. A team that grows past the size where informal protocols work creates several at once. These decisions have the shortest half-lives — four months on average before the first "should we revisit this?" conversation — which makes them a reliable staleness target in the quarterly obsolescence check.

Rejection records ("not building this" decisions) should appear in rough proportion to the positive decisions — perhaps slightly fewer, since some rejection reasoning is exploratory rather than settled. If they're near zero, the triage is missing them. If they're greater than the positive decisions, the team may be over-analyzing options without closing them, which is a different problem.

Decision debt and its compound effect

Consecutive quiet log quarters — whether because the team is dormant or because the log genuinely hasn't been maintained — create decision debt. Each undocumented choice is a future re-litigation risk. The debt compounds because decisions are often downstream of each other: the API contract decision constrains the client-caching decision, which constrains the mobile performance decision. If the API contract decision isn't recorded, the team that makes the client-caching decision later can't see the constraint, may make a choice that violates it, and discovers the conflict only when both decisions collide in production.

Decision debt is invisible until it surfaces. The moments when it becomes visible are recognizable: the post-mortem where nobody can explain why the system is designed the way it is, the architecture review where a proposed change is blocked by an undocumented invariant that only one engineer knows about, the new-engineer onboarding where "why did we pick Postgres over Mongo?" gets answered with a shrug instead of a link. Each of these moments costs time — reconstruction is slow and incomplete. But none of them appear in a sprint tracker or on a roadmap. Decision debt doesn't delay features the way technical debt does. It delays understanding, which delays confident decision-making, which delays features — but the causal chain is long enough that the connection is rarely made explicitly.

The practical implication: two consecutive quarters with a dormant log is worth a dedicated session to recover the gap. Run the extractor on the last six months of chat history rather than the standard quarter. The ChatGPT history export and the Claude conversation export both support arbitrary date ranges — there's no need to limit the extraction to the current quarter if you have a known gap. The resulting batch triage will be larger than a standard quarterly pass, but the recovery is worth the overhead. Undocumented decisions from six months ago are still recoverable today; they won't be recoverable in another six months when the context has faded further.

Using extraction data to calibrate the practice

The extractor is useful not just as a documentation tool but as a diagnostic. Run it quarterly, compare the output to your existing log, and measure the gap. If the gap is consistently small — extracted count minus logged count is near zero — your practice is healthy. If the gap is growing quarter over quarter, your documentation cadence isn't keeping up with your decision cadence.

Three specific calibration signals from extraction output:

Extraction count well above logged count. The decisions are happening; they're just not getting documented. Fix is triage discipline: add a standing item to the weekly planning meeting to flag decisions made this week that should be logged before they're forgotten. The window for effective logging closes quickly — within a week of a decision, the context that made it clear has started to fade from working memory.

Extraction count near zero in an active quarter. The decisions may have happened in non-AI channels. Slack threads where a library was evaluated and chosen, a whiteboard session that was photographed and filed in a drive folder that nobody checks, a Zoom call where the architecture was sketched and the decision was made without a follow-up write-up. AI chat isn't the only decision surface; it's just the one where the reasoning is most faithfully preserved. If your team uses AI chat extensively but extraction returns near-zero results, check the extractor's hit-rate output against the raw conversation count — the extractor's hit rate on raw exports is around 3.1%, which means 100 conversations should yield roughly 3 decision candidates. Zero from 100 conversations points to either a very atypical conversation set or an extraction issue.

Extraction count matches logged count, but both are low. This is the genuine quiet quarter. No intervention needed unless the team also shipped significant features — in which case it's worth asking whether the shipped features required zero novel choices, or whether the novel choices happened and were obvious enough that nobody thought to write them down. Often it's the latter: a choice that seemed obvious at the time becomes a source of confusion six months later when the constraint that made it obvious is no longer shared knowledge.

The WhyChose extractor runs locally on your ChatGPT export or Claude conversation export — nothing is sent to a server. The output gives you both the extracted candidates and a summary of the hit rate against the raw conversation count. That hit rate is your baseline for what "normal extraction from this chat history" looks like, making quarter-over-quarter comparison meaningful.

Practical calibration steps

If this quarter's extraction is below the floor for your team size:

  1. Check the shipped list first. List everything that shipped this quarter — features, migrations, process changes. For each item, ask: did this require any technology or architecture choice that isn't obvious from reading the code? If yes, that choice should be in the log. Each missing one is a recovery target.
  2. Look at the Dismiss bucket from the last triage pass. Dismissed extraction candidates that were "not actionable" sometimes contain decision records that were de-prioritized at triage time. Revisit them against the shipped list — a Dismissed conversation about a library choice that shipped this quarter was probably under-triaged.
  3. Set a decision-awareness prompt for the next sprint. Before closing any PR that introduces a non-obvious technology choice, the author notes whether the choice warrants a decision record. Thirty seconds at PR review is cheaper than thirty minutes of recovery three months later.
  4. If two consecutive quarters are below floor: run a six-month recovery extraction. Use the longer date range in the export to capture the full gap in one batch. The triage will be heavier, but the alternative is compounding debt.

If this quarter's extraction is above the ceiling and the log is growing unwieldy:

  1. Apply the durability filter strictly at the Park/Promote boundary. If the choice wouldn't confuse a new engineer who joined next month, it goes to Park or Dismiss, not Promote. Park is for "maybe later" — check Park entries again in two quarters and if none of the reasons to promote have materialized, Dismiss them.
  2. Check whether the ADR tier has grown faster than the decision-log tier. The ADR format is for load-bearing decisions. If architectural invariants are the only thing getting full ADR treatment, the ratio is healthy. If process decisions and library choices are getting ADR ceremony, the ceremony cost is too high and the durability signal is too low.

Volume calibration is a quarterly activity, not a one-time setup. The right number changes as the team grows, the product matures, and the decision landscape shifts. The quarterly triage workflow builds this calibration in automatically — finishing a triage pass with fewer than three promoted records triggers a diagnostic review, not just an acceptance that this was a light quarter.

The gap between decision count and extraction count is the metric that matters. Run the open-source WhyChose extractor on your ChatGPT or Claude history and compare the result to your existing log. The difference — decisions that happened but weren't documented — is the debt that accretes into the new-CTO problem six months from now. MIT-licensed, runs locally, nothing sent to a server. Join the waitlist for the hosted version with team sharing, multi-quarter comparison, and export to Notion and Linear.