Blog · 2026-06-05 · ~10 min read
I built an open-source tool to extract decisions from ChatGPT/Claude. Here's every regex I used — and every one I had to throw out.
The WhyChose extractor is about 500 lines of dependency-free Node. It takes a ChatGPT conversations.json or a Claude export and emits structured decision records — the "I chose X over Y because Z" moments buried in years of chat history. This is the story of building it: five heuristics that went in the bin, four that survived, and what each failure mode revealed about how engineers actually think with AI.
The short version
The extractor uses a two-pass architecture: a question pass that finds user messages matching decision-initiation patterns, then a commit pass that looks for a user commit phrase within the next 6 messages. Trade-off markers bump confidence; reversal markers disqualify the record. Five other approaches were tried first — sentence-length thresholds, named-entity recognition, first-person-verb matching, message-count filtering, and question-answer adjacency detection — and all five produced unacceptable false-positive rates before being discarded. The post explains each failure and why the two-pass shape is the right floor for v1.
What the extractor is actually doing
The job is simpler than it sounds. A decision-in-chat looks like this: the user asks a question that frames a choice between named options, the conversation explores the options, the user commits to one. The output for that thread should be a record with five fields: the question, the chosen option, the alternatives that were named, a snippet of the rationale, and a confidence level.
That's the whole problem. The hard part isn't the data model — it's distinguishing decision threads from the sea of other things engineers use AI for: debugging, explaining concepts, writing boilerplate, reviewing code, asking for definitions. In a 1,200-conversation export, the real hit rate is around 3%. The remaining 97% needs to be discarded cleanly, without requiring the user to read a false-positive list longer than the real output.
I considered two approaches: train a classifier (ML, probably a small fine-tuned model), or write deterministic pattern rules (NLP lite, no dependencies). The classifier approach would have better recall — it would catch implicit decisions, non-English threads, and the slow-burn consensus that develops over 30 messages. But the product constraint ruled it out immediately: the extractor's core promise is that it runs fully offline with no network calls. Export your conversations.json. Run node bin/extractor.js conversations.json. Nothing leaves your machine. A classifier that requires a model download, inference time proportional to export size, and a GPU or significant CPU would break the "five-minute local run" promise. So: deterministic patterns, conservative by design, known misses documented openly in patterns.md.
The five approaches that failed
1. Sentence-length threshold
The first thing I tried was the simplest possible heuristic: count the tokens in a user message. Hypothesis: decision-framing messages are longer than average — an engineer asking "should I use Postgres or MongoDB" and explaining their constraints writes more than one asking "how do I sort a list."
I wrote the threshold at 15 tokens (rough equivalent: about two short sentences) and ran it against a sample of 60 conversations I'd manually tagged as containing real decisions.
The recall was fine — 87% of real decision threads had at least one user message above 15 tokens. But the precision was catastrophic. Every long debugging session, every "can you explain X in detail," every code review prompt — all above threshold, all feeding false positives into the output. In the 60-conversation sample, a 15-token filter produced 41 "candidate threads." Fourteen contained real decisions. Twenty-seven were noise. That's a false positive rate above 60%.
The deeper problem is that length and decision-intent are correlated only weakly and in one direction. Most long messages aren't decisions. Discarding this heuristic was the right call, and it happened after one afternoon of testing.
2. Named-entity recognition for technology terms
The second approach was more plausible: decisions tend to name technologies, frameworks, or services. "Should I use React or Vue" contains two named tech entities. "How do I sort a list" contains none.
I built a lexicon of ~300 technology names — databases, cloud services, languages, frameworks, libraries — and tagged messages that contained two or more names from the lexicon as candidate decision threads.
This worked better on the obvious cases: "Postgres vs MongoDB," "React vs Svelte," "Terraform vs Pulumi" all fired correctly. But the failure modes were worse than with length:
- Stack-listing false positives. "Can you help me with a React + Next.js app that uses Postgres on Supabase and deploys to Vercel?" contains five lexicon entries and zero decision intent. It's a project description, not a comparison.
- The lexicon maintenance problem. The tech ecosystem moves fast. Within two weeks of freezing the lexicon, I'd already missed Bun, Drizzle, Turso, and Astro as they became common in chat sessions. A lexicon-based approach requires constant updates or it silently degrades.
- Generic decisions aren't technology comparisons. "Should we hire a senior engineer or two mid-level ones?" is a real decision that doesn't contain a single tech name. Neither does "should I launch the beta now or wait until the auth is solid?" The lexicon approach has a structural blind spot on hiring, product, and timing decisions — which are often the highest-stakes ones.
I kept the tech-name list as a tag derivation layer (it's used in patterns.md to bucket decisions into tags like database, frontend, infra) but removed it from the detection path entirely. Tag inference after detection is fine. Detection gated on tech terms is too narrow.
3. First-person verb matching in isolation
The third approach targeted the commit signal directly: find user messages containing first-person future-tense or present-progressive verbs followed by a technology or approach name. "I'll use Postgres." "I'm going with React." "We're adopting Terraform."
The pattern seemed tight. It isn't.
The failure mode I didn't anticipate: engineers routinely use this voice when explaining things to the AI, not when committing to a choice. "I'll use the value from the previous step here." "I'm going to iterate over the array and filter by date." "We're using server-side rendering in this app." None of these are decisions — they're code narration. The user is describing what they're doing in code, not announcing what they've chosen.
The false positive rate for "first-person commit verb without prior decision-framing question" was above 70% on a test set of 40 conversations. The phrase shape is ambiguous without context. It's not a reliable standalone signal.
This failure led directly to the two-pass architecture. Commit phrases are only meaningful after a question pass has established that a decision thread is already in progress. A commit phrase on its own means nothing. A commit phrase within 6 messages of a question shape is signal. The detection path had to be sequential, not parallel.
4. Message-count filtering
The fourth approach was based on thread length. Hypothesis: real decisions happen in longer threads — the engineer and AI are exploring options, which takes multiple turns. A thread shorter than 8 messages is likely a quick lookup, not a decision thread.
This was approximately right empirically (long threads do contain more decisions) but wrong as a filter. The message-count distribution for real decisions has a long left tail: some of the cleanest, most decisive commits happen in 4-message threads. The engineer asks the question. The AI presents options. The engineer commits in 8 words. Done.
These short, decisive threads are actually the highest-value records — the ones where the engineer had already done the thinking and just needed to say it out loud. Filtering them out would miss exactly the records that are easiest to export and most reliable to act on.
The count filter was dropped. Thread length is not a quality signal. Very short threads can contain sharp, clean decisions. Very long threads often contain decisions that get revisited and reversed. Neither length marker is directionally reliable.
5. Question-answer adjacency
The fifth attempt was the most obvious-seeming heuristic and the most disappointing: find a user question (ends with "?") in one message, look for a response in the next message that includes "I'd" or "I'd recommend" or "the better choice is," and emit a decision record.
This produces a catastrophic false positive rate because most AI chat is Q&A. "How does React reconciliation work?" → "React uses a virtual DOM..." is a Q&A pair. "What's the difference between async and sync?" is a Q&A pair. The AI giving a recommendation in response to a question is common — but the recommendation only becomes a decision when the user commits to it, and Q&A adjacency has no signal on the user's response.
More subtly: this approach would miss decisions where the AI presents options without recommending one ("here are the tradeoffs: [option A] does X, [option B] does Y — depends on your requirements"), which are actually the highest-quality decision threads because the user is making a genuine choice rather than rubber-stamping an AI recommendation. Adjacency detection optimises for the wrong signal entirely.
What survived: the four pattern groups
After discarding all five of those approaches, the architecture that held up is the two-pass model described in patterns.md:
Pass 1: Question shapes
Eight patterns that catch the user framing a decision, all requiring an explicit named choice or an explicit request for a recommendation. The core patterns cover the VS construction (\b\w+\s+vs\.?\s+\w+\b), deliberation verbs (\b(?:torn between|deciding between|choosing between|weighing)\b), and direct picks (\b(?:should I|should we|shall I)\s+(?:pick|use|choose|go with|adopt)\b).
The rejection filters on question shapes do most of the precision work: questions inside code blocks are discarded (someone is quoting a comment that says "should I use X here?"), questions where both options are the same word are discarded (typos), and questions shorter than 8 characters are discarded (too ambiguous). These three filters together cut false positives by roughly half on the test set.
Pass 2: Commit phrases
Five patterns that catch the user closing the decision. The critical design decision here is user-only: commit phrases from the assistant role are explicitly ignored. An assistant saying "I'd go with Postgres" is a recommendation. The user saying "ok I'll go with Postgres" is a decision. These are not the same event and the extractor treats them differently.
The 6-message window is the other design choice. After the question shape fires, the extractor searches the next 6 messages for a user commit. If no commit appears within that window, the thread is logged at low confidence and hidden by default. This cuts the false-positive tail from long free-ranging conversations that start as decision discussions and drift into something else entirely — which is very common.
Trade-off markers: the confidence bump
A record that contains both a question pass and a commit pass fires at confidence: medium. If the thread also contains two or more trade-off markers — "on the one hand," "tradeoffs," "upside," "drawback" — confidence bumps to high. High-confidence records are almost never false positives in testing. Medium-confidence records have a false positive rate around 8%, which is acceptable for a default output list.
The trade-off markers also serve as a quality signal beyond binary detection. A high-confidence record is one where the engineer explicitly worked through competing considerations — the exact scenario where the rationale is most worth preserving. A future engineer asking "why did we pick Postgres" gets the most value from a record where the alternatives were named and the tradeoffs were articulated, not from a record where the answer was "yeah sure."
Reversal markers: the disqualifier
The reversal detection system was the last piece and the one I'm least satisfied with. If a commit phrase is followed within 2 messages by a reversal — "actually no," "scratch that," "changed my mind" — the record is dropped.
This catches the obvious case: the engineer commits, then immediately walks it back. What it misses is the less obvious case: the engineer commits in message 4, reverses in message 8 (outside the 2-message window), then re-commits to a different option in message 12. The current implementation would emit a record for the first commit (message 4) without the reversal context, which is misleading.
This is the known miss I think about most. The fix requires tracking the full decision lifecycle — not just the first commit — and that's a materially more complex state machine than v1 implements. It's on the v2 roadmap.
What the failure modes reveal
Each of the five discarded approaches failed for the same structural reason: they tried to detect decision intent from a single message or a single signal, without context about the surrounding thread. Length, tech names, verb forms, thread length, Q&A adjacency — all of these are weak signals in isolation. None are reliable standalone classifiers of decision intent.
The two-pass architecture works because it treats a decision as an event with structure — a question that initiates a thread, followed by a user who settles it. Both halves must be present. Either half alone is noise. This mirrors how decisions actually happen in chat: the engineer doesn't just announce a choice out of nowhere; they ask, explore, and commit. The extractor is tracking the lifecycle, not just the outcome.
The deeper lesson is about what chat transcripts are not. They're not a sequence of propositions where each message is independently meaningful. They're a dialog with conversational structure — questions that reference context from earlier, responses that assume the question is still open, commits that close threads started several messages ago. Pattern matching on individual messages, without modeling the thread structure, will always underperform. The two-pass architecture is a minimal thread model. It's not sophisticated, but it's the right shape.
If you want to read the patterns end-to-end — the full regex list, the rejection filters, the tag derivation lexicon, the known misses — they're all in patterns.md in the extractor source. The source is MIT-licensed and about 500 lines. The operational guide for running it against a ChatGPT export has the step-by-step walkthrough if you want to try it on your own history.
The five things v1 doesn't catch
Documented openly in patterns.md, because hiding the misses would make the tool less trustworthy, not more:
- Multi-turn decisions that span 20+ messages. The commit-phrase window is 6 messages. Long deliberative threads — the kind where the engineer spends 15 messages working through a technical comparison — will miss if the commit comes late. Raising the window to 20 increases false positives more than it increases true positive recall, which is why it stays at 6.
- Implicit decisions. "Ok let me scaffold this" contains no commit phrase. "Alright, let's go ahead" is too vague to capture with a named option. These are common in practice and completely invisible to the current patterns.
- Decisions phrased as statements. "I think we should use Postgres, the team already knows it" followed by "yeah good call" is a decision that never triggers a question shape. The engineer is expressing a preference, not asking a question. The two-pass architecture requires a question shape to anchor the thread.
- Non-English transcripts. All patterns are English-only. An engineer working in Spanish, German, Japanese, or a mixed-language session gets no coverage. This is a meaningful gap for international engineering teams and a known v2 target.
- Code-as-options decisions. When both options are shell commands or code fragments, the option extractor's noun-detection logic struggles. "Should I run
docker compose up --buildordocker build . && docker run?" is a real decision question where both options are syntactically identical to code expressions.
If your export has these patterns, the --sensitivity=high flag surfaces low-confidence records that might catch some of them. You'll need to triage more aggressively — at high sensitivity, roughly 30–40% of output is false positive — but it's better than nothing for archives where the high-value decisions live in long threads or implicit commits.
Why 3.1% is the right hit rate
The 1,200-conversation walkthrough produced 38 records at default sensitivity — a 3.1% hit rate. The reaction I've heard most often is "that seems low." It's not.
Most AI chat is not decision-making. A 1,200-conversation archive contains debugging sessions, concept explanations, code generation, writing drafts, recipe lookups, travel planning, joke requests. Even the engineering sessions are mostly not decisions — they're questions that get answered, problems that get solved, code that gets reviewed. Decisions are a specific subset of conversations with a specific structure: a choice between named options, worked through in dialog, committed to by a human. That's maybe 3–5% of the archive.
The right test isn't "did the extractor find 38 of how many decisions were really there?" — that requires a ground truth I don't have. The right test is "are the 38 records it found real decisions?" In the walkthrough post, three were discarded as false positives in the 15-minute triage pass. Thirty-five were real. That's a 92% precision rate on the default output. For a v1 with no ML and no cloud dependency, that's the floor I was aiming for.
Precision over recall is a deliberate choice, not a limitation. The user experience of opening the extractor output and seeing 38 trustworthy records is categorically different from opening it and seeing 300 candidates with a coin-flip precision rate. A short list you can act on beats a long list you have to filter.
Try it on your own export. The open-source extractor is ~500 lines of dependency-free Node, MIT-licensed, no network calls. Read the source in your browser, install with curl, run against your ChatGPT export or Claude export. The patterns.md file in the source lists every active heuristic and every known miss. Or join the waitlist for the hosted version with team sharing, search, and Notion / Linear export.
Related questions
Why didn't you use an LLM to classify decisions instead of regex?
Cost and offline operation. An average ChatGPT export has 1,200–3,000 conversations with 8–30 messages each — that's 10,000–90,000 LLM API calls per export run at v1 pricing, which is neither free nor private. The extractor's core promise is that it runs fully locally with no network calls. You can grep the source for 'fetch' or 'http' and find nothing. An offline regex pass achieves the same classification objective at zero cost and zero data exposure. The tradeoff is recall: the regex pass misses implicit and non-English decisions that an LLM would catch. That's the honest version of precision-over-recall — it's not that LLMs would do a worse job, it's that the offline constraint makes them the wrong tool for v1.
How do you handle false positives — chat messages that look like decisions but aren't?
The two-pass architecture is the main false-positive filter. A question pass match alone — without a matching commit phrase from the user within 6 messages — produces a 'low confidence' record that is hidden by default. Only when both a question shape AND a user commit phrase match does the record surface at medium confidence. Trade-off markers bump to high. This means a chat that says 'should I use Postgres or Mongo?' and then trails off into something unrelated doesn't produce a record — because the user never committed. The commit-from-user-only rule is the other major filter: the assistant saying 'I'd go with Postgres' is a recommendation, not a decision. The extractor only fires on the user's voice.
What is the 6-message window and why 6?
After detecting a question shape in a user message, the extractor looks at the next 6 messages (across both the user and assistant turns) for a commit phrase. If no commit is found within that window, the question is logged as low-confidence and discarded by default. Six was chosen empirically from a sample of 40 real decision threads: the median distance from 'should I use X or Y' to 'ok I'll go with X' was 3 messages; 95th percentile was 5. The 6-message window covers 95% of real cases while keeping the false-positive rate on long free-ranging threads acceptably low. Longer windows (10, 20 messages) captured more true decisions but also more decisions-that-got-reversed-and-then-re-decided, which confused the output.
Can I contribute patterns or report missed decisions?
Yes. The patterns.md file in the extractor source is the canonical list of active and known-miss patterns. If you run the extractor and find a genuine decision it missed, open an issue at whychose.com/extractor with a redacted snippet showing the decision shape. The most useful submissions include: the rough shape of the question (e.g., 'implicit decision phrased as a statement followed by agreement'), a redacted example, and whether you'd want this captured at --sensitivity=medium or only at --sensitivity=high. Precision-over-recall is a deliberate v1 choice — the goal is that every record in the default output is a real decision — but the sensitivity flag exists exactly for teams that prefer higher recall.
Further reading
- The open-source extractor — read the source in your browser, install with curl, run locally. patterns.md has the full pattern list and known misses.
- How to extract decisions from ChatGPT history — the operational step-by-step guide for running the extractor against a ChatGPT export.
- How to export your ChatGPT history — getting the conversations.json from the ChatGPT settings panel, with format notes.
- How to export Claude conversations — the Claude-side export workflow and the differences in the JSON schema.
- From 1,200 ChatGPT chats to 38 durable decisions — the dogfooding walkthrough: real numbers, real findings, honest false-negative discussion.
- ADR vs decision log vs RFC: when to use each — what to do with the extracted records once you have them, and which format fits which decision type.
- The MADR 4.0 spec in 15 minutes — the extraction output format that the extractor uses as its default Markdown output, and why MADR maps naturally onto AI chat reasoning structure.
- A worked ADR example — what a promoted extraction record looks like once it's been raised to full ADR format.