Blog · 2026-06-05 · ~10 min read

I built an open-source tool to extract decisions from ChatGPT/Claude. Here's every regex I used — and every one I had to throw out.

The WhyChose extractor is about 500 lines of dependency-free Node. It takes a ChatGPT conversations.json or a Claude export and emits structured decision records — the "I chose X over Y because Z" moments buried in years of chat history. This is the story of building it: five heuristics that went in the bin, four that survived, and what each failure mode revealed about how engineers actually think with AI.

The short version

The extractor uses a two-pass architecture: a question pass that finds user messages matching decision-initiation patterns, then a commit pass that looks for a user commit phrase within the next 6 messages. Trade-off markers bump confidence; reversal markers disqualify the record. Five other approaches were tried first — sentence-length thresholds, named-entity recognition, first-person-verb matching, message-count filtering, and question-answer adjacency detection — and all five produced unacceptable false-positive rates before being discarded. The post explains each failure and why the two-pass shape is the right floor for v1.

What the extractor is actually doing

The job is simpler than it sounds. A decision-in-chat looks like this: the user asks a question that frames a choice between named options, the conversation explores the options, the user commits to one. The output for that thread should be a record with five fields: the question, the chosen option, the alternatives that were named, a snippet of the rationale, and a confidence level.

That's the whole problem. The hard part isn't the data model — it's distinguishing decision threads from the sea of other things engineers use AI for: debugging, explaining concepts, writing boilerplate, reviewing code, asking for definitions. In a 1,200-conversation export, the real hit rate is around 3%. The remaining 97% needs to be discarded cleanly, without requiring the user to read a false-positive list longer than the real output.

I considered two approaches: train a classifier (ML, probably a small fine-tuned model), or write deterministic pattern rules (NLP lite, no dependencies). The classifier approach would have better recall — it would catch implicit decisions, non-English threads, and the slow-burn consensus that develops over 30 messages. But the product constraint ruled it out immediately: the extractor's core promise is that it runs fully offline with no network calls. Export your conversations.json. Run node bin/extractor.js conversations.json. Nothing leaves your machine. A classifier that requires a model download, inference time proportional to export size, and a GPU or significant CPU would break the "five-minute local run" promise. So: deterministic patterns, conservative by design, known misses documented openly in patterns.md.

The five approaches that failed

1. Sentence-length threshold

The first thing I tried was the simplest possible heuristic: count the tokens in a user message. Hypothesis: decision-framing messages are longer than average — an engineer asking "should I use Postgres or MongoDB" and explaining their constraints writes more than one asking "how do I sort a list."

I wrote the threshold at 15 tokens (rough equivalent: about two short sentences) and ran it against a sample of 60 conversations I'd manually tagged as containing real decisions.

The recall was fine — 87% of real decision threads had at least one user message above 15 tokens. But the precision was catastrophic. Every long debugging session, every "can you explain X in detail," every code review prompt — all above threshold, all feeding false positives into the output. In the 60-conversation sample, a 15-token filter produced 41 "candidate threads." Fourteen contained real decisions. Twenty-seven were noise. That's a false positive rate above 60%.

The deeper problem is that length and decision-intent are correlated only weakly and in one direction. Most long messages aren't decisions. Discarding this heuristic was the right call, and it happened after one afternoon of testing.

2. Named-entity recognition for technology terms

The second approach was more plausible: decisions tend to name technologies, frameworks, or services. "Should I use React or Vue" contains two named tech entities. "How do I sort a list" contains none.

I built a lexicon of ~300 technology names — databases, cloud services, languages, frameworks, libraries — and tagged messages that contained two or more names from the lexicon as candidate decision threads.

This worked better on the obvious cases: "Postgres vs MongoDB," "React vs Svelte," "Terraform vs Pulumi" all fired correctly. But the failure modes were worse than with length:

Stack-listing false positives. "Can you help me with a React + Next.js app that uses Postgres on Supabase and deploys to Vercel?" contains five lexicon entries and zero decision intent. It's a project description, not a comparison.
The lexicon maintenance problem. The tech ecosystem moves fast. Within two weeks of freezing the lexicon, I'd already missed Bun, Drizzle, Turso, and Astro as they became common in chat sessions. A lexicon-based approach requires constant updates or it silently degrades.
Generic decisions aren't technology comparisons. "Should we hire a senior engineer or two mid-level ones?" is a real decision that doesn't contain a single tech name. Neither does "should I launch the beta now or wait until the auth is solid?" The lexicon approach has a structural blind spot on hiring, product, and timing decisions — which are often the highest-stakes ones.

I kept the tech-name list as a tag derivation layer (it's used in patterns.md to bucket decisions into tags like database, frontend, infra) but removed it from the detection path entirely. Tag inference after detection is fine. Detection gated on tech terms is too narrow.

3. First-person verb matching in isolation

The third approach targeted the commit signal directly: find user messages containing first-person future-tense or present-progressive verbs followed by a technology or approach name. "I'll use Postgres." "I'm going with React." "We're adopting Terraform."

The pattern seemed tight. It isn't.

The failure mode I didn't anticipate: engineers routinely use this voice when explaining things to the AI, not when committing to a choice. "I'll use the value from the previous step here." "I'm going to iterate over the array and filter by date." "We're using server-side rendering in this app." None of these are decisions — they're code narration. The user is describing what they're doing in code, not announcing what they've chosen.

The false positive rate for "first-person commit verb without prior decision-framing question" was above 70% on a test set of 40 conversations. The phrase shape is ambiguous without context. It's not a reliable standalone signal.

This failure led directly to the two-pass architecture. Commit phrases are only meaningful after a question pass has established that a decision thread is already in progress. A commit phrase on its own means nothing. A commit phrase within 6 messages of a question shape is signal. The detection path had to be sequential, not parallel.

4. Message-count filtering

The fourth approach was based on thread length. Hypothesis: real decisions happen in longer threads — the engineer and AI are exploring options, which takes multiple turns. A thread shorter than 8 messages is likely a quick lookup, not a decision thread.

This was approximately right empirically (long threads do contain more decisions) but wrong as a filter. The message-count distribution for real decisions has a long left tail: some of the cleanest, most decisive commits happen in 4-message threads. The engineer asks the question. The AI presents options. The engineer commits in 8 words. Done.

These short, decisive threads are actually the highest-value records — the ones where the engineer had already done the thinking and just needed to say it out loud. Filtering them out would miss exactly the records that are easiest to export and most reliable to act on.

The count filter was dropped. Thread length is not a quality signal. Very short threads can contain sharp, clean decisions. Very long threads often contain decisions that get revisited and reversed. Neither length marker is directionally reliable.

5. Question-answer adjacency

The fifth attempt was the most obvious-seeming heuristic and the most disappointing: find a user question (ends with "?") in one message, look for a response in the next message that includes "I'd" or "I'd recommend" or "the better choice is," and emit a decision record.

This produces a catastrophic false positive rate because most AI chat is Q&A. "How does React reconciliation work?" → "React uses a virtual DOM..." is a Q&A pair. "What's the difference between async and sync?" is a Q&A pair. The AI giving a recommendation in response to a question is common — but the recommendation only becomes a decision when the user commits to it, and Q&A adjacency has no signal on the user's response.

More subtly: this approach would miss decisions where the AI presents options without recommending one ("here are the tradeoffs: [option A] does X, [option B] does Y — depends on your requirements"), which are actually the highest-quality decision threads because the user is making a genuine choice rather than rubber-stamping an AI recommendation. Adjacency detection optimises for the wrong signal entirely.

What survived: the four pattern groups

After discarding all five of those approaches, the architecture that held up is the two-pass model described in patterns.md:

Pass 1: Question shapes

The rejection filters on question shapes do most of the precision work: questions inside code blocks are discarded (someone is quoting a comment that says "should I use X here?"), questions where both options are the same word are discarded (typos), and questions shorter than 8 characters are discarded (too ambiguous). These three filters together cut false positives by roughly half on the test set.

Pass 2: Commit phrases

Five patterns that catch the user closing the decision. The critical design decision here is user-only: commit phrases from the assistant role are explicitly ignored. An assistant saying "I'd go with Postgres" is a recommendation. The user saying "ok I'll go with Postgres" is a decision. These are not the same event and the extractor treats them differently.

The 6-message window is the other design choice. After the question shape fires, the extractor searches the next 6 messages for a user commit. If no commit appears within that window, the thread is logged at low confidence and hidden by default. This cuts the false-positive tail from long free-ranging conversations that start as decision discussions and drift into something else entirely — which is very common.

Trade-off markers: the confidence bump

A record that contains both a question pass and a commit pass fires at confidence: medium. If the thread also contains two or more trade-off markers — "on the one hand," "tradeoffs," "upside," "drawback" — confidence bumps to high. High-confidence records are almost never false positives in testing. Medium-confidence records have a false positive rate around 8%, which is acceptable for a default output list.

The trade-off markers also serve as a quality signal beyond binary detection. A high-confidence record is one where the engineer explicitly worked through competing considerations — the exact scenario where the rationale is most worth preserving. A future engineer asking "why did we pick Postgres" gets the most value from a record where the alternatives were named and the tradeoffs were articulated, not from a record where the answer was "yeah sure."

Reversal markers: the disqualifier

The reversal detection system was the last piece and the one I'm least satisfied with. If a commit phrase is followed within 2 messages by a reversal — "actually no," "scratch that," "changed my mind" — the record is dropped.

This catches the obvious case: the engineer commits, then immediately walks it back. What it misses is the less obvious case: the engineer commits in message 4, reverses in message 8 (outside the 2-message window), then re-commits to a different option in message 12. The current implementation would emit a record for the first commit (message 4) without the reversal context, which is misleading.

This is the known miss I think about most. The fix requires tracking the full decision lifecycle — not just the first commit — and that's a materially more complex state machine than v1 implements. It's on the v2 roadmap.

What the failure modes reveal

Each of the five discarded approaches failed for the same structural reason: they tried to detect decision intent from a single message or a single signal, without context about the surrounding thread. Length, tech names, verb forms, thread length, Q&A adjacency — all of these are weak signals in isolation. None are reliable standalone classifiers of decision intent.

The two-pass architecture works because it treats a decision as an event with structure — a question that initiates a thread, followed by a user who settles it. Both halves must be present. Either half alone is noise. This mirrors how decisions actually happen in chat: the engineer doesn't just announce a choice out of nowhere; they ask, explore, and commit. The extractor is tracking the lifecycle, not just the outcome.

The deeper lesson is about what chat transcripts are not. They're not a sequence of propositions where each message is independently meaningful. They're a dialog with conversational structure — questions that reference context from earlier, responses that assume the question is still open, commits that close threads started several messages ago. Pattern matching on individual messages, without modeling the thread structure, will always underperform. The two-pass architecture is a minimal thread model. It's not sophisticated, but it's the right shape.

If you want to read the patterns end-to-end — the full regex list, the rejection filters, the tag derivation lexicon, the known misses — they're all in patterns.md in the extractor source. The source is MIT-licensed and about 500 lines. The operational guide for running it against a ChatGPT export has the step-by-step walkthrough if you want to try it on your own history.

The five things v1 doesn't catch

Documented openly in patterns.md, because hiding the misses would make the tool less trustworthy, not more:

Multi-turn decisions that span 20+ messages. The commit-phrase window is 6 messages. Long deliberative threads — the kind where the engineer spends 15 messages working through a technical comparison — will miss if the commit comes late. Raising the window to 20 increases false positives more than it increases true positive recall, which is why it stays at 6.
Implicit decisions. "Ok let me scaffold this" contains no commit phrase. "Alright, let's go ahead" is too vague to capture with a named option. These are common in practice and completely invisible to the current patterns.
Decisions phrased as statements. "I think we should use Postgres, the team already knows it" followed by "yeah good call" is a decision that never triggers a question shape. The engineer is expressing a preference, not asking a question. The two-pass architecture requires a question shape to anchor the thread.
Non-English transcripts. All patterns are English-only. An engineer working in Spanish, German, Japanese, or a mixed-language session gets no coverage. This is a meaningful gap for international engineering teams and a known v2 target.
Code-as-options decisions. When both options are shell commands or code fragments, the option extractor's noun-detection logic struggles. "Should I run docker compose up --build or docker build . && docker run?" is a real decision question where both options are syntactically identical to code expressions.

If your export has these patterns, the --sensitivity=high flag surfaces low-confidence records that might catch some of them. You'll need to triage more aggressively — at high sensitivity, roughly 30–40% of output is false positive — but it's better than nothing for archives where the high-value decisions live in long threads or implicit commits.

Why 3.1% is the right hit rate

The 1,200-conversation walkthrough produced 38 records at default sensitivity — a 3.1% hit rate. The reaction I've heard most often is "that seems low." It's not.

Most AI chat is not decision-making. A 1,200-conversation archive contains debugging sessions, concept explanations, code generation, writing drafts, recipe lookups, travel planning, joke requests. Even the engineering sessions are mostly not decisions — they're questions that get answered, problems that get solved, code that gets reviewed. Decisions are a specific subset of conversations with a specific structure: a choice between named options, worked through in dialog, committed to by a human. That's maybe 3–5% of the archive.

The right test isn't "did the extractor find 38 of how many decisions were really there?" — that requires a ground truth I don't have. The right test is "are the 38 records it found real decisions?" In the walkthrough post, three were discarded as false positives in the 15-minute triage pass. Thirty-five were real. That's a 92% precision rate on the default output. For a v1 with no ML and no cloud dependency, that's the floor I was aiming for.

Precision over recall is a deliberate choice, not a limitation. The user experience of opening the extractor output and seeing 38 trustworthy records is categorically different from opening it and seeing 300 candidates with a coin-flip precision rate. A short list you can act on beats a long list you have to filter.

Try it on your own export. The open-source extractor is ~500 lines of dependency-free Node, MIT-licensed, no network calls. Read the source in your browser, install with curl, run against your ChatGPT export or Claude export. The patterns.md file in the source lists every active heuristic and every known miss. Or join the waitlist for the hosted version with team sharing, search, and Notion / Linear export.