# whychose-extractor

> Turn a ChatGPT or Claude export into a structured decision log. **MIT, zero deps, runs anywhere node does.**

This is the open-source core of [WhyChose](https://whychose.com) — the bit
that reads a raw chat transcript and surfaces the durable *decisions*
inside. The hosted product wraps this in a browser UI, a searchable log, and
teammate sharing. This CLI ships the extraction as standalone code you can
read, run, and audit locally before you ever upload a single byte.

**Why this exists:** every 6–12 months I become the new engineer who has to
answer "why did we pick Postgres over Mongo?" — and the reasoning is buried
somewhere across 300+ ChatGPT conversations. CMD+F doesn't cut it. This
turns the chat history into a durable artifact.

---

## Install

```bash
curl -sL https://whychose.com/extractor/whychose-extractor-v1.0.0.tar.gz | tar -xz
cd whychose-extractor
node bin/extractor.js --help
```

Or browse the source directly at [whychose.com/extractor](https://whychose.com/extractor).

No `npm install`. No build step. Node 18+ is the only requirement.

---

## Quickstart

Run it against the bundled samples to see the output shape:

```bash
# ChatGPT sample (4 conversations, 2 decisions extracted)
node bin/extractor.js sample-chatgpt.json

# Claude sample (2 conversations, 2 decisions extracted)
node bin/extractor.js sample-claude.json --format=md
```

Run it against your own export:

```bash
# Export from ChatGPT: Settings → Data Controls → Export → unzip → conversations.json
node bin/extractor.js ~/Downloads/conversations.json > decisions.json

# Export from Claude: Settings → Account → Export data → unzip → conversations.json
node bin/extractor.js ~/Downloads/claude-export/conversations.json > decisions.json
```

---

## Flags

| Flag | Values | Default | What it does |
|---|---|---|---|
| `--sensitivity` | `normal`, `high` | `normal` | `high` includes `confidence: low` records (question-shape match with no explicit commit). More recall, more false positives. |
| `--format` | `json`, `jsonl`, `md` | `json` | Output shape. `jsonl` = one record per line (good for piping into `jq` or a SQLite importer). `md` = human-browsable markdown. |
| `-h`, `--help` | — | — | Show help. |

---

## Output — the `DecisionRecord` shape

Every record follows [schema.json](./schema.json). Minimal example:

```json
{
  "id": "chatgpt-20240627-da25e40f",
  "date": "2024-06-27",
  "source": "chatgpt",
  "chat_title": "Postgres vs Mongo for the new service",
  "question": "We're starting a new billing service",
  "chosen": "Postgres",
  "rejected": [],
  "trade_offs": [
    { "option": "option-1", "pro": "", "con": "Yet another DB for the ops team to babysit" }
  ],
  "confidence": "high",
  "snippet": "user: Yeah money data is the deciding factor. ... I'll go with Postgres.\nassistant: Good call. ...",
  "tags": ["database"]
}
```

---

## How the extraction works (honest version)

This is a **heuristic regex extractor**, not an LLM. The trade-off: it runs
in milliseconds on your laptop with zero API costs and zero data leaving
your machine, but it will miss decisions that don't fit its patterns. See
[patterns.md](./patterns.md) for the exact pattern library.

The algorithm, in one paragraph:

1. Walk every conversation. For each **user message**, test against the
   *question shapes* (e.g. `should I pick X or Y`, `X vs Y`, `torn between`).
2. On a match, scan the next ≤6 messages for a *commit phrase* from the
   user (e.g. `I'll go with X`, `decided on X`, `going with X`).
3. If the ≤6-message window also contains *trade-off markers*
   (`pros:`, `on the other hand`, etc.), mark confidence `high`. Otherwise
   `medium`. Question match alone → `low` (hidden unless `--sensitivity=high`).
4. Drop the record if the commit is walked back within 2 messages
   (`actually no`, `scratch that`).
5. Tag the record by keyword bucket (database, architecture, pricing, etc.).

That's it. ~500 lines of Node, no dependencies, browse the source:
[bin/extractor.js](./bin/extractor.js).

---

## Privacy

This CLI runs **entirely locally**. The export JSON never leaves your
machine. `whychose-extractor` doesn't make a single network request — check
the source, `grep -n 'require\|fetch\|http' bin/extractor.js` shows only
`fs`, `path`, `crypto`. No telemetry, no phone-home.

If you pipe the output into [whychose.com](https://whychose.com), only the
extracted decision records are uploaded (5–50 short strings per quarterly
export), not the raw transcript. See [whychose.com/privacy](https://whychose.com/privacy)
for the full story.

---

## Known misses (v1)

Documented openly so you know what to expect and when to lean on
`--sensitivity=high`:

- **Long multi-turn decisions (20+ msgs between question and commit).**
  The commit-search window is 6 messages.
- **Implicit decisions** — "ok let me scaffold this" with no commit phrase.
- **Non-English transcripts.**
- **Decisions phrased as statements, not questions** — "I think we should
  probably use Postgres" followed by "yeah ok."

If your export has a common case we're missing, open an issue with a
redacted snippet — the pattern library gets tightened every time a real
miss shows up. See [whychose.com/extractor](https://whychose.com/extractor)
for the issue channel.

---

## Running the tests

```bash
npm test
```

Regenerates the output against the bundled samples and diffs against the
golden files (`sample-chatgpt-output.json`, `sample-claude-output.json`). No
test framework — just `diff`. If this fails, the extractor's behaviour has
changed.

---

## License

MIT — see [LICENSE](./LICENSE).

---

## Contributing

This repo is the open-source core of a hosted product, not a community
project looking for major features. PRs welcome for:

- New pattern entries in `patterns.md` + corresponding regex in `bin/extractor.js`
- New export-format adapters (Poe, Perplexity, Notion AI, Gemini)
- Bug reports with a redacted snippet that shows a real miss

Not in scope:

- LLM-powered extraction (we may build this as a separate opt-in module
  later; the point of the heuristic extractor is zero-cost, zero-network,
  instant)
- Full frontend UI (that's the hosted product at whychose.com)
