Blog · 2026-06-05 · ~9 min read

From 1,200 ChatGPT chats to 38 durable decisions: a real export walkthrough

Q: Should I run the extractor quarterly or on a longer cadence?

Quarterly is the right cadence for most engineers, for two reasons. First, it aligns with planning cycles — a quarterly decision review before sprint planning or a board meeting is a natural slot that already exists. Second, the extractor becomes less accurate the more time passes between the conversation and the review — at 3 months you still remember enough context to know whether a record is real or noise; at 12 months you're guessing. The one exception: if you just joined a team or are starting a new project, run an immediate extraction on the last 3 months so you're not starting with an empty decision log while the decisions are fresh.

I've been using ChatGPT as a thinking partner for about 18 months — architecture trade-offs, pricing calls, hiring decisions, infrastructure choices. Last quarter I exported all of it, ran it through the extractor, and expected to surface several hundred decisions. I got 38. Here's what the actual numbers mean, what the 38 records look like, and the three findings that changed how we work.

TL;DR

1,214 conversations, 47.2MB, 3.1 minutes to extract. The extractor surfaced 154 candidate records in the raw pass; after deduplication and the durability filter, 38 survived. That's a 3.1% hit rate — which sounds low until you realise that the 1,162 chats that produced zero decisions were genuinely not decisions: debugging sessions, code generation, explanations, and scratch brainstorming. The 38 that survived are all things a new team member would need to find. Three of them changed our workflow the moment we found them.

The setup

The export request took about 30 seconds to submit from Settings → Data Controls → Export in the ChatGPT web UI. The email with the download link arrived 21 hours later — close to the 24-hour estimate that the export guide documents. The ZIP contains several files; the one the extractor needs is conversations.json. Mine was 47.2MB, which is on the larger end but not unusual for 18 months of regular use — the format is a JSON object keyed by conversation ID, each conversation containing a mapping DAG of messages that stores every branch and regeneration, not just the rendered path. Heavy use of regeneration and branching in long conversations is what inflates the file size.

I ran the extractor locally — it's about 500 lines of dependency-free Node that you can read in 20 minutes, install via a tarball download, and run without sending anything to a server. The command is straightforward:

node bin/extractor.js --input conversations.json --output ./decisions --format jsonl

It ran for 3.1 minutes on my M2 MacBook Pro. The progress output logged two numbers as it went: candidates (conversations that passed the initial keyword and structure filters) and survivors (candidates that also passed the durability check). I watched the ratio settle: about 12–15% of conversations produced a candidate, and about 25% of candidates survived to become records. The final count: 154 candidates → 38 survivors.

What "durable" means to the extractor

The extractor isn't counting decisions; it's counting durable decisions. The distinction matters for understanding the output. The algorithm looks for three properties together:

A named choice: the conversation explicitly names what was picked — not "we might go with X" but "we went with X" or "I'm going with X." Speculative conversations are excluded even if they discuss trade-offs at length.
Named alternatives: the conversation discussed at least one other option that was explicitly set aside. A "we decided to use Postgres" conversation without any mention of alternatives doesn't pass this filter — that's a statement, not a decision record. A decision record by definition captures the road not taken.
A durability signal: something in the conversation suggests the choice will constrain future options for at least a few months. The algorithm uses several proxies for this: mentions of "going forward," "from now on," explicit timelines, mentions of migration cost, or the presence of conditions that would trigger revisiting. Choices that are trivially reversible — a library for a one-off script, a prompt for a single batch job — tend not to pass this filter.

This is the right filter for the use case. The goal of a decision log isn't to record every technical choice you made; it's to record the choices that will matter when someone else needs to understand the current state of the system. Most choices don't matter at that level. The 38 that passed all three filters are the ones that do.

The 38 decisions: breakdown by category

The output lands as JSONL — one JSON object per line, each with the extracted title, date, choice, rejected alternatives with rejection reasons, rationale, and a pointer back to the source conversation ID. Here's how the 38 break down:

Infrastructure decisions (14): The biggest category, which makes sense — infrastructure choices are exactly the kind of durable, alternatives-considered decisions that AI chat is good at working through. The standouts: a PostgreSQL-over-CockroachDB call for OLTP workloads where the conversation walked through operational complexity and lack of team CockroachDB experience; a pgvector-over-dedicated-vector-DB decision with a concrete revisit trigger (when similarity query latency exceeds 80ms p95); a choice to self-host with Caddy rather than sit behind a managed load balancer, with a cost breakeven calculation for the traffic levels expected in year one.

Tooling decisions (8): Eight tool choices that stuck — Cursor over Copilot (with a specific note that Copilot's inline suggestions were better for small edits but Cursor's chat context was the deciding factor for architecture work), Linear over Jira (with a revisit condition: if we add non-technical stakeholders who need board views, reconsider Jira), DataGrip over TablePlus for the team (DataGrip's schema diff tooling was the specific reason). Most of these I'd half-forgotten making.

Product and feature calls (7): This category surprised me the most — not because I didn't remember making the calls, but because the reasoning had faded. The freemium-over-paid-only launch decision was in here: a conversation from nine months ago that walked through the activation-rate implications of each model and landed on freemium specifically because paid-only launch would have masked whether the product worked without the sales motion to compensate. That reasoning is directly relevant now and I'd completely forgotten it was written down somewhere.

Hiring and team decisions (5): Five calls about how the team would work — including one about async-first norms, one about "no PM for the first 18 months" with a specific trigger for revisiting (if more than 50% of engineering time is going to stakeholder communication that isn't producing product work), and one about who owns on-call rotation and why.

Operational decisions (4): Backup cadence, Cloudflare WAF rule rationale, a GDPR cookie-banner scoping decision (documented for if we ever get a compliance audit), and an incident severity tier classification. Four decisions I would never have written up as ADRs but that are genuinely useful to have on record.

Three findings that changed how we work

Most of the 38 records were confirmatory — I remembered the decisions and the reasoning held up. But three records were genuinely surprising in ways that changed something.

Finding 1: A pricing decision that ended a re-litigation. The product and feature category includes a pricing conversation from 14 months ago — March 2025 — where I worked through $49 vs $29 vs $9 pricing for the Pro tier. The reasoning in the chat was specific: "$29 is correct if you're selling a feature add-on to an existing workflow; $49 is correct if you're selling a workflow replacement that competes with the engineer's own tooling habit." That reasoning applied directly to WhyChose. About three weeks before I ran the extraction, a team conversation started about whether $9/month was too low and whether we should move to $19 or $29. I presented the extracted record. The conversation ended in about ten minutes — not because the record was authoritative, but because it meant we had already done the analysis and the analysis was good. Without the extraction, we'd have spent another hour re-litigating something we'd already decided with more context than we currently had.

Finding 2: A "not building X" decision we'd forgotten existed. One of the most valuable things a decision record can preserve is a deliberate choice not to build something. I found a 60-message conversation from about eight months ago that worked through whether to build a real-time sync feature — connecting to the ChatGPT or Claude account directly and pulling new conversations automatically. The conversation was thorough: it documented that neither platform exposes a conversation-history API, that any workaround would require scraping (ToS-violating), that the engineering cost of maintaining a browser-automation sync path would be 3–5× the cost of the quarterly-batch alternative, and that quarterly batch actually matched the ICP's review cadence better than continuous sync anyway. The decision: don't build sync, make quarterly batch the design constraint.

About a month before the extraction, a new contributor opened an issue proposing exactly this feature — real-time sync. The proposal was thoughtful and the contributor didn't know the history. Without the extracted record, that conversation would have restarted from zero. With it, I could share the full analysis from eight months ago and the conversation was productive in a completely different way: "here's why we said no before, here's what would change that" rather than "here's why sync is hard."

Finding 3: A technical debt acceptance with three different memories. The fourth operational decision in the extracted set was a choice to use SQLite for the decision storage layer, with a specific note: "revisit if we hit 50k stored decisions per user — above that, query planner performance degrades on the decision-search indexes we need." I had forgotten this threshold existed. When I mentioned it to two other contributors, one remembered the threshold as "100k," one thought we'd never set a specific number at all. The actual record had "50k" with a brief explanation of the query pattern that would start struggling above that threshold. Three different working memories, one authoritative record.

What the extractor missed

Being honest about false negatives matters here. The extractor missed decisions in a few predictable categories.

Implicit decisions: Several conversations ended with "okay we'll go with that" without ever naming what "that" was — the context was clear in the chat but the extractor requires an explicit named choice. If you want to maximize extraction yield, name your choices explicitly in the conversation: "going with BullMQ over the homegrown approach" rather than "okay let's do that."

Very short decision threads: About a dozen conversations I checked manually were decisions but ran under 8 messages — quick sanity-checks where I knew what I wanted and the chat confirmed it. Those don't produce the alternatives-and-rationale signal the extractor needs, so they come out blank. For this category, the right solution is to capture the decision manually at the time rather than relying on extraction retroactively.

Decisions where I used o1 or o3 reasoning models: The o1/o3 model conversations were in the export (the format is identical to GPT-4), but the reasoning chain that characterises o1/o3 deliberation is in the final response rather than a separate field. The extractor treats these like regular conversations and extraction quality is actually good — o1/o3 final responses explicitly enumerate alternatives and state rejection reasons — but a couple of conversations where I used brief o1 for a quick architectural sense-check were too short to pass the filter even though the final response contained useful reasoning.

The 15-minute triage pass

The extractor output isn't the final record; it's the first draft. I spent about 15 minutes going through the 38 extracted records and doing four things:

Discard noise: Three records were false positives — planning conversations where I'd enumerated options and picked one speculatively, but no real commitment had been made. Those went to the bin.
Annotate the revisit conditions: About half the records had a revisit_if field populated by the extractor; the other half needed one added manually. This is the most important 30-second investment per record — a decision record without a revisit condition is one that you'll either follow forever or re-litigate from scratch. Naming the trigger makes it a living rule rather than a frozen artifact.
Tag by domain: I added tags to each record so the output is searchable by area — infrastructure, tooling, product, team, ops. The extractor guesses at tags but they needed cleanup.
Flag three for full ADR promotion: The SQLite-storage-threshold record, the sync-not-built record, and the pricing record were important enough to warrant promotion from the shorter extracted format to a full Nygard ADR in the doc/decisions/ repo. The other 35 live in the decision log as extracted records — shorter, less formal, but findable.

That's the two-tier practice in operation: extracted records for the 85% that would otherwise vanish, full ADRs for the 3 out of 38 that warrant the ceremony. Both tiers serve the same function — answering "why did we do this?" when the original author is unavailable — but at very different cost per record.

The number to hold on to: 3.1%

The ratio of durable decisions to total chats is about 3% — 38 from 1,214. That sounds underwhelming, but consider the alternative: without extraction, the hit rate for documented durable decisions would be closer to 0.2–0.5% of the same corpus (based on the 60-day ADR staleness data — if most teams that adopt an ADR practice document 8–15 records per quarter, and an active engineer makes 20–40 decisions per quarter, roughly half to two-thirds of decisions are never written up even in teams with an active ADR practice). The extraction approach doesn't make decisions documentable; it just makes the documentation happen automatically against conversations that were already being recorded.

The 97% that didn't produce a decision record aren't lost — they're just not decision records. The debugging session where you fixed a tricky bug is in your chat history if you ever need it. But it's not a record that constrains your team's future options. The 38 that are in the log are the ones that do.

If you want to see what this looks like for your own history, the extraction walkthrough takes about ten minutes including the export wait. The extractor is open source — 500 lines of Node, runs locally, no signup required. If the records it surfaces look useful, the quarterly cadence is worth building in. If they don't, you've spent ten minutes and learned the answer.

Try it on your own export. The open-source extractor takes a ChatGPT or Claude conversations.json and emits decision records to JSON, JSONL, or Markdown. ~500 lines of dependency-free Node, MIT-licensed, runs locally. Download your ChatGPT export and you can have an initial run in under ten minutes. Or join the waitlist for the hosted version with team sharing, search, and Notion / Linear export.