Topic: convert chatgpt export to markdown

How to Convert Your ChatGPT Export to Markdown

Q: Why can't I just dump every message in conversations.json to Markdown?

Because conversations.json doesn't store messages in render order. Each conversation has a mapping object (a node graph) plus a current_node pointer, and the actual visible chat is the path from root to current_node. A naive .[].mapping[].message dump returns every regenerated branch, every hidden system prompt, and every orphan, in arbitrary order — what looked like a 40-message chat in the UI becomes 120 unordered messages in the file. You have to walk the DAG.

Q: Do I need to handle the parts array, or is the first element always the message?

You have to handle the array. content.parts is a list, not a singleton, and assistant turns frequently contain multiple parts: a tool call, a tool result, and the final text reply. If you take parts[0] you get the wrong fragment for any chat that involved DALL·E, the code interpreter, or any function call. Concatenate parts[] with a separator, or filter on content_type before joining.

Q: Should I include the system message in the output?

For a readable archive: no. The first node after the root is almost always the system prompt (role 'system') and the next is a hidden author 'system' message containing OpenAI's safety boilerplate; rendering both pollutes the file with content the user never wrote and never read. Filter author.role != 'system' unless you're auditing prompt-injection or studying how Custom Instructions changed the conversation.

Q: What's the right way to name the output files?

Use a slug derived from the conversation title plus the create_time prefix in YYYY-MM-DD form, so files sort chronologically and remain unique even when two chats share a title. Example: 2026-03-12-postgres-vs-mongodb.md. The conversation id is canonical but unfriendly — keep it as a comment in the file's frontmatter, not in the filename.

You exported the JSON. Now you want one readable .md file per conversation, in render order, without the regenerated branches and the hidden system boilerplate. Here's the 30-line script — and the four edge cases that break the naive version.

TL;DR

ChatGPT's conversations.json stores each chat as a DAG (a mapping object plus a current_node pointer), not a flat array. Walking parent-pointers from current_node back to the root gives you the visible thread; everything else is regenerated branches you don't want in the output. Filter out author.role == "system", concatenate content.parts[] for assistant turns, render with one heading per message, and write one file per conversation named YYYY-MM-DD-<slug>.md. The full script is below.

Why a one-line dump won't work

The first instinct — jq '.[] | .mapping[] | .message' — fails for three reasons that show up immediately. (1) Messages aren't sorted. The mapping object is keyed by uuid and iterates in insertion order, which is roughly creation order, but creation includes every regenerated assistant turn you discarded. A 40-message chat in the UI becomes 120 messages in the dump. (2) The visible thread is selected by current_node, not by recency. ChatGPT records every "Regenerate response" you clicked as a sibling node; only the path from root to current_node is the chat you actually had. (3) System messages are noise. The first node after the conversation root is almost always a system-role message containing safety boilerplate and your Custom Instructions; rendering it pollutes the file with content the user never wrote.

The shape of the export is documented in the ChatGPT conversations.json field reference. The script below assumes that schema and walks it correctly.

The 30-line script

Save as chatgpt-to-md.sh. Requires jq 1.7+. Reads conversations.json from stdin, writes one file per conversation into ./out/:

#!/usr/bin/env bash
set -euo pipefail
mkdir -p out

jq -c '.[]' conversations.json | while read -r convo; do
  id=$(jq -r '.id' <<<"$convo")
  title=$(jq -r '.title // "untitled"' <<<"$convo")
  ctime=$(jq -r '.create_time | strftime("%Y-%m-%d")' <<<"$convo")
  slug=$(echo "$title" | tr '[:upper:]' '[:lower:]' \
    | sed -e 's/[^a-z0-9]\+/-/g' -e 's/^-\+//' -e 's/-\+$//' | cut -c1-60)
  out="out/${ctime}-${slug:-untitled}.md"

  # Walk current_node back to root, then reverse — gives render order.
  jq -r --arg ID "$id" --arg TITLE "$title" '
    def walk(node):
      if node == null then []
      else [node] + walk(.mapping[node].parent) end;
    walk(.current_node) | reverse as $path
    | "# " + $TITLE + "\n\n> conversation_id: " + $ID + "\n"
    , ( $path[]
        | .mapping[.] as $n
        | $n.message
        | select(. != null)
        | select(.author.role != "system")
        | select((.content.parts // []) | length > 0)
        | "\n## " + (.create_time | strftime("%Y-%m-%d %H:%M %Z"))
              + " — " + .author.role + "\n\n"
              + ( (.content.parts // [])
                  | map(if type == "string" then . else (.text // "") end)
                  | join("\n\n") )
      )
  ' <<<"$convo" > "$out"
  echo "wrote $out"
done

Run with ./chatgpt-to-md.sh from the directory holding conversations.json. Expect ~150 ms per conversation on a recent laptop; a 1,200-chat export takes ~3 minutes.

The four edge cases the naive version hits

Orphan nodes. Some mapping entries have no parent and aren't reachable from current_node — typically tool-call placeholders or fragments from a chat that hit a backend error mid-turn. The walk above ignores them by construction (they're outside the parent-chain), which is the right behavior. If you want to audit them, run a separate query: jq '.[].mapping | to_entries[] | select(.value.parent == null and .value.message != null)'.
Hidden system messages. The root node's message is usually null, but the first child is a system-role boilerplate ("You are ChatGPT, a large language model trained by OpenAI…"). The select(.author.role != "system") filter drops it. If you have Custom Instructions enabled, those land as a second system message — same filter handles both.
Multi-part assistant turns. When a turn involves DALL·E, the code interpreter, or any function call, content.parts is an array of {content_type, text} objects, not a list of strings. Naive parts[0] takes the first fragment (often a tool-call placeholder) and drops the actual reply. The map(if type == "string" then . else (.text // "") end) coalesces both shapes; join("\n\n") preserves the order. For tool-call diagnostics, replace with .text // (.image_url // "[non-text part]").
Regenerated branches. "Regenerate response" creates a sibling node in the DAG, not a replacement. The walk(current_node) | reverse path naturally selects only the chosen branch. If you want to recover earlier regens for diff-review, query jq '.[].mapping | to_entries[] | select(.value.children | length > 1)' — every entry there is a fork point, and .value.children is the list of sibling nodes you can walk individually.

Sanity-check the output before you trust it

Two cheap checks. (1) Compare conversation count: the input has jq 'length' conversations and out/ should have the same number of .md files (minus any whose title was empty and create_time was missing — those produce untitled.md collisions; rerun with --arg on id instead of title if you hit them). (2) Compare visible-message count: open one chat in the ChatGPT UI, count user-and-assistant turns, then grep -c '^## ' out/<that-file>.md. The numbers should match. If the markdown is higher, you forgot to filter system messages; if lower, you're walking parent from the wrong node (use current_node, not the root).

Once you have markdown, what next?

Per-conversation Markdown is the right format for two follow-up workflows. (1) Archive. Drop the out/ directory into a private repo or a Notion Backups database; you now have grep-able, ripgrep-able, dated history that survives ChatGPT account changes. (2) Decision extraction. The conversion above gives you readable text per chat, but it doesn't surface which chats contained durable decisions vs which were scratch thinking. That's the layer above search — a different kind of question, with a different tool. Decision extraction walks the same DAG you just walked, applies heuristic patterns for "chose X over Y" / "decided to" / "going with" framings, and emits a structured record per match instead of raw markdown. WhyChose's open-source extractor ships the same DAG-walk this page describes — fork it, run it locally, audit every match.

How WhyChose helps

WhyChose treats your ChatGPT export as a source of decisions, not as a content archive. The conversion script above gets you to readable markdown; WhyChose's extractor goes one layer further and surfaces the architectural and product calls buried in those conversations. Same DAG-walk under the hood — the open-source extractor implements exactly the parent-pointer reversal shown above, then layers a decision classifier on top. The hosted product adds a teammate-shareable link, Notion / Linear export, and an audit trail. If you're already comfortable with jq, the extractor is the more honest path: it's MIT-licensed, runs locally, and you can read every regex in patterns.md.

Get early access