Topic: ChatGPT export coverage

ChatGPT Voice Mode in the Data Export — Transcripts, What's Missing, and How to Process Them

When you use ChatGPT Voice Mode, OpenAI transcribes your speech to text in real time and discards the audio. The transcript appears in conversations.json as a regular text message — indistinguishable from a message you typed. There is no audio file in the export, no voice flag on the message, and no separate voice export path. This page covers exactly what is and is not stored, how to identify voice turns in the export, how Whisper transcription quality affects the text, and what changes (if anything) when you are extracting decisions from a voice-heavy session.

TL;DR

Voice conversations are in conversations.json as plain text. Audio is never stored. The Whisper transcript is all you have. Voice turns have author.role: "user" and content_type: "text" — identical to typed turns. No voice flag, no audio node, no special content type. Standard Voice Mode and Advanced Voice Mode export identically. Decision extraction tools work without modification on voice conversations.

What is in the export

A voice mode conversation that you export via Settings → Data Controls → Export data appears in conversations.json in exactly the same mapping DAG structure as any other conversation. Each turn in the conversation is a node in the mapping, each node has a message object, and the message object has the same fields regardless of whether the turn was spoken or typed:

{
  "id": "msg_abc123",
  "author": {
    "role": "user",
    "metadata": {}
  },
  "content": {
    "content_type": "text",
    "parts": [
      "so basically the question is whether we should use Redis here or
       just lean on Postgres for the job queue, because I know we already
       have a Postgres instance and adding Redis is another thing to operate"
    ]
  },
  "create_time": 1748745600.0,
  "metadata": {
    "request_id": "req_xyz789",
    "model_slug": "gpt-4o"
  }
}

This is a voice turn. The parts[0] string is the Whisper transcription of what you said. There is no audio_url, no voice_mode flag in metadata, and no distinct content_type value that identifies this as a voice-mode message. From the export's perspective, you typed it.

The assistant's response to a voice turn is equally standard — author.role: "assistant", content_type: "text", same structure as any GPT-4o text response. In Standard Voice Mode, the assistant's text is converted to speech for playback, but it is the same text response that would appear if you had typed the prompt. In Advanced Voice Mode, the end-to-end audio model generates the response differently, but it still stores only the text transcript.

What IS in the export from a voice conversation:

Whisper transcription of every user utterance (as a text message)
Full assistant text response for every turn
Message timestamps (create_time)
Model slug (e.g., gpt-4o or gpt-4o-realtime-preview for Advanced Voice Mode)
gizmo_id if the voice session used a Custom GPT
Conversation title (auto-generated or user-named)

What is not in the export

The following information from a voice session is not present in the export because it was never stored after the transcription occurred:

Audio files. OpenAI does not retain voice audio after real-time processing. The audio is transcribed by Whisper (Standard Voice) or processed directly by the audio model (Advanced Voice), and the audio data is discarded. No audio file appears in the export ZIP.
Tone, emphasis, and prosody. The emotional register of what you said — whether you were uncertain, confident, frustrated, or emphatic on a specific word — is not captured in the text transcription. A sentence spoken haltingly and a sentence spoken fluently produce the same exported string.
Disfluencies (mostly). Whisper strips most disfluencies — filler words like "uh", "um", "like" and mid-sentence repairs where you restart a sentence — from the transcription. Some pass through, particularly in longer utterances or when followed by a meaningful word that looks like a noun phrase to the transcription model.
Turn boundaries in overlapping speech. If you spoke over the assistant's audio playback or interrupted it, the transcription captures what it received but the turn boundary may not match the conversational intent. The export will show a user turn that starts mid-concept.
Visual context from the camera. If you used the Live Camera feature (pointing the phone camera at something while voice chatting), the visual frames are processed by GPT-4o's vision capability in real time but are not stored. The assistant's text response references what it saw, but the images themselves are absent from the export. You will see the assistant describing something ("I can see a whiteboard with three columns…") with no corresponding image node in the turn.
Speaker-turn attribution for multi-speaker sessions. If two people took turns speaking in the same voice session (common in in-person discussions), the export shows all user-side speech as a single author.role: "user" — there is no per-speaker attribution.

Whisper transcription quality and what it means for your export

The text in your voice conversation export is only as accurate as the Whisper transcription. For most general-purpose conversation this is high quality, but three categories regularly produce transcription errors:

Technical jargon and product names

Whisper is trained on general audio data and has strong coverage of common technical vocabulary, but it makes errors on proprietary product names, internal codenames, and domain-specific terms. "PostgreSQL" typically transcribes correctly; "ClickHouse" sometimes becomes "click house" (two words) or "Clickhouse" (one word, wrong casing); a specific library like "npryce/adr-tools" may become "in price adr tools" or similar. The error pattern is consistent: Whisper falls back to a phonetic spelling when it doesn't recognise a term as a word in its vocabulary.

This affects decision extraction: if the exported text says "so we went with in price adr tools" instead of "npryce/adr-tools", an extractor matching on tool names will miss it. The mitigation is to review candidate decision records from voice conversations and check any unusual spellings against the original conversational context.

Accented speech

Whisper handles a wide range of accents but has lower accuracy on accents that are underrepresented in its training data. The error distribution is non-uniform: some accented speakers get near-perfect transcription; others see systematic substitutions on specific phonemes that propagate across the entire conversation. If your voice conversation exports consistently show unexpected word substitutions, the most likely cause is accent-driven transcription errors rather than a storage or formatting problem.

Low-audio-quality conditions

Background noise, wind, phone-at-distance recordings, and Bluetooth headsets with compressed audio all reduce transcription accuracy. The errors from audio quality problems are typically more random and harder to recover from than accent-driven errors — they appear as incorrect common words rather than phonetic approximations of the correct term. Decision records derived from low-quality audio sessions should be treated as drafts requiring manual review.

Standard Voice Mode vs Advanced Voice Mode in the export

ChatGPT has two voice modes with different underlying architectures, but the same export behaviour:

Dimension	Standard Voice Mode	Advanced Voice Mode (AVM)
Available on	All plans	Plus, Team, Enterprise
Architecture	Whisper STT → GPT-4o → TTS	End-to-end GPT-4o audio model (native audio in and out)
Export format	Text in conversations.json	Text in conversations.json (identical)
Audio stored?	No	No
model_slug in export	`gpt-4o`	`gpt-4o-realtime-preview` or similar
Transcription fidelity	Whisper accuracy (generally high)	Native audio model (generally equal or higher for mid-sentence corrections)
Interrupted-turn handling	May truncate interrupted turns	Better handling of barge-in behaviour, still text-only export

The model_slug field in the per-message metadata is the only way to distinguish an AVM session from a Standard Voice session in the export. A jq one-liner that groups conversations by whether any turn has a gpt-4o-realtime model slug will partition your export into AVM and non-AVM sessions.

How to identify voice conversations in the export

There is no reliable programmatic flag for voice conversations in conversations.json. The model_slug identifies AVM sessions but not Standard Voice sessions (which use the same gpt-4o slug as typed conversations). The following heuristics are the best available signals:

model_slug gpt-4o-realtime-preview — confirms Advanced Voice Mode. Use: jq '[.[] | select(.mapping | to_entries[].value.message.metadata.model_slug? == "gpt-4o-realtime-preview")] | length' conversations.json
Short median turn length in user messages — voice utterances are typically 5–30 words; typed messages are typically 20–150 words. A conversation where the median user turn is under 20 words is likely voice-heavy.
Conversational filler vocabulary — grep the user-turn text for "so basically", "you know", "I mean", "right so", "and then". These appear in speech transcription far more often than in typed messages.
Sentence-restart patterns — mid-sentence restarts ("I think we should — actually, let me rephrase that") are common in transcribed speech and uncommon in typed text.
Conversation topic memory — if you know you discussed a specific decision aloud, the conversation title (if auto-generated) often reflects the topic accurately enough to match against topic recall.

A jq recipe that extracts conversations where user turns have a median word count below 15 (a rough voice heuristic):

jq '
  [
    .[] |
    . as $conv |
    (.mapping | to_entries[].value |
      select(.message != null) |
      select(.message.author.role == "user") |
      select(.message.content.content_type == "text") |
      (.message.content.parts[0] | split(" ") | length)
    ) |
    [., $conv.title]
  ] |
  group_by(.[1]) |
  map({
    title: .[0][1],
    median_user_words: (map(.[0]) | sort | .[length/2 | floor])
  }) |
  map(select(.median_user_words < 15))
' conversations.json

This produces a list of conversation titles with short median user turns — the candidate set for voice sessions. It will include some very short typed conversations (quick clarification exchanges), so treat it as a filter, not a definitive label.

Extracting decisions from voice conversations

Voice conversations contain engineering decisions as often as — and sometimes more honestly than — typed conversations. When engineers think aloud about architecture choices ("so the question is Redis or Postgres for this queue, and the reason I'm leaning Redis is because…"), the spoken reasoning is often less polished but more direct than what they would type. The trade-off consideration and the rejection rationale appear in natural language, unfiltered by the tendency to present typed messages as more considered than they are.

Decision extraction from voice conversations works identically to extraction from typed conversations: the same jq recipes described on the extract decisions from ChatGPT page work without modification, because voice conversations are plain text in the same structure. The WhyChose extractor processes them with the same pipeline.

Three practical differences in practice:

1. Reasoning is more fragmented across turns

In a typed conversation, an engineer might write a single 150-word message explaining a trade-off. In a voice conversation, the same reasoning appears across 8–12 short turns — question, clarification, partial answer, reassertion, conclusion. The decision record's Context field, if populated from a single turn, captures only a fragment. Effective extraction from voice requires a wider context window (the several turns before and after the decision statement) rather than the single-message extraction that works for typed conversations.

2. Transcription errors may corrupt named entities

As noted above, library names and product names may be mis-transcribed. When reviewing extracted decision records from voice sessions, verify that all technical terms in the Decision and Context fields match your actual intent. The pattern "chose X over Y because Z" may have "X" mis-transcribed even if the reasoning around it is accurate.

3. Voice conversations tend toward exploration, not conclusion

A typed conversation often ends with an explicit decision: "OK, let's go with Postgres then." A voice conversation more often ends when the speaker runs out of things to say — the conclusion may be implicit in the last few exchanges rather than stated explicitly. Decision extraction heuristics that look for conclusion markers ("we'll go with", "the decision is", "we've decided") work well for typed conversations; voice conversations often require the "trade-off evaluation" pattern that extracts the decision from the comparison structure ("option A has X advantage but Y downside; option B has…") rather than from an explicit conclusion statement.

The WhyChose extractor handles both patterns and is not voice-specific in its extraction logic — it processes the text regardless of input modality.

Get early access to WhyChose

Voice mode and ChatGPT memory

If you have ChatGPT Memory enabled, voice conversations can trigger memory saves just as typed conversations do — OpenAI may save a memory entry based on something you said in a voice session. In the memory.json file in the data export, memory entries from voice sessions are indistinguishable from entries from typed sessions: the same per-entry schema (id, content, enabled, created_at, updated_at) with no source-session pointer. As covered in the ChatGPT memory export reference, there is no link from a memory entry back to the conversation that created it — this is true for both typed and voice-triggered memories.