Topic: ChatGPT export coverage

ChatGPT Voice Mode in the Data Export — Transcripts, What's Missing, and How to Process Them

When you use ChatGPT Voice Mode, OpenAI transcribes your speech to text in real time and discards the audio. The transcript appears in conversations.json as a regular text message — indistinguishable from a message you typed. There is no audio file in the export, no voice flag on the message, and no separate voice export path. This page covers exactly what is and is not stored, how to identify voice turns in the export, how Whisper transcription quality affects the text, and what changes (if anything) when you are extracting decisions from a voice-heavy session.

TL;DR

Voice conversations are in conversations.json as plain text. Audio is never stored. The Whisper transcript is all you have. Voice turns have author.role: "user" and content_type: "text" — identical to typed turns. No voice flag, no audio node, no special content type. Standard Voice Mode and Advanced Voice Mode export identically. Decision extraction tools work without modification on voice conversations.

What is in the export

A voice mode conversation that you export via Settings → Data Controls → Export data appears in conversations.json in exactly the same mapping DAG structure as any other conversation. Each turn in the conversation is a node in the mapping, each node has a message object, and the message object has the same fields regardless of whether the turn was spoken or typed:

{
  "id": "msg_abc123",
  "author": {
    "role": "user",
    "metadata": {}
  },
  "content": {
    "content_type": "text",
    "parts": [
      "so basically the question is whether we should use Redis here or
       just lean on Postgres for the job queue, because I know we already
       have a Postgres instance and adding Redis is another thing to operate"
    ]
  },
  "create_time": 1748745600.0,
  "metadata": {
    "request_id": "req_xyz789",
    "model_slug": "gpt-4o"
  }
}

This is a voice turn. The parts[0] string is the Whisper transcription of what you said. There is no audio_url, no voice_mode flag in metadata, and no distinct content_type value that identifies this as a voice-mode message. From the export's perspective, you typed it.

The assistant's response to a voice turn is equally standard — author.role: "assistant", content_type: "text", same structure as any GPT-4o text response. In Standard Voice Mode, the assistant's text is converted to speech for playback, but it is the same text response that would appear if you had typed the prompt. In Advanced Voice Mode, the end-to-end audio model generates the response differently, but it still stores only the text transcript.

What IS in the export from a voice conversation:

What is not in the export

The following information from a voice session is not present in the export because it was never stored after the transcription occurred:

Whisper transcription quality and what it means for your export

The text in your voice conversation export is only as accurate as the Whisper transcription. For most general-purpose conversation this is high quality, but three categories regularly produce transcription errors:

Technical jargon and product names

Whisper is trained on general audio data and has strong coverage of common technical vocabulary, but it makes errors on proprietary product names, internal codenames, and domain-specific terms. "PostgreSQL" typically transcribes correctly; "ClickHouse" sometimes becomes "click house" (two words) or "Clickhouse" (one word, wrong casing); a specific library like "npryce/adr-tools" may become "in price adr tools" or similar. The error pattern is consistent: Whisper falls back to a phonetic spelling when it doesn't recognise a term as a word in its vocabulary.

This affects decision extraction: if the exported text says "so we went with in price adr tools" instead of "npryce/adr-tools", an extractor matching on tool names will miss it. The mitigation is to review candidate decision records from voice conversations and check any unusual spellings against the original conversational context.

Accented speech

Whisper handles a wide range of accents but has lower accuracy on accents that are underrepresented in its training data. The error distribution is non-uniform: some accented speakers get near-perfect transcription; others see systematic substitutions on specific phonemes that propagate across the entire conversation. If your voice conversation exports consistently show unexpected word substitutions, the most likely cause is accent-driven transcription errors rather than a storage or formatting problem.

Low-audio-quality conditions

Background noise, wind, phone-at-distance recordings, and Bluetooth headsets with compressed audio all reduce transcription accuracy. The errors from audio quality problems are typically more random and harder to recover from than accent-driven errors — they appear as incorrect common words rather than phonetic approximations of the correct term. Decision records derived from low-quality audio sessions should be treated as drafts requiring manual review.

Standard Voice Mode vs Advanced Voice Mode in the export

ChatGPT has two voice modes with different underlying architectures, but the same export behaviour:

DimensionStandard Voice ModeAdvanced Voice Mode (AVM)
Available onAll plansPlus, Team, Enterprise
ArchitectureWhisper STT → GPT-4o → TTSEnd-to-end GPT-4o audio model (native audio in and out)
Export formatText in conversations.jsonText in conversations.json (identical)
Audio stored?NoNo
model_slug in exportgpt-4ogpt-4o-realtime-preview or similar
Transcription fidelityWhisper accuracy (generally high)Native audio model (generally equal or higher for mid-sentence corrections)
Interrupted-turn handlingMay truncate interrupted turnsBetter handling of barge-in behaviour, still text-only export

The model_slug field in the per-message metadata is the only way to distinguish an AVM session from a Standard Voice session in the export. A jq one-liner that groups conversations by whether any turn has a gpt-4o-realtime model slug will partition your export into AVM and non-AVM sessions.

How to identify voice conversations in the export

There is no reliable programmatic flag for voice conversations in conversations.json. The model_slug identifies AVM sessions but not Standard Voice sessions (which use the same gpt-4o slug as typed conversations). The following heuristics are the best available signals:

  1. model_slug gpt-4o-realtime-preview — confirms Advanced Voice Mode. Use: jq '[.[] | select(.mapping | to_entries[].value.message.metadata.model_slug? == "gpt-4o-realtime-preview")] | length' conversations.json
  2. Short median turn length in user messages — voice utterances are typically 5–30 words; typed messages are typically 20–150 words. A conversation where the median user turn is under 20 words is likely voice-heavy.
  3. Conversational filler vocabulary — grep the user-turn text for "so basically", "you know", "I mean", "right so", "and then". These appear in speech transcription far more often than in typed messages.
  4. Sentence-restart patterns — mid-sentence restarts ("I think we should — actually, let me rephrase that") are common in transcribed speech and uncommon in typed text.
  5. Conversation topic memory — if you know you discussed a specific decision aloud, the conversation title (if auto-generated) often reflects the topic accurately enough to match against topic recall.

A jq recipe that extracts conversations where user turns have a median word count below 15 (a rough voice heuristic):

jq '
  [
    .[] |
    . as $conv |
    (.mapping | to_entries[].value |
      select(.message != null) |
      select(.message.author.role == "user") |
      select(.message.content.content_type == "text") |
      (.message.content.parts[0] | split(" ") | length)
    ) |
    [., $conv.title]
  ] |
  group_by(.[1]) |
  map({
    title: .[0][1],
    median_user_words: (map(.[0]) | sort | .[length/2 | floor])
  }) |
  map(select(.median_user_words < 15))
' conversations.json

This produces a list of conversation titles with short median user turns — the candidate set for voice sessions. It will include some very short typed conversations (quick clarification exchanges), so treat it as a filter, not a definitive label.

Extracting decisions from voice conversations

Voice conversations contain engineering decisions as often as — and sometimes more honestly than — typed conversations. When engineers think aloud about architecture choices ("so the question is Redis or Postgres for this queue, and the reason I'm leaning Redis is because…"), the spoken reasoning is often less polished but more direct than what they would type. The trade-off consideration and the rejection rationale appear in natural language, unfiltered by the tendency to present typed messages as more considered than they are.

Decision extraction from voice conversations works identically to extraction from typed conversations: the same jq recipes described on the extract decisions from ChatGPT page work without modification, because voice conversations are plain text in the same structure. The WhyChose extractor processes them with the same pipeline.

Three practical differences in practice:

1. Reasoning is more fragmented across turns

In a typed conversation, an engineer might write a single 150-word message explaining a trade-off. In a voice conversation, the same reasoning appears across 8–12 short turns — question, clarification, partial answer, reassertion, conclusion. The decision record's Context field, if populated from a single turn, captures only a fragment. Effective extraction from voice requires a wider context window (the several turns before and after the decision statement) rather than the single-message extraction that works for typed conversations.

2. Transcription errors may corrupt named entities

As noted above, library names and product names may be mis-transcribed. When reviewing extracted decision records from voice sessions, verify that all technical terms in the Decision and Context fields match your actual intent. The pattern "chose X over Y because Z" may have "X" mis-transcribed even if the reasoning around it is accurate.

3. Voice conversations tend toward exploration, not conclusion

A typed conversation often ends with an explicit decision: "OK, let's go with Postgres then." A voice conversation more often ends when the speaker runs out of things to say — the conclusion may be implicit in the last few exchanges rather than stated explicitly. Decision extraction heuristics that look for conclusion markers ("we'll go with", "the decision is", "we've decided") work well for typed conversations; voice conversations often require the "trade-off evaluation" pattern that extracts the decision from the comparison structure ("option A has X advantage but Y downside; option B has…") rather than from an explicit conclusion statement.

The WhyChose extractor handles both patterns and is not voice-specific in its extraction logic — it processes the text regardless of input modality.

Get early access to WhyChose

Voice mode and ChatGPT memory

If you have ChatGPT Memory enabled, voice conversations can trigger memory saves just as typed conversations do — OpenAI may save a memory entry based on something you said in a voice session. In the memory.json file in the data export, memory entries from voice sessions are indistinguishable from entries from typed sessions: the same per-entry schema (id, content, enabled, created_at, updated_at) with no source-session pointer. As covered in the ChatGPT memory export reference, there is no link from a memory entry back to the conversation that created it — this is true for both typed and voice-triggered memories.

Related questions

Are ChatGPT voice conversations included in the data export?

Yes. Voice mode conversations are included in the conversations.json data export as plain text. The audio is transcribed in real time by Whisper and stored as text — the audio itself is never retained by OpenAI after transcription. In the export, a voice conversation looks identical to a text conversation: the same mapping DAG structure, the same author.role: 'user' and 'assistant' turns, the same content_type: 'text' in the message content. There is no voice flag, no audio file, and no special voice content type. The only record of what you said is the Whisper transcription.

How do I identify which conversations were voice mode in the export?

There is no explicit voice flag in conversations.json. Voice conversations are not tagged differently from typed conversations in the export schema. Heuristics that suggest a conversation was a voice session: unusually short user turns (voice utterances are typically shorter and more fragmented than typed messages), conversational filler phrases ('um', 'uh', 'you know', 'so basically') that appear in the transcription, sentences that trail off or restart mid-thought, and — for Standard Voice Mode — assistant responses that begin with phrasing suited for spoken delivery rather than read delivery. The most reliable signal is context from the conversation topic: if a conversation covers a topic you remember discussing aloud rather than typing, the content itself confirms it.

What is the difference between Standard Voice Mode and Advanced Voice Mode in the export?

Both Standard Voice Mode and Advanced Voice Mode (AVM, the GPT-4o real-time audio model available on Plus and Team plans) produce text-only exports. In both modes, the audio is processed in real time and only the text transcript is stored. The export schema is identical. The practical difference is transcription fidelity: Standard Voice Mode uses Whisper as a separate transcription step before the language model processes the text; AVM uses an end-to-end audio model that processes audio natively, which generally produces better handling of accented speech and mid-sentence corrections, but the export in both cases is the text-only result. Neither mode stores audio in OpenAI's systems after the response is generated.

Can I extract decisions from voice mode conversations with the same tools?

Yes. Because voice mode conversations export as standard text in conversations.json, the same jq extraction recipes, conversion scripts, and decision-extraction tools that work on typed conversations work identically on voice conversations. The WhyChose extractor processes them without modification. The practical difference is that voice conversations often have shorter individual turns and more fragmented reasoning — a decision rationale that a typed conversation captures in one long paragraph may be spread across five to ten short voice turns. Decision extraction from voice conversations therefore requires more context window around each candidate sentence to reconstruct the full reasoning chain.

Further reading