Topic: ChatGPT export coverage
ChatGPT Voice Mode in the Data Export — Transcripts, What's Missing, and How to Process Them
When you use ChatGPT Voice Mode, OpenAI transcribes your speech to text in real time and discards the audio. The transcript appears in conversations.json as a regular text message — indistinguishable from a message you typed. There is no audio file in the export, no voice flag on the message, and no separate voice export path. This page covers exactly what is and is not stored, how to identify voice turns in the export, how Whisper transcription quality affects the text, and what changes (if anything) when you are extracting decisions from a voice-heavy session.
TL;DR
Voice conversations are in conversations.json as plain text. Audio is never stored. The Whisper transcript is all you have. Voice turns have author.role: "user" and content_type: "text" — identical to typed turns. No voice flag, no audio node, no special content type. Standard Voice Mode and Advanced Voice Mode export identically. Decision extraction tools work without modification on voice conversations.
What is in the export
A voice mode conversation that you export via Settings → Data Controls → Export data appears in conversations.json in exactly the same mapping DAG structure as any other conversation. Each turn in the conversation is a node in the mapping, each node has a message object, and the message object has the same fields regardless of whether the turn was spoken or typed:
{
"id": "msg_abc123",
"author": {
"role": "user",
"metadata": {}
},
"content": {
"content_type": "text",
"parts": [
"so basically the question is whether we should use Redis here or
just lean on Postgres for the job queue, because I know we already
have a Postgres instance and adding Redis is another thing to operate"
]
},
"create_time": 1748745600.0,
"metadata": {
"request_id": "req_xyz789",
"model_slug": "gpt-4o"
}
}
This is a voice turn. The parts[0] string is the Whisper transcription of what you said. There is no audio_url, no voice_mode flag in metadata, and no distinct content_type value that identifies this as a voice-mode message. From the export's perspective, you typed it.
The assistant's response to a voice turn is equally standard — author.role: "assistant", content_type: "text", same structure as any GPT-4o text response. In Standard Voice Mode, the assistant's text is converted to speech for playback, but it is the same text response that would appear if you had typed the prompt. In Advanced Voice Mode, the end-to-end audio model generates the response differently, but it still stores only the text transcript.
What IS in the export from a voice conversation:
- Whisper transcription of every user utterance (as a text message)
- Full assistant text response for every turn
- Message timestamps (
create_time) - Model slug (e.g.,
gpt-4oorgpt-4o-realtime-previewfor Advanced Voice Mode) gizmo_idif the voice session used a Custom GPT- Conversation title (auto-generated or user-named)
What is not in the export
The following information from a voice session is not present in the export because it was never stored after the transcription occurred:
- Audio files. OpenAI does not retain voice audio after real-time processing. The audio is transcribed by Whisper (Standard Voice) or processed directly by the audio model (Advanced Voice), and the audio data is discarded. No audio file appears in the export ZIP.
- Tone, emphasis, and prosody. The emotional register of what you said — whether you were uncertain, confident, frustrated, or emphatic on a specific word — is not captured in the text transcription. A sentence spoken haltingly and a sentence spoken fluently produce the same exported string.
- Disfluencies (mostly). Whisper strips most disfluencies — filler words like "uh", "um", "like" and mid-sentence repairs where you restart a sentence — from the transcription. Some pass through, particularly in longer utterances or when followed by a meaningful word that looks like a noun phrase to the transcription model.
- Turn boundaries in overlapping speech. If you spoke over the assistant's audio playback or interrupted it, the transcription captures what it received but the turn boundary may not match the conversational intent. The export will show a user turn that starts mid-concept.
- Visual context from the camera. If you used the Live Camera feature (pointing the phone camera at something while voice chatting), the visual frames are processed by GPT-4o's vision capability in real time but are not stored. The assistant's text response references what it saw, but the images themselves are absent from the export. You will see the assistant describing something ("I can see a whiteboard with three columns…") with no corresponding image node in the turn.
- Speaker-turn attribution for multi-speaker sessions. If two people took turns speaking in the same voice session (common in in-person discussions), the export shows all user-side speech as a single
author.role: "user"— there is no per-speaker attribution.
Whisper transcription quality and what it means for your export
The text in your voice conversation export is only as accurate as the Whisper transcription. For most general-purpose conversation this is high quality, but three categories regularly produce transcription errors:
Technical jargon and product names
Whisper is trained on general audio data and has strong coverage of common technical vocabulary, but it makes errors on proprietary product names, internal codenames, and domain-specific terms. "PostgreSQL" typically transcribes correctly; "ClickHouse" sometimes becomes "click house" (two words) or "Clickhouse" (one word, wrong casing); a specific library like "npryce/adr-tools" may become "in price adr tools" or similar. The error pattern is consistent: Whisper falls back to a phonetic spelling when it doesn't recognise a term as a word in its vocabulary.
This affects decision extraction: if the exported text says "so we went with in price adr tools" instead of "npryce/adr-tools", an extractor matching on tool names will miss it. The mitigation is to review candidate decision records from voice conversations and check any unusual spellings against the original conversational context.
Accented speech
Whisper handles a wide range of accents but has lower accuracy on accents that are underrepresented in its training data. The error distribution is non-uniform: some accented speakers get near-perfect transcription; others see systematic substitutions on specific phonemes that propagate across the entire conversation. If your voice conversation exports consistently show unexpected word substitutions, the most likely cause is accent-driven transcription errors rather than a storage or formatting problem.
Low-audio-quality conditions
Background noise, wind, phone-at-distance recordings, and Bluetooth headsets with compressed audio all reduce transcription accuracy. The errors from audio quality problems are typically more random and harder to recover from than accent-driven errors — they appear as incorrect common words rather than phonetic approximations of the correct term. Decision records derived from low-quality audio sessions should be treated as drafts requiring manual review.
Standard Voice Mode vs Advanced Voice Mode in the export
ChatGPT has two voice modes with different underlying architectures, but the same export behaviour:
| Dimension | Standard Voice Mode | Advanced Voice Mode (AVM) |
|---|---|---|
| Available on | All plans | Plus, Team, Enterprise |
| Architecture | Whisper STT → GPT-4o → TTS | End-to-end GPT-4o audio model (native audio in and out) |
| Export format | Text in conversations.json | Text in conversations.json (identical) |
| Audio stored? | No | No |
| model_slug in export | gpt-4o | gpt-4o-realtime-preview or similar |
| Transcription fidelity | Whisper accuracy (generally high) | Native audio model (generally equal or higher for mid-sentence corrections) |
| Interrupted-turn handling | May truncate interrupted turns | Better handling of barge-in behaviour, still text-only export |
The model_slug field in the per-message metadata is the only way to distinguish an AVM session from a Standard Voice session in the export. A jq one-liner that groups conversations by whether any turn has a gpt-4o-realtime model slug will partition your export into AVM and non-AVM sessions.
How to identify voice conversations in the export
There is no reliable programmatic flag for voice conversations in conversations.json. The model_slug identifies AVM sessions but not Standard Voice sessions (which use the same gpt-4o slug as typed conversations). The following heuristics are the best available signals:
- model_slug
gpt-4o-realtime-preview— confirms Advanced Voice Mode. Use:jq '[.[] | select(.mapping | to_entries[].value.message.metadata.model_slug? == "gpt-4o-realtime-preview")] | length' conversations.json - Short median turn length in user messages — voice utterances are typically 5–30 words; typed messages are typically 20–150 words. A conversation where the median user turn is under 20 words is likely voice-heavy.
- Conversational filler vocabulary — grep the user-turn text for "so basically", "you know", "I mean", "right so", "and then". These appear in speech transcription far more often than in typed messages.
- Sentence-restart patterns — mid-sentence restarts ("I think we should — actually, let me rephrase that") are common in transcribed speech and uncommon in typed text.
- Conversation topic memory — if you know you discussed a specific decision aloud, the conversation title (if auto-generated) often reflects the topic accurately enough to match against topic recall.
A jq recipe that extracts conversations where user turns have a median word count below 15 (a rough voice heuristic):
jq '
[
.[] |
. as $conv |
(.mapping | to_entries[].value |
select(.message != null) |
select(.message.author.role == "user") |
select(.message.content.content_type == "text") |
(.message.content.parts[0] | split(" ") | length)
) |
[., $conv.title]
] |
group_by(.[1]) |
map({
title: .[0][1],
median_user_words: (map(.[0]) | sort | .[length/2 | floor])
}) |
map(select(.median_user_words < 15))
' conversations.json
This produces a list of conversation titles with short median user turns — the candidate set for voice sessions. It will include some very short typed conversations (quick clarification exchanges), so treat it as a filter, not a definitive label.
Extracting decisions from voice conversations
Voice conversations contain engineering decisions as often as — and sometimes more honestly than — typed conversations. When engineers think aloud about architecture choices ("so the question is Redis or Postgres for this queue, and the reason I'm leaning Redis is because…"), the spoken reasoning is often less polished but more direct than what they would type. The trade-off consideration and the rejection rationale appear in natural language, unfiltered by the tendency to present typed messages as more considered than they are.
Decision extraction from voice conversations works identically to extraction from typed conversations: the same jq recipes described on the extract decisions from ChatGPT page work without modification, because voice conversations are plain text in the same structure. The WhyChose extractor processes them with the same pipeline.
Three practical differences in practice:
1. Reasoning is more fragmented across turns
In a typed conversation, an engineer might write a single 150-word message explaining a trade-off. In a voice conversation, the same reasoning appears across 8–12 short turns — question, clarification, partial answer, reassertion, conclusion. The decision record's Context field, if populated from a single turn, captures only a fragment. Effective extraction from voice requires a wider context window (the several turns before and after the decision statement) rather than the single-message extraction that works for typed conversations.
2. Transcription errors may corrupt named entities
As noted above, library names and product names may be mis-transcribed. When reviewing extracted decision records from voice sessions, verify that all technical terms in the Decision and Context fields match your actual intent. The pattern "chose X over Y because Z" may have "X" mis-transcribed even if the reasoning around it is accurate.
3. Voice conversations tend toward exploration, not conclusion
A typed conversation often ends with an explicit decision: "OK, let's go with Postgres then." A voice conversation more often ends when the speaker runs out of things to say — the conclusion may be implicit in the last few exchanges rather than stated explicitly. Decision extraction heuristics that look for conclusion markers ("we'll go with", "the decision is", "we've decided") work well for typed conversations; voice conversations often require the "trade-off evaluation" pattern that extracts the decision from the comparison structure ("option A has X advantage but Y downside; option B has…") rather than from an explicit conclusion statement.
The WhyChose extractor handles both patterns and is not voice-specific in its extraction logic — it processes the text regardless of input modality.
Voice mode and ChatGPT memory
If you have ChatGPT Memory enabled, voice conversations can trigger memory saves just as typed conversations do — OpenAI may save a memory entry based on something you said in a voice session. In the memory.json file in the data export, memory entries from voice sessions are indistinguishable from entries from typed sessions: the same per-entry schema (id, content, enabled, created_at, updated_at) with no source-session pointer. As covered in the ChatGPT memory export reference, there is no link from a memory entry back to the conversation that created it — this is true for both typed and voice-triggered memories.
Related questions
Are ChatGPT voice conversations included in the data export?
Yes. Voice mode conversations are included in the conversations.json data export as plain text. The audio is transcribed in real time by Whisper and stored as text — the audio itself is never retained by OpenAI after transcription. In the export, a voice conversation looks identical to a text conversation: the same mapping DAG structure, the same author.role: 'user' and 'assistant' turns, the same content_type: 'text' in the message content. There is no voice flag, no audio file, and no special voice content type. The only record of what you said is the Whisper transcription.
How do I identify which conversations were voice mode in the export?
There is no explicit voice flag in conversations.json. Voice conversations are not tagged differently from typed conversations in the export schema. Heuristics that suggest a conversation was a voice session: unusually short user turns (voice utterances are typically shorter and more fragmented than typed messages), conversational filler phrases ('um', 'uh', 'you know', 'so basically') that appear in the transcription, sentences that trail off or restart mid-thought, and — for Standard Voice Mode — assistant responses that begin with phrasing suited for spoken delivery rather than read delivery. The most reliable signal is context from the conversation topic: if a conversation covers a topic you remember discussing aloud rather than typing, the content itself confirms it.
What is the difference between Standard Voice Mode and Advanced Voice Mode in the export?
Both Standard Voice Mode and Advanced Voice Mode (AVM, the GPT-4o real-time audio model available on Plus and Team plans) produce text-only exports. In both modes, the audio is processed in real time and only the text transcript is stored. The export schema is identical. The practical difference is transcription fidelity: Standard Voice Mode uses Whisper as a separate transcription step before the language model processes the text; AVM uses an end-to-end audio model that processes audio natively, which generally produces better handling of accented speech and mid-sentence corrections, but the export in both cases is the text-only result. Neither mode stores audio in OpenAI's systems after the response is generated.
Can I extract decisions from voice mode conversations with the same tools?
Yes. Because voice mode conversations export as standard text in conversations.json, the same jq extraction recipes, conversion scripts, and decision-extraction tools that work on typed conversations work identically on voice conversations. The WhyChose extractor processes them without modification. The practical difference is that voice conversations often have shorter individual turns and more fragmented reasoning — a decision rationale that a typed conversation captures in one long paragraph may be spread across five to ten short voice turns. Decision extraction from voice conversations therefore requires more context window around each candidate sentence to reconstruct the full reasoning chain.
Further reading
- ChatGPT conversations.json format reference — the mapping DAG and leaf-walk — the full structural reference for conversations.json; voice conversations use the same mapping DAG and content_type: "text" schema as typed conversations, with no voice-specific nodes.
- How to export ChatGPT history — the step-by-step export guide; voice conversations are included automatically in the standard Settings → Data Controls → Export path, no separate voice export step is needed.
- Extract decisions from ChatGPT — the full extraction guide; the jq recipes and extraction patterns on this page work identically on voice conversations, with the caveat that voice reasoning is more fragmented across turns.
- ChatGPT export not working — eight failure modes and fixes — if voice conversations are missing from your export entirely, Mode 1 (stuck export request) and Mode 3 (partial ZIP) are the most common causes; the conversations were stored, the export request is what failed.
- How to search ChatGPT history (jq, sidebar, extraction) — jq search recipes for voice conversations work identically to typed conversations; the median-turn-length heuristic described above can be added as a filter to identify voice-heavy conversations in search results.
- ChatGPT Memory export — where your memories live in the data download — memory entries triggered by voice conversations are indistinguishable from entries triggered by typed conversations in memory.json; both have the same schema with no source-session pointer.
- ChatGPT Custom GPTs export — conversations vs configurations — voice sessions with Custom GPTs (voice-enabled GPTs) appear in the export with the gizmo_id field populated, identical to typed Custom GPT conversations.
- Perplexity conversation export — platform comparison — Perplexity has a voice/audio search feature; like ChatGPT voice, it produces text output only and has the same no-native-export limitation that applies to all Perplexity conversation history.
- The open-source extractor — processes voice conversations from the ChatGPT export identically to typed conversations; the sliding context window handles the turn fragmentation that voice reasoning typically produces.