Listen to the article
Half of AI Chatbot Responses to Health Questions Found Problematic in New Study
A substantial portion of medical information provided by popular AI chatbots is inaccurate and incomplete, according to a new study published in the open access journal BMJ Open. Researchers found that half of the answers to evidence-based health questions were either “somewhat” or “highly” problematic, raising concerns about the reliability of these increasingly popular tools.
The study, conducted in February 2025, evaluated five widely used generative AI chatbots: Google’s Gemini, DeepSeek’s High-Flyer, Meta AI, OpenAI’s ChatGPT, and xAI’s Grok. Researchers tested these platforms with 50 questions across five health categories known to attract misinformation: cancer, vaccines, stem cells, nutrition, and athletic performance.
The research team designed prompts that mimicked common health queries, intentionally creating questions that might “strain” the AI models toward potentially providing misinformation or inappropriate medical advice. This approach, increasingly used for stress-testing AI systems, helps identify behavioral vulnerabilities in the technology.
Results showed significant gaps in accuracy and reliability. Of all responses analyzed, 30% were categorized as “somewhat problematic” and 20% as “highly problematic,” meaning they could potentially direct users toward ineffective treatments or harmful actions if followed without professional medical guidance.
The type of question significantly influenced response quality. Open-ended prompts, which typically required the chatbots to generate multiple responses in list form, produced significantly more highly problematic answers than closed questions requiring specific responses aligned with scientific consensus.
While overall quality didn’t differ dramatically between platforms, xAI’s Grok performed notably worse, generating highly problematic responses 58% of the time. Google’s Gemini showed the strongest performance, with the fewest highly problematic answers and the most non-problematic ones.
Performance varied substantially across medical topics. The chatbots handled questions about vaccines and cancer most accurately, while information on stem cells, athletic performance, and nutrition showed the highest rates of inaccuracy.
Researchers noted several concerning patterns in how information was presented. Responses were consistently delivered with confidence and certainty, rarely including caveats or disclaimers. Only two refusals to answer potentially harmful questions were recorded, both from Meta AI when asked about anabolic steroids and alternative cancer treatments.
Reference quality was particularly poor, scoring an average completeness of just 40%. All chatbots exhibited instances of “hallucinations” – fabricating citations and references that don’t exist – meaning none provided fully accurate reference lists.
Readability posed another barrier to effective public health communication. All responses were graded as “difficult,” requiring college-graduate-level reading comprehension, limiting accessibility for many users.
The researchers acknowledge some limitations in their methodology. The study assessed only five chatbots in a rapidly evolving field, and the deliberately challenging nature of the questions may have overstated the prevalence of problematic content compared to typical user queries.
“Our findings regarding scientific accuracy, reference quality, and response readability highlight important behavioral limitations and the need to re-evaluate how AI chatbots are deployed in public-facing health and medical communication,” the research team stated.
The fundamental architecture of these systems contributes to their limitations. Chatbots generate outputs by inferring statistical patterns from training data rather than by reasoning or weighing evidence. They lack the ability to make ethical judgments or value-based assessments, which can lead to authoritative-sounding but potentially flawed responses.
Another limitation stems from the training data itself. While chatbots draw on vast amounts of internet content, including social media and Q&A forums, their access to scientific literature is typically limited to open-access publications, which represent only 30-50% of published studies.
As these technologies continue to gain popularity as information sources, the researchers emphasize the urgent need for public education, professional training, and regulatory oversight to ensure that generative AI supports rather than undermines public health.
“Continued deployment of these chatbots without public education and oversight risks amplifying misinformation,” the researchers concluded, highlighting the growing importance of digital literacy in navigating AI-generated health information.
Fact Checker
Verify the accuracy of this article using The Disinformation Commission analysis and real-time sources.


10 Comments
The study’s stress-testing approach seems like a smart way to uncover potential vulnerabilities in these AI systems. It’s important to continually evaluate their performance, especially for sensitive domains like healthcare.
Yes, proactive testing is key to improving the accuracy and reliability of chatbots over time. Ongoing monitoring and refinement will be critical.
Half of the chatbot responses being problematic is quite high. This highlights the need for more robust training and validation processes to ensure these tools can provide high-quality, trustworthy medical information.
Agreed. The high rate of inaccurate or incomplete responses is concerning and suggests more work is needed to make chatbots truly reliable for sensitive health queries.
This is a good reminder that AI chatbots, while convenient, should not be blindly trusted for critical health information. Consulting qualified medical professionals is still the safest approach.
Absolutely. These chatbots can be useful as a starting point, but any serious medical concerns should be followed up with an actual doctor.
Interesting study highlighting the challenges with relying on chatbots for sensitive medical information. More rigorous testing and validation of these AI systems is clearly needed to ensure they provide reliable and safe guidance to users.
I agree, the findings are concerning. Inaccurate or incomplete health advice from chatbots could potentially lead to harmful consequences for users.
It’s good to see research being done to evaluate the performance of these popular AI chatbots, especially in areas like healthcare where accuracy is so crucial. This study provides valuable insights to improve the technology going forward.
Yes, this type of rigorous testing is important to keep pace with the rapid advancement of conversational AI. Ongoing monitoring and iteration will be key to enhancing their capabilities over time.