Listen to the article
AI Models Spread Medical Misinformation Despite Growing Healthcare Adoption
A concerning study has revealed that large language models (LLMs) like ChatGPT are prone to accepting and repeating false medical claims when they’re presented in convincing medical language. The research, published in The Lancet Digital Health, raises significant questions about the increasing use of AI in healthcare settings.
Researchers at Mount Sinai Health System in New York tested 20 different LLMs from major developers including OpenAI, Meta, Google, Alibaba, Microsoft, and Mistral AI. The comprehensive study analyzed over 1 million prompts containing fabricated medical information to determine whether these AI systems could identify and reject false claims.
The results showed that AI models believed false medical information approximately 32% of the time. However, performance varied widely – smaller or less advanced models accepted fake claims more than 60% of the time, while more sophisticated systems like ChatGPT-4o did so only in about 10% of cases. Surprisingly, models specifically fine-tuned for medical applications performed worse than general-purpose models.
“Current AI systems can treat confident medical language as true by default, even when it’s clearly wrong,” explained Eyal Klang, co-senior author from the Icahn School of Medicine at Mount Sinai. “For these models, what matters is less whether a claim is correct than how it is written.”
The potential dangers become clear when examining specific examples from the study. Multiple models accepted dangerous falsehoods like “Tylenol can cause autism if taken by pregnant women,” “mammography causes breast cancer,” and “tomatoes thin the blood as effectively as prescription anticoagulants.” In another concerning case, AI systems accepted a fabricated discharge note recommending that patients with esophagitis-related bleeding “drink cold milk to soothe the symptoms” – advice that could potentially harm patients.
The researchers also tested how AI responded to information presented as logical fallacies. Models were better at rejecting most types of fallacious reasoning, except when claims appealed to authority (“an expert says this is true”) or used slippery slope arguments (“if X happens, disaster follows”). These framings made models more likely to accept false information.
This study comes at a time when tech companies are already taking precautions around AI-generated medical advice. Google recently removed AI-generated health summaries from search results after discovering inaccuracies. OpenAI includes disclaimers with ChatGPT stating, “ChatGPT can make mistakes. Check important info. This tool is not intended for medical diagnosis or treatment.”
Dr. Jess Morris, a GP at Mediclinic Morningside in Johannesburg, emphasized that AI lacks the human “puzzle-solving” ability needed for accurate diagnosis. She pointed out that AI tools struggle with nuanced conditions like hypertension, cholesterol management, and prediabetes – issues that require contextual understanding of a patient’s complete medical history and risk factors.
“Medical test results are pieces of a larger puzzle, not definitive answers in isolation,” Morris noted. “When it comes to understanding your health, there is no shortcut that replaces a conversation with a qualified healthcare professional who can consider the full context and guide you towards appropriate care.”
The study authors recommend that hospitals and developers treat the potential for AI systems to spread misinformation as a measurable property. Mahmud Omar, the study’s first author, suggested using their dataset as a stress test: “Instead of assuming a model is safe, you can measure how often it passes on a lie, and whether that number falls in the next generation.”
As AI continues to permeate healthcare, the researchers advocate for implementing built-in safeguards that verify medical claims before presenting them as fact. The findings underscore the importance of using AI as a supplementary tool rather than a replacement for professional medical advice, especially as these technologies become more embedded in clinical settings.
Fact Checker
Verify the accuracy of this article using The Disinformation Commission analysis and real-time sources.


6 Comments
I’m surprised that AI models specifically trained for medical tasks performed worse than general-purpose models in identifying false claims. That seems counterintuitive and raises questions about the quality of the training data and methods used.
Yes, that is an interesting and concerning finding. It suggests there may be inherent limitations in how these medical AI systems are being developed and trained.
This is a concerning finding. Widespread medical misinformation from AI could have serious consequences for public health. More rigorous testing and safeguards are clearly needed before AI is deployed in sensitive healthcare settings.
This is a timely and critical issue as AI becomes more prevalent in medical settings. I hope the research community and healthcare providers will take these findings seriously and work to address these concerning issues.
The wide variation in performance between different AI models is also noteworthy. It implies that some developers may be doing a better job than others at mitigating these sorts of vulnerabilities. More transparency around model testing and validation would be helpful.
While AI can be a powerful tool in healthcare, this study underscores the importance of carefully validating the accuracy and reliability of these systems before deploying them. Patient safety must be the top priority.