Listen to the article

0:00
0:00

ChatGPT’s Confidence Masks Concerning Inconsistencies, New Study Finds

A new study reveals a troubling pattern in ChatGPT’s reliability, suggesting that the AI system’s confident tone may obscure significant flaws in its reasoning abilities. Researchers at Washington State University (WSU) discovered that when asked identical questions repeatedly, ChatGPT frequently delivers contradicting answers without any changes to the input prompt.

The research team, led by Mesut Cicek, conducted extensive testing by posing hundreds of hypotheses from published scientific research papers to ChatGPT, asking the system to determine whether each claim was true or false. By running identical prompts ten times, they uncovered a concerning inconsistency in the AI’s responses.

“We’re not just talking about accuracy, we’re talking about inconsistency because if you ask the same question again and again, you come up with different answers,” explained Cicek, highlighting a fundamental issue with the system’s reliability.

This inconsistency manifested starkly in some cases, with the AI oscillating between labeling the same claim as “true” in one instance and “false” in another, despite no changes to the input. Such reversals expose critical limitations in how large language models evaluate factual information.

At first glance, ChatGPT’s overall accuracy appeared reasonably strong, improving from 76.5 percent in 2024 to 80 percent in 2025. However, when researchers factored out the element of random chance inherent in true-or-false questions, the effective accuracy plummeted to around 60 percent—equivalent to a low D grade in academic terms.

The study identified specific weaknesses in ChatGPT’s reasoning capabilities. The AI performed particularly poorly when evaluating unsupported hypotheses, correctly identifying false claims just 16.4 percent of the time in 2025. This suggests a persistent bias toward agreement, with the system defaulting to “yes” responses because matching familiar language patterns is easier than detecting flawed ideas.

Furthermore, the research revealed that across ten repeated runs, only 72.9 percent of responses in 2025 remained consistently correct. This instability presents a significant challenge for users relying on these systems for decision-making, as a single response might appear reliable while repeated checks expose its fundamental fragility.

The AI demonstrated stronger performance on straightforward cause-and-effect relationships but struggled with context-dependent claims—precisely the kind of nuanced judgments central to everyday business decisions, from pricing strategies to policy tradeoffs.

These findings highlight the inherent limitations of large language models, which are trained on vast text datasets and function by predicting likely word sequences rather than verifying facts against reality. This design helps generate fluent, confident-sounding responses even when the system lacks a grounded way to assess their truthfulness.

OpenAI, the creator of ChatGPT, acknowledges that the system can produce “hallucinations”—responses that appear certain but contain factual inaccuracies. This combination of confidence and unreliability makes the system particularly challenging to use effectively, as incorrect answers can seem convincing enough to trust.

For businesses, scientific teams, and other professional users, this weakness transforms what might be a useful productivity tool into a potential source of risk. While AI-generated summaries can accelerate planning processes, a single flawed judgment could misdirect product development, budgeting decisions, or marketing campaigns.

“They just memorize, and they can give you some insight, but they don’t understand what they’re talking about,” Cicek noted, emphasizing the gap between the system’s linguistic fluency and its actual comprehension.

The research suggests several practical approaches for more responsible AI use. Users should treat AI outputs as preliminary drafts rather than final decisions, run identical prompts multiple times to identify inconsistencies, verify information against established sources, and consider potential missing context.

Though the study focused specifically on business hypotheses from open-access research evaluated through ChatGPT, its findings carry broader implications for AI reliability. Despite appearing more polished in its 2025 version compared to 2024, the system has not fundamentally evolved into a dependable reasoning tool.

As organizations increasingly integrate AI systems into their workflows, WSU’s research serves as a timely reminder: human expertise remains essential for evaluating AI-generated content, particularly when the stakes are high and the answers seem suspiciously straightforward.

The complete study has been published in Rutgers Business Review.

Fact Checker

Verify the accuracy of this article using The Disinformation Commission analysis and real-time sources.

16 Comments

  1. This study reveals some concerning limitations in ChatGPT’s scientific reasoning abilities. Inconsistent responses to the same prompts undermine the trustworthiness and reliability of the system, which is especially problematic for applications in research and medicine. More rigorous testing is clearly needed.

  2. The findings about ChatGPT’s inconsistent responses to scientific claims are quite concerning. Reliability and trustworthiness should be paramount for any AI system, especially those intended for research or medical applications. This study underscores the need for more robust validation protocols.

  3. Oliver Johnson on

    The findings about ChatGPT’s inconsistent responses to the same questions are concerning. Trustworthiness and reliability should be top priorities for AI systems, especially ones intended for scientific or medical applications. More work is needed to address these issues.

    • Noah R. Rodriguez on

      Absolutely. Inconsistent results undermine the credibility of the system. Rigorous validation protocols are crucial before deploying AI in sensitive domains.

  4. Elizabeth Garcia on

    This study reveals some significant limitations in ChatGPT’s scientific reasoning abilities. Inconsistent responses to the same prompts are problematic and undermine the credibility of the system. Rigorous testing and validation protocols are clearly needed before deploying large language models in high-stakes domains.

    • Elizabeth Martin on

      Agreed. Overconfidence in AI can be dangerous if it obscures fundamental flaws. Thorough, impartial evaluation is essential to ensure these systems are truly fit for purpose.

  5. Amelia Taylor on

    Interesting to see these limitations in ChatGPT’s scientific reasoning. Consistency and reliability are critical for AI systems, especially in sensitive domains like healthcare and research. This study highlights the need for more rigorous testing and validation before deploying large language models in high-stakes applications.

    • Emma X. Lopez on

      Agreed. Overconfidence in AI systems can be dangerous if it masks underlying flaws. Thorough, transparent testing is essential to understand the true capabilities and limitations of these models.

  6. This study highlights the importance of not blindly trusting AI, even one as advanced as ChatGPT. Fluctuating responses to the same prompts point to significant limitations that need to be addressed. Careful evaluation and transparency around an AI’s strengths and weaknesses are essential.

  7. Lucas Williams on

    Interesting to see these limitations in ChatGPT’s performance on scientific claims. Consistency and accuracy are critical for AI systems, especially in sensitive domains. This study highlights the need for more thorough testing and validation protocols before deploying large language models in high-stakes applications.

    • Linda Hernandez on

      Agreed. Overconfidence in AI can be dangerous if it obscures underlying flaws. Rigorous, impartial evaluation is essential to ensure these systems are truly fit for purpose.

  8. James L. Jackson on

    The findings about ChatGPT’s inconsistent responses to scientific claims are quite concerning. Reliability and accuracy are critical for AI systems, especially in high-stakes domains. This study underscores the need for more robust testing and validation protocols before deployment.

    • Agreed. Overconfidence in AI can be dangerous if it obscures fundamental flaws. Thorough, impartial evaluation of these systems is vital to ensure they are fit for purpose.

  9. The inconsistent responses from ChatGPT to the same scientific claims are quite troubling. Reliability and accuracy should be top priorities for any AI system, particularly those intended for high-stakes applications. This study underscores the need for more robust validation protocols.

    • Isabella Moore on

      Agreed. Overconfidence in AI can be dangerous if it obscures fundamental flaws. Thorough, impartial evaluation is crucial to ensure these systems are fit for purpose.

  10. Interesting to see these limitations in ChatGPT’s scientific reasoning abilities. Consistency and trustworthiness are essential for AI systems, especially in sensitive domains like healthcare and research. This study highlights the need for more rigorous testing and validation before deploying large language models.

Leave A Reply

A professional organisation dedicated to combating disinformation through cutting-edge research, advanced monitoring tools, and coordinated response strategies.

Company

Disinformation Commission LLC
30 N Gould ST STE R
Sheridan, WY 82801
USA

© 2026 Disinformation Commission LLC. All rights reserved.