Scientists Question Reliability of ChatGPT

Listen to the article

0:00

In a concerning revelation for AI reliability, researchers have uncovered significant inconsistency in ChatGPT’s ability to evaluate scientific hypotheses, raising questions about overreliance on artificial intelligence in research and business settings.

Washington State University professor Mesut Cicek and colleagues tested ChatGPT against 719 hypotheses from business research papers, repeatedly asking the system whether each hypothesis was supported by research. The results revealed a troubling pattern of inconsistency despite the AI’s confident tone.

“We’re not just talking about accuracy, we’re talking about inconsistency, because if you ask the same question again and again, you come up with different answers,” said Cicek, an associate professor in WSU’s Carson College of Business Department of Marketing and International Business.

The study found that in mid-2024, the free version of ChatGPT-3.5 answered correctly 76.5% of the time. When researchers repeated the experiment in mid-2025 using GPT-5 mini, accuracy rose modestly to 80%. While statistically significant, the improvement was relatively small. After adjusting for the 50% chance of being correct in a true-false scenario, the effective performance was substantially lower.

More alarming was the inconsistency when presented with identical prompts. The researchers tested each hypothesis statement ten times using exactly the same wording.

“We used 10 prompts with the same exact question. Everything was identical. It would answer true. Next, it says it’s false. It’s true, it’s false, false, true. There were several cases where there were five true, five false,” Cicek explained.

Though prompt-level consistency improved from 80.2% in 2024 to 86.8% in 2025, the stricter measure showed only 66.3% of hypotheses in 2024 were answered correctly across all ten repeated prompts, rising to 72.9% in 2025. This means more than a quarter of cases still had at least one incorrect answer despite unchanged wording.

The study also revealed a persistent weakness in identifying unsupported hypotheses. ChatGPT correctly identified false statements only 13.6% of the time in 2024 and 16.4% in 2025, suggesting a bias toward confirming whatever statement it evaluates.

The research team extracted their test cases from 127 open-access articles published since 2021 in nine marketing and management journals. Each hypothesis described a formal, testable relationship, such as a main effect, mediation, or moderation.

Performance varied significantly by hypothesis type. The AI performed best on mediation hypotheses, which follow a more linear chain of reasoning, and worst on moderation hypotheses requiring contextual or conditional thinking. This pattern suggests that while AI systems can reproduce the language of logic, they struggle with the actual reasoning.

“Current AI tools don’t understand the world the way we do — they don’t have a ‘brain,'” Cicek said. “They just memorize, and they can give you some insight, but they don’t understand what they’re talking about.”

The researchers aren’t suggesting abandoning AI altogether but advocating for cautious implementation. They believe generative AI still has value for structured tasks with clear language and straightforward reasoning, such as A/B testing, experimental design, or campaign simulation.

However, they warn that managers, consultants, analysts, and researchers should exercise skepticism, especially in high-stakes situations involving context, conditions, or indirect effects. The polish and fluency of AI responses can create an illusion of expertise that masks fundamental reasoning flaws.

“Always be skeptical,” Cicek advised. “I’m not against AI. I’m using it. But you need to be very careful.”

For businesses and research institutions, the study suggests practical approaches to AI implementation. Organizations should consider repeating prompts, verifying results through other means, and training employees to question AI outputs regardless of how confident they appear.

The research underscores that current AI systems function better as assistants than decision-makers. The relatively modest year-over-year improvement in performance (3.5 percentage points) suggests that advances are happening incrementally rather than revolutionarily, with enhancements in text processing rather than fundamental leaps in conceptual understanding.

As AI continues to penetrate business and academic environments, this study serves as a timely reminder that technological sophistication should not be confused with genuine comprehension. Behind the fluent language of today’s AI systems lies a more limited understanding than their confident outputs might suggest.

Fact Checker

Verify the accuracy of this article using The Disinformation Commission analysis and real-time sources.

View 11 Comments

11 Comments

Emma Jackson on March 24, 2026 2:42 pm

This is concerning. While AI can be a powerful tool, we need to be cautious about overrelying on it, especially for sensitive tasks like scientific research. Inconsistent results from ChatGPT raise questions about its reliability.

- Elizabeth Jackson on March 24, 2026 3:13 pm
  
  Agreed. More rigorous testing and validation is clearly needed before AI systems can be trusted for critical applications.
  
Linda White on March 24, 2026 2:43 pm

This study highlights the importance of thorough testing and validation of AI systems before deploying them in high-stakes applications. Inconsistency is a major red flag.

William Lee on March 24, 2026 2:46 pm

While AI holds great promise, this study serves as a reminder that we must approach its use cautiously, especially in sensitive areas like scientific research. Inconsistent results are a significant concern.

Olivia Davis on March 24, 2026 2:55 pm

It’s good to see researchers scrutinizing the reliability of ChatGPT and other AI models. Maintaining a healthy skepticism is crucial as these technologies continue to advance.

- Liam Lee on March 24, 2026 3:26 pm
  
  Agreed. We shouldn’t rush to over-rely on AI, especially in critical domains. Careful validation and human oversight are essential safeguards.
  
Liam Rodriguez on March 24, 2026 2:56 pm

This is a good wake-up call. We shouldn’t assume AI is infallible, even as the technology improves. Vigilance and accountability are crucial, especially in fields like scientific research.

William Davis on March 24, 2026 2:56 pm

I’m not surprised to see this. Language models like ChatGPT, while impressive, still have limitations and biases. Careful human oversight and validation is essential, especially for high-stakes domains.

- James Hernandez on March 24, 2026 3:06 pm
  
  Absolutely. AI should complement human expertise, not replace it entirely. Maintaining a healthy skepticism is important as these technologies continue to evolve.
  
Isabella Garcia on March 24, 2026 2:56 pm

While the progress in AI is remarkable, it’s clear we still have a ways to go before these systems can be fully relied upon. Inconsistency in evaluating hypotheses is a serious concern.

- James Johnson on March 24, 2026 3:28 pm
  
  Exactly. Responsible development and deployment of AI is essential to maintain trust and ensure it is used safely and effectively.