AI Companies’ Web Crawling Meets Resistance from Reputable News Sites

AI models have developed an insatiable appetite for data, requiring constant updates to provide users with current information. Major AI companies have responded with aggressive web crawling practices, but they’re increasingly facing pushback from website owners who are taking steps to protect their content.

A new study from Saarland University reveals a growing divide between how reputable news outlets and misinformation sites are responding to AI crawlers. Researcher Nicolas Steinacker-Olsztyn and colleagues analyzed more than 4,000 websites to determine how they used robots.txt files—technical instructions that tell web crawlers whether they’re permitted to access content.

The findings show a stark contrast: approximately 60% of reputable news websites now block at least one AI crawler from accessing their information. These established outlets block an average of 15 different AI agents through their robots.txt files. By comparison, only 9.1% of sites classified as sources of misinformation employ similar restrictions.

“The biggest takeaway is that the reputable news websites keep well up-to-date with the evolving ecosystem as it pertains to these major AI developers and their practices,” Steinacker-Olsztyn explains.

The researchers tracked 63 different AI-related user agents, including GPTBot (OpenAI), ClaudeBot (Anthropic), CCBot, and Google-Extended, all deployed by AI companies to collect training data. To categorize websites as reputable or not, they utilized ratings from Media Bias/Fact Check, an organization that evaluates news sources based on credibility and factual reporting.

The trend toward blocking AI crawlers has accelerated significantly among legitimate news outlets. Between September 2023 and May 2024, the proportion of reputable platforms restricting crawler access jumped from 23% to 60%, while misinformation sites maintained consistently low blocking rates.

This divergence creates a concerning dynamic: as AI systems continue to develop and require current information, they’re increasingly being blocked from accessing quality journalism while facing few barriers to absorbing questionable content.

“Increasingly these models are also being used simply for information retrieval, replacing traditionally used options such as search engines,” notes Steinacker-Olsztyn. This shift amplifies concerns about potential imbalances in AI training data.

The battle over content access has already spilled into courtrooms. The New York Times’ ongoing lawsuit against OpenAI highlights the tension between publishers protecting their intellectual property and AI companies seeking diverse training materials. Publishers contend that AI firms are illegally harvesting their content to build commercial products without compensation.

While blocking crawlers protects publishers’ interests, it may inadvertently create information quality problems. “If reputable news is increasingly making this information unavailable, then this gives reason to believe this can affect the reliability of these models,” Steinacker-Olsztyn warns. “Going forward, this is changing the percentage of legitimate data that they have access to.”

Not all experts are equally concerned, however. Felix Simon, a research fellow at the University of Oxford-based Reuters Institute for the Study of Journalism, suggests that AI developers employ filtering mechanisms to identify and potentially discount unreliable sources.

“AI developers filter and weigh data at various points of the system training process and at inference time,” Simon explains. “One would hope that by the same means by which the authors have been able to identify untrustworthy websites, AI developers would be able to filter out such data.”

The research highlights a fundamental tension in the AI era: legitimate publishers are increasingly asserting control over their content through technical and legal means, while less reputable sources remain widely available for AI consumption. As this dynamic evolves, questions persist about how it will ultimately shape the information landscape that AI systems present to users.

Fact Checker

Verify the accuracy of this article using The Disinformation Commission analysis and real-time sources.

View 26 Comments

26 Comments

Olivia Martin on November 19, 2025 8:12 pm

Production mix shifting toward News might help margins if metals stay firm.

Michael Martin on November 19, 2025 8:13 pm

I like the balance sheet here—less leverage than peers.

James Rodriguez on November 19, 2025 8:15 pm

Exploration results look promising, but permitting will be the key risk.

- Mary Taylor on November 19, 2025 8:24 pm
  
  Good point. Watching costs and grades closely.
  
- James Johnson on November 19, 2025 9:09 pm
  
  Good point. Watching costs and grades closely.
  
Michael Taylor on November 19, 2025 8:15 pm

I like the balance sheet here—less leverage than peers.

- Mary Hernandez on November 19, 2025 8:33 pm
  
  Good point. Watching costs and grades closely.
  
- Emma A. Johnson on November 19, 2025 9:06 pm
  
  Good point. Watching costs and grades closely.
  
William Johnson on November 19, 2025 8:16 pm

Production mix shifting toward News might help margins if metals stay firm.

- Ava Moore on November 19, 2025 8:38 pm
  
  Good point. Watching costs and grades closely.
  
Mary Davis on November 19, 2025 8:17 pm

Production mix shifting toward News might help margins if metals stay firm.

Michael Jackson on November 19, 2025 8:18 pm

I like the balance sheet here—less leverage than peers.

Patricia Taylor on November 19, 2025 8:19 pm

If AISC keeps dropping, this becomes investable for me.

- Linda Garcia on November 19, 2025 8:41 pm
  
  Good point. Watching costs and grades closely.
  
Emma Davis on November 19, 2025 8:22 pm

Exploration results look promising, but permitting will be the key risk.

James Y. Johnson on November 19, 2025 8:22 pm

Production mix shifting toward News might help margins if metals stay firm.

- Mary Lee on November 19, 2025 8:39 pm
  
  Good point. Watching costs and grades closely.
  
Liam B. Rodriguez on November 19, 2025 8:24 pm

Nice to see insider buying—usually a good signal in this space.

- James A. Thomas on November 19, 2025 8:56 pm
  
  Good point. Watching costs and grades closely.
  
Amelia Brown on November 19, 2025 8:24 pm

Interesting update on AI Spreading Misinformation Through Websites at Alarming Rate. Curious how the grades will trend next quarter.

- Mary Miller on November 19, 2025 9:05 pm
  
  Good point. Watching costs and grades closely.
  
- Emma Martinez on November 19, 2025 9:12 pm
  
  Good point. Watching costs and grades closely.
  
John White on November 19, 2025 8:25 pm

Silver leverage is strong here; beta cuts both ways though.

- William Moore on November 19, 2025 8:37 pm
  
  Good point. Watching costs and grades closely.
  
Michael Lopez on November 19, 2025 8:26 pm

Uranium names keep pushing higher—supply still tight into 2026.

- Patricia Thomas on November 19, 2025 9:06 pm
  
  Good point. Watching costs and grades closely.