Listen to the article

0:00
0:00

AI Companies’ Web Crawling Meets Resistance from Reputable News Sites

AI models have developed an insatiable appetite for data, requiring constant updates to provide users with current information. Major AI companies have responded with aggressive web crawling practices, but they’re increasingly facing pushback from website owners who are taking steps to protect their content.

A new study from Saarland University reveals a growing divide between how reputable news outlets and misinformation sites are responding to AI crawlers. Researcher Nicolas Steinacker-Olsztyn and colleagues analyzed more than 4,000 websites to determine how they used robots.txt files—technical instructions that tell web crawlers whether they’re permitted to access content.

The findings show a stark contrast: approximately 60% of reputable news websites now block at least one AI crawler from accessing their information. These established outlets block an average of 15 different AI agents through their robots.txt files. By comparison, only 9.1% of sites classified as sources of misinformation employ similar restrictions.

“The biggest takeaway is that the reputable news websites keep well up-to-date with the evolving ecosystem as it pertains to these major AI developers and their practices,” Steinacker-Olsztyn explains.

The researchers tracked 63 different AI-related user agents, including GPTBot (OpenAI), ClaudeBot (Anthropic), CCBot, and Google-Extended, all deployed by AI companies to collect training data. To categorize websites as reputable or not, they utilized ratings from Media Bias/Fact Check, an organization that evaluates news sources based on credibility and factual reporting.

The trend toward blocking AI crawlers has accelerated significantly among legitimate news outlets. Between September 2023 and May 2024, the proportion of reputable platforms restricting crawler access jumped from 23% to 60%, while misinformation sites maintained consistently low blocking rates.

This divergence creates a concerning dynamic: as AI systems continue to develop and require current information, they’re increasingly being blocked from accessing quality journalism while facing few barriers to absorbing questionable content.

“Increasingly these models are also being used simply for information retrieval, replacing traditionally used options such as search engines,” notes Steinacker-Olsztyn. This shift amplifies concerns about potential imbalances in AI training data.

The battle over content access has already spilled into courtrooms. The New York Times’ ongoing lawsuit against OpenAI highlights the tension between publishers protecting their intellectual property and AI companies seeking diverse training materials. Publishers contend that AI firms are illegally harvesting their content to build commercial products without compensation.

While blocking crawlers protects publishers’ interests, it may inadvertently create information quality problems. “If reputable news is increasingly making this information unavailable, then this gives reason to believe this can affect the reliability of these models,” Steinacker-Olsztyn warns. “Going forward, this is changing the percentage of legitimate data that they have access to.”

Not all experts are equally concerned, however. Felix Simon, a research fellow at the University of Oxford-based Reuters Institute for the Study of Journalism, suggests that AI developers employ filtering mechanisms to identify and potentially discount unreliable sources.

“AI developers filter and weigh data at various points of the system training process and at inference time,” Simon explains. “One would hope that by the same means by which the authors have been able to identify untrustworthy websites, AI developers would be able to filter out such data.”

The research highlights a fundamental tension in the AI era: legitimate publishers are increasingly asserting control over their content through technical and legal means, while less reputable sources remain widely available for AI consumption. As this dynamic evolves, questions persist about how it will ultimately shape the information landscape that AI systems present to users.

Fact Checker

Verify the accuracy of this article using The Disinformation Commission analysis and real-time sources.

26 Comments

  1. Interesting update on AI Spreading Misinformation Through Websites at Alarming Rate. Curious how the grades will trend next quarter.

Leave A Reply

A professional organisation dedicated to combating disinformation through cutting-edge research, advanced monitoring tools, and coordinated response strategies.

Company

Disinformation Commission LLC
30 N Gould ST STE R
Sheridan, WY 82801
USA

© 2026 Disinformation Commission LLC. All rights reserved.