Cloudflare has revealed that Perplexity ignores scraping rules by bypassing site protections and crawling content that should have been off-limits. The AI-powered search engine reportedly used undeclared crawlers to access data from sites that explicitly blocked bots through robots.txt and firewall settings.


How Perplexity Got Around the Rules

When websites blocked Perplexity’s known crawlers, the company allegedly switched tactics. It began using stealth crawlers that disguised themselves as regular browsers. These crawlers rotated through different IP addresses and user-agent strings, making them difficult to identify.

Cloudflare discovered this behavior by setting up trap domains. These were unpublished websites configured to block all crawlers. Despite the restrictions, Perplexity accessed content from these sites—proving that it ignored no-crawl directives.


Why This Behavior Matters

Robots.txt is a long-standing web standard that tells bots what content they are allowed to access. While not enforceable by law, it forms the ethical backbone of how the web operates. AI companies that ignore these rules break the trust between site owners and automated tools.

Cloudflare responded by removing Perplexity from its verified bot list and enhancing its defenses to block the company’s traffic.


Perplexity’s Response

Perplexity claims it doesn’t crawl the web in the traditional sense. According to the company, its AI agents fetch pages only when users request them. However, Cloudflare’s findings suggest a broader and more persistent data-gathering strategy that contradicts that explanation.


A Growing Concern for the Web

As AI companies scramble to train their models on vast amounts of data, more platforms are resorting to scraping tactics. Site owners, in turn, are tightening controls. The balance between open access and content protection is under pressure, and the actions of companies like Perplexity are accelerating that tension.


Conclusion

Cloudflare’s investigation shows that Perplexity ignores scraping rules, undermining web standards like robots.txt by using stealthy methods to access restricted content. This case highlights the need for transparency and ethical conduct in how AI tools gather data. Without clear boundaries, trust in the open web could erode fast.


0 responses to “Perplexity Ignores Scraping Rules, Violates Robots.txt Directives”