
Photo by Joshua Woroniecki on Unsplash
Cloudflare Researchers Claim Perplexity Is Scraping Websites Despite AI Bot Block
Researchers from internet infrastructure provider Cloudflare claim that the AI system Perplexity has been scraping content from websites without permission, even when publishers have implemented AI bot blocks.
In a rush? Here are the quick facts:
- Cloudflare claims that Perplexity has been scraping content from websites without permission.
- Researchers confirmed Perplexity’s “stealth crawling” behavior even when publishers implement AI bot blocks.
- A spokesperson from Perplexity called Cloudflare’s report a “publicity stunt.”
According to the report shared by Cloudflare on Monday, Perplexity crawls websites using its default user agent and switches its identity to bypass these blocks. This “stealth crawling” behavior was confirmed by Cloudflare’s experts.
“We see continued evidence that Perplexity is repeatedly modifying their user agent and changing their source ASNs to hide their crawling activity, as well as ignoring — or sometimes failing to even fetch — robots.txt files,” wrote the researchers.
Crawlers are expected to be transparent, state their purpose clearly, and respect websites’ preferences, but researchers claim Perplexity has not been following these trust principles. This conclusion was reached following an investigation prompted by customer complaints.
“We received complaints from customers who had both disallowed Perplexity crawling activity in their robots.txt files and also created WAF rules to specifically block both of Perplexity’s declared crawlers: PerplexityBot and Perplexity-User,” wrote the researchers. “These customers told us that Perplexity was still able to access their content even when they saw its bots successfully blocked.”
Cloudflare researchers said they verified these claims by replicating the blocks and conducting multiple tests to observe the crawler’s behavior. In one test, they created new domains that had not yet been indexed and included robots.txt files to block “respectful bots.” Later, they queried Perplexity for specific information about the restricted domains and found that the AI-powered answer engine still provided details and accurate information about the website.
“This response was unexpected, as we had taken all necessary precautions to prevent this data from being retrievable by their crawlers,” added the researchers.
A spokesperson from Perplexity, Jesse Dwyer, called the research a “publicity stunt” in a statement for The Verge. Dwyer added that there are “misunderstandings” in Cloudflare’s report.
Cloudflare has been developing multiple tools to help publishers prevent unauthorized AI crawling. In March, Cloudflare released “AI Labyrinth,” a tool that redirects unauthorized crawlers into AI-generated content mazes. Last month, it launched “Pay Per Crawl,” a system to charge AI bots for accessing publishers’ content.