Crawlers And Agents And Bots, Oh My: Time To Clarify Robots.txt
from the perplexing dept
Perplexity is an up-and-coming AI company that has broad ambition to compete with Google in the search market by providing answers to user queries with AI as its core technology.
They’ve been in the news because their news feature repurposed content published on the Forbes website in an investigative article, which severely annoyed the Forbes editorial staff and media community (never a good idea) and led to accusations from Forbes’ legal team of willful copyright infringement. Now Wired is reporting that Perplexity’s web hosting provider (AWS) is investigating their practices, focused on whether they respect robots.txt, the standard governing the behavior of web crawlers (Or is it all robots? More on that later.)
We don’t know everything about how Perplexity actually works under the hood, and I have no relationship to the company or special knowledge. The facts are still somewhat murky, and as with any dispute over the ethics or legality of digital copying, the technical details will matter. I worked on copyright policy for years at Google, and have seen this pattern play out enough times to not pass judgment too quickly.
Based on what we know today from press reports, it seems plausible to me that the fundamental issue at root here, i.e. what is driving Perplexity to dig its heels in, and where much of the reporting seems to cite as Perplexity’s fundamental ethical failing, is what counts as a “crawler” for the purposes of robots.txt.
This is an ambiguity that will likely need to be addressed in years to come regardless of Perplexity’s practices, so it seems worth unpacking a little bit. (In fact similar questions are floating around Quora’s chatbot Poe.)
Why do I think this is the core issue? This snippet from today’s Wired article was instructive (Platnick is a Perplexity spokesperson):
“When a user prompts with a specific URL, that doesn’t trigger crawling behavior,” Platnick says. “The agent acts on the user’s behalf to retrieve the URL. It works the same way as if the user went to a page themselves, copied the text of the article, and then pasted it into the system.”
This description of Perplexity’s functionality confirms WIRED’s findings that its chatbot is ignoring robots.txt in certain instances.
The phrase “ignoring robots.txt in certain instances” sounds bad. There is the ethical conversation of what Perplexity is doing with news content of course, which is likely to be an ongoing and vigorous debate. The claim is that Perplexity is ignoring the wishes of news publishers, as expressed in robots.txt.
But we tend to codify norms and ethics into rules, and a reasonable question is: What does the robots.txt standard have to say? When is a technical system expected to comply with it, or ignore it? Could this be rooted in different interpretations of the standard?
First a very quick history of robots.txt: In the late 80s and early 90s, it was a lot more expensive to run a web server. They also tended to be very prone to breaking under high loads. As companies began to crawl the web to build things like search engines (which requires accessing a lot of the website), stuff started to break, and the blessed nerds who kept the web working came up with an informal standard in the mid 90s that allowed webmasters to put up road signs to direct crawlers away from certain areas. Most crawlers respected this relatively informal arrangement, and still do.
Thus, “crawlers” has for decades been understood to refer to systems that access URLs in bulk, systems that pick which URLs to access next based on a predetermined method written in code (presumably why it’s described as “crawling”). And the motivating issue was mainly a coordination problem: how to enable useful services like search engines, that are good for everyone including web publishers, without breaking things.
It took nearly two decades but robots.txt was eventually codified and adopted as the Robots Exclusion Protocol, or RFC 9309, by the Internet Engineering Task Force (IETF), part of the aforementioned blessed nerd community who maintain the technical standards of the internet.
RFC 9309 does not define “crawler” or “robot” in the way a lawyer might expect a contract or statute to define a term. It says simply that “crawlers are automatic clients” with the rest left up to context clues. Most of those context clues refer to issues posed by bulk access of URIs:
It may be inconvenient for service owners if crawlers visit the entirety of their URI space. This document specifies the rules […] that crawlers are requested to honor when accessing URIs.
Every year the web’s social footprint expands and we increase the pressures put on robots.txt. It’s begun to solve a broader set of challenges, beyond protecting webmasters from the technical inconveniences of bulk access. It now increasingly arbitrates massive economic interests, and now the social and ethical questions AI has inspired in recent years. Google, whose staff are the listed authors of RFC 9309, has already started thinking about what’s next.
And the technology landscape is shifting. Automated systems are accessing web content with a broader set of underlying intentions. We’re seeing the emergence of AI agents that actually do things on behalf of users and at their direction, intermediated by AI companies using large language models. As OpenAI says, AI agents may “substantially expand the helpful uses of AI systems, and introduce a range of new technical and social challenges.”
Automatic clients will continue to access web content. The user-agent might even reasonably have “Bot” in the name. But is it a crawler? It won’t be for the same purpose as a search engine crawler, and not at the same scale and depth required for search. The ethical, economic, technical, and legal landscape for automatic AI agents will look completely different than for crawlers.
It may very well be sensible to expand RFC 9309 to apply to things like AI agents directed by users, or any method of automated access of web content where the user-agent isn’t directly a user’s browser. And then we would think about the cascading implications of the robots.txt standard and its requirements. Or maybe we need a new set of norms and rules to govern that activity separate from RFC 9309.
Either way, disputes like this are an opportunity to consider improving and updating the rules and standards that guide actors on the web. To the extent this disagreement really is about the interpretation of “crawler” in RFC 9309, i.e. what counts as a robot or crawler and therefore what must respect listed disallows in the robots.txt file, that seems like a reasonable place to start thinking about solutions.
Alex Kozak is a tech policy consultant with Proteus Strategies, formerly gov’t affairs and regulatory strategy at Google X, global copyright policy lead at Google, and open licensing advocate at Creative Commons.
Filed Under: agents, ai, bots, crawling, generative ai, robots.txt
Companies: perplexity


Comments on “Crawlers And Agents And Bots, Oh My: Time To Clarify Robots.txt”
Nuance and Clarity
As long as I have been in SEO (24 years), I read a lot about AI and how it impacts site ownership control over content. To me, if an AI “agent” is accessing a URL at the instruction of the person writing the prompt, that’s no different than if that person were to go directly to that URL in a browser. In no way is it acting like a traditional crawler for bulk scraping.
Unless the prompt directs such an action, then things get nasty. But do they?
I think the intent of the scraping is what matters, not the method of scraping. That’s what it comes down to. If you’re scraping for any use that does not include producing a new product whose main content is scraped, I don’t see a use-case problem. The minute commercialization comes into play, publishers should have the right to block that scraping, again, regardless of scraping method.
Re:
Yep. If my dog bites you I can’t blame the dog.
“robots.txt” file allow a basic set of rules about how search engines and other crawlers should request different website pages (for example during specific hours to prevent burden a popular website).
It couldn’t block any crawler or even search engine, but most websites could block requests as unwanted/malicious bot if there is too much requests not matching the robots.txt.
The main difference with wannabe search engines, social networks and bots of the past decades is there is a lot of money poured into the AI race, and bandwidth or requests limits are mostly ignored due to the gigantic amount of data needed to get some basic AI relevance.
You are just now into a hundreds billions dollars crawler market.
I think it is disingenuous to distinguish between search engine “crawlers” and AI-feed “whatever you call them”s.
As the article says, Robots.txt was implemented to ameliorate the impact of automated bulk URL requests.
On the one hand: It doesn’t matter to what purpose the automation is requesting the bulk URLs. The system in place was designed to limit precisely that activity. The fact that only one use case was known when the standard was created is irrelevant to that purpose. Whether the site can (now) withstand quests in that quantity is also irrelevant.
On the other hand: Sites may want a finer control, based on just that purpose which is irrelevant to robots.txt. For that, a new standard would be appropriate.
Until then, if you’re making requests in bulk to a server and ignoring an explicit robots.txt, enjoy getting blocked.
Re:
As I understand it, the crux of the debate is that AI tools are not making bulk requests to servers. They’re making very limited requests to specific pages based on user actions. However, though limited in scope, this is still automated retrieval of web pages. The question whether robots.txt should (or does) apply to such requests is worth exploring.
Re: Re:
It would probably make a difference if the ML instance were local to the user, rather than a provided service that is copying and pasting rather than just loading the site of origin.
There is case law
In 2006 the Field vs. Google case clarified the legality of Google’s search engine cache. The judge found that the cache was legal, partly as fair use but also because it was easy to opt out using robots.txt. If crawlers are ignoring the robots file, the analysis changes and not in a way that is good for the crawlers.
https://en.wikipedia.org/wiki/Field_v._Google%2C_Inc.
I was Google’s technical expert in the case.
Re:
I think the issue here is that robots.txt is used only to limit mass automated scraping. A website can have a robots.txt that blocks a mass crawler, but anyone can still visit the URL.
In other words, a site can block Google’s search crawler via robots.txt, but if an individual puts that URL in Chrome, it’s fine for it to retrieve it.
In this scenario, Perplexity appears to respect robots.txt for automated crawling, but if you give it a URL (like putting a URL into a browser bar) it will retrieve it for the user, and then analyze it.
Re: Re: not much difference then
I do not see much difference between this fact pattern and what the web site owner probably expects:
* type URL into browser bar
* browser retrieves page
* copy content from retrieved page
* paste into analysis engine
Partticularly from the web server’s point of view, there is the exact same amount of work involved, a single serving serving of the specified URL.
Re: Re:
The question then is do they incorporate the page into their corpora of training materials, or do they immediately forget it?
If you really don’t want to be scraped, I’d advocate throwing everything you don’t want scraped behind a login and put some legalese in the agreement that they aren’t a bot and will not use the account for that purpose, maybe something about requiring them to excise all data and subsequent training results if they are found in violation.
This reminds me of several japanese sites put their images behind a 403 forbidden when accessed directly
Japanese sites like skeb, parocolla, and especially pixiv went very hard against bot-access, to the point that direct links to their images, that if you, on your browser, load its URL directly (by entering the URL on the address bar, or load any HTML file that hotlinks off-site), and refresh, triggers a 403 error, rather than redirect like several image hosting sites like eugh, photobucket and imgur.
(I say especially for pixiv because it went a step further and requires account to look at full resolution, Pixiv have been doing this since way before AI generated art have started.)
Because of this, if artists wished their works appear on search engines, they have to crosspost on sites that do allow direct URL access like on twitter. Else it is the equivalent of posting your site on various file hosting sites like mediafire, google drive, mega, and others that their download page and the direct link to the download is excluded from search engines.
Most of the time, users would even have most of their works only on social media and not on “bot-blocked, linked only, private-by-default” sites.
56k modem chant