alanbleiweiss (profile)

July 3, 2024 at 1:41 pm

Nuance and Clarity

As long as I have been in SEO (24 years), I read a lot about AI and how it impacts site ownership control over content. To me, if an AI “agent” is accessing a URL at the instruction of the person writing the prompt, that’s no different than if that person were to go directly to that URL in a browser. In no way is it acting like a traditional crawler for bulk scraping.

Unless the prompt directs such an action, then things get nasty. But do they?

I think the intent of the scraping is what matters, not the method of scraping. That’s what it comes down to. If you’re scraping for any use that does not include producing a new product whose main content is scraped, I don’t see a use-case problem. The minute commercialization comes into play, publishers should have the right to block that scraping, again, regardless of scraping method.

Anonymous Coward

July 4, 2024 at 6:33 am

Re:

Yep. If my dog bites you I can’t blame the dog.

Anonymous Coward

July 3, 2024 at 1:47 pm

“robots.txt” file allow a basic set of rules about how search engines and other crawlers should request different website pages (for example during specific hours to prevent burden a popular website).
It couldn’t block any crawler or even search engine, but most websites could block requests as unwanted/malicious bot if there is too much requests not matching the robots.txt.
The main difference with wannabe search engines, social networks and bots of the past decades is there is a lot of money poured into the AI race, and bandwidth or requests limits are mostly ignored due to the gigantic amount of data needed to get some basic AI relevance.
You are just now into a hundreds billions dollars crawler market.

Anonymous Coward

July 3, 2024 at 2:19 pm

I think it is disingenuous to distinguish between search engine “crawlers” and AI-feed “whatever you call them”s.

As the article says, Robots.txt was implemented to ameliorate the impact of automated bulk URL requests.

On the one hand: It doesn’t matter to what purpose the automation is requesting the bulk URLs. The system in place was designed to limit precisely that activity. The fact that only one use case was known when the standard was created is irrelevant to that purpose. Whether the site can (now) withstand quests in that quantity is also irrelevant.

On the other hand: Sites may want a finer control, based on just that purpose which is irrelevant to robots.txt. For that, a new standard would be appropriate.

Until then, if you’re making requests in bulk to a server and ignoring an explicit robots.txt, enjoy getting blocked.

Anonymous Coward

July 3, 2024 at 2:58 pm

Re:

As I understand it, the crux of the debate is that AI tools are not making bulk requests to servers. They’re making very limited requests to specific pages based on user actions. However, though limited in scope, this is still automated retrieval of web pages. The question whether robots.txt should (or does) apply to such requests is worth exploring.

Anonymous Coward

July 3, 2024 at 4:05 pm

Re: Re:

It would probably make a difference if the ML instance were local to the user, rather than a provided service that is copying and pasting rather than just loading the site of origin.

spamvictim (profile)

July 3, 2024 at 7:06 pm

There is case law

In 2006 the Field vs. Google case clarified the legality of Google’s search engine cache. The judge found that the cache was legal, partly as fair use but also because it was easy to opt out using robots.txt. If crawlers are ignoring the robots file, the analysis changes and not in a way that is good for the crawlers.

https://en.wikipedia.org/wiki/Field_v._Google%2C_Inc.

I was Google’s technical expert in the case.

Mike Masnick (profile)

July 3, 2024 at 8:03 pm

Re:

I think the issue here is that robots.txt is used only to limit mass automated scraping. A website can have a robots.txt that blocks a mass crawler, but anyone can still visit the URL.

In other words, a site can block Google’s search crawler via robots.txt, but if an individual puts that URL in Chrome, it’s fine for it to retrieve it.

In this scenario, Perplexity appears to respect robots.txt for automated crawling, but if you give it a URL (like putting a URL into a browser bar) it will retrieve it for the user, and then analyze it.

Tanner Andrews (profile)

July 6, 2024 at 6:06 am

Re: Re: not much difference then

if you give it a URL (like putting a URL into a browser bar) it will retrieve it for the user, and then analyze it.

I do not see much difference between this fact pattern and what the web site owner probably expects:
* type URL into browser bar
* browser retrieves page
* copy content from retrieved page
* paste into analysis engine

Partticularly from the web server’s point of view, there is the exact same amount of work involved, a single serving serving of the specified URL.

Nathan (profile)

July 6, 2024 at 10:02 am

Re: Re:

The question then is do they incorporate the page into their corpora of training materials, or do they immediately forget it?

If you really don’t want to be scraped, I’d advocate throwing everything you don’t want scraped behind a login and put some legalese in the agreement that they aren’t a bot and will not use the account for that purpose, maybe something about requiring them to excise all data and subsequent training results if they are found in violation.

GHB (profile)

July 9, 2024 at 3:43 pm

This reminds me of several japanese sites put their images behind a 403 forbidden when accessed directly

Japanese sites like skeb, parocolla, and especially pixiv went very hard against bot-access, to the point that direct links to their images, that if you, on your browser, load its URL directly (by entering the URL on the address bar, or load any HTML file that hotlinks off-site), and refresh, triggers a 403 error, rather than redirect like several image hosting sites like eugh, photobucket and imgur.

(I say especially for pixiv because it went a step further and requires account to look at full resolution, Pixiv have been doing this since way before AI generated art have started.)

Because of this, if artists wished their works appear on search engines, they have to crosspost on sites that do allow direct URL access like on twitter. Else it is the equivalent of posting your site on various file hosting sites like mediafire, google drive, mega, and others that their download page and the direct link to the download is excluded from search engines.

Most of the time, users would even have most of their works only on social media and not on “bot-blocked, linked only, private-by-default” sites.

Anonymous Coward

July 9, 2024 at 9:42 pm

56k modem chant

Wednesday
12:13	Gun Rights Activists Briefly Pause MAGA Cheerleading To Half-Assedly Defend Rights Of Murdered Minnesotan (0)
10:52	Minneapolis Proved Something MAGA Can't Accept: Most People Are Actually Virtuous (0)
10:47	Daily Deal: StackSkills Premium Annual Pass (0)
09:23	Minnesota's Top Judge Exposes DOJ's Desperate Attempt To Salvage A Bunch Of Church Protest Arrest Warrants (6)
05:21	Brendan Carr Pretends To Care About Competition, Helps Larry Ellison Undermine Netflix Warner Brothers Merger (1)
Tuesday
20:02	CDC Dep. Director On Measles Going Kazoo: It's Just 'The Cost Of Doing Business' (32)
15:38	How Hackers Are Fighting Back Against ICE (7)
13:44	TikTok Already Getting Shittier Under The Ownership Of Trump's Billionaire Buddies (8)
12:04	Border Patrol Thug Greg Bovino Gets Booted Back To The Border By The Trump Administration (3)
10:40	ATproto: The Enshittification Killswitch That Enables Resonant Computing (22)

Crawlers And Agents And Bots, Oh My: Time To Clarify Robots.txt

from the perplexing dept

Comments on “Crawlers And Agents And Bots, Oh My: Time To Clarify Robots.txt”

Nuance and Clarity

Re:

Re:

Re: Re:

There is case law

Re:

Re: Re: not much difference then

Re: Re:

This reminds me of several japanese sites put their images behind a 403 forbidden when accessed directly

Add Your Comment Cancel reply

Comment Options:

What's this?

Get all our posts in your inbox with the Techdirt Daily Newsletter!

The Techdirt Greenhouse

Trending Posts

Wednesday

Tuesday

More

Tools & Services

Company

Contact

More

Crawlers And Agents And Bots, Oh My: Time To Clarify Robots.txt

from the perplexing dept

Comments on “Crawlers And Agents And Bots, Oh My: Time To Clarify Robots.txt”

Add Your Comment Cancel reply

Comment Options:

What's this?

Techdirt Daily Newsletter

Get all our posts in your inbox with the Techdirt Daily Newsletter!

The Techdirt Greenhouse

Trending Posts

Wednesday

Tuesday

More

Email This Story

Tools & Services

Company

Contact

More