Anonymous Coward

April 10, 2025 at 1:19 pm

According to the proponents of regurgitation engines, it’s just like people going into a library to read, why you gotta be hatin’ man, don’t you know it’s all in the name of progress

Bah.

Bilateralrope (profile)

April 10, 2025 at 11:38 pm

Re:

It occurs to me that these regurgitation engines are doing two things:
– Increasing the costs on the organisations that are producing original content
– Reducing the traffic and income for those same organisations

The end result of that is that the creators disappear. Leaving nobody to feed the regurgitation engines.

I haven’t heard anyone talk about what happens next, but I doubt I’ll like it.

Anonymous Coward

April 12, 2025 at 10:32 am

Re:

I still remember and occasionally get angry about that one idiot TechDirt gave space to that tried to analogize ‘AI’ to ‘having an autistic person’.

Anonymous Coward

April 10, 2025 at 1:37 pm

@drew you can confirm? 😉

drew (profile)

April 11, 2025 at 2:35 am

Re:

Well, in my tiny sample size of one, I can definitely confirm that both those things are happening.
I think the end result is more likely to be greater consolidation with they big players as they can bear the costs.

Anonymous Coward

April 10, 2025 at 1:41 pm

This is why I’ve offloaded most of my Open Source content to GitHub. I figure Microsoft can bear the brunt of the crawls, since it’s supporting the AI data hoarding in the first place. Then I can host my project pages elsewhere on a lightweight framework that pulls resources from GitHub repositories.

I understand that that’s not for everyone though; some people don’t like depending on Microsoft to host (or arbitrarily block) their content.

But, if you have all the content checked out, you can host on GitHub and wire up your main server in such a way that you can redirect to pulling assets from a different source whenever you want. So you can even mirror all project activity on a local repo, and switch between them as needed. This lets MS handle the bulk of the scraping traffic, while contributors can work away uninhibited on the other repository with pushes making their way back to the public repository as necessary.

Stick the main page behind something like Cloudflare, and you make the scraping a Big Business problem there as well.

mick

April 10, 2025 at 1:47 pm

100% of library websites are affected

I run (among other things) a large image database, and we’ve been absolutely hammered by bots, starting about 6 months ago.

Finally we caved and are now paying Cloudflare to put a stop to it, which has worked pretty well.

AI companies – including Google – should be paying this bill, given that they don’t respect robots.txt.

Anonymous Coward

April 10, 2025 at 1:53 pm

Need a way to poison LLMs. A true HCF scenario would be a great bonus.

Bloof (profile)

April 10, 2025 at 1:56 pm

Given how the average techbro leans on the political spectrum and how Wikipedia has long been a target of the guillotinable class in general as it cannot be bought or gamed that easily, this bombardment is likely by design. If you can’t have PR people win edit wars, if you can’t write laws to remove articles on history that is inconvenient for the white nationalists in power, making the site unusuable through brute force exploitation and draining resources is the next best thing.

Thad (profile)

April 10, 2025 at 2:52 pm

Re:

I wouldn’t be surprised if there were some element of that, but it’s happening to a lot of sites. Popularity and visibility appear to be factors but it doesn’t seem especially targeted. They just want to gobble up everything they can, indiscriminately.

drew (profile)

April 10, 2025 at 2:25 pm

It’s an increasing problem with an audio engineering site I’m a member of. Site slowdowns and page load fails are happening nearly daily; two years ago they were very rare.
Unfortunately, restricting access to general knowledge seems to particularly benefit one party.

Anonymous Coward

April 10, 2025 at 2:57 pm

So even if congress doesn’t kill the open web, AI bros are gonna do it?

Marvelous, simply marvelous.
We just weren’t meant to have this kind of space I guess.

Anonymous Coward

April 10, 2025 at 3:26 pm

Re:

Cool cool. Now you can fuck off.

Anonymous Coward

April 10, 2025 at 7:19 pm

I believe Cloudflare has a solution that puts the crawler in a loop of BS…

drew (profile)

April 11, 2025 at 2:33 am

Re:

This one: https://developers.cloudflare.com/bots/get-started/bot-fight-mode/

Anonymous Coward

April 10, 2025 at 10:37 pm

So what happens if they start hoovering Uncyclopedia content (or other encyclopaedia parodies in the same series, such as Brazil’s Desciclopédia)? These are relatively low-volume sites (Wikia/FANDOM tried to utterly destroy the project in 2019, but it limps along under independent hosting) and the Wikimedia Foundation made a very sleazy attempt this year to get Finland’s “sweat” or “hiki” encyclopedia shut down on trumped-up trademark grounds.

These sites are nonsense. They were created as parody. They look like encyclopaedias, but the joke is that they let any moron edit pages. Sound familiar?

There are only two inevitable outcomes, neither of them good. One is that the bots slow the sites (which are non-commercial and small) so badly that they’re unusable for their intended human audience. The other undesirable outcome is that the bots actually do manage to download an entire Uncyclopedia and the end result is not Artificial Intelligence but Artificial Stupidity.

I can see that one coming from a mile (1.609km) away.

Nemo_bis (profile)

April 11, 2025 at 10:41 am

Re: Wikimedia trademarks

the Wikimedia Foundation made a very sleazy attempt this year to get Finland’s “sweat” or “hiki” encyclopedia shut down on trumped-up trademark grounds

What?

Anonymous Coward

April 10, 2025 at 10:38 pm

This is why we can’t have nice things. Poo!

Anonymous Coward

April 10, 2025 at 10:39 pm

So what happens if they start hoovering Uncyclopedia content (or other encyclopaedia parodies in the same series, such as Brazil’s Desciclopédia)? These are relatively low-volume sites (Wikia/FANDOM tried to utterly destroy the project in 2019, but it limps along under independent hosting) and the Wikimedia Foundation made a very sleazy attempt this year to get Finland’s “sweat” or “hiki” encyclopedia shut down on trumped-up trademark grounds.

Bilateralrope (profile)

April 11, 2025 at 9:00 am

Re:

They probably already are. But while other sites are still producing content, those sites are a low priority for regurgitation.

Until they kill the other sites by increased costs and reduced visitors.

k-h

April 11, 2025 at 5:30 pm

Is this why so many sites make me identify bicycles?

So many sites now insist I prove I am human by forcing me to do tasks that make me feel like I am a robot.

This comment has been flagged by the community. Click here to show it.

IIMSkills Medical Coding Courses in Chennai (profile)

April 11, 2025 at 6:39 pm

Medical Coding Courses in Coimbatore

It occurs to me that these regurgitation engines are doing two things:
– Increasing the costs on the organisations that are producing original content
– Reducing the traffic and income for those same organisations

The end result of that is that the creators disappear. Leaving nobody to feed the regurgitation engines.

I haven’t heard anyone talk about what happens next, but I doubt I’ll like it.

This comment has been flagged by the community. Click here to show it.

premkumar vattanavar (profile)

April 11, 2025 at 6:48 pm

Medical Coding Courses in Coimbatore

Given how the average techbro leans on the political spectrum and how Wikipedia has long been a target of the guillotinable class in general as it cannot be bought or gamed that easily, this bombardment is likely by design. If you can’t have PR people win edit wars, if you can’t write laws to remove articles on history that is inconvenient for the white nationalists in power, making the site unusuable through brute force exploitation and draining resources is the next best thing.

https://iimskills.com/medical-coding-courses-in-coimbatore/

GHB (profile)

April 12, 2025 at 9:15 am

This also explains about this link tax thing when charging to crawl

An article in MIT Technology Review by Shayne Longpre warns that publishers may respond to this challenge in another way, by blocking all crawlers unless they are licensed. That may solve the problem for those sites, and allow deep-pocketed AI companies to train their systems on the licensed material, but many others will lose out

What well-known search engine crawls the web? That’s right Google, along with some other big tech companies on the web. The EU, Austrilia (News Media Bargaining Code), and Canada (Online News Act) were desperate of asking for money. Ads pay less, not many users subscribing, and that the fact that search engine’s not necessarily doing the news sites a favor (such as being stuck between indexing and AI trained, or not appearing on the google search indexing at all.

Monday
09:31	CBP Commander Greg Bovino Is Taking Guest Speaker Spots At White Nationalist Conferences (4)
05:29	AT&T Sues California Regulators For Trying To Make Broadband Affordable (3)
Sunday
12:00	Funniest/Most Insightful Comments Of The Week At Techdirt (13)
Saturday
12:00	This Week In Techdirt History: May 24th - 30th (2)
Friday
19:39	Knox County, TN Rolls Back 'Roots' Book Ban After Backlash (8)
15:24	How AI Can Lead To False Arrests & Wrongful Convictions (21)
13:09	Ctrl-Alt-Speech: Deus vs. Machina (0)
11:15	Court Temporarily Freezes Trump's $1.776 Billion 'Anti-Weaponization' Slush Fund To Figure Out WTF Is Going On (23)
11:10	Daily Deal: MasterBundle For Web Designers (0)
09:21	City Lawmaker Responds To Flock Camera Ban By Demanding A Cell Phone Ban (5)

AI Crawlers Are Harming Wikimedia, Bringing Open Source Sites To Their Knees, And Putting The Open Web At Risk

from the externalizing-your-costs-directly-into-my-face dept

Comments on “AI Crawlers Are Harming Wikimedia, Bringing Open Source Sites To Their Knees, And Putting The Open Web At Risk”

Re:

Re:

Re:

100% of library websites are affected

Re:

Re:

Re:

Re: Wikimedia trademarks

Re:

Is this why so many sites make me identify bicycles?

Medical Coding Courses in Coimbatore

Medical Coding Courses in Coimbatore

This also explains about this link tax thing when charging to crawl

Add Your Comment Cancel reply

Comment Options:

What's this?

Get all our posts in your inbox with the Techdirt Daily Newsletter!

The Techdirt Greenhouse

Trending Posts

Monday

Sunday

Saturday

Friday

More

Tools & Services

Company

Contact

More

AI Crawlers Are Harming Wikimedia, Bringing Open Source Sites To Their Knees, And Putting The Open Web At Risk

from the externalizing-your-costs-directly-into-my-face dept

Comments on “AI Crawlers Are Harming Wikimedia, Bringing Open Source Sites To Their Knees, And Putting The Open Web At Risk”

Add Your Comment Cancel reply

Comment Options:

What's this?

Techdirt Daily Newsletter

Get all our posts in your inbox with the Techdirt Daily Newsletter!

The Techdirt Greenhouse

Trending Posts

Monday

Sunday

Saturday

Friday

More

Email This Story

Tools & Services

Company

Contact

More