Anonymous Coward

July 14, 2025 at 3:03 pm

"And that means paying Cloudflare in perpetuity"

Which is undesirable given that Cloudflare is one of the largest sources of abusive AI scans.

They’re very concerned about what gets into their operation, but they don’t care even a little bit about what comes out.

Anonymous Coward

July 14, 2025 at 3:08 pm

Cloudflare “protections” also DoS your own system[0]. It also needlessly burns resources on visitors, and by mandating code execution on the visitors machine, makes everyone more vulnerable to exploits. Further more, by terminating the TLS tunnel on cloudflare servers, they literally man-in-the-middle all their customers. A cloud flare. For hostile parties, gaining access to cloudflare servers is/would be a great single target to exploit vast swaths of the internet.

I’m sure the list could go on. Point is, cloudflare has LOTS of unhealthy aspects to it.

[0] https://www.devever.net/~hl/cloudflare See the “Conclusions” second bullet (but if you want evidence and arguments, see the article)

Anonymous Coward

July 14, 2025 at 3:43 pm

Walled Culture the book (free digital versions available)
…
More generally, this would be a terrible move for the open Web, which has at its heart the frictionless access to knowledge. Locking things down simply to keep out the AI bots would go against that core philosophy completely.

Really? It’s been pointed out many times, over several years, that the “free digital versions” link blocks many potential readers, including me—I might be a robot, you see. I suspect that I’m not, but, then, isn’t that what the artificial intelligences always think?

I guess it’s easy to talk up the open web from behind your culture-wall.

Extra irony: to gain access I’d have to enable cookies and solve a CAPTCHA made up of wavy letters and numbers. I’m told actual bots are better at such things than humans now.

What the hell happened to all the computing advancements of the last 25 years, anyway? Dan Kegel wrote “the C10k problem” in 1999, about the difficulty of serving 10,000 clients at a time over a gigabit link. At the time, one of the busiest sites was handling 3,600 clients over a 70 Mbit/s link, but the Linux and BSD people quickly solved the scalability problems, and by 2014 people were doing 10 million simultaneous clients. And then we had another decade of improvements to core count, speed, memory, and of course storage (M.2 was introduced in 2013). So even if there were thousands of bots hitting a site all the time, I don’t see why it should be such a huge problem; if it is, maybe fixing the underlying scalability limitations would be a better use of time than bot-detection.

Anonymous Coward

July 14, 2025 at 4:57 pm

Re:

It’s been pointed out before that the book is available elsewhere, such as Arhive.org. For all your complaints, you can’t be bothered to search for it.

https://archive.org/details/walledculturehow0000mood

Anonymous Coward

July 14, 2025 at 9:52 pm

Re: Re:

You’re missing the point. The author is complaining about sites blocking people—this being against some vague “core philosophy”—while doing basically the same thing.

Yeah, the archive.org version’s been posted before, but Glyn keeps linking to the version that blocks some people, while talking about openness.

Anonymous Coward

July 14, 2025 at 6:04 pm

Re:

Really? It’s been pointed out many times, over several years, that the “free digital versions” link blocks many potential readers, including me—I might be a robot, you see. I suspect that I’m not, but, then, isn’t that what the artificial intelligences always think?

Honestly, I can’t read this and come out with a coherent meaning. I thought make it was complaining that the link require… SOMETHING of people (javascript, captcha what ever) for get the free edition… so I literally just followed the link in question. It took me to a page with “download pdf” button that I clicked… and downloaded a PDF. And by the way, I browser with javascript off[0].

Maybe try restating what the complaint is, because at least some of us are not getting anything coherent.

[0] Of course it’s totally possible that the site is protected by cloudflare, and random people will randomly get denied access, unless they capitulate to cloudflares demands. I’ve had that happen to me (at which point I go way). Usaully between a day and a few weeks cloudflare will move on to harassing other people/connections. IMHO this is definitely a problem for the web

Anonymous Coward

July 14, 2025 at 10:12 pm

Re: Re:

Maybe try restating what the complaint is, because at least some of us are not getting anything coherent.
[0] Of course it’s totally possible that the site is protected by cloudflare, and random people will randomly get denied access

That’s basically the problem, except I think it’s not Cloudflare. It redirects me, through several steps, to a page that says:

Our system thinks you might be a robot!
We’re really sorry about this, but it’s getting harder and harder to tell the difference between humans and bots these days.
Please complete the captcha below to prove you’re a human and proceed to the page you’re trying to reach.
This page requires cookies to be enabled in your browser settings. Please check this setting and enable cookies (if disabled).

If I enabled cookies, I might be able to get past that by solving the CAPTCHA. An actual bot would get past it more easily, I suspect. And of course some people don’t get hit with these screens, for various reasons, and don’t realize there’s any problem at all. The blocking page doesn’t even say why they need “to tell the difference between humans and bots”; Glyn’s post says a bit, but a 50% increase in traffic doesn’t much seem like a crisis to me.

But, to be clear, my main complaint is more about the hypocrisy than being unable to access the book. Glyn has been linking the site for years and it’s always given me trouble; “a terrible move for the open Web” that was happening long before anyone’s hand was “forced by the onslaught of AI bots”.

(I’ve posted maybe twice, before today, about the blocking. There have been other comments by at least one other person. Usually someone posts an archive.org link.)

Rocky (profile)

July 15, 2025 at 3:49 pm

Re: Re: Re:

Your main complaint is kind of stupid because I don’t see you complaining about the need to buy a computer and internet access to get a free book.

That you have chosen to use the internet in a way that differ from how almost everyone else use it is entirely a “you” problem.

What you also fail to understand is that Techdirt and its associated sites get DOSed on occasion which necessitates mitigation measures.

So in the end, you are blaming TD for your own behavior and the behavior of bad actors. I guess you also think it is too difficult to spin up an incognito tab to download the book if you don’t like cookies, because that means the book isn’t free anymore.

Anonymous Coward

July 15, 2025 at 6:38 pm

Re: Re: Re:²

What you also fail to understand is that Techdirt and its associated sites get DOSed on occasion which necessitates mitigation measures.

So… has the site hosting this book just happened to be under denial-of-service attacks on every “occasion” when I’ve tried to access it? That seems implausible.

Plus, Glyn is now blaming A.I. load rather than some targeted attack. I still don’t understand how giving me tasks, such as solving CAPTCHAS or running proof-of-work Javascript, helps with that. Those are things computers have become better than humans at. In particular, these A.I. companies have like billions of dollars of computing power available to them; the “wastefulness” is a common complaint. And I don’t understand why modern servers that can handle millions of connections at once are having such trouble (except in relation to huge files like videos; but some HTML or a book shouldn’t be a problem).

you are blaming TD for your own behavior

What? I don’t generally have problems accessing Techdirt. I’m talking about the Walled Culture site, which I believe is Glyn’s. And Glyn’s the one complaining about how the sort of thing Glyn has been doing for years is harming the open web.

Rocky (profile)

July 17, 2025 at 8:11 pm

Re: Re: Re:³

So… has the site hosting this book just happened to be under denial-of-service attacks on every “occasion” when I’ve tried to access it? That seems implausible.

From this I can only conclude you don’t understand how it works. I’ll just make an analogy that even you should understand: Some people never lock the door on their houses because they live in a “safe neighborhood” up until they get burglarized, at which point they usually invest in more locks and security.

Plus, Glyn is now blaming A.I. load rather than some targeted attack. I still don’t understand how giving me tasks, such as solving CAPTCHAS or running proof-of-work Javascript, helps with that. Those are things computers have become better than humans at. In particular, these A.I. companies have like billions of dollars of computing power available to them; the “wastefulness” is a common complaint.

You still don’t understand how it works. Processing power costs money when used and sites employs mitigations techniques so they don’t have to pay increased costs due to increased load and traffic from bots and AI’s scraping their sites. Add DOS attacks ontop of that, running a site can get very expensive fast.

What? I don’t generally have problems accessing Techdirt. I’m talking about the Walled Culture site, which I believe is Glyn’s. And Glyn’s the one complaining about how the sort of thing Glyn has been doing for years is harming the open web.

Either you haven’t read his book or you didn’t understand it at all. I guess it’s the latter based on your complaints.

Anonymous Coward

July 14, 2025 at 6:22 pm

Hopefully anti-AI AI detection systems arenotthe giant energy sinkhole most other AI is.

TKnarr (profile)

July 14, 2025 at 7:48 pm

TBH, I think we may have to resort to the tactics that finally made progress against email spam: RBLs that block entire netblocks that repeatedly originate AI training-data scans. Yes, without regard to who else that impacts. The infrastructure companies who host the AI training-data infrastructure won’t consider it a problem until it’s their customers complaining to them about being blocked. It’s not very nice, but as with spam everything else we try doesn’t seem to be working.

nzeid (profile)

July 15, 2025 at 8:24 am

Re:

Incidentally, Spamhaus is discussing this. I’ve noticed that my ISP (Altice/Optimum) is hosting LLM training and all these countermeasures (including Cloudflare) appear to be using AS’s as the grain for blocking, causing me to have to check those stupid boxes all. The. Time. This is not sustainable.

Grant Gould

July 14, 2025 at 7:49 pm

Common Crawl?

Surely this is exactly the problem that Common Crawl was meant to solve?

Has common crawl broken down or are the current round of trainers just ignoring (or ignorant of) it? Because it seems like at least for a little while we had a solution to this problem in hand that closely aligned with the values of most GLAM institutions.

fawzi (profile)

July 15, 2025 at 2:30 pm

Re:

The problem is non textual data. Images for example, are not in CommonCrawl.
As you are much more protected against copyright and generally infringement claims by just hosting URL, rather than the content itself basically everybody hosts collections composed of lists of URLs, forcing every user to re-download everything. For example coyo has that approach and many of the image are unreachable https://github.com/kakaobrain/coyo-dataset/tree/main/download##missing-images

GHB (profile)

July 14, 2025 at 7:51 pm

Login wall's second problem

These text:

A radically different approach to tackling AI bots is to move collections behind a login

More generally, this would be a terrible move for the open Web, which has at its heart the frictionless access to knowledge.

Not only you are saying goodbye to convenient access for users (could reduce your traffic because users don’t like this, and why BugMeNot exits) to browse a website, it also cost a search indexing to appear on search engines (if google cannot find a sentence of something the user have searched because it’s behind a login wall, it won’t appear). Which is why robots.txt fixes this before AI wreak havoc on the web.

News sites are having a problem with google’s forcing to either allow them to AI train on their works, or not appear on the search results at all. Already the risk facing declining traffic.

Anonymous Coward

July 15, 2025 at 2:37 am

Re:

News sites are having a problem with Google’s forcing to either allow them to AI train on their works, or not appear on the search results at all.

I wasn’t aware of this, but then I haven’t used Google for around a decade. I used to use OneSearch until I discovered Brave Search, then I ditched that in favor of Startpage because Brave Search seems to need cookies and still has a problem maintaining the appearance set by the user ion its homepage, whereas Startpage manages to maintain the set appearance on its homepage as well as the results pages with just a generated link that contains all of one’s settings.

Anonymous Coward

July 15, 2025 at 10:22 am

Re: Re:

You may also be interested in Mullvad Leta, which searches Brave or Google without their anti-user bullshit.

Anonymous Coward

July 18, 2025 at 8:44 am

Re: Re: Re:

I can’t access that one because the library computer blocks it (I’m quite frankly astonished it doesn’t block any search engines that aren’t Google or Bing, TBH).

Wronski Feint (profile)

July 14, 2025 at 10:05 pm

ISP/hosting company solutions

Surely (don’t call me Shirley!), given AI training traffic is a widespread problem, it would be quite a selling point to offer protection, just like protection from DDOS and spam with email systems. AI traffic really can be in that category. I imagine this will come out in the next year or so.

I agree it would be great to have an opensource system and RBLs to deal with this at the server level.

Jacobs Family Farms Global Inc llc

July 16, 2025 at 9:52 pm

Proxies for Bots / Global Proxy Networks

do a quick search for “proxies for bots” and a hypothesis develops where hundreds of millions of individual consumers/users get “free” internet access simply for willingly being an active part of a global proxy network. not sure how cloudflare or anyone will be able to distinguish quasi-random proxy-networks from naturally occurring web traffic. definitely a race to the bottom nevertheless.

Wednesday
09:27	An 18-Million-Subscriber YouTuber Just Explained Section 230 Better Than Every Politician In Washington (0)
05:21	Trump DOJ Launches Bunk Investigation Of Netflix Merger As a Favor To Larry Ellison (1)
Tuesday
20:07	So, You’ve Hit An Age Gate. What Now? (20)
15:03	Border Patrol Thug Greg Bovino Bitched About Being Asked To Be A Bit More Lawful Before Being Turfed To California (14)
13:30	Techdirt Podcast Episode 443: The Supreme Court's Internet Cases (0)
12:03	How To Think About AI: Is It The Tool, Or Are You? (32)
11:59	Daily Deal: Nix Mini 3 Color Sensor (1)
10:46	Sen. Blackburn Gets Shitty Because Justice Ketanji Brown Jackson Attended An Awards Show Where ICE Was Criticized (12)
09:28	Hey Rep. Gonzales, Finish The Thought: What About That Five-Year-Old US Citizen? (21)
05:22	Brendan Carr Launches Fake Investigation Of ABC's 'The View' Because They Haven't Adequately Coddled Trumpism (14)

Tackling The AI Bots That Threaten To Overwhelm The Open Web

from the overrunning-the-commons dept

Comments on “Tackling The AI Bots That Threaten To Overwhelm The Open Web”

"And that means paying Cloudflare in perpetuity"

Re:

Re: Re:

Re:

Re: Re:

Re: Re: Re:

Re: Re: Re:²

Re: Re: Re:³

Re:

Common Crawl?

Re:

Login wall's second problem

Re:

Re: Re:

Re: Re: Re:

ISP/hosting company solutions

Proxies for Bots / Global Proxy Networks

Add Your Comment Cancel reply

Comment Options:

What's this?

Get all our posts in your inbox with the Techdirt Daily Newsletter!

The Techdirt Greenhouse

Trending Posts

Wednesday

Tuesday

More

Tools & Services

Company

Contact

More

Tackling The AI Bots That Threaten To Overwhelm The Open Web

from the overrunning-the-commons dept

Comments on “Tackling The AI Bots That Threaten To Overwhelm The Open Web”

Add Your Comment Cancel reply

Comment Options:

What's this?

Techdirt Daily Newsletter

Get all our posts in your inbox with the Techdirt Daily Newsletter!

The Techdirt Greenhouse

Trending Posts

Wednesday

Tuesday

More

Email This Story

Tools & Services

Company

Contact

More