Tackling The AI Bots That Threaten To Overwhelm The Open Web
from the overrunning-the-commons dept
It is a measure of how fast the field of AI has developed in the three years since Walled Culture the book (free digital versions available) was published that the issue of using copyright material for training AI systems, briefly mentioned in the book, has become one of the hottest topics in the copyright world, as numerous posts on this blog attest.
The current situation sees the copyright industry pitted against the generative AI companies. The former wants to limit how copyright material can be used, while the latter want a free for all. But that crude characterization does not mean that the AI companies can be regarded as on the side of the angels when it comes to broadening access to online material. They may want unfettered access for themselves, but it is becoming increasingly clear that as more companies rush to harvest key online resources for AI training purposes, they risk hobbling access for everyone else, and even threaten the very nature of the open Web.
The problem is particularly acute for non-commercial sites offering access to material for free, because they tend to be run on a shoestring, and are thus unable to cope easily with the extra demand placed on their servers by AI companies downloading holdings en masse. Even huge sites like the Wikimedia Projects, which describes itself as “the largest collection of open knowledge in the world”, are struggling with the rise of AI bots:
We are observing a significant increase in request volume, with most of this traffic being driven by scraping bots collecting training data for large language models (LLMs) and other use cases. Automated requests for our content have grown exponentially, alongside the broader technology economy, via mechanisms including scraping, APIs, and bulk downloads. This expansion happened largely without sufficient attribution, which is key to drive new users to participate in the movement, and is causing a significant load on the underlying infrastructure that keeps our sites available for everyone.
Specifically:
Since January 2024, we have seen the bandwidth used for downloading multimedia content grow by 50%. This increase is not coming from human readers, but largely from automated programs that scrape the Wikimedia Commons image catalog of openly licensed images to feed images to AI models. Our infrastructure is built to sustain sudden traffic spikes from humans during high-interest events, but the amount of traffic generated by scraper bots is unprecedented and presents growing risks and costs.
A valuable new report from the GLAM-E Lab explores how widespread this problem is in the world of GLAMs – galleries, libraries, archives, and museums. Here’s the main result:
Bots are widespread, although not universal. Of 43 respondents, 39 had experienced a recent increase in traffic. Twenty-seven of the 39 respondents experiencing an increase in traffic attributed it to AI training data bots, with an additional seven believing that bots could be contributing to the traffic.
Although the sites that responded to the survey were generally keen for their holdings to be accessed, there comes a point where AI bots are degrading the service to human visitors. The question then becomes: what can be done about it?
There is already a tried and tested way to block bots, using robots.txt, a tool that “allows websites to signal to bots which parts of the site the bots should not visit. Its most widely adopted use is to indicate which parts of sites should not be indexed by search engines,” as the report explains. However, there is no mechanism for enforcing the robot.txt rules, which often leads to problems:
Respondents reported that robots.txt is being ignored by many (although not necessarily all) AI scraping bots. This was widely viewed as breaking the norms of the internet, and not playing fair online.
Reports of these types of bots ignoring robots.txt are widespread, even beyond respondents. So widespread, in fact, that there are currently a number of efforts to develop new or updated robots.txt-style protocols to specifically govern AI-related bot behavior online.
One solution is to use a firewall to block traffic according to certain rules. For example, to block by IP addresses, by geography, or by particular domains. Another is to offload the task of blocking to a third party. The most popular among survey respondents is Cloudflare:
One [respondent] noted that, although they can still see the bot traffic spikes in their Cloudflare dashboard, since implementing protections, none of those spikes had managed to negatively impact the system. Others appreciated the effectiveness of Cloudflare but worried that an environment of persistent bot traffic would mean they would have to rely on Cloudflare in perpetuity.
And that means paying Cloudflare in perpetuity, which for many non-profit sites is a challenge, as is simply increasing server capability or moving to a cloud-based system – other ways of coping with surges in demand. A radically different approach to tackling AI bots is to move collections behind a login. But for many in the GLAM world, there is a big problem with this kind of shift:
the larger objection to moving works behind a login screen was philosophical. Respondents expressed concern that moving work behind a login screen, even if creating an account was free, ran counter to their collection’s mission to make their collections broadly available online. Their goal was to create an accessible collection, and adding barriers made that collection less available.
More generally, this would be a terrible move for the open Web, which has at its heart the frictionless access to knowledge. Locking things down simply to keep out the AI bots would go against that core philosophy completely. It would also bolster arguments frequently made by the copyright industry that access to everything online should by default require permission.
It seems unfair that groups working for the common good are forced by the onslaught of AI bots to carry out extra work constantly re-configuring firewalls, to pay for extra services, or to undermine the openness that lies at the heart of their missions. An article on the University of North Carolina Web site discussing how the university’s library tackled this problem of AI bots describes an interesting alternative approach that could offer a general solution. Faced with a changing pattern of access by huge numbers of AI bots, the library brought in local tech experts:
[Associate University Librarian for Digital Strategies & Information Technology] Shearer turned to the University’s Information Technology Services, which serves the entire campus. They had never encountered an attack quite like this either, and they readily brought their security and networking teams to the table. By mid-January a powerful AI-based firewall was in place, blocking the bots while permitting legitimate searches.
Stopping just the AI bots requires spotting patterns in access traffic that distinguishes them from human visitors in order to allow the latter to continue with their visits unimpeded. Finding patterns quickly in large quantities of data is something that modern AI is good at, so using it to filter out the constantly shifting patterns of AI bot access by tweaking the site’s firewall rules in real time is an effective solution. It’s also an apt one: it means that the problems that AI is creating can be solved by AI itself.
Such an AI-driven firewall management system needs to be created and updated to keep ahead of the rapidly-evolving AI bot landscape. It would make a great open source project that coders and non-profits around the world could work on together, since the latter face a common problem, and many have too few resources to do it on their own. Open source applications of the latest AI technologies are rather thin on the ground, even if most generative AI systems are based on open source code. An AI-driven firewall management system optimized for the GLAM sector would be a great place for the free software world to start remedying that.
Follow me @glynmoody on Mastodon and on Bluesky. Republished from Walled Culture.


Comments on “Tackling The AI Bots That Threaten To Overwhelm The Open Web”
"And that means paying Cloudflare in perpetuity"
Which is undesirable given that Cloudflare is one of the largest sources of abusive AI scans.
They’re very concerned about what gets into their operation, but they don’t care even a little bit about what comes out.
Cloudflare “protections” also DoS your own system[0]. It also needlessly burns resources on visitors, and by mandating code execution on the visitors machine, makes everyone more vulnerable to exploits. Further more, by terminating the TLS tunnel on cloudflare servers, they literally man-in-the-middle all their customers. A cloud flare. For hostile parties, gaining access to cloudflare servers is/would be a great single target to exploit vast swaths of the internet.
I’m sure the list could go on. Point is, cloudflare has LOTS of unhealthy aspects to it.
[0] https://www.devever.net/~hl/cloudflare See the “Conclusions” second bullet (but if you want evidence and arguments, see the article)
Really? It’s been pointed out many times, over several years, that the “free digital versions” link blocks many potential readers, including me—I might be a robot, you see. I suspect that I’m not, but, then, isn’t that what the artificial intelligences always think?
I guess it’s easy to talk up the open web from behind your culture-wall.
Extra irony: to gain access I’d have to enable cookies and solve a CAPTCHA made up of wavy letters and numbers. I’m told actual bots are better at such things than humans now.
What the hell happened to all the computing advancements of the last 25 years, anyway? Dan Kegel wrote “the C10k problem” in 1999, about the difficulty of serving 10,000 clients at a time over a gigabit link. At the time, one of the busiest sites was handling 3,600 clients over a 70 Mbit/s link, but the Linux and BSD people quickly solved the scalability problems, and by 2014 people were doing 10 million simultaneous clients. And then we had another decade of improvements to core count, speed, memory, and of course storage (M.2 was introduced in 2013). So even if there were thousands of bots hitting a site all the time, I don’t see why it should be such a huge problem; if it is, maybe fixing the underlying scalability limitations would be a better use of time than bot-detection.
Re:
It’s been pointed out before that the book is available elsewhere, such as Arhive.org. For all your complaints, you can’t be bothered to search for it.
https://archive.org/details/walledculturehow0000mood
Re: Re:
You’re missing the point. The author is complaining about sites blocking people—this being against some vague “core philosophy”—while doing basically the same thing.
Yeah, the archive.org version’s been posted before, but Glyn keeps linking to the version that blocks some people, while talking about openness.
Re:
Honestly, I can’t read this and come out with a coherent meaning. I thought make it was complaining that the link require… SOMETHING of people (javascript, captcha what ever) for get the free edition… so I literally just followed the link in question. It took me to a page with “download pdf” button that I clicked… and downloaded a PDF. And by the way, I browser with javascript off[0].
Maybe try restating what the complaint is, because at least some of us are not getting anything coherent.
[0] Of course it’s totally possible that the site is protected by cloudflare, and random people will randomly get denied access, unless they capitulate to cloudflares demands. I’ve had that happen to me (at which point I go way). Usaully between a day and a few weeks cloudflare will move on to harassing other people/connections. IMHO this is definitely a problem for the web
Re: Re:
That’s basically the problem, except I think it’s not Cloudflare. It redirects me, through several steps, to a page that says:
If I enabled cookies, I might be able to get past that by solving the CAPTCHA. An actual bot would get past it more easily, I suspect. And of course some people don’t get hit with these screens, for various reasons, and don’t realize there’s any problem at all. The blocking page doesn’t even say why they need “to tell the difference between humans and bots”; Glyn’s post says a bit, but a 50% increase in traffic doesn’t much seem like a crisis to me.
But, to be clear, my main complaint is more about the hypocrisy than being unable to access the book. Glyn has been linking the site for years and it’s always given me trouble; “a terrible move for the open Web” that was happening long before anyone’s hand was “forced by the onslaught of AI bots”.
(I’ve posted maybe twice, before today, about the blocking. There have been other comments by at least one other person. Usually someone posts an archive.org link.)
Re: Re: Re:
Your main complaint is kind of stupid because I don’t see you complaining about the need to buy a computer and internet access to get a free book.
That you have chosen to use the internet in a way that differ from how almost everyone else use it is entirely a “you” problem.
What you also fail to understand is that Techdirt and its associated sites get DOSed on occasion which necessitates mitigation measures.
So in the end, you are blaming TD for your own behavior and the behavior of bad actors. I guess you also think it is too difficult to spin up an incognito tab to download the book if you don’t like cookies, because that means the book isn’t free anymore.
Re: Re: Re:2
So… has the site hosting this book just happened to be under denial-of-service attacks on every “occasion” when I’ve tried to access it? That seems implausible.
Plus, Glyn is now blaming A.I. load rather than some targeted attack. I still don’t understand how giving me tasks, such as solving CAPTCHAS or running proof-of-work Javascript, helps with that. Those are things computers have become better than humans at. In particular, these A.I. companies have like billions of dollars of computing power available to them; the “wastefulness” is a common complaint. And I don’t understand why modern servers that can handle millions of connections at once are having such trouble (except in relation to huge files like videos; but some HTML or a book shouldn’t be a problem).
What? I don’t generally have problems accessing Techdirt. I’m talking about the Walled Culture site, which I believe is Glyn’s. And Glyn’s the one complaining about how the sort of thing Glyn has been doing for years is harming the open web.
Re: Re: Re:3
From this I can only conclude you don’t understand how it works. I’ll just make an analogy that even you should understand: Some people never lock the door on their houses because they live in a “safe neighborhood” up until they get burglarized, at which point they usually invest in more locks and security.
You still don’t understand how it works. Processing power costs money when used and sites employs mitigations techniques so they don’t have to pay increased costs due to increased load and traffic from bots and AI’s scraping their sites. Add DOS attacks ontop of that, running a site can get very expensive fast.
Either you haven’t read his book or you didn’t understand it at all. I guess it’s the latter based on your complaints.
Hopefully anti-AI AI detection systems arenotthe giant energy sinkhole most other AI is.
TBH, I think we may have to resort to the tactics that finally made progress against email spam: RBLs that block entire netblocks that repeatedly originate AI training-data scans. Yes, without regard to who else that impacts. The infrastructure companies who host the AI training-data infrastructure won’t consider it a problem until it’s their customers complaining to them about being blocked. It’s not very nice, but as with spam everything else we try doesn’t seem to be working.
Re:
Incidentally, Spamhaus is discussing this. I’ve noticed that my ISP (Altice/Optimum) is hosting LLM training and all these countermeasures (including Cloudflare) appear to be using AS’s as the grain for blocking, causing me to have to check those stupid boxes all. The. Time. This is not sustainable.
Common Crawl?
Surely this is exactly the problem that Common Crawl was meant to solve?
Has common crawl broken down or are the current round of trainers just ignoring (or ignorant of) it? Because it seems like at least for a little while we had a solution to this problem in hand that closely aligned with the values of most GLAM institutions.
Re:
The problem is non textual data. Images for example, are not in CommonCrawl.
As you are much more protected against copyright and generally infringement claims by just hosting URL, rather than the content itself basically everybody hosts collections composed of lists of URLs, forcing every user to re-download everything. For example coyo has that approach and many of the image are unreachable https://github.com/kakaobrain/coyo-dataset/tree/main/download##missing-images
Login wall's second problem
These text:
Not only you are saying goodbye to convenient access for users (could reduce your traffic because users don’t like this, and why BugMeNot exits) to browse a website, it also cost a search indexing to appear on search engines (if google cannot find a sentence of something the user have searched because it’s behind a login wall, it won’t appear). Which is why robots.txt fixes this before AI wreak havoc on the web.
News sites are having a problem with google’s forcing to either allow them to AI train on their works, or not appear on the search results at all. Already the risk facing declining traffic.
Re:
I wasn’t aware of this, but then I haven’t used Google for around a decade. I used to use OneSearch until I discovered Brave Search, then I ditched that in favor of Startpage because Brave Search seems to need cookies and still has a problem maintaining the appearance set by the user ion its homepage, whereas Startpage manages to maintain the set appearance on its homepage as well as the results pages with just a generated link that contains all of one’s settings.
Re: Re:
You may also be interested in Mullvad Leta, which searches Brave or Google without their anti-user bullshit.
Re: Re: Re:
I can’t access that one because the library computer blocks it (I’m quite frankly astonished it doesn’t block any search engines that aren’t Google or Bing, TBH).
ISP/hosting company solutions
Surely (don’t call me Shirley!), given AI training traffic is a widespread problem, it would be quite a selling point to offer protection, just like protection from DDOS and spam with email systems. AI traffic really can be in that category. I imagine this will come out in the next year or so.
I agree it would be great to have an opensource system and RBLs to deal with this at the server level.
Proxies for Bots / Global Proxy Networks
do a quick search for “proxies for bots” and a hypothesis develops where hundreds of millions of individual consumers/users get “free” internet access simply for willingly being an active part of a global proxy network. not sure how cloudflare or anyone will be able to distinguish quasi-random proxy-networks from naturally occurring web traffic. definitely a race to the bottom nevertheless.