AI Crawlers Are Harming Wikimedia, Bringing Open Source Sites To Their Knees, And Putting The Open Web At Risk
from the externalizing-your-costs-directly-into-my-face dept
The current rapid advances in generative AI are built on three things. Computing power, some clever coding, and vast amounts of training data. Lots of money can buy you more of the first two, but finding the necessary training material is increasingly hard. Anyone seeking to bolster their competitive advantage through training needs to find fresh sources. This has led to the widespread deployment of AI crawlers, which scour the Internet for more data that can be downloaded and used to train AI systems. Some of the prime targets for these AI scraping bots are Wikimedia projects, which claim to be “the largest collection of open knowledge in the world”. This has now become a serious problem for them:
We are observing a significant increase in request volume, with most of this traffic being driven by scraping bots collecting training data for large language models (LLMs) and other use cases. Automated requests for our content have grown exponentially, alongside the broader technology economy, via mechanisms including scraping, APIs, and bulk downloads. This expansion happened largely without sufficient attribution, which is key to drive new users to participate in the movement, and is causing a significant load on the underlying infrastructure that keeps our sites available for everyone.
Specifically:
Since January 2024, we have seen the bandwidth used for downloading multimedia content grow by 50%. This increase is not coming from human readers, but largely from automated programs that scrape the Wikimedia Commons image catalog of openly licensed images to feed images to AI models. Our infrastructure is built to sustain sudden traffic spikes from humans during high-interest events, but the amount of traffic generated by scraper bots is unprecedented and presents growing risks and costs.
AI crawlers seek to download as much material as possible, including the most obscure, so Wikimedia projects that are optimized for human use incur extra costs:
While human readers tend to focus on specific – often similar – topics, crawler bots tend to “bulk read” larger numbers of pages and visit also the less popular pages. This means these types of requests are more likely to get forwarded to the core datacenter, which makes it much more expensive in terms of consumption of our resources.
Wikimedia’s analysis shows that 65% of this resource-consuming traffic is coming from bots, whereas the overall pageviews from bots are about 35% of the total. As the Diff news story notes, this is becoming a widespread problem not just for Wikimedia, but across the Internet. Some companies are responding with lawsuits, but for another important class of sites this is not a practical option.
These are the open source projects that have a Web presence with a wide range of resources. Many of them are struggling under the impact of aggressive AI crawlers, as a post by Niccolò Venerandi on the LibreNews site details. For example, Drew Devault, the founder of the open source development platform SourceHut, wrote a blog post last month with the title “Please stop externalizing your costs directly into my face”, in which he lamented:
These bots crawl everything they can find, robots.txt be damned, including expensive endpoints like git blame, every page of every git log, and every commit in every repo, and they do so using random User-Agents that overlap with end-users and come from tens of thousands of IP addresses – mostly residential, in unrelated subnets, each one making no more than one HTTP request over any time period we tried to measure – actively and maliciously adapting and blending in with end-user traffic and avoiding attempts to characterize their behavior or block their traffic.
Devault says that he knows many other Web sites are similarly affected:
All of my sysadmin friends are dealing with the same problems. I was asking one of them for feedback on a draft of this article and our discussion was interrupted to go deal with a new wave of LLM bots on their own server. Every time I sit down for beers or dinner or to socialize with my sysadmin friends it’s not long before we’re complaining about the bots and asking if the other has cracked the code to getting rid of them once and for all. The desperation in these conversations is palpable.
The LibreNews article discusses some of the technical approaches to excluding these AI crawlers. But setting them up, monitoring and fine-tuning them requires time and energy from those running the sites — time that could have been spent more fruitfully on managing the actual projects. Similar, the unexpected extra bandwidth costs caused by massive bot downloads come out of the small and often stretched budgets of open source projects. There is a clear danger that these LLM bots will cause open source projects to struggle, and possibly shut down completely.
An article in MIT Technology Review by Shayne Longpre warns that publishers may respond to this challenge in another way, by blocking all crawlers unless they are licensed. That may solve the problem for those sites, and allow deep-pocketed AI companies to train their systems on the licensed material, but many others will lose out:
Crawlers from academic researchers, journalists, and non-AI applications may increasingly be denied open access. Unless we can nurture an ecosystem with different rules for different data uses, we may end up with strict borders across the web, exacting a price on openness and transparency.
It’s increasingly clear that the reckless and selfish way in which AI crawlers are being deployed by companies eager to tap into today’s AI hype is bringing many sites around the Internet to their knees. As a result, AI crawlers are beginning to threaten the open Web itself, and thus the frictionless access to knowledge that it has provided to general users for the last 30 years.
Follow me @glynmoody on Mastodon and on Bluesky.
Filed Under: access to knowledge, ai, apis, bandwidth, bots, datacenter, drew devault, licensing, llms, open source, open web, publishers, scraping, sysadmins, training data, web crawlers, wikimedia
Companies: sourcehut


Comments on “AI Crawlers Are Harming Wikimedia, Bringing Open Source Sites To Their Knees, And Putting The Open Web At Risk”
According to the proponents of regurgitation engines, it’s just like people going into a library to read, why you gotta be hatin’ man, don’t you know it’s all in the name of progress
Bah.
Re:
It occurs to me that these regurgitation engines are doing two things:
– Increasing the costs on the organisations that are producing original content
– Reducing the traffic and income for those same organisations
The end result of that is that the creators disappear. Leaving nobody to feed the regurgitation engines.
I haven’t heard anyone talk about what happens next, but I doubt I’ll like it.
Re:
I still remember and occasionally get angry about that one idiot TechDirt gave space to that tried to analogize ‘AI’ to ‘having an autistic person’.
@drew you can confirm? 😉
Re:
Well, in my tiny sample size of one, I can definitely confirm that both those things are happening.
I think the end result is more likely to be greater consolidation with they big players as they can bear the costs.
This is why I’ve offloaded most of my Open Source content to GitHub. I figure Microsoft can bear the brunt of the crawls, since it’s supporting the AI data hoarding in the first place. Then I can host my project pages elsewhere on a lightweight framework that pulls resources from GitHub repositories.
I understand that that’s not for everyone though; some people don’t like depending on Microsoft to host (or arbitrarily block) their content.
But, if you have all the content checked out, you can host on GitHub and wire up your main server in such a way that you can redirect to pulling assets from a different source whenever you want. So you can even mirror all project activity on a local repo, and switch between them as needed. This lets MS handle the bulk of the scraping traffic, while contributors can work away uninhibited on the other repository with pushes making their way back to the public repository as necessary.
Stick the main page behind something like Cloudflare, and you make the scraping a Big Business problem there as well.
100% of library websites are affected
I run (among other things) a large image database, and we’ve been absolutely hammered by bots, starting about 6 months ago.
Finally we caved and are now paying Cloudflare to put a stop to it, which has worked pretty well.
AI companies – including Google – should be paying this bill, given that they don’t respect robots.txt.
Need a way to poison LLMs. A true HCF scenario would be a great bonus.
Given how the average techbro leans on the political spectrum and how Wikipedia has long been a target of the guillotinable class in general as it cannot be bought or gamed that easily, this bombardment is likely by design. If you can’t have PR people win edit wars, if you can’t write laws to remove articles on history that is inconvenient for the white nationalists in power, making the site unusuable through brute force exploitation and draining resources is the next best thing.
Re:
I wouldn’t be surprised if there were some element of that, but it’s happening to a lot of sites. Popularity and visibility appear to be factors but it doesn’t seem especially targeted. They just want to gobble up everything they can, indiscriminately.
It’s an increasing problem with an audio engineering site I’m a member of. Site slowdowns and page load fails are happening nearly daily; two years ago they were very rare.
Unfortunately, restricting access to general knowledge seems to particularly benefit one party.
So even if congress doesn’t kill the open web, AI bros are gonna do it?
Marvelous, simply marvelous.
We just weren’t meant to have this kind of space I guess.
Re:
Cool cool. Now you can fuck off.
I believe Cloudflare has a solution that puts the crawler in a loop of BS…
Re:
This one: https://developers.cloudflare.com/bots/get-started/bot-fight-mode/
So what happens if they start hoovering Uncyclopedia content (or other encyclopaedia parodies in the same series, such as Brazil’s Desciclopédia)? These are relatively low-volume sites (Wikia/FANDOM tried to utterly destroy the project in 2019, but it limps along under independent hosting) and the Wikimedia Foundation made a very sleazy attempt this year to get Finland’s “sweat” or “hiki” encyclopedia shut down on trumped-up trademark grounds.
These sites are nonsense. They were created as parody. They look like encyclopaedias, but the joke is that they let any moron edit pages. Sound familiar?
There are only two inevitable outcomes, neither of them good. One is that the bots slow the sites (which are non-commercial and small) so badly that they’re unusable for their intended human audience. The other undesirable outcome is that the bots actually do manage to download an entire Uncyclopedia and the end result is not Artificial Intelligence but Artificial Stupidity.
I can see that one coming from a mile (1.609km) away.
Re: Wikimedia trademarks
What?
This is why we can’t have nice things. Poo!
So what happens if they start hoovering Uncyclopedia content (or other encyclopaedia parodies in the same series, such as Brazil’s Desciclopédia)? These are relatively low-volume sites (Wikia/FANDOM tried to utterly destroy the project in 2019, but it limps along under independent hosting) and the Wikimedia Foundation made a very sleazy attempt this year to get Finland’s “sweat” or “hiki” encyclopedia shut down on trumped-up trademark grounds.
Re:
They probably already are. But while other sites are still producing content, those sites are a low priority for regurgitation.
Until they kill the other sites by increased costs and reduced visitors.
Is this why so many sites make me identify bicycles?
So many sites now insist I prove I am human by forcing me to do tasks that make me feel like I am a robot.
This comment has been flagged by the community. Click here to show it.
Medical Coding Courses in Coimbatore
It occurs to me that these regurgitation engines are doing two things:
– Increasing the costs on the organisations that are producing original content
– Reducing the traffic and income for those same organisations
The end result of that is that the creators disappear. Leaving nobody to feed the regurgitation engines.
I haven’t heard anyone talk about what happens next, but I doubt I’ll like it.
This comment has been flagged by the community. Click here to show it.
Medical Coding Courses in Coimbatore
Given how the average techbro leans on the political spectrum and how Wikipedia has long been a target of the guillotinable class in general as it cannot be bought or gamed that easily, this bombardment is likely by design. If you can’t have PR people win edit wars, if you can’t write laws to remove articles on history that is inconvenient for the white nationalists in power, making the site unusuable through brute force exploitation and draining resources is the next best thing.
https://iimskills.com/medical-coding-courses-in-coimbatore/
This also explains about this link tax thing when charging to crawl
What well-known search engine crawls the web? That’s right Google, along with some other big tech companies on the web. The EU, Austrilia (News Media Bargaining Code), and Canada (Online News Act) were desperate of asking for money. Ads pay less, not many users subscribing, and that the fact that search engine’s not necessarily doing the news sites a favor (such as being stuck between indexing and AI trained, or not appearing on the google search indexing at all.