AI Crawlers Are Harming Wikimedia, Bringing Open Source Sites To Their Knees, And Putting The Open Web At Risk

from the externalizing-your-costs-directly-into-my-face dept

The current rapid advances in generative AI are built on three things. Computing power, some clever coding, and vast amounts of training data. Lots of money can buy you more of the first two, but finding the necessary training material is increasingly hard. Anyone seeking to bolster their competitive advantage through training needs to find fresh sources. This has led to the widespread deployment of AI crawlers, which scour the Internet for more data that can be downloaded and used to train AI systems. Some of the prime targets for these AI scraping bots are Wikimedia projects, which claim to be “the largest collection of open knowledge in the world”. This has now become a serious problem for them:

We are observing a significant increase in request volume, with most of this traffic being driven by scraping bots collecting training data for large language models (LLMs) and other use cases. Automated requests for our content have grown exponentially, alongside the broader technology economy, via mechanisms including scraping, APIs, and bulk downloads. This expansion happened largely without sufficient attribution, which is key to drive new users to participate in the movement, and is causing a significant load on the underlying infrastructure that keeps our sites available for everyone.

Specifically:

Since January 2024, we have seen the bandwidth used for downloading multimedia content grow by 50%. This increase is not coming from human readers, but largely from automated programs that scrape the Wikimedia Commons image catalog of openly licensed images to feed images to AI models. Our infrastructure is built to sustain sudden traffic spikes from humans during high-interest events, but the amount of traffic generated by scraper bots is unprecedented and presents growing risks and costs.

AI crawlers seek to download as much material as possible, including the most obscure, so Wikimedia projects that are optimized for human use incur extra costs:

While human readers tend to focus on specific – often similar – topics, crawler bots tend to “bulk read” larger numbers of pages and visit also the less popular pages. This means these types of requests are more likely to get forwarded to the core datacenter, which makes it much more expensive in terms of consumption of our resources.

Wikimedia’s analysis shows that 65% of this resource-consuming traffic is coming from bots, whereas the overall pageviews from bots are about 35% of the total. As the Diff news story notes, this is becoming a widespread problem not just for Wikimedia, but across the Internet. Some companies are responding with lawsuits, but for another important class of sites this is not a practical option.

These are the open source projects that have a Web presence with a wide range of resources. Many of them are struggling under the impact of aggressive AI crawlers, as a post by Niccolò Venerandi on the LibreNews site details. For example, Drew Devault, the founder of the open source development platform SourceHut, wrote a blog post last month with the title “Please stop externalizing your costs directly into my face”, in which he lamented:

These bots crawl everything they can find, robots.txt be damned, including expensive endpoints like git blame, every page of every git log, and every commit in every repo, and they do so using random User-Agents that overlap with end-users and come from tens of thousands of IP addresses – mostly residential, in unrelated subnets, each one making no more than one HTTP request over any time period we tried to measure – actively and maliciously adapting and blending in with end-user traffic and avoiding attempts to characterize their behavior or block their traffic.

Devault says that he knows many other Web sites are similarly affected:

All of my sysadmin friends are dealing with the same problems. I was asking one of them for feedback on a draft of this article and our discussion was interrupted to go deal with a new wave of LLM bots on their own server. Every time I sit down for beers or dinner or to socialize with my sysadmin friends it’s not long before we’re complaining about the bots and asking if the other has cracked the code to getting rid of them once and for all. The desperation in these conversations is palpable.

The LibreNews article discusses some of the technical approaches to excluding these AI crawlers. But setting them up, monitoring and fine-tuning them requires time and energy from those running the sites time that could have been spent more fruitfully on managing the actual projects. Similar, the unexpected extra bandwidth costs caused by massive bot downloads come out of the small and often stretched budgets of open source projects. There is a clear danger that these LLM bots will cause open source projects to struggle, and possibly shut down completely.

An article in MIT Technology Review by Shayne Longpre warns that publishers may respond to this challenge in another way, by blocking all crawlers unless they are licensed. That may solve the problem for those sites, and allow deep-pocketed AI companies to train their systems on the licensed material, but many others will lose out:

Crawlers from academic researchers, journalists, and non-AI applications may increasingly be denied open access. Unless we can nurture an ecosystem with different rules for different data uses, we may end up with strict borders across the web, exacting a price on openness and transparency.

It’s increasingly clear that the reckless and selfish way in which AI crawlers are being deployed by companies eager to tap into today’s AI hype is bringing many sites around the Internet to their knees. As a result, AI crawlers are beginning to threaten the open Web itself, and thus the frictionless access to knowledge that it has provided to general users for the last 30 years.

Follow me @glynmoody on Mastodon and on Bluesky.

Filed Under: , , , , , , , , , , , , , , , ,
Companies: sourcehut

Rate this comment as insightful
Rate this comment as funny
You have rated this comment as insightful
You have rated this comment as funny
Flag this comment as abusive/trolling/spam
You have flagged this comment
The first word has already been claimed
The last word has already been claimed
Insightful Lightbulb icon Funny Laughing icon Abusive/trolling/spam Flag icon Insightful badge Lightbulb icon Funny badge Laughing icon Comments icon

Comments on “AI Crawlers Are Harming Wikimedia, Bringing Open Source Sites To Their Knees, And Putting The Open Web At Risk”

Subscribe: RSS Leave a comment
24 Comments
Bilateralrope (profile) says:

Re:

It occurs to me that these regurgitation engines are doing two things:
– Increasing the costs on the organisations that are producing original content
– Reducing the traffic and income for those same organisations

The end result of that is that the creators disappear. Leaving nobody to feed the regurgitation engines.

I haven’t heard anyone talk about what happens next, but I doubt I’ll like it.

Anonymous Coward says:

This is why I’ve offloaded most of my Open Source content to GitHub. I figure Microsoft can bear the brunt of the crawls, since it’s supporting the AI data hoarding in the first place. Then I can host my project pages elsewhere on a lightweight framework that pulls resources from GitHub repositories.

I understand that that’s not for everyone though; some people don’t like depending on Microsoft to host (or arbitrarily block) their content.

But, if you have all the content checked out, you can host on GitHub and wire up your main server in such a way that you can redirect to pulling assets from a different source whenever you want. So you can even mirror all project activity on a local repo, and switch between them as needed. This lets MS handle the bulk of the scraping traffic, while contributors can work away uninhibited on the other repository with pushes making their way back to the public repository as necessary.

Stick the main page behind something like Cloudflare, and you make the scraping a Big Business problem there as well.

mick says:

100% of library websites are affected

I run (among other things) a large image database, and we’ve been absolutely hammered by bots, starting about 6 months ago.

Finally we caved and are now paying Cloudflare to put a stop to it, which has worked pretty well.

AI companies – including Google – should be paying this bill, given that they don’t respect robots.txt.

Bloof (profile) says:

Given how the average techbro leans on the political spectrum and how Wikipedia has long been a target of the guillotinable class in general as it cannot be bought or gamed that easily, this bombardment is likely by design. If you can’t have PR people win edit wars, if you can’t write laws to remove articles on history that is inconvenient for the white nationalists in power, making the site unusuable through brute force exploitation and draining resources is the next best thing.

Anonymous Coward says:

So what happens if they start hoovering Uncyclopedia content (or other encyclopaedia parodies in the same series, such as Brazil’s Desciclopédia)? These are relatively low-volume sites (Wikia/FANDOM tried to utterly destroy the project in 2019, but it limps along under independent hosting) and the Wikimedia Foundation made a very sleazy attempt this year to get Finland’s “sweat” or “hiki” encyclopedia shut down on trumped-up trademark grounds.

These sites are nonsense. They were created as parody. They look like encyclopaedias, but the joke is that they let any moron edit pages. Sound familiar?

There are only two inevitable outcomes, neither of them good. One is that the bots slow the sites (which are non-commercial and small) so badly that they’re unusable for their intended human audience. The other undesirable outcome is that the bots actually do manage to download an entire Uncyclopedia and the end result is not Artificial Intelligence but Artificial Stupidity.

I can see that one coming from a mile (1.609km) away.

Anonymous Coward says:

So what happens if they start hoovering Uncyclopedia content (or other encyclopaedia parodies in the same series, such as Brazil’s Desciclopédia)? These are relatively low-volume sites (Wikia/FANDOM tried to utterly destroy the project in 2019, but it limps along under independent hosting) and the Wikimedia Foundation made a very sleazy attempt this year to get Finland’s “sweat” or “hiki” encyclopedia shut down on trumped-up trademark grounds.

This comment has been flagged by the community. Click here to show it.

IIMSkills Medical Coding Courses in Chennai (profile) says:

Medical Coding Courses in Coimbatore

It occurs to me that these regurgitation engines are doing two things:
– Increasing the costs on the organisations that are producing original content
– Reducing the traffic and income for those same organisations

The end result of that is that the creators disappear. Leaving nobody to feed the regurgitation engines.

I haven’t heard anyone talk about what happens next, but I doubt I’ll like it.

This comment has been flagged by the community. Click here to show it.

premkumar vattanavar (profile) says:

Medical Coding Courses in Coimbatore

Given how the average techbro leans on the political spectrum and how Wikipedia has long been a target of the guillotinable class in general as it cannot be bought or gamed that easily, this bombardment is likely by design. If you can’t have PR people win edit wars, if you can’t write laws to remove articles on history that is inconvenient for the white nationalists in power, making the site unusuable through brute force exploitation and draining resources is the next best thing.

https://iimskills.com/medical-coding-courses-in-coimbatore/

GHB (profile) says:

This also explains about this link tax thing when charging to crawl

An article in MIT Technology Review by Shayne Longpre warns that publishers may respond to this challenge in another way, by blocking all crawlers unless they are licensed. That may solve the problem for those sites, and allow deep-pocketed AI companies to train their systems on the licensed material, but many others will lose out

What well-known search engine crawls the web? That’s right Google, along with some other big tech companies on the web. The EU, Austrilia (News Media Bargaining Code), and Canada (Online News Act) were desperate of asking for money. Ads pay less, not many users subscribing, and that the fact that search engine’s not necessarily doing the news sites a favor (such as being stuck between indexing and AI trained, or not appearing on the google search indexing at all.

Add Your Comment

Your email address will not be published. Required fields are marked *

Have a Techdirt Account? Sign in now. Want one? Register here

Comment Options:

Make this the or (get credits or sign in to see balance) what's this?

What's this?

Techdirt community members with Techdirt Credits can spotlight a comment as either the "First Word" or "Last Word" on a particular comment thread. Credits can be purchased at the Techdirt Insider Shop »

Follow Techdirt

Techdirt Daily Newsletter

Subscribe to Our Newsletter

Get all our posts in your inbox with the Techdirt Daily Newsletter!

We don’t spam. Read our privacy policy for more info.

Ctrl-Alt-Speech

A weekly news podcast from
Mike Masnick & Ben Whitelaw

Subscribe now to Ctrl-Alt-Speech »
Techdirt Deals
Techdirt Insider Discord
The latest chatter on the Techdirt Insider Discord channel...
Loading...