Search Engines, AI, And The Long Fight Over Fair Use
from the don't-throw-out-fair-use dept
Long before generative AI, copyright holders warned that new technologies for reading and analyzing information would destroy creativity. Internet search engines, they argued, were infringement machines—tools that copied copyrighted works at scale without permission. As they had with earlier information technologies like the photocopier and the VCR, copyright owners sued.
Courts disagreed. They recognized that copying works in order to understand, index, and locate information is a classic fair use—and a necessary condition for a free and open internet.
Today, the same argument is being recycled against AI. It’s whether copyright owners should be allowed to control how others analyze, reuse, and build on existing works.
Fair Use Protects Analysis—Even When It’s Automated
U.S. courts have long recognized that copying for purposes of analysis, indexing, and learning is a classic fair use. That principle didn’t originate with artificial intelligence. It doesn’t disappear just because the processes are performed by a machine.
Copying that works in order to understand them, extract information from them, or make them searchable is transformative and lawful. That’s why search engines can index the web, libraries can make digital indexes, and researchers can analyze large collections of text and data without negotiating licenses from millions of rightsholders. These uses don’t substitute for the original works; they enable new forms of knowledge and expression.
Training AI models fits squarely within that tradition. An AI system learns by analyzing patterns across many works. The purpose of that copying is not to reproduce or replace the original texts, but to extract statistical relationships that allow the AI system to generate new outputs. That is the hallmark of a transformative use.
Attacking AI training on copyright grounds misunderstands what’s at stake. If copyright law is expanded to require permission for analyzing or learning from existing works, the damage won’t be limited to generative AI tools. It could threaten long-standing practices in machine learning and text-and-data mining that underpin research in science, medicine, and technology.
Researchers already rely on fair use to analyze massive datasets such as scientific literature. Requiring licenses for these uses would often be impractical or impossible, and it would advantage only the largest companies with the money to negotiate blanket deals. Fair use exists to prevent copyright from becoming a barrier to understanding the world. The law has protected learning before. It should continue to do so now, even when that learning is automated.
A Road Forward For AI Training And Fair Use
One court has already shown how these cases should be analyzed. In Bartz v. Anthropic, the court found that using copyrighted works to train an AI model is a highly transformative use. Training is a kind of studying how language works—not about reproducing or supplanting the original books. Any harm to the market for the original works was speculative.
The court in Bartz rejected the idea that an AI model might infringe because, in some abstract sense, its output competes with existing works. While EFF disagrees with other parts of the decision, the court’s ruling on AI training and fair use offers a good approach. Courts should focus on whether training is transformative and non-substitutive, not on fear-based speculation about how a new tool could affect someone’s market share.
AI Can Create Problems, But Expanding Copyright Is the Wrong Fix
Workers’ concerns about automation and displacement are real and should not be ignored. But copyright is the wrong tool to address them. Managing economic transitions and protecting workers during turbulent times may be core functions of government, but copyright law doesn’t help with that task in the slightest. Expanding copyright control over learning and analysis won’t stop new forms of worker automation—it never has. But it will distort copyright law and undermine free expression.
Broad licensing mandates may also do harm by entrenching the current biggest incumbent companies. Only the largest tech firms can afford to negotiate massive licensing deals covering millions of works. Smaller developers, research teams, nonprofits, and open-source projects will all get locked out. Copyright expansion won’t restrain Big Tech—it will give it a new advantage.
Fair Use Still Matters
Learning from prior work is foundational to free expression. Rightsholders cannot be allowed to control it. Courts have rejected that move before, and they should do so again.
Search, indexing, and analysis didn’t destroy creativity. Nor did the photocopier, nor the VCR. They expanded speech, access to knowledge, and participation in culture. Artificial intelligence raises hard new questions, but fair use remains the right starting point for thinking about training.
Republished from the EFF’s Deeplinks blog.


Comments on “Search Engines, AI, And The Long Fight Over Fair Use”
There are literally people actively using AI to do this. Both directly and indirectly.
That is the definition of supplanting or replacing the original works.
Historically, this has basically never happened. Can’t really blame workers for being skeptical, especially when you aren’t offering any concrete actionable plan. It’s just hollow platitudes.
And the fact that this is even conceded as necessary implies that it does in fact supplant existing works. If it didn’t, workers wouldn’t need that protection.
Re:
Yeah, agreed, especially on that last part. The EFF talking like things that have never happened might actually happen this time around, is quite galling.
Re:
Another AI booster looking down their nose at people. Tech industry has become so uninteresting there’s nothing left but people pumping AI. Sad.
I’d like to see the EFF and Techdirt discuss the externalities of AI crawlers/scrapers, the way that they clog up sites that can’t handle the sheer traffic. You can say it’s fair use all you want. But the way that those that operate the crawlers and scrapers go about things, genuinely straining people’s websites (including Wikipedia which has said it’s been having issues with with countless requests/pulls), does not feel anywhere near “fair”.
Re:
Masnick did kind of write that post. In a nutshell, he said that sitemasters need to stop trying to hedge out the bots and try to figure out a way to profit from being constantly DDoSed. Really made me realize why apologists so often talk about the scraper issue: There’s no way to philosophize your way out of it when it’s just public destruction for private profit.
Re: Re:
That is not what I’ve ever said, nor that I believe. The only point I have made is that I don’t think sites should use questionable legal arguments that such scraping can be sued over it. But I am 100% in favor of sites figuring out technological ways to deal with overwhelming scraping.
It’s so weird how people want to put beliefs on me that I do not share.
You should maybe not do that?
Re: Re: Re:
Your solution amounts to saying that people and organizations need to “nerd harder”. Website owners need to have legal recourse or legal/regulatory protection against the owners of AI crawlers/scrapers that place undue strain on their sites. It cannot be left up to a game of whac-a-mole where site owners figure out technical solutions to stop debilitating scraping/crawling, the owners of the AI tech find workarounds and keep doing it, the site owners have to figure out how to stop that, and so on.
We risk driving a lot of the vibrant, unique, smaller sites and creative endeavors and message boards off the web, leaving only those larger sites who can either pay for hosting all the AI scraper/crawler traffic, or dedicate manpower and resources to the technical solution whac-a-mole.
Re: Re: Re:2
No, it’s not. Nerd harder is something different. It’s when politicians assume the impossible is possible.
I’m not asking for the impossible. Blocking scrapers is well within any site’s ability. Hell, Cloudflare now does it by default for any site.
As someone who runs a smaller site, I will say, that’s absolute nonsense.
Re: Re: Re:3
If I remember correctly, there was a point made about how blocking scrapers runs the risk of blocking out reasonable uses of AI (say, summarising a bunch of reviews for a product) and also runs the risk of sites just not being found at all.
Re: Re: Re:4
That’s similar to the news sites complaining about aggregators. They fought to block these scrapers, both technologically and legally, and then when their traffic had dropped sharply as a result, they wanted the scrapers back.
And long before that, search engines such as Google were the devil for their crushing amounts of traffic—until people realized it was necessary to allow that, and started paying others to get them good Google rankings.
I think, rather than villainizing this or that group, it makes more sense to look at why sites fall over so easily. Remember when Microsoft claimed their web server was so much faster than Linux servers, and the Linux developers unleashed a flood of activity to definitively fix that? That where some of the “new” servers like TUX and nginx came from. (There’s still a lot of “magical thinking” in which static files are assumed to be fast, and databases are assumed to be slow—even though the whole point of database software is to be fast, a file system is basically a type of database, and we have so much performance-profiling ability these days.)
There are various providers selling virtual private servers with like 10-2000 Mbit/s of unlimited traffic, under $10/month. I don’t know what the constant load of scraping is, but it’s probably feasible to throttle them instead of blocking them or just joining the current moral panic. There’s much less risk of blocking legitimate users that way (either by accusing them of being robots—it’s happened even to my grandmother, who couldn’t figure out what to do—or by blocking their automated agents).
Re:
How is that any different than irresponsible scraping for any other purpose?
Oh, as well, this appears to be another pro-AI Friday article meant to be left up over the weekend to keep eyeballs on it in an attempt to manufacture more consent for the slop.
Learning from prior work is foundational to free expression.
You’re taking extremely broad license with both ‘learning’ and ‘expression’ in that sentence. A parrot isn’t learning to comprehend words and sounds, it’s repeating back certain noises for a reward, with no comprehension of content or ability to construct something new. Making that a billion times more complex didn’t change how it works, just proved how expensive that dead end to AGI is.
Expression isn’t valued for being easy, if anything we have far too much low effort mediocrity to wade through as it is, and the outputs of generative AI seem to be solidly fixed there: a calculated, smoothed average of pure mid, devoid of meaning or value. It can produce a greeting-card level image of a generic family, but doesn’t comprehend relationships, can’t tell me why anyone is looking at each other a specific way, there’s no backstory or humanity. Artists and creatives often speak of the layers and drafts that go into their work, and the learning process that comes about from having the labor be a part of their life and world, it can’t help but interact and breathe with every sentence, brushstroke, and melody.
Yeah, don’t throw out fair use with the bath water. Not sure piracy counts in fair use, tho’. They check those works out from the library?
Another article restating the obvious, which the rabid AI haters in the comments will fling their shit at.
Hahahaha.
God damn, AI takes on techdirt always go from journalism to the most biased idioicy out there.
AI is so great… that everyone who isn’t a corporate bootlicker hates it.
https://www.pcgamer.com/software/ai/darren-aronofsky-might-finally-kill-art-with-his-new-ai-generated-american-revolution-drama-series-presented-by-salesforce/
https://www.pcgamer.com/gaming-industry/more-than-half-of-game-developers-now-think-generative-ai-is-bad-for-the-industry-a-dramatic-increase-from-just-2-years-ago-id-rather-quit-the-industry-than-use-generative-ai/
“Researchers already rely on fair use to analyze massive datasets such as scientific literature. Requiring licenses for these uses would often be impractical or impossible, and it would advantage only the largest companies with the money to negotiate blanket deals.”
Are you fucking serious?
The only ones who can train AI are billion dollar companies you fucking imbecile! The level of stupidity is up there with $3 meal bitch.
The ones to PROFIT the most from licensing would be the all writers, artists, and small companies who produce content. By opposing licensing you are only helping the worlds elite to the cost of everyone else.
Re:
I mean, this is just factually untrue. You’re confusing frontier models with all AI training, and that’s wrong.
Just this week, Dave Willner had an article about the small, but very useful, AI his org has trained that is solving some serious trust & safety/content moderation issues: https://www.techpolicy.press/ai-is-removing-bottlenecks-to-effective-content-moderation-at-scale/
The current regime had said they don’t want any AI regulation, so it’s possible the same thing might happen again.
The fact that this misconception around AI model training is so pervasive makes me wonder if it’s part of an op by Big Copyright™ to discredit it.
All four factors, case by case
I’m a songwriter, previously signed to an indie, now a hobbyist but with an aim of shifting to recording and production through semi retirement.
I’ve also long been an advocate of copyright reform and a supporter of fair use.
So I’ve got skin in the game from both angles.
And after a while I’ve come to the conclusion that articles like this aren’t really adding much to the discussion anymore; they’re too general.
If you’re going to talk fair use then you have to talk all four factors and that means you have to talk about the specifics.
Could AI training on copyrighted works be fair use? Almost certainly? Especially if you heavily weight the first factor.
Could it be infringing if the purpose of the training is to produce competing works in the same market? I would argue yes if you weight the fourth factor and consider the speed at which it can work.
The only real conclusion you can draw from generic arguments is, “maybe.”
Case law will firm up some guidelines in due course, but until then we have to look at all four factors on a case by case basis.
There’s too many articles about fair use like this one that don’t mention copyleft. It’s not fair if the derivative work is privative.
The AI debate is in dire need of some nuance.
On one side the doomers who dismiss anything positive that is said about AI technology out of hand and continue spreading the narrative that AI will doom us all.
On the other side the glazers keen to dismiss genuine and legitimate concerns, choosing to lump them in with the doomer narrative.
Neither is doing the cause on either side any good and will result in a complete lack of any sensible regulations to reign in the worst aspect of AI technology. The truth is always somewhere between the two extremes, so there is a lot of growing up needed on both sides of the debate.
Stop trying to make fetch happen.