Judge Alsup: Training AI On Copyrighted Works? Fair Use. Building Pirate Libraries? Not So Much

from the right-to-read dept

While dozens of AI copyright lawsuits wind their way through courts nationwide, Judge William Alsup’s ruling this week in Bartz v. Anthropic stands out — not just because it’s from one of the most thoughtful tech judges on the federal bench, but because it charts a somewhat nuanced path through the copyright minefield that could define how AI companies operate going forward.

The ruling has sparked predictably divergent takes, with observers claiming it’s both a big win and a big loss for AI. But the real story is more interesting: Alsup has essentially created a roadmap that validates legitimate AI training while drawing clear lines around what crosses into infringement.

The bottom line: this may cost Anthropic some serious money, but it’s actually great news for generative AI development generally should it stand up.

In short, Judge Alsup found that training an AI system on unlicensed copyright works is easily transformative fair use. So too was buying physical books and scanning them to be digital copies used for training. However, initially downloading a bunch of unlicensed works and storing them long-term as a kind of central library can be infringing.

To summarize the analysis that now follows, the use of the books at issue to train Claude and its precursors was exceedingly transformative and was a fair use under Section 107 of the Copyright Act. And, the digitization of the books purchased in print form by Anthropic was also a fair use but not for the same reason as applies to the training copies. Instead, it was a fair use because all Anthropic did was replace the print copies it had purchased for its central library with more convenient space-saving and searchable digital copies for its central library — without adding new copies, creating new works, or redistributing existing copies. However, Anthropic had no entitlement to use pirated copies for its central library. Creating a permanent, general-purpose library was not itself a fair use excusing Anthropic’s piracy.

The good (and I believe correct) part is that training is transformative fair use. Judge Alsup goes through the standard four factors analysis, with the correct emphasis on the transformative nature of the use for generative AI training. Alsup notes that the training on generative AI tools on a corpus of information is the equivalent of how humans learn from works of the past, not to replace them, but to learn from them:

In short, the purpose and character of using copyrighted works to train LLMs to generate new text was quintessentially transformative. Like any reader aspiring to be a writer, Anthropic’s LLMs trained upon works not to race ahead and replicate or supplant them — but to turn a hard corner and create something different. If this training process reasonably required making copies within the LLM or otherwise, those copies were engaged in a transformative use.

The first factor favors fair use for the training copies.

He finds similarly (though for slightly different reasons) on the hard copy books Anthropic purchased to scan. The scanning, a la Google Books, was for transformative purposes and, a la the Sony Betamax case, to make the content more convenient:

Storage and searchability are not creative properties of the copyrighted work itself but physical properties of the frame around the work or informational properties about the work. See Texaco, 802 F. Supp. at 14 (physical), aff’d, 60 F.3d at 919; Google, 804 F.3d at 225 (informational); Sony Corp. of Am. v. Universal City Studios, Inc. (“Sony Betamax”), 464 U.S. 417, 447 (1984) (rightful interests). In Texaco, the court reasoned that if a purchased scientific journal article had been copied “onto microfilm to conserve space, this might [have been] a persuasive transformative use.” 802 F. Supp. at 14 (Judge Pierre Leval), aff’d, 60 F.3d at 919 (reducing “bulk[ ]” “might suffice to tilt the first fair use factor in favor of Texaco if these purposes were dominant“). In Google Books, the court reasoned that a print-to-digital change to expose information about the work was transformative. Google, 804 F.3d at 225 (Judge Pierre Leval). And, in Sony Betamax, the Supreme Court held that making a recording of a television show in order to instead watch it at a later time was copying but did not usurp any rightful interest of the copyright owner. 464 U.S. at 447, 455. Important to the Supreme Court’s reasoning was the expectation that most such copiers would not distribute the permanent copies of the work.

And since that was effectively the same as what Anthropic did here, it gets another vote towards fair use:

Here, every purchased print copy was copied in order to save storage space and to enable searchability as a digital copy. The print original was destroyed. One replaced the other. And, there is no evidence that the new, digital copy was shown, shared, or sold outside the company. This use was even more clearly transformative than those in Texaco, Google, and Sony Betamax (where the number of copies went up by at least one), and, of course, more transformative than those uses rejected in Napster (where the number went up by “millions” of copies shared for free with others).

Thankfully, Alsup flatly rejects the idea that it can’t be fair use because authors/publishers might have wished to license these works at a higher rate. That’s not how this works:

Yes, Authors also might have wished to charge Anthropic more for digital than for print copies. And, this order takes for granted that Authors could have succeeded if Anthropic had been barred from the format change. “But the Constitution’s language [in Clause 8] nowhere suggests that [the copyright owner’s] limited exclusive right should include a right to divide markets or a concomitant right to charge different purchasers different prices for the same book, [merely] say to increase or to maximize gain.” See Kirtsaeng v. John Wiley & Sons, Inc., 568 U.S. 519, 552 (2013); see also U.S. CONST. art. I., § 8, cl. 8. Nor does the Copyright Act itself. Section 106 sets out exclusive rights that fair uses under Section 107 abridge. Section 106(1) reserves to the copyright owner the right to make reproductions. But on our facts we face the unusual situation where one copy entirely replaced the another. And, Section 106(2) reserves to the copyright owner the right to make derivative works that add or subtract creative material — as occurs in a “translation, musical arrangement, dramatization, fictionalization, motion picture version, sound recording, art reproduction, abridgment, [or] condensation” of a book, 17 U.S.C. § 101 (definitions). For some “other modification[ ]” of a book to constitute a “derivative work,” it must itself “represent an original work of authorship.” Ibid. But on our facts the format was changed but no content was added or subtracted. See Mirage Editions, Inc. v. Albuquerque A.R.T. Co., 856 F.2d 1341, 1342, 1343– 44 (9th Cir. 1988) (yes where elements added to create new decorative ceramic).4 Section 106(3) further reserves to the copyright owner the right to distribute copies. But again, the replacement copy here was kept in the central library, not distributed. Cf. Fox News Network, LLC v. TVEyes, Inc., 883 F.3d 169, 176–78 (2d Cir. 2018) (enabling searching for “information about the material” can be transformative use, even if some distribution results); Lewis Galoob Toys, Inc. v. Nintendo of Am., Inc., 964 F.2d 965, 968, 971 (9th Cir. 1992) (using nifty converter to “merely enhance[ ]” audiovisual displays emitted from purchased videogame cartridge was fair use of those displays partly because no surplus copies of cartridge or displays were ever created).

As a result, Anthropic’s format-change from print library copies to digital library copies was transformative under fair use factor one. Anthropic was entitled to retain a copy of these works in a print format. It retained them instead in a digital format, easing storage and searchability. And, the further copies made therefrom for purposes of training LLMs were themselves transformative for that further reason, as above.

My quibble with this is that there is an argument that with the books that were either legally purchased or licensed and then used for training, should you even need to get to the fair use argument at all. If you buy a used book and read it and learn from it without directly paying the author or publisher, it’s not because of “fair use” that you do it. It’s because reading and learning from the work doesn’t trigger copyright at all.

However, if we must go to fair use based on the fact that in this training process copies were made, having Alsup call it transformative fair use is a good outcome.

But then there’s the question of the non-licensed book collections (things like Books3 and LibGen) that Anthropic downloaded from the internet and then stored in the internal “digital library” it was using. And here, Alsup is not impressed and finds it difficult to see the fair use. Basically, in those cases, the company was clearly just downloading unlicensed copies to put into its own library.

This order doubts that any accused infringer could ever meet its burden of explaining why downloading source copies from pirate sites that it could have purchased or otherwise accessed lawfully was itself reasonably necessary to any subsequent fair use. There is no decision holding or requiring that pirating a book that could have been bought at a bookstore was reasonably necessary to writing a book review, conducting research on facts in the book, or creating an LLM. Such piracy of otherwise available copies is inherently, irredeemably infringing even if the pirated copies are immediately used for the transformative use and immediately discarded.

This feels close to reasonable. There are certainly plenty of cases on the books that show that simply downloading unlicensed content off the internet can be seen as infringing (though I’d still quibble that under the exact text of copyright law it only counts as a “copy” if it’s a “material object,” and purely digital content isn’t covered — but courts have long rejected that argument).

Where it still worries me a bit is that this feels pretty similar to things like “indexing the web.” Organizations like Google and the Internet Archive and many others copy all the content they can find online and store it in giant databases/indexes/libraries. And those have been found to be fair use in the past.

So what makes this different?

Judge Alsup tries to distinguish this from key cases regarding internet scanning, but this part feels weaker to me:

Nor were the initial copies made immediately transformed into a significantly altered form. In Perfect 10, images were copied by the search engine in thumbnail form only and deployed immediately into the transformative use of identifying the full-sized images and the pages from which they came. 508 F.3d at 1160, 1165, 1167. And, in Kelly v. Arriba Software Corp., images were copied at full size and then into thumbnails for immediate use in building a search engine, after which the full-sized copies were immediately deleted. 336 F.3d 811, 815 (9th Cir. 2003). Not here. The full-text copies of books were downloaded and maintained “forever.”

Nor does the initial copying here even resemble the full-text copying in the Google Books cases. There, libraries of authorized copies already had been assembled, and all copies therefrom were made for direct employment in a one-to-one further fair use — whether the transformative use of pointing to the works themselves, the use of providing the works in formats for print-disabled patrons, or the use of insuring against going out of print, getting lost, and becoming otherwise unavailable. HathiTrust, 755 F.3d at 97, 101, 103; Google, 804 F.3d at 206, 216–18, 228 (further distinguishing search and snippet uses, which “test[ed] the boundaries of fair use”). Not so here concerning the pirated copies. No authorized copies existed from which Anthropic made its first copies. No full-text copy therefrom was put immediately into use training LLMs. Not every copy was even necessary nor used for training LLMs. No initial copy was ever deleted, even if never used or no longer used. The university libraries and Google went to exceedingly great lengths to ensure that all copies were secured against unauthorized uses — both through technical measures and through legal agreements among all participants. Not so here. The library copies lacked internal controls limiting access and use.

This… feels like rationalization. Yes, the Perfect 10 and Arriba cases were about thumbnails, but search engines do more than turning content into thumbnails, and we generally consider that — even when it sweeps up infringing works on its own — to still be a fair use. So while I understand the logic of what Alsup is saying here, I do worry that it goes too far, and could wipe out other important and valuable uses.

Without going into too much detail on the other four factors (since they tend to matter less here), Alsup says the nature of the works cuts against fair use (but this factor rarely matters much in the final analysis), and while the copying required pretty much the entirety of the copyright-covered works, it leans towards fair use because (as multiple other cases have shown over the years), the use involved the amount necessary to achieve the transformative nature of the work.

Copies selected for inclusion in training sets were selected because they were complete and because they contained rich protectible expression, or so this order accepts the record shows for Authors. Was all this copying reasonably necessary to the transformative use?

Yes.

“What matters [ ] is not so much ‘the amount and substantiality of the portion used’ in making a copy, but rather the amount and substantiality of what is thereby made accessible to a public [in the purported secondary use] for which it may serve as a competing substitute [for the primary use].”

Then there’s the dreaded “effect of the use upon the market” factor, which I honestly think shouldn’t be a fair use factor at all. But in this case, Alsup splits the three classes of works, saying the training use again favors fair use, since it has no direct impact on the market. The use to build the library is mixed again: the purchased copies is seen as neutral, while the unlicensed download copies cuts against fair use (again).

So, in the end: fair use for training, fair use for buying used books and scanning them, not fair use for downloading Books3/LibGen and creating an internal library out of them:

This order grants summary judgment for Anthropic that the training use was a fair use. And, it grants that the print-to-digital format change was a fair use for a different reason. But it denies summary judgment for Anthropic that the pirated library copies must be treated as training copies.

The win for AI is that the training aspect (and even the scanning aspect) are found to be fair use. But, the people who say this is a win for the authors aren’t entirely wrong, because the downloading of the unauthorized copies was done by almost all of the big foundation LLM companies (though it’s not clear all of them set up a similar “library” as Anthropic did).

The prediction is that this one part, on which Alsup says there should be a trial, will likely lead Anthropic to try to settle the case and pay up for that use. That wouldn’t surprise me, given the insane statutory damages rates (effectively starting at $750 per work infringed, but going all the way up to a potential $150k per work if found to be willful).

Though, it also strikes me that even if the authors win, the remedy here wouldn’t require the destruction of the LLMs themselves, since it’s not the tool that is infringing, but rather the separate storage as a library.

Also left open, to me, is the question of what would happen if a model figured out a way to train on those works like Books3/LibGen just by scanning them when found elsewhere online, and not creating the internal library. That could limit some of the usefulness of those collections but would, in theory, avoid some of the liability risk Alsup sees here.

The end result then is that this ruling favors LLM training, which is good for innovation and usefulness. It might, however, ding more sketchy ancillary practices of the big LLM creators. And maybe that’s the proper balance? Alsup has created a framework that distinguishes between legitimate, transformative innovation practices and what amounts to direct infringement with a corporate veneer.

This distinction matters because it gives other AI companies a clear playbook (one that may come too late for some): if you want to avoid Anthropic’s potential liability, don’t create permanent archives of questionably sourced content. The ruling essentially says you can learn from copyrighted works, but you can’t just wholesale copy them into your corporate library.

Some will argue that’s a distinction without a difference, but it’s actually how copyright is supposed to work — focusing on the nature of the use rather than blanket prohibitions on touching copyrighted content.

Of course, this is still just one district court ruling among many pending cases, and appeals are inevitable. But if this framework holds up, it could reshape how AI companies approach data collection — favoring more legally defensible practices over the pure “move fast and break things” approach that might prove to be more trouble than it was worth.

Filed Under: , , , , , , , , ,
Companies: anthropic

Rate this comment as insightful
Rate this comment as funny
You have rated this comment as insightful
You have rated this comment as funny
Flag this comment as abusive/trolling/spam
You have flagged this comment
The first word has already been claimed
The last word has already been claimed
Insightful Lightbulb icon Funny Laughing icon Abusive/trolling/spam Flag icon Insightful badge Lightbulb icon Funny badge Laughing icon Comments icon

Comments on “Judge Alsup: Training AI On Copyrighted Works? Fair Use. Building Pirate Libraries? Not So Much”

Subscribe: RSS Leave a comment
13 Comments
Anonymous Coward says:

I think the judge came to the right decision that anthropic did do something wrong.

Taking from pirated sources should not be an option (for a legitimate company). However I dont think it should matter if the company copied the data or just scanned it.

On a personal note, I am more open to piracy by individuals because it is a more nuanced gray area than for companies.

Scott Craver says:

This is the same issue that befell MP3.com

MP3.com was taken to court for a “CD beaming” service that they argued was fair use—streaming CDs to people who verified that the owned the physical CD—but it likewise involved building a library of ripped CDs, and were busted on that act of direct infringement.

I always suspected that could become a model for AI copyright lawsuits, where the computation and processing is too transformative and weird to be easily established as infringement, but the copying of online works into training sets could be the act targeted by a lawsuit.

Arianity (profile) says:

I think your summary is missing a few crucial points that is worth mentioning:

1) the authors didn’t argue that Claude regurgitated parts of the book(s). This lawsuit is specifically focused on inputs only, and that shapes the ruling a lot. The judge is also making a very big distinction between AI that writes new content itself (in contrast to the Reuter’s decision).

2) the authors conceded that training was similar to human learning: Authors argue that using works to train Claude’s underlying LLMs was like using works to train any person to read and write

These make the fourth factor particularly very weak. It’s not necessarily that Alsup is putting less emphasis on it.

And since that was effectively the same as what Anthropic did here, it gets another vote towards fair use:

You’re misreading that portion a bit. The authors explicitly separated the format shift as a separate element: Authors argue it was a distinguishable step requiring independent justification. (That said, this is actually a nice win for property rights in regards to format shifting. Although I’m a little worried about the Judge’s reasoning. As noted, 106 restricts reproduction, and it doesn’t say anything about one to one copies. So he’s freestyling a bit, there)

Organizations like Google and the Internet Archive and many others copy all the content they can find online and store it in giant databases/indexes/libraries. And those have been found to be fair use in the past. So what makes this different?

Those passed fair use because of other factors, not the copying part. The ruling(and past rulings) explicitly goes into this.

though I’d still quibble that under the exact text of copyright law it only counts as a “copy” if it’s a “material object,” and purely digital content isn’t covered

Eh, if you want to get pedantic about it, every “purely digital” copy resides on some physical media, be it RAM or other forms of storage. It very much is fixed by any method now known or later developed, and from which the work can be perceived, reproduced, or otherwise communicated, either directly or with the aid of a machine or device.link

Also left open, to me, is the question of what would happen if a model figured out a way to train on those works like Books3/LibGen just by scanning them when found elsewhere online, and not creating the internal library.

That’s still making a copy, as far as copyright goes. Although there’s still some wiggle room in precedent if it’s streamed quickly enough. But practically- it’s potentially much more wasteful, if companies start having to rescreen material. AI companies are already putting significant load on sites. If every new model had to regrab ephemeral data, that would get much worse. That would actually kind of suck, from a practical point of view.

And maybe that’s the proper balance? Alsup has created a framework that distinguishes between legitimate, transformative innovation practices and what amounts to direct infringement with a corporate veneer.

That seems like it might potentially lead to a bad equilibrium. If any sale of a book can be turned into an AI input, it’s going to have to be priced accordingly. I could also see this leading to more “licenses” or other workarounds, where you end up not actually owning the thing, similar to how software commonly works now. But maybe it’s a start of something workable.

Add Your Comment

Your email address will not be published. Required fields are marked *

Have a Techdirt Account? Sign in now. Want one? Register here

Comment Options:

Make this the or (get credits or sign in to see balance) what's this?

What's this?

Techdirt community members with Techdirt Credits can spotlight a comment as either the "First Word" or "Last Word" on a particular comment thread. Credits can be purchased at the Techdirt Insider Shop »

Follow Techdirt

Techdirt Daily Newsletter

Subscribe to Our Newsletter

Get all our posts in your inbox with the Techdirt Daily Newsletter!

We don’t spam. Read our privacy policy for more info.

Ctrl-Alt-Speech

A weekly news podcast from
Mike Masnick & Ben Whitelaw

Subscribe now to Ctrl-Alt-Speech »
Techdirt Deals
Techdirt Insider Discord
The latest chatter on the Techdirt Insider Discord channel...
Loading...