A Bunch Of Authors Sue OpenAI Claiming Copyright Infringement, Because They Don’t Understand Copyright

from the not-how-any-of-this-works dept

You may have seen some headlines recently about some authors filing lawsuits against OpenAI. The lawsuits (plural, though I’m confused why it’s separate attempts at filing a class action lawsuit, rather than a single one) began last week, when authors Paul Tremblay and Mona Awad sued OpenAI and various subsidiaries, claiming copyright infringement in how OpenAI trained its models. They got a lot more attention over the weekend when another class action lawsuit was filed against OpenAI with comedian Sarah Silverman as the lead plaintiff, along with Christopher Golden and Richard Kadrey. The same day the same three plaintiffs (though with Kadrey now listed as the top plaintiff) also sued Meta, though the complaint is basically the same.

All three cases were filed by Joseph Saveri, a plaintiffs class action lawyer who specializes in antitrust litigation. As with all too many class action lawyers, the goal is generally enriching the class action lawyers, rather than actually stopping any actual wrong. Saveri is not a copyright expert, and the lawsuits… show that. There are a ton of assumptions about how Saveri seems to think copyright law works, which is entirely inconsistent with how it actually works.

The complaints are basically all the same, and what it comes down to is the argument that AI systems were trained on copyright-covered material (duh) and that somehow violates their copyrights.

Much of the material in OpenAI’s training datasets, however, comes from copyrighted works—including books written by Plaintiffs—that were copied by OpenAI without consent, without credit, and without compensation

But… this is both wrong and not quite how copyright law works. Training an LLM does not require “copying” the work in question, but rather reading it. To some extent, this lawsuit is basically arguing that merely reading a copyright-covered work is, itself, copyright infringement.

Under this definition, all search engines would be copyright infringing, because effectively they’re doing the same thing: scanning web pages and learning from what they find to build an index. But we’ve already had courts say that’s not even remotely true. If the courts have decided that search engines scanning content on the web to build an index is clearly transformative fair use, so to would be scanning internet content for training an LLM. Arguably the latter case is way more transformative.

And this is the way it should be, because otherwise, it would basically be saying that anyone reading a work by someone else, and then being inspired to create something new would be infringing on the works they were inspired by. I recognize that the Blurred Lines case sorta went in the opposite direction when it came to music, but more recent decisions have really chipped away at Blurred Lines, and even the recording industry (the recording industry!) is arguing that the Blurred Lines case extended copyright too far.

But, if you look at the details of these lawsuits, they’re not arguing any actual copying (which, you know, is kind of important for their to be copyright infringement), but just that the LLMs have learned from the works of the authors who are suing. The evidence there is, well… extraordinarily weak.

For example, in the Tremblay case, they asked ChatGPT to “summarize” his book “The Cabin at the End of the World,” and ChatGPT does so. They do the same in the Silverman case, with her book “The Bedwetter.” If those are infringing, so is every book report by every schoolchild ever. That’s just not how copyright law works.

The lawsuit tries one other tactic here to argue infringement, beyond just “the LLMs read our books.” It also claims that the corpus of data used to train the LLMs was itself infringing.

For instance, in its June 2018 paper introducing GPT-1 (called “Improving Language Understanding by Generative Pre-Training”), OpenAI revealed that it trained GPT-1 on BookCorpus, a collection of “over 7,000 unique unpublished books from a variety of genres including Adventure, Fantasy, and Romance.” OpenAI confirmed why a dataset of books was so valuable: “Crucially, it contains long stretches of contiguous text, which allows the generative model to learn to condition on long-range information.” Hundreds of large language models have been trained on BookCorpus, including those made by OpenAI, Google, Amazon, and others.

BookCorpus, however, is a controversial dataset. It was assembled in 2015 by a team of AI researchers for the purpose of training language models. They copied the books from a website called Smashwords that hosts self-published novels, that are available to readers at no cost. Those novels, however, are largely under copyright. They were copied into the BookCorpus dataset without consent, credit, or compensation to the authors.

If that’s the case, then they could make the argument that BookCorpus itself is infringing on copyright (though, again, I’d argue there’s a very strong fair use claim under the Perfect 10 cases), but that’s separate from the question of whether or not training on that data is infringing.

And that’s also true of the other claims of secret pirated copies of books that the complaint insists OpenAI must have relied on:

As noted in Paragraph 32, supra, the OpenAI Books2 dataset can be estimated to contain about 294,000 titles. The only “internet-based books corpora” that have ever offered that much material are notorious “shadow library” websites like Library Genesis (aka LibGen), Z-Library (aka Bok), Sci-Hub, and Bibliotik. The books aggregated by these websites have also been available in bulk via torrent systems. These flagrantly illegal shadow libraries have long been of interest to the AI-training community: for instance, an AI training dataset published in December 2020 by EleutherAI called “Books3” includes a recreation of the Bibliotik collection and contains nearly 200,000 books. On information and belief, the OpenAI Books2 dataset includes books copied from these “shadow libraries,” because those are the most sources of trainable books most similar in nature and size to OpenAI’s description of Books2.

Again, think of the implications if this is copyright infringement. If a musician were inspired to create music in a certain genre after hearing pirated songs in that genre, would that make the songs they created infringing? No one thinks that makes sense except the most extreme copyright maximalists. But that’s not how the law actually works.

This entire line of cases is just based on a total and complete misunderstanding of copyright law. I completely understand that many creative folks are worried and scared about AI, and in particular that it was trained on their works, and can often (if imperfectly) create works inspired by them. But… that’s also how human creativity works.

Humans read, listen, watch, learn from, and are inspired by those who came before them. And then they synthesize that with other things, and create new works, often seeking to emulate the styles of those they learned from. AI systems and LLMs are doing the same thing. It’s not infringing to learn from and be inspired by the works of others. It’s not infringing to write a book report style summary of the works of others.

I understand the emotional appeal of these kinds of lawsuits, but the legal reality is that these cases seem doomed to fail, and possibly in a way that will leave the plaintiffs having to pay legal fees (since in copyright legal fee awards are much more common).

That said, if we’ve learned anything at all in the past two plus decades of lawsuits about copyright and the internet, courts will sometimes bend over backwards to rewrite copyright law to pretend it says what they want it to say, rather than what it does say. If that happens here, however, it would be a huge loss to human creativity.

Filed Under: , , , , , , , , , , ,
Companies: meta, openai

Rate this comment as insightful
Rate this comment as funny
You have rated this comment as insightful
You have rated this comment as funny
Flag this comment as abusive/trolling/spam
You have flagged this comment
The first word has already been claimed
The last word has already been claimed
Insightful Lightbulb icon Funny Laughing icon Abusive/trolling/spam Flag icon Insightful badge Lightbulb icon Funny badge Laughing icon Comments icon

Comments on “A Bunch Of Authors Sue OpenAI Claiming Copyright Infringement, Because They Don’t Understand Copyright”

Subscribe: RSS Leave a comment
87 Comments
This comment has been deemed insightful by the community.
Anonymous Coward says:

Re:

People are not machines and law is under no obligation to treat them identically for copyright consideration

That’s an interesting point, but moot for now: the lawsuit is based in existing laws, which do not differentiate between human and machine uses.

Anonymous Coward says:

Re: Re:

Fair use looks at
Factor 1: The Purpose and Character of the Use
Factor 2: The Nature of the Copyrighted Work
Factor 3: The Amount or Substantiality of the Portion Used
Factor 4: The Effect of the Use on the Potential Market for or Value of the Work

I could see a court looking at certain things (especially factor 4) differently for a human vs how OpenAI uses their LLM for determining if some uses are fair use, though I personally would still side on it being fair use.

Anonymous Coward says:

Re: Re: Re:

Not relevant with these cases, but you could probably argue that those AI that focus on an individual’s speech or art style has a negative “Effect of the Use on the Potential Market for or Value of the Work” if you were to treat the output as being transformative of the original rather than a new creation.

Anonymous Coward says:

Humans read, listen, watch, learn from, and are inspired by those who came before them. And then they synthesize that with other things, and create new works, often seeking to emulate the styles of those they learned from. AI systems and LLMs are doing the same thing.

They aren’t doing the same thing, Mike. Humans learning a technique and AI using its perfect recall to synthesize images and text and music and more are two different processes. We can reach a conclusion that a human who listened to a pirated song or read a pirated book and made something brand new from it is an entirely different ballpark than scraping the Internet and whole books to make LLMs.

And deeper than that, we can legally distinguish the internet-scraping use to make a search engine between the internet-scraping to make an LLM.

I’m sorry, but with:
-The one article about the writers’ strike from the NFT guy,
-The other one you wrote claiming that AI would give workers without technical skills a leg up and an entryway into the workforce, while ignoring that it’d just replace them later and remove that entryway for future low-skill workers,
-And now this where you trot out the incorrect “Humans and computers are the same” cliché and go into a moral panic that this case could lead to search engines and more being outlawed…

Techdirt is coming off as aloof and distant from the harms and externalities this tech is gonna place on society, and already has in many ways, while simultaneously only giving a ceremonial level of sympathy to the fears of artists and workers.

This comment has been deemed insightful by the community.
Anonymous Coward says:

Re:

The way you describe AI generation in your comment makes it sound like you think AI just makes a collage from a huge database of source material… which is not how it works at all. Fearmongering statements born of ignorance won’t help anyone.

Anonymous Coward says:

Re: Re:

Not that hyping up AI out of ignorance will help either, and those are the ones who are funding the marketing not just the research.

There is valid concern that, without humans to keep making more art to riff off off, AI-generated art will start getting repetitive and less valuable. Therefore, you’d think that it’d be dumb to let the people provide the data and value go, but… I’m sure you read up on the news about Twitter and Reddit on this very website.

This comment has been deemed insightful by the community.
Anonymous Coward says:

Re: Re: Re:

I see you believe in the fallacious argument put forward by the copyright lobby, and that is that the creation of art depends on an ability to make money. More art is produced as a hobby than is ever published, and that includes self published works.

This comment has been deemed insightful by the community.
Anonymous Coward says:

Re: Re: Re:

There is valid concern that, without humans to keep making more art to riff off off, AI-generated art will start getting repetitive and less valuable. Therefore, you’d think that it’d be dumb to let the people provide the data and value go

In which case human art becomes more valuable, and AI generated art – even the ones generated from the new human art – continue to become less valuable. Human artists then become more valuable as the AI operators’ time in the spotlight is over.

I don’t think this is how you wanted the argument to go, though.

BernardoVerda (profile) says:

Re: Re: Re:2 Of words, paintings, and pearls...

A necklace of “genuine” or “natural” pearls are worth a lot more in the market, than “cultured” pearls produced by human intervention. A broach or ring set with “genuine” sapphires fetch a much higher price than one set with “synthetic” gems. Paintings and prints by an actual artist are more expensive than a mass-produced graphic… And all this is true even if very few people can actually tell the difference.

People do still place a premium on “the real thing”.

Even now, book tours play a significant role. People want to see the author, hear their opinions and their answers to questions, get a feel for the actual person behind the art.

So Harlequin Romance” products might — or might not (?) — end up being churned out by the next Large Language Model/Artificial Intelligence, but there will still be special place for the next Maya Angelou, J K Rowling, Paolo Bacigalupi, Cixin Liu, Margaret Atwood…

Anonymous Coward says:

Re: Re: Re:

There is valid concern that […] AI-generated art will start getting repetitive and less valuable.

Some people believe Hollywood films have already gotten repetitive. They’ve been saying it about sequels and remakes, in particular, for decades. And while films do remain unprofitable, that’s due to the hard work of the the accountants rather than any lack of revenue.

This comment has been deemed insightful by the community.
Anonymous Coward says:

Re:

They aren’t doing the same thing, Mike. Humans learning a technique and AI using its perfect recall to synthesize images and text and music and more are two different processes.

If its recall is do good, why have there been so many articles about how bad it is at writing factual articles? A marker of AI art is that it doesn’t know how many fingers a human has, or that a loop of hair only has one end attached to the scalp.

Anonymous Coward says:

Re: Re: Re:

Interesting. It could be easily seen the other way around.

What market do you suppose AI generated images are infringing upon?

The fact that AI-generated racist and homophobic images in the style of Sarah Anderson exist, trained by incels from 4chan, does not suddenly mean that Sarah Anderson has lost a market for her work.

AI-generated art could disappear tomorrow and it wouldn’t stop hateful idiots or anyone else trying to appropriate her style in that way.

Fleshbot says:

Re: Re: Re:2

Illustration for:
– Home decoration
– Graphic design produicts & Advertising in general
– Book covers & texts
– Pitches, moodboards, concept & final art in Games & Films

Voice Acting
Writing
Music

and probably more I can’t think of right now.

But I guess all these things are too worthless to be paid and respected, but fancy enough to mimic via parasitic software.

This comment has been deemed insightful by the community.
Strawb (profile) says:

Re:

We can reach a conclusion that a human who listened to a pirated song or read a pirated book and made something brand new from it is an entirely different ballpark than scraping the Internet and whole books to make LLMs.

Can we? An AI uses existing material to copy, transform and combine(shoutout to Kirby Ferguson) it into new material based on prompts, or what you could practically call ideas, from a person interacting with it.
That’s how creativity works in a nutshell. It doesn’t matter if it has “perfect recall”(which it clearly doesn’t anyway).

And deeper than that, we can legally distinguish the internet-scraping use to make a search engine between the internet-scraping to make an LLM.

Search engines use LLM’s to facilitate searches and results. So what’s the legal distinction?

And now this where you trot out the incorrect “Humans and computers are the same” cliché

That’s a strawman and you know it.

Techdirt is coming off as aloof and distant from the harms and externalities this tech is gonna place on society, and already has in many ways,

Even if that were true, it certainly beats the panic that so many other people are exhibiting when it comes to AI.

Anonymous Coward says:

Re: Re: Re:

It’s a comparison that was not being made. Pointing out that the learning patterns of humans and computers are similar is not the same as saying they are equal or even alike.

There’s pointing out that a comparison is bullshit, and then there’s making assumptions about what the other person is saying because you think it’s an “I win because you’re dumb” button.

Anonymous Coward says:

Re: Re: Re:2

It’s a comparison that was not being made. Pointing out that the learning patterns of humans and computers are similar is not the same as saying they are equal or even alike.

Mike’s own words:

Humans read, listen, watch, learn from, and are inspired by those who came before them. And then they synthesize that with other things, and create new works, often seeking to emulate the styles of those they learned from. AI systems and LLMs are doing the same thing.

JWise says:

Re: Re: Re:4

“But… that’s also how human creativity works.”

Corporate use of data is not the same as personal use of data. This is more analogous to corporations taking your data for advertising. By law they have to ask you if you want their cookies on your computer.

AI works at billions of operations per second. The damage it can do to an artist blows away any human being inspired a said artist now has a style that looks like AI slop, killing any chance of a career.

The courts are allowing AI companies to use artist’s own work to destroy their prospects because money is what ultimately decides these matters.

MathFox says:

Re:

How did the AI read the copyright work

One word at a time…

without possessing a copy

You are right that an AI can not own property, but that does not men that it’s “owners” may own (digital) copies of books.

And the way large language models are trained mean they can be fed books in fragments and their training is not spoiled if they miss a chapter of a book here and there. As long as the model knows which words are commonly used in which order.

migi (profile) says:

Re: Re:

Firstly the lawsuit alleges that the AI ‘read’ an illegal copy on the web, not a legally bought copy from a bookshop. Maybe OpenAI will disprove this by pointing to the kindle receipts or whatever, but somehow I doubt that, it’s not the techbro way.

Secondly, you’re implying that it’s OK to download the work one word/sentence/page at a time, read it, discard it, then do the same for the next one. We have a word for that, it’s called streaming.

We know it is legal to stream a movie from netflix/amazon/apple/whoever. But streaming an illegal copy is still illegal, so the underlying source of the material is important.

Thirdly I would like a source for the assertion that AI models use streaming to ingest data. It seems to me that by using streaming would be incredibly inefficient. Every time you tweak the model and want to see how it does, you’d need to scan the whole internet again. If you want to control what goes into the model, you need to be able to review the data. Data storage is cheap, and the only thing you need to store is text, which is highly compressible.

Anonymous Coward says:

Re: Re:

One word at a time…

To get real pedantic, it’s one bit at a time.

The implication here is that the grifters expended a fair amount of effort to transcribe a fragment of a book into a machine-readable form, be it through OCR or manually.

Regardless, the only thing that should be asked is if the rightsholders gave explicit permission for their works to be used in machine learning, and nothing more.

This comment has been deemed insightful by the community.
Bruce C. says:

The fun-house mirror...

These cases almost sound like parodies of the “it’s patentable because its … on a COMPUTER.” cases earlier in the millenium. Somehow a COMPUTER reading and gathering information in a copyrighted work is supposed to be significantly different from a person reading and learning from a reference book and then using that knowledge in their workplace.

This comment has been deemed insightful by the community.
BernardoVerda (profile) says:

Re:

We’re seeing the same fallacious nonsense, in the visual arts world, with people screaming that AI art is “stealing” or “infringing” their art, just because the AI is trained by looking at art.

Oddly enough, this whole “violating our copyrights” argument essentially ignores that’s how human writers and artists learn their craft/trade, too.

(I’m sympathetic to how these technological developments are affecting artists and writers — especially those who make or wish to make a living from their work — but I don’t think they’ve thought this through, and wouldn’t like the world that this sort of copyright maximalism would inevitably lead to.)

This comment has been deemed insightful by the community.
BernardoVerda (profile) says:

Re: Re: Re:

Sort of true, but not actually relevant.

That the machines ‘learned’ their ‘craft’ by examining the work of those previous authors. That the mechanisms of ‘learning’ (‘training’) are different is quite besides the point.

These “AIs” (I think the name is giving more credit than is due) are not simply reproducing artists’ works, but creating works in the style/a similar style of previous authors. ‘Style’ is not copyrightable — nor are plots, themes, motifs, etc.

This may in the foreseeable future be a threat to the “business model” of actual, human artists, who would like to make a living from their art — but it’s certainly not copyright infringement.

Fleshbot says:

Re: Re: Re:2

It IS still relevant. Just look at you. AI “just learns styles”. No, it processes data assets other humans prepped for you, that you wish to let the machine mimic for you. AI is a SOFTWARE PRODUCT.

Apparantly licensing and consent systems are just a silly game to IT bros that they shouldn’t be part of BECAUSE their AI is just their brain prosthetic or virtual human-like child. AI bros see their freedoms and agendas as more important than of those who they use as data cattle.

There is indeed a limit to how much humans can control how their traits and their creations can be used. For example you can’t prohibit a human from being simply inspired by your work and learning from it.

So, how handy it is to just postulate “My machine is just looking & learning like a person” – and hoping that clueless old politicians and legislators gulp this nonsense down. (The consumer masses already do.)

Fleshbot says:

Re: Re: Re:2

It IS still relevant. Just look at you. AI “just learns styles”. No, it processes data assets other humans prepped for you, that you wish to let the machine mimic for you. AI is a SOFTWARE PRODUCT.

Apparantly licensing and consent systems are just a silly game to IT bros that they shouldn’t be part of BECAUSE their AI is just their brain prosthetic or virtual human-like child. AI bros see their freedoms and agendas as more important than of those who they use as data cattle.

There is indeed a limit to how much humans can control how their traits and their creations can be used. For example you can’t prohibit a human from being simply inspired by your work and learning from it.

So, how handy it is to just postulate “My machine is just looking & learning like a person” – and hoping that clueless old politicians and legislators gulp this nonsense down. (The consumer masses already do.)

(Sorry if this comment appeared double.)

Anonymous Coward says:

With existing law, if the owner of AI software is made aware of specific training data that was used in training it was infringing, would the owner be required to remove any data acquired from such infringing material from their data set, since such data presumably exists on a computer somewhere? This brings up indexing, but Google certainly removes infringing content from what they index upon receiving DMCA notices.

Anonymous Coward says:

Re: Re: Re:3

Tough cookies, but maybe you should’ve tried getting your permissions in line first.

You mean like the time the RIAA when they used someone’s landscape photograph as a backdrop for a website without permission?

Or like Richard Liebowitz, who represented plenty of “content creators” without their permission and pocketed all the settlement money?

Copyright-types do like to bitch and moan about permissions but they seem perfectly fine with not getting them when it’s convenient.

coby says:

but even just possessing a copy can be illegal

I will happily defer to the more informed opinions around here if there is a substantive answer to this, but isn’t the mere possession of an illegal copy where the violation is? i.e. with the musician inspired by pirated music it does not matter whether what she writes is truly new and original, or if she writes anything at all. The copyright infringement happened with the illegal download of a pirated song, no?

So the lawsuit could have merit if these AI models used copies of books that were not legally obtained. Who cares how it was used, or even if it wasn’t used al all?

Please enlighten me.

This comment has been deemed insightful by the community.
Anonymous Coward says:

Re:

isn’t the mere possession of an illegal copy where the violation is?

I don’t think copyright law generally has a concept of “illegal copy”; rather, it deals with illegal copying—an act, not an object.

In most of the world, whoever is distributing the work would be the one (potentially) infringing copyright, whereas the receiver would not be. I think the USA, at the behest of the film companies, did make downloading also illegal if the uploader isn’t authorized to make the copy. Still, I’ve never heard of anyone being sued for mere possession.

This comment has been deemed insightful by the community.
That One Guy (profile) says:

What I’m hearing is that no-one should read any books by these authors because if a reader comes up with an idea or picks up a particular writing style after doing so that’s copyright infringement and since copyright infringement is the most heinous crime possible(just ask the people pushing more and more extreme copyright laws) it’s better to avoid even the possibility of that happening.

kgb99 says:

Maybe these guys don’t understand copyright, but this author clearly doesn’t even have a basic understanding of how computers work.

Computers literally CAN NOT READ. Full stop. They also can not “think” or “learn” in any human sense. They are computers.

As a technological fact, “reading” is objectively not what happens when an AI model is trained.

bhull242 (profile) says:

Re:

As a programmer, let me tell you that computers absolutely can read and learn, even if it’s not exactly the same way humans do. And yes, when an AI is trained on certain data, it has to read that data. (Seriously, a command common to just about every programming language is “read”.)

As for thinking, no one claims that it does.

terop (profile) says:

easy way of checking if copyright infringement ruling is warranted is by checking if removal of the infringed work from the world would result damage to the product the defendant sold. For example, if chatgpt/AI lawsuit wants sue them, they need to claim that all chatgpt training materials was unlicensed. That’s enough. When the product pretty much disappears when alleged infringing content is removed, it’s clear instance of copyright infringement and laws should prevent the horror.

Add Your Comment

Your email address will not be published. Required fields are marked *

Have a Techdirt Account? Sign in now. Want one? Register here

Comment Options:

Make this the or (get credits or sign in to see balance) what's this?

What's this?

Techdirt community members with Techdirt Credits can spotlight a comment as either the "First Word" or "Last Word" on a particular comment thread. Credits can be purchased at the Techdirt Insider Shop »

Follow Techdirt

Techdirt Daily Newsletter

Subscribe to Our Newsletter

Get all our posts in your inbox with the Techdirt Daily Newsletter!

We don’t spam. Read our privacy policy for more info.

Ctrl-Alt-Speech

A weekly news podcast from
Mike Masnick & Ben Whitelaw

Subscribe now to Ctrl-Alt-Speech »
Techdirt Deals
Techdirt Insider Discord
The latest chatter on the Techdirt Insider Discord channel...
Loading...