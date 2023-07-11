A Bunch Of Authors Sue OpenAI Claiming Copyright Infringement, Because They Don’t Understand Copyright
You may have seen some headlines recently about some authors filing lawsuits against OpenAI. The lawsuits (plural, though I’m confused why it’s separate attempts at filing a class action lawsuit, rather than a single one) began last week, when authors Paul Tremblay and Mona Awad sued OpenAI and various subsidiaries, claiming copyright infringement in how OpenAI trained its models. They got a lot more attention over the weekend when another class action lawsuit was filed against OpenAI with comedian Sarah Silverman as the lead plaintiff, along with Christopher Golden and Richard Kadrey. The same day the same three plaintiffs (though with Kadrey now listed as the top plaintiff) also sued Meta, though the complaint is basically the same.
All three cases were filed by Joseph Saveri, a plaintiffs class action lawyer who specializes in antitrust litigation. As with all too many class action lawyers, the goal is generally enriching the class action lawyers, rather than actually stopping any actual wrong. Saveri is not a copyright expert, and the lawsuits… show that. There are a ton of assumptions about how Saveri seems to think copyright law works, which is entirely inconsistent with how it actually works.
The complaints are basically all the same, and what it comes down to is the argument that AI systems were trained on copyright-covered material (duh) and that somehow violates their copyrights.
Much of the material in OpenAI’s training datasets, however, comes from copyrighted works—including books written by Plaintiffs—that were copied by OpenAI without consent, without credit, and without compensation
But… this is both wrong and not quite how copyright law works. Training an LLM does not require “copying” the work in question, but rather reading it. To some extent, this lawsuit is basically arguing that merely reading a copyright-covered work is, itself, copyright infringement.
Under this definition, all search engines would be copyright infringing, because effectively they’re doing the same thing: scanning web pages and learning from what they find to build an index. But we’ve already had courts say that’s not even remotely true. If the courts have decided that search engines scanning content on the web to build an index is clearly transformative fair use, so to would be scanning internet content for training an LLM. Arguably the latter case is way more transformative.
And this is the way it should be, because otherwise, it would basically be saying that anyone reading a work by someone else, and then being inspired to create something new would be infringing on the works they were inspired by. I recognize that the Blurred Lines case sorta went in the opposite direction when it came to music, but more recent decisions have really chipped away at Blurred Lines, and even the recording industry (the recording industry!) is arguing that the Blurred Lines case extended copyright too far.
But, if you look at the details of these lawsuits, they’re not arguing any actual copying (which, you know, is kind of important for their to be copyright infringement), but just that the LLMs have learned from the works of the authors who are suing. The evidence there is, well… extraordinarily weak.
For example, in the Tremblay case, they asked ChatGPT to “summarize” his book “The Cabin at the End of the World,” and ChatGPT does so. They do the same in the Silverman case, with her book “The Bedwetter.” If those are infringing, so is every book report by every schoolchild ever. That’s just not how copyright law works.
The lawsuit tries one other tactic here to argue infringement, beyond just “the LLMs read our books.” It also claims that the corpus of data used to train the LLMs was itself infringing.
For instance, in its June 2018 paper introducing GPT-1 (called “Improving Language Understanding by Generative Pre-Training”), OpenAI revealed that it trained GPT-1 on BookCorpus, a collection of “over 7,000 unique unpublished books from a variety of genres including Adventure, Fantasy, and Romance.” OpenAI confirmed why a dataset of books was so valuable: “Crucially, it contains long stretches of contiguous text, which allows the generative model to learn to condition on long-range information.” Hundreds of large language models have been trained on BookCorpus, including those made by OpenAI, Google, Amazon, and others.
BookCorpus, however, is a controversial dataset. It was assembled in 2015 by a team of AI researchers for the purpose of training language models. They copied the books from a website called Smashwords that hosts self-published novels, that are available to readers at no cost. Those novels, however, are largely under copyright. They were copied into the BookCorpus dataset without consent, credit, or compensation to the authors.
If that’s the case, then they could make the argument that BookCorpus itself is infringing on copyright (though, again, I’d argue there’s a very strong fair use claim under the Perfect 10 cases), but that’s separate from the question of whether or not training on that data is infringing.
And that’s also true of the other claims of secret pirated copies of books that the complaint insists OpenAI must have relied on:
As noted in Paragraph 32, supra, the OpenAI Books2 dataset can be estimated to contain about 294,000 titles. The only “internet-based books corpora” that have ever offered that much material are notorious “shadow library” websites like Library Genesis (aka LibGen), Z-Library (aka Bok), Sci-Hub, and Bibliotik. The books aggregated by these websites have also been available in bulk via torrent systems. These flagrantly illegal shadow libraries have long been of interest to the AI-training community: for instance, an AI training dataset published in December 2020 by EleutherAI called “Books3” includes a recreation of the Bibliotik collection and contains nearly 200,000 books. On information and belief, the OpenAI Books2 dataset includes books copied from these “shadow libraries,” because those are the most sources of trainable books most similar in nature and size to OpenAI’s description of Books2.
Again, think of the implications if this is copyright infringement. If a musician were inspired to create music in a certain genre after hearing pirated songs in that genre, would that make the songs they created infringing? No one thinks that makes sense except the most extreme copyright maximalists. But that’s not how the law actually works.
This entire line of cases is just based on a total and complete misunderstanding of copyright law. I completely understand that many creative folks are worried and scared about AI, and in particular that it was trained on their works, and can often (if imperfectly) create works inspired by them. But… that’s also how human creativity works.
Humans read, listen, watch, learn from, and are inspired by those who came before them. And then they synthesize that with other things, and create new works, often seeking to emulate the styles of those they learned from. AI systems and LLMs are doing the same thing. It’s not infringing to learn from and be inspired by the works of others. It’s not infringing to write a book report style summary of the works of others.
I understand the emotional appeal of these kinds of lawsuits, but the legal reality is that these cases seem doomed to fail, and possibly in a way that will leave the plaintiffs having to pay legal fees (since in copyright legal fee awards are much more common).
That said, if we’ve learned anything at all in the past two plus decades of lawsuits about copyright and the internet, courts will sometimes bend over backwards to rewrite copyright law to pretend it says what they want it to say, rather than what it does say. If that happens here, however, it would be a huge loss to human creativity.
Filed Under: ai, christopher golden, copyright, inspiration, joseph saveri, llms, mona awad, paul tremlay, richard kadrey, sarah silverman, search engines, training
Companies: meta, openai
People are not machines and law is under no obligation to treat them identically for copyright consideration any more than it would for, say, eligibility for office.
That’s an interesting point, but moot for now: the lawsuit is based in existing laws, which do not differentiate between human and machine uses.
Fair use looks at
Factor 1: The Purpose and Character of the Use
Factor 2: The Nature of the Copyrighted Work
Factor 3: The Amount or Substantiality of the Portion Used
Factor 4: The Effect of the Use on the Potential Market for or Value of the Work
I could see a court looking at certain things (especially factor 4) differently for a human vs how OpenAI uses their LLM for determining if some uses are fair use, though I personally would still side on it being fair use.
Not relevant with these cases, but you could probably argue that those AI that focus on an individual’s speech or art style has a negative “Effect of the Use on the Potential Market for or Value of the Work” if you were to treat the output as being transformative of the original rather than a new creation.
Yes they do, and not only do they do so, they differentiate between types of machine use, and not only do they do that, but Techdirt is well aware of it given “copyright doesn’t attach to things a human didn’t make” references are quite common here.
True! Only people can infringe. Machines cannot.
Good thing nobody is suing a machine, then.
And…?
And so “but this is just the same as how humans learn, and since that’s not restricted this isn’t either” is a specious argument.
Re: Re: Re:
I don’t think you know what ‘specious’ means.
If this case goes badly, it will be copyright infringement for humans to learn to read by reading about Spot running.
I give it 5 years before every physical book includes a prologue that’s actually just a 10-page EULA about the reader not being a robot, and agreeing to not discuss the events or descriptions of the story without the express written consent of the publisher.
This is way too close to the truth to be effective satire. Sports broadcasts already have legal text forbidding discussions and descriptions, and I’ve seen books with “EULAs” (actually references to online EULAs).
See Will
See Will sue
Sue, Will, sue
Fortunately, Dick and Jane is out of copyright
They aren’t doing the same thing, Mike. Humans learning a technique and AI using its perfect recall to synthesize images and text and music and more are two different processes. We can reach a conclusion that a human who listened to a pirated song or read a pirated book and made something brand new from it is an entirely different ballpark than scraping the Internet and whole books to make LLMs.
And deeper than that, we can legally distinguish the internet-scraping use to make a search engine between the internet-scraping to make an LLM.
I’m sorry, but with:
-The one article about the writers’ strike from the NFT guy,
-The other one you wrote claiming that AI would give workers without technical skills a leg up and an entryway into the workforce, while ignoring that it’d just replace them later and remove that entryway for future low-skill workers,
-And now this where you trot out the incorrect “Humans and computers are the same” cliché and go into a moral panic that this case could lead to search engines and more being outlawed…
Techdirt is coming off as aloof and distant from the harms and externalities this tech is gonna place on society, and already has in many ways, while simultaneously only giving a ceremonial level of sympathy to the fears of artists and workers.
The way you describe AI generation in your comment makes it sound like you think AI just makes a collage from a huge database of source material… which is not how it works at all. Fearmongering statements born of ignorance won’t help anyone.
Not that hyping up AI out of ignorance will help either, and those are the ones who are funding the marketing not just the research.
There is valid concern that, without humans to keep making more art to riff off off, AI-generated art will start getting repetitive and less valuable. Therefore, you’d think that it’d be dumb to let the people provide the data and value go, but… I’m sure you read up on the news about Twitter and Reddit on this very website.
I see you believe in the fallacious argument put forward by the copyright lobby, and that is that the creation of art depends on an ability to make money. More art is produced as a hobby than is ever published, and that includes self published works.
That seems to be a problem that solves itself.
Not a good sign for the former NFT grifters running these “AI” companies.
In which case human art becomes more valuable, and AI generated art – even the ones generated from the new human art – continue to become less valuable. Human artists then become more valuable as the AI operators’ time in the spotlight is over.
I don’t think this is how you wanted the argument to go, though.
A necklace of “genuine” or “natural” pearls are worth a lot more in the market, than “cultured” pearls produced by human intervention. A broach or ring set with “genuine” sapphires fetch a much higher price than one set with “synthetic” gems. Paintings and prints by an actual artist are more expensive than a mass-produced graphic… And all this is true even if very few people can actually tell the difference.
People do still place a premium on “the real thing”.
Even now, book tours play a significant role. People want to see the author, hear their opinions and their answers to questions, get a feel for the actual person behind the art.
So Harlequin Romance” products might — or might not (?) — end up being churned out by the next Large Language Model/Artificial Intelligence, but there will still be special place for the next Maya Angelou, J K Rowling, Paolo Bacigalupi, Cixin Liu, Margaret Atwood…
Some people believe Hollywood films have already gotten repetitive. They’ve been saying it about sequels and remakes, in particular, for decades. And while films do remain unprofitable, that’s due to the hard work of the the accountants rather than any lack of revenue.
If its recall is do good, why have there been so many articles about how bad it is at writing factual articles? A marker of AI art is that it doesn’t know how many fingers a human has, or that a loop of hair only has one end attached to the scalp.
More like some authors see an excuse to try and extract money from successful companies.
Interesting. It could be easily seen the other way around.
What market do you suppose AI generated images are infringing upon?
The fact that AI-generated racist and homophobic images in the style of Sarah Anderson exist, trained by incels from 4chan, does not suddenly mean that Sarah Anderson has lost a market for her work.
AI-generated art could disappear tomorrow and it wouldn’t stop hateful idiots or anyone else trying to appropriate her style in that way.
Can we? An AI uses existing material to copy, transform and combine(shoutout to Kirby Ferguson) it into new material based on prompts, or what you could practically call ideas, from a person interacting with it.
That’s how creativity works in a nutshell. It doesn’t matter if it has “perfect recall”(which it clearly doesn’t anyway).
Search engines use LLM’s to facilitate searches and results. So what’s the legal distinction?
That’s a strawman and you know it.
Even if that were true, it certainly beats the panic that so many other people are exhibiting when it comes to AI.
Pointing out a bullshit comparison isn’t a strawman.
Re: Re: Re:
It’s a comparison that was not being made. Pointing out that the learning patterns of humans and computers are similar is not the same as saying they are equal or even alike.
There’s pointing out that a comparison is bullshit, and then there’s making assumptions about what the other person is saying because you think it’s an “I win because you’re dumb” button.
Mike’s own words:
Funny how you ignore the technical details that actually matter which are different between humans and AI but from a practical standpoint it’s the same thing – learning.
Yes? Humans and machines doing the same thing is not even remotely suggesting that the two of them are alike or the same. That’s like saying men and women both pee and that makes them exactly alike.
Hmm… perhaps what you should want is for copyright laws to change, rather than for Techdirt to play to your biases.
Not to mention fixing the sheer contempt required of a ruling class who think they can replace you with a machine, true or not.
How did the AI read the copyright work without possessing a copy?
Maybe it went to the library?
Or otherwise obtained a copy of the book legally. It does happen, you know.
Yeah, how much could 200,000 books cost? Probably a drop in the bucket compared to OpenAI’s funding. But we’re in jest, of course Sarah Silverman isn’t interested in payment for just 1 copy of her book for the AI to “read”, she expects a percentage cut of everything.
One word at a time…
You are right that an AI can not own property, but that does not men that it’s “owners” may own (digital) copies of books.
And the way large language models are trained mean they can be fed books in fragments and their training is not spoiled if they miss a chapter of a book here and there. As long as the model knows which words are commonly used in which order.
Firstly the lawsuit alleges that the AI ‘read’ an illegal copy on the web, not a legally bought copy from a bookshop. Maybe OpenAI will disprove this by pointing to the kindle receipts or whatever, but somehow I doubt that, it’s not the techbro way.
Secondly, you’re implying that it’s OK to download the work one word/sentence/page at a time, read it, discard it, then do the same for the next one. We have a word for that, it’s called streaming.
We know it is legal to stream a movie from netflix/amazon/apple/whoever. But streaming an illegal copy is still illegal, so the underlying source of the material is important.
Thirdly I would like a source for the assertion that AI models use streaming to ingest data. It seems to me that by using streaming would be incredibly inefficient. Every time you tweak the model and want to see how it does, you’d need to scan the whole internet again. If you want to control what goes into the model, you need to be able to review the data. Data storage is cheap, and the only thing you need to store is text, which is highly compressible.
To get real pedantic, it’s one bit at a time.
The implication here is that the grifters expended a fair amount of effort to transcribe a fragment of a book into a machine-readable form, be it through OCR or manually.
Regardless, the only thing that should be asked is if the rightsholders gave explicit permission for their works to be used in machine learning, and nothing more.
How did you read this web-page, without possessing a copy?
By downloading the words, via a web browser.
I don’t know how things work for you, but for me a copy was transmitted by the rights-holder to my browser across the internet … which I now respectfully utilize in compliance with the accompanying terms of service.
The messed up thing about copyright is that we have no idea who the rights-holder on a work is until the subpoena rolls in.
And even then it may be a happy birthday subpoena, for a public domain work.
The fun-house mirror...
These cases almost sound like parodies of the “it’s patentable because its … on a COMPUTER.” cases earlier in the millenium. Somehow a COMPUTER reading and gathering information in a copyrighted work is supposed to be significantly different from a person reading and learning from a reference book and then using that knowledge in their workplace.
We’re seeing the same fallacious nonsense, in the visual arts world, with people screaming that AI art is “stealing” or “infringing” their art, just because the AI is trained by looking at art.
Oddly enough, this whole “violating our copyrights” argument essentially ignores that’s how human writers and artists learn their craft/trade, too.
(I’m sympathetic to how these technological developments are affecting artists and writers — especially those who make or wish to make a living from their work — but I don’t think they’ve thought this through, and wouldn’t like the world that this sort of copyright maximalism would inevitably lead to.)
If you think that machines learn the same way as humans do, and carry their years/decades of life experiences into their work like humans do, then it seems like you really haven’t thought this through.
Obviously, it’s not exact. No one is saying it is.
And that just illustrates the problem discussing technology with people who don’t understand it and who want simple answers/solutions. And when you explain it and the nuances they get upset because they think you are bullshitting them since they didn’t comprehend what you just said.
Sort of true, but not actually relevant.
That the machines ‘learned’ their ‘craft’ by examining the work of those previous authors. That the mechanisms of ‘learning’ (‘training’) are different is quite besides the point.
These “AIs” (I think the name is giving more credit than is due) are not simply reproducing artists’ works, but creating works in the style/a similar style of previous authors. ‘Style’ is not copyrightable — nor are plots, themes, motifs, etc.
This may in the foreseeable future be a threat to the “business model” of actual, human artists, who would like to make a living from their art — but it’s certainly not copyright infringement.
With existing law, if the owner of AI software is made aware of specific training data that was used in training it was infringing, would the owner be required to remove any data acquired from such infringing material from their data set, since such data presumably exists on a computer somewhere? This brings up indexing, but Google certainly removes infringing content from what they index upon receiving DMCA notices.
idk, can one excise the parts of one’s brain affected by reading an infringing copy of a book? It’s about the same thing.
Data on a server seems closer to a Youtube video (that could be taken down if infringing) than memories in the brain.
Re: Re: Re:
Re: Re: Re:2
Re: Re: Re:3
Re: Re: Re:4
Because we ain’t competing in writing ability here.
You mean like the time the RIAA when they used someone’s landscape photograph as a backdrop for a website without permission?
Or like Richard Liebowitz, who represented plenty of “content creators” without their permission and pocketed all the settlement money?
Copyright-types do like to bitch and moan about permissions but they seem perfectly fine with not getting them when it’s convenient.
Honestly, if they’d been a little more careful with how much collateral damage their antipiracy campaigns went, they’d probably have more people willing to stick their neck out for them.
It is not the same thing at all.
but even just possessing a copy can be illegal
I will happily defer to the more informed opinions around here if there is a substantive answer to this, but isn’t the mere possession of an illegal copy where the violation is? i.e. with the musician inspired by pirated music it does not matter whether what she writes is truly new and original, or if she writes anything at all. The copyright infringement happened with the illegal download of a pirated song, no?
So the lawsuit could have merit if these AI models used copies of books that were not legally obtained. Who cares how it was used, or even if it wasn’t used al all?
Please enlighten me.
I don’t think copyright law generally has a concept of “illegal copy”; rather, it deals with illegal copying—an act, not an object.
In most of the world, whoever is distributing the work would be the one (potentially) infringing copyright, whereas the receiver would not be. I think the USA, at the behest of the film companies, did make downloading also illegal if the uploader isn’t authorized to make the copy. Still, I’ve never heard of anyone being sued for mere possession.
What I’m hearing is that no-one should read any books by these authors because if a reader comes up with an idea or picks up a particular writing style after doing so that’s copyright infringement and since copyright infringement is the most heinous crime possible(just ask the people pushing more and more extreme copyright laws) it’s better to avoid even the possibility of that happening.
Professional authors already act like being in the same time zone as someone who said “what if the curtains were green” means that the author is no longer allowed to use green curtains ever, and state potential copyright infringement as the reason for doing so.
Also don’t read any summaries of their work or you’ll be accused of having ingested the full work from a pirated source.
Maybe these guys don’t understand copyright, but this author clearly doesn’t even have a basic understanding of how computers work.
Computers literally CAN NOT READ. Full stop. They also can not “think” or “learn” in any human sense. They are computers.
As a technological fact, “reading” is objectively not what happens when an AI model is trained.
Computers can absolutely “read.” They are ingesting all sorts of works to train the LLM and it is the functional equivalent of reading.
I didn’t say they “think.” But training an LLM is effectively the same as having it read/watch/listen to the content it is training on.
As a programmer, let me tell you that computers absolutely can read and learn, even if it’s not exactly the same way humans do. And yes, when an AI is trained on certain data, it has to read that data. (Seriously, a command common to just about every programming language is “read”.)
As for thinking, no one claims that it does.
Except for Sega with it’s Dreamcast.
We will see
Whether training the AI constitutes ‘copying’ remains to be seen. There have been cases holding that copying copyrighted content into a computer’s RAM is enough to infringe copyright. It’ll be interesting to see whether AI training utilizes “copying,” for purposes of copyright law, and if so whether a Fair Use argument can prevail.
They’ll lose to implied license, which is why Google can exist.
headline hilarity
A colleague just noted that this headline is peak Masnick, “as if generated by an AI that had been trained on a corpus of Masnick headlines.”
