Judge: Just Because AI Trains On Your Publication, Doesn’t Mean It Infringes On Your Copyright
from the that's-not-how-any-of-this-works dept
I get that a lot of people don’t like the big AI companies and how they scrape the web. But these copyright lawsuits being filed against them are absolute garbage. And you want that to be the case, because if it goes the other way, it will do real damage to the open web by further entrenching the largest companies. If you don’t like the AI companies find another path, because copyright is not the answer.
So far, we’ve seen that these cases aren’t doing all that well, though many are still ongoing.
Last week, a judge tossed out one of the early ones against OpenAI, brought by Raw Story and Alternet.
Part of the problem is that these lawsuits assume, incorrectly, that these AI services really are, as some people falsely call them, “plagiarism machines.” The assumption is that they’re just copying everything and then handing out snippets of it.
But that’s not how it works. It is much more akin to reading all these works and then being able to make suggestions based on an understanding of how similar things kinda look, though from memory, not from having access to the originals.
Some of this case focused on whether or not OpenAI removed copyright management information (CMI) from the works that they were being trained on. This always felt like an extreme long shot, and the court finds Raw Story’s arguments wholly unconvincing in part because they don’t show any work that OpenAI distributed without their copyright management info.
For one thing, Plaintiffs are wrong that Section 1202 “grant[ s] the copyright owner the sole prerogative to decide how future iterations of the work may differ from the version the owner published.” Other provisions of the Copyright Act afford such protections, see 17 U.S.C. § 106, but not Section 1202. Section 1202 protects copyright owners from specified interferences with the integrity of a work’s CMI. In other words, Defendants may, absent permission, reproduce or even create derivatives of Plaintiffs’ works-without incurring liability under Section 1202-as long as Defendants keep Plaintiffs’ CMI intact. Indeed, the legislative history of the DMCA indicates that the Act’s purpose was not to guard against property-based injury. Rather, it was to “ensure the integrity of the electronic marketplace by preventing fraud and misinformation,” and to bring the United States into compliance with its obligations to do so under the World Intellectual Property Organization (WIPO) Copyright Treaty, art. 12(1) (“Obligations concerning Rights Management Information”) and WIPO Performances and Phonograms Treaty….
Moreover, I am not convinced that the mere removal of identifying information from a copyrighted work-absent dissemination-has any historical or common-law analogue.
Then there’s the bigger point, which is that the judge, Colleen McMahon, has a better understanding of how ChatGPT works than the plaintiffs and notes that just because ChatGPT was trained on pretty much the entire internet, that doesn’t mean it’s going to infringe on Raw Story’s copyright:
Plaintiffs allege that ChatGPT has been trained on “a scrape of most of the internet,” Compl. , 29, which includes massive amounts of information from innumerable sources on almost any given subject. Plaintiffs have nowhere alleged that the information in their articles is copyrighted, nor could they do so. When a user inputs a question into ChatGPT, ChatGPT synthesizes the relevant information in its repository into an answer. Given the quantity of information contained in the repository, the likelihood that ChatGPT would output plagiarized content from one of Plaintiffs’ articles seems remote.
Finally, the judge basically says, “Look, I get it, you’re upset that ChatGPT read your stuff, but you don’t have an actual legal claim here.”
Let us be clear about what is really at stake here. The alleged injury for which Plaintiffs truly seek redress is not the exclusion of CMI from Defendants’ training sets, but rather Defendants’ use of Plaintiffs’ articles to develop ChatGPT without compensation to Plaintiffs. See Compl. ~ 57 (“The OpenAI Defendants have acknowledged that use of copyright-protected works to train ChatGPT requires a license to that content, and in some instances, have entered licensing agreements with large copyright owners … They are also in licensing talks with other copyright owners in the news industry, but have offered no compensation to Plaintiffs.”). Whether or not that type of injury satisfies the injury-in-fact requirement, it is not the type of harm that has been “elevated” by Section 1202(b )(i) of the DMCA. See Spokeo, 578 U.S. at 341 (Congress may “elevate to the status of legally cognizable injuries, de facto injuries that were previously inadequate in law.”). Whether there is another statute or legal theory that does elevate this type of harm remains to be seen. But that question is not before the Court today.
While the judge dismisses the case with prejudice and says they can try again, it would appear that she is skeptical they could do so with any reasonable chance of success:
In the event of dismissal Plaintiffs seek leave to file an amended complaint. I cannot ascertain whether amendment would be futile without seeing a proposed amended pleading. I am skeptical about Plaintiffs’ ability to allege a cognizable injury but, at least as to injunctive relief, I am prepared to consider an amended pleading.
I totally get why publishers are annoyed and why they keep suing. But copyright is the wrong tool for the job. Hopefully, more courts will make this clear and we can get past all of these lawsuits.
Filed Under: ai, cmi, copyright, dmca, generative ai, reading
Companies: alternet, openai, raw story


Comments on “Judge: Just Because AI Trains On Your Publication, Doesn’t Mean It Infringes On Your Copyright”
After decades of invoking theft equivalence (“you wouldn’t steal a car”), it doesn’t come as any surprise that artists are using amoral computers who can’t know right from wrong to infringe copyright. You can’t jail a computer so the leverage is out the window.
Re:
Thank you for tell me you haven’t read the post without telling me you haven’t read the post.
Re: thats not the point though
“You can’t jail a computer so the leverage is out the window.”
Thats irrelevant. This has never been about computers doing stuff. Its about people doing stuff “using a computer”. Somehow that point keeps being missed!
Re: Re: What if they did not use a computer
They read lots of things, then they thought about things , then they wrote something.
Lock them up!!!, How can they possibly write something that not a copy of my work after reading my work.
Re: Re: Re:
Like this. Here’s a sentence using all of the words you wrote but it’s not a copy of your work:
After reading lots of things, they thought about how they could possibly write something that was not a copy of my work, then wrote something and locked it up.
Re:
It’s literally not copyright infringement to train or use these tools. That’s the entire fucking point.
You can argue that it isn’t “art” or that it’s a bad practice or that it’s decreasing business for human artists, but pretending it’s copyright infringement just indicates you’re either ignorant or you have an agenda.
Re:
Thanks for telling everyone here that you didn’t read the article without saying you didn’t read the article.
The fuck?
All the companies scrapping the open web are the massive companies.
Openai, Microsoft, Google, ect.
Sure some of the companies suing over ai usage are also massive companies.
But none of the companies scraping the entire web to train ai are small companies
Re:
FTFY. And even then, I am not convinced that ALL of them are massive, though I will concede the three examples you name. A trivial search yielded articles like “The Best 19 AI Website Scrapers You Haven’t Heard Of”. I’m pretty sure there aren’t 19 distinct massive companies making the scrapers, and I’m even more sure that the folks who make the tools aren’t the only ones using them.
This is NOT a ruling on whether training infringes copyright
The plaintiff’s ONLY theory was that removal of CMI violated the DMCA. The judge intimates no view about whether training on copyrighted works infringes copyright.
“Whether there is another statute or legal theory that does elevate this type of harm remains to be seen. But that question is not before the Court today.”
And “Other provisions of the Copyright Act afford such protections [against non-consented use], see 17 U.S.C. § 106, but not Section 1202.”
Those questions remain to be decided in other cases
Re: If this was a case about humans…
Get it yet, maximalist shill?
Sure hope none of those suing learned their craft from anyone else...
‘It was trained on content that might or even did include my stuff, therefore it’s output from that point forward infringes upon my copyright(s)’ is just the digital equivalent of ‘They learned to read thanks to my books, therefore anything they wrote from that point onwards infringes upon my copyright(s)’.
Re:
For-profit machines owned by massive companies designed to scrape the entire Internet many, many times over to where it can have the same effect as a Denial of Service attack on some sites =/= human learning and inspiration.
Re: Re:
And if it’s not the same as human learning and inspiration, good luck trying getting it nailed under copyright law, chumley.
The point y’all trying to make is that a human can be found guilty under copyright infringement while trying to call it inspiration or influence. If inspiration can’t be done by a machine… this argument might not work out the way you want it to.
Re: Re: Re:
If human learning and inspiration is enough to get into legal trouble, then I can understand why everyone from large corps to artists would be interested in the future of AI to combat copyright. No one wants to get arrested or sued for making art, let’s get these computers involved who can’t be thrown in jail to do this dangerous work for us. Of course, it doesn’t necessarily mean that the people who use these machines are good folk, either.
Re: Re:
Whether or not a profit is being made does not change non-infringing activity to infringing activity, otherwise an author/artist would have to be very careful if they ever tried to charge for their works.
Whether or not a large company is doing something does not change non-infringing activity to infringing activity, because the company isn’t doing squat, the people running it are.
How much is being ‘scraped’ does not change non-infringing activity to infringing activity, otherwise again how many books an author read could be an influence as to whether or not their works were infringing.
If the process of ‘scraping’ a site is resource-intensive enough to cause actual problems to the site’s stability that’s not a copyright issue.
Re: Re:
TIL: Browsing the Internet with more than one tab open is the same as DDoS.
Re: Re: Re:
The Game UI Database, among other sites, has faced slowdowns because of the scrapers. Are you reloading a webpage 200 times a second?
https://www.reddit.com/r/gamernews/comments/1fcmq1g/this_was_essentially_a_twoweek_long_ddos_attack/
Any sort of rights (property or otherwise) are going to have similar issues. The underlying issue is companies having exploitative leverage to pay you pennies for that right. No tool is going to be able to do the job as long as the playing field is tilted towards large companies that have so much market power they can force you to accept underpayment. But the only way you’re really going to fix that is to address the power large companies have.
Really what this tells you is that property rights like copyright aren’t sufficient on their own. If you’re trying to plow a field full of rocks, the issue isn’t that the plow isn’t good at it’s job. It’s a singular puzzle piece in a bigger picture.
There’s an inherent tension between compensation and the group best able to pay said compensation, unless you’re going to hit them with anti-trust or something to keep them out of the sector. Not compensating lowers the barrier to entry, but is fundamentally pyrrhic. You can lower the barrier to entry to manufacturing by not paying your workers, too.
The thing is, even if we accept it as reading (which is playing a bit fast and loose with things like ‘understanding’), the underlying concerns by content creators are still there. It just means that existing copyright isn’t designed for the distinction. Which “You’re not covered under existing law, too bad” doesn’t really seem super tenable in either the short or long term. Not least because it further entrenches those market power problems, or that it’s likely to lead to extending copyright to reading in some fashion.
Re:
Agreed. I will never understand why people equate reading with scraping the Internet many times over in ways that are impossible for humans to do.
Re: Re: Scraping = Reading
“I will never understand why people equate reading with scraping“
Because it is?? The mechanisms are different but it’s the same fundamental activity. I don’t understand why (well…I kinda do) people seem to insistent on not understanding it.
Imagine if someone had a super power that allowed them to read and remember the entire library of Congress in a day; then that person wrote a bunch of books and essays influenced by the information they consumed.
Nobody would be accusing that person of “violating copyright” because that would be stupid. Yet people, like almost everyone commenting under this article, continue to claim that’s what would be happening. Preposterous
Re: Re: Re:
Yeah, I’d be fine with that and I’m sure a lot of other people would be fine with that because it would be a human applying his writing skills.
The learning models aren’t humans; they’re products meant to bring profit to corporations and pumping up stock prices with lofty promises. The learning models don’t deserve the same protections as humans.
Please go watch this Jimquisition video about AI.
Re: Re: Re:2 "The learning models aren’t humans;" is irrelevant
Again, this is humans using computers. Whenever anyone implies that the computer is violating copyright they are wrong. Computers cannot violate the law. Only the humans using the computer can. So the question is whether it is a violation of copyright to use a computer to mass read the internet.
Re: Re: Re:2
And people using an LLM is a human applying a tool that a human made. It’s just a complicated tool. Do you think photographers are artists? A lot of people didn’t consider photography to be art when cameras became more popular and accessible. It was “cheating” to point and click to produce an image. But now we don’t bat an eye at photography as an art form.
Neither are cameras, paint brushes, printing presses, or typewriters. Would you suggest true artists only do finger paintings and true writers never write down their work or would you concede that tools can be used for legitimate purposes?
This right here is the problem with your approach. You have a bias that is tainting your understanding of the issue. At it’s core, your argument is actually against the abuses of technology and money by wealthy exploitative tech bros. The problem is that LLMs and image generators aren’t only used by those people, but you’re arguing against the tools when it’s the humans using them in ways you don’t like that are the problem. And exploitative, greedy tech bros are a problem regardless of what they’re doing or what tools they’re using. You’re likely intentionally blind to the benefits of LLMs because you only see them as tools of oppression, despite the fact that some people have found useful and productive and positive uses for them – including uses that benefit women, minorities, LGBTQ, and neurodivergent individuals.
This is actually true, but not the way you mean it. They don’t deserve protections, but the humans using these tools do.
That video has the same problem you have. It is an argument against bad people using technology in bad ways. It doesn’t actually provide any useful arguments against the technology itself. It also doesn’t provide any legal or technical critique. It just keeps calling it theft and violation of consent (where consent isn’t always legally required) and saying tech bros can’t do anything good. It’s also full of admittedly intentional hyperbole which makes the video just sound angry rather than well-reasoned.
But you and Sterling are either ignoring or ignorant of the fact that these tools have other uses and are being used by smaller companies and individuals who aren’t exploitative tech bros.
Re: Re: Re:2 profit motive
Yes, but some humans have the same intent. I read stuff and write based on what I have read, and you may be assured that I intend to obtain some advantage, such as money, from doing so. That advantage may come through my corporation, because people give money to the corp and then it gives money to me.
This comment has been flagged by the community. Click here to show it.
I think that LLMs are plagiarism machines. Not necessarily in the legal sense, definitely in the artistic sense.
I recommend this video
https://www.youtube.com/watch?v=5qoOYrTzOfM
(which probably should have been featured here)
and then suggest to think of LLMs as a black box that gives you one of the levels of plagiarism, and you can’t know which. Also, this is why I think the outputs should not be copyrightable.
Re:
That video definitely shouldn’t have been featured here, Tom. And the outputs aren’t copyrightable if identified as such according to Copyright Law and the Copyright Office. Plagiarism is not a legal concept and has no relevance here.
I think the judge is going to run into some issues on appeal surrounding copying. Whether or not the training violates copyright, the AI system had to make a copy of the material on it’s servers before it could use that material for training. That particular thing comes up elsewhere, where you have to copy something to your system before you can use it and that copying needs permission from the copyright holder separate from any permission to use what you’re copying. While 1201 is supposed to cover that, there’s been so many work-arounds established to enforce shrink-wrap licenses, DRM schemes, anti-cheat provisions and so on that I can’t imagine a competent lawyer not being able to establish a loophole big enough to drive a tractor-trailer through.
Though I’d like to see the appeals courts see reason and rule that yes, 1201 does make that copying legal and you can’t make it illegal again just by coming at it from a different angle.
Re:
Dear lord, you have no idea how the internet works, huh?
Thats how the internet works. ALL uses of the internet copy data into local storage. Courts have ruled in multiple cases, that the copying necessary for the internet to work is legal. There is no difference between temporary transitory storage for a human to view content and temporary transitory storage for a computer to view content.
Re:
OKay, rereading, I better understand your claim, but you seem to think a lot more of the text of the internet is protected against copying with DRM than id imagine is really true.
Re:
As a savant, whenever I read a book, I copy it to my brain for years, which means it isn’t there transiently as it is for most people. With this fact in mind, please explain how people feeding input into LLMs are committing copyright infringement and I’m not? Go on, I’ll wait.
Re: eyeballs
The lens of my eye had to make a copy of the material which I read before I could use the material.
For people with vision impairments, there may need to be a camera converting images to tactile form. And if I read it on the computer, it is necessary for the computer to have a copy in memory in order to throw it up on the screen.
I give little weight to the argument that a copy must made as part of the process of using material.
Re: Re:
As a blind person (not “visually impaired”, for the love of God), I’ve yet to come across a camera that can convert anything to Braille without extra hardware, so I instead have it convert the text to audio through TTS software.
A study in confirmation bias
These comments are absolutely fascinating. Almost none of them engage with Mike’s point that scraping/training doesn’t infringe. The one or two that do completely contort the words to fit their own biases.
I suppose this is the hallmark of our times though: have an opinion, then work backwards to make up facts that support it.
This comment has been flagged by the community. Click here to show it.
Best Labret Piercing Hoops
These comments are incredibly intriguing. Nearly all of them overlook Mike’s argument that scraping and training do not constitute infringement.
I'm happy that reading a copyrighted book doesn't make me a felon.
I’ve been arguing that if AI is only trained on non copyrighted information, it’s not going to learn much beyond the bible.
I’m happy to see a case of sanity in the AI debate.
Re:
TIL: The works of Shakespeare are under copyright.
ha Ha ha ha. if your IA model does reproduce enough or the whole of anything covered by copyright, good luck gettint out of this in front of a proper court.