Judge: Just Because AI Trains On Your Publication, Doesn’t Mean It Infringes On Your Copyright

from the that's-not-how-any-of-this-works dept

I get that a lot of people don’t like the big AI companies and how they scrape the web. But these copyright lawsuits being filed against them are absolute garbage. And you want that to be the case, because if it goes the other way, it will do real damage to the open web by further entrenching the largest companies. If you don’t like the AI companies find another path, because copyright is not the answer.

So far, we’ve seen that these cases aren’t doing all that well, though many are still ongoing.

Last week, a judge tossed out one of the early ones against OpenAI, brought by Raw Story and Alternet.

Part of the problem is that these lawsuits assume, incorrectly, that these AI services really are, as some people falsely call them, “plagiarism machines.” The assumption is that they’re just copying everything and then handing out snippets of it.

But that’s not how it works. It is much more akin to reading all these works and then being able to make suggestions based on an understanding of how similar things kinda look, though from memory, not from having access to the originals.

Some of this case focused on whether or not OpenAI removed copyright management information (CMI) from the works that they were being trained on. This always felt like an extreme long shot, and the court finds Raw Story’s arguments wholly unconvincing in part because they don’t show any work that OpenAI distributed without their copyright management info.

For one thing, Plaintiffs are wrong that Section 1202 “grant[ s] the copyright owner the sole prerogative to decide how future iterations of the work may differ from the version the owner published.” Other provisions of the Copyright Act afford such protections, see 17 U.S.C. § 106, but not Section 1202. Section 1202 protects copyright owners from specified interferences with the integrity of a work’s CMI. In other words, Defendants may, absent permission, reproduce or even create derivatives of Plaintiffs’ works-without incurring liability under Section 1202-as long as Defendants keep Plaintiffs’ CMI intact. Indeed, the legislative history of the DMCA indicates that the Act’s purpose was not to guard against property-based injury. Rather, it was to “ensure the integrity of the electronic marketplace by preventing fraud and misinformation,” and to bring the United States into compliance with its obligations to do so under the World Intellectual Property Organization (WIPO) Copyright Treaty, art. 12(1) (“Obligations concerning Rights Management Information”) and WIPO Performances and Phonograms Treaty….

Moreover, I am not convinced that the mere removal of identifying information from a copyrighted work-absent dissemination-has any historical or common-law analogue.

Then there’s the bigger point, which is that the judge, Colleen McMahon, has a better understanding of how ChatGPT works than the plaintiffs and notes that just because ChatGPT was trained on pretty much the entire internet, that doesn’t mean it’s going to infringe on Raw Story’s copyright:

Plaintiffs allege that ChatGPT has been trained on “a scrape of most of the internet,” Compl. , 29, which includes massive amounts of information from innumerable sources on almost any given subject. Plaintiffs have nowhere alleged that the information in their articles is copyrighted, nor could they do so. When a user inputs a question into ChatGPT, ChatGPT synthesizes the relevant information in its repository into an answer. Given the quantity of information contained in the repository, the likelihood that ChatGPT would output plagiarized content from one of Plaintiffs’ articles seems remote.

Finally, the judge basically says, “Look, I get it, you’re upset that ChatGPT read your stuff, but you don’t have an actual legal claim here.”

Let us be clear about what is really at stake here. The alleged injury for which Plaintiffs truly seek redress is not the exclusion of CMI from Defendants’ training sets, but rather Defendants’ use of Plaintiffs’ articles to develop ChatGPT without compensation to Plaintiffs. See Compl. ~ 57 (“The OpenAI Defendants have acknowledged that use of copyright-protected works to train ChatGPT requires a license to that content, and in some instances, have entered licensing agreements with large copyright owners … They are also in licensing talks with other copyright owners in the news industry, but have offered no compensation to Plaintiffs.”). Whether or not that type of injury satisfies the injury-in-fact requirement, it is not the type of harm that has been “elevated” by Section 1202(b )(i) of the DMCA. See Spokeo, 578 U.S. at 341 (Congress may “elevate to the status of legally cognizable injuries, de facto injuries that were previously inadequate in law.”). Whether there is another statute or legal theory that does elevate this type of harm remains to be seen. But that question is not before the Court today.

While the judge dismisses the case with prejudice and says they can try again, it would appear that she is skeptical they could do so with any reasonable chance of success:

In the event of dismissal Plaintiffs seek leave to file an amended complaint. I cannot ascertain whether amendment would be futile without seeing a proposed amended pleading. I am skeptical about Plaintiffs’ ability to allege a cognizable injury but, at least as to injunctive relief, I am prepared to consider an amended pleading.

I totally get why publishers are annoyed and why they keep suing. But copyright is the wrong tool for the job. Hopefully, more courts will make this clear and we can get past all of these lawsuits.

Filed Under: , , , , ,
Companies: alternet, openai, raw story

Rate this comment as insightful
Rate this comment as funny
You have rated this comment as insightful
You have rated this comment as funny
Flag this comment as abusive/trolling/spam
You have flagged this comment
The first word has already been claimed
The last word has already been claimed
Insightful Lightbulb icon Funny Laughing icon Abusive/trolling/spam Flag icon Insightful badge Lightbulb icon Funny badge Laughing icon Comments icon

Comments on “Judge: Just Because AI Trains On Your Publication, Doesn’t Mean It Infringes On Your Copyright”

Subscribe: RSS Leave a comment
38 Comments
This comment has been deemed insightful by the community.
MrWilson (profile) says:

Re:

It’s literally not copyright infringement to train or use these tools. That’s the entire fucking point.

You can argue that it isn’t “art” or that it’s a bad practice or that it’s decreasing business for human artists, but pretending it’s copyright infringement just indicates you’re either ignorant or you have an agenda.

This comment has been deemed insightful by the community.
Anonymous Coward says:

Re:

All the [AI] companies scrapping the open web are the massive companies.

FTFY. And even then, I am not convinced that ALL of them are massive, though I will concede the three examples you name. A trivial search yielded articles like “The Best 19 AI Website Scrapers You Haven’t Heard Of”. I’m pretty sure there aren’t 19 distinct massive companies making the scrapers, and I’m even more sure that the folks who make the tools aren’t the only ones using them.

Paul Alan Levy (profile) says:

This is NOT a ruling on whether training infringes copyright

The plaintiff’s ONLY theory was that removal of CMI violated the DMCA. The judge intimates no view about whether training on copyrighted works infringes copyright.

“Whether there is another statute or legal theory that does elevate this type of harm remains to be seen. But that question is not before the Court today.”

And “Other provisions of the Copyright Act afford such protections [against non-consented use], see 17 U.S.C. § 106, but not Section 1202.”

Those questions remain to be decided in other cases

Anonymous Coward says:

Re: If this was a case about humans…

The plaintiff’s ONLY theory was that removal of CMI violated the DMCA. The judge intimates no view about whether reading copyrighted works for training purposes infringes copyright.

“Whether there is another statute or legal theory that does elevate this type of harm remains to be seen. But that question is not before the Court today.”

And “Other provisions of the Copyright Act afford such protections [against non-consented use], see 17 U.S.C. § 106, but not Section 1202.”

Those questions remain to be decided in other cases.

Get it yet, maximalist shill?

That One Guy (profile) says:

Sure hope none of those suing learned their craft from anyone else...

‘It was trained on content that might or even did include my stuff, therefore it’s output from that point forward infringes upon my copyright(s)’ is just the digital equivalent of ‘They learned to read thanks to my books, therefore anything they wrote from that point onwards infringes upon my copyright(s)’.

Anonymous Coward says:

Re: Re:

And if it’s not the same as human learning and inspiration, good luck trying getting it nailed under copyright law, chumley.

The point y’all trying to make is that a human can be found guilty under copyright infringement while trying to call it inspiration or influence. If inspiration can’t be done by a machine… this argument might not work out the way you want it to.

Crafty Coyote says:

Re: Re: Re:

If human learning and inspiration is enough to get into legal trouble, then I can understand why everyone from large corps to artists would be interested in the future of AI to combat copyright. No one wants to get arrested or sued for making art, let’s get these computers involved who can’t be thrown in jail to do this dangerous work for us. Of course, it doesn’t necessarily mean that the people who use these machines are good folk, either.

That One Guy (profile) says:

Re: Re:

Whether or not a profit is being made does not change non-infringing activity to infringing activity, otherwise an author/artist would have to be very careful if they ever tried to charge for their works.

Whether or not a large company is doing something does not change non-infringing activity to infringing activity, because the company isn’t doing squat, the people running it are.

How much is being ‘scraped’ does not change non-infringing activity to infringing activity, otherwise again how many books an author read could be an influence as to whether or not their works were infringing.

If the process of ‘scraping’ a site is resource-intensive enough to cause actual problems to the site’s stability that’s not a copyright issue.

Arianity says:

But copyright is the wrong tool for the job.

Any sort of rights (property or otherwise) are going to have similar issues. The underlying issue is companies having exploitative leverage to pay you pennies for that right. No tool is going to be able to do the job as long as the playing field is tilted towards large companies that have so much market power they can force you to accept underpayment. But the only way you’re really going to fix that is to address the power large companies have.

Really what this tells you is that property rights like copyright aren’t sufficient on their own. If you’re trying to plow a field full of rocks, the issue isn’t that the plow isn’t good at it’s job. It’s a singular puzzle piece in a bigger picture.

it will do real damage to the open web by further entrenching the largest companies.

There’s an inherent tension between compensation and the group best able to pay said compensation, unless you’re going to hit them with anti-trust or something to keep them out of the sector. Not compensating lowers the barrier to entry, but is fundamentally pyrrhic. You can lower the barrier to entry to manufacturing by not paying your workers, too.

It is much more akin to reading all these works and then being able to make suggestions based on an understanding of how similar things kinda look, though from memory, not from having access to the originals.

The thing is, even if we accept it as reading (which is playing a bit fast and loose with things like ‘understanding’), the underlying concerns by content creators are still there. It just means that existing copyright isn’t designed for the distinction. Which “You’re not covered under existing law, too bad” doesn’t really seem super tenable in either the short or long term. Not least because it further entrenches those market power problems, or that it’s likely to lead to extending copyright to reading in some fashion.

Anonymous Coward says:

Re: Re: Scraping = Reading

“I will never understand why people equate reading with scraping“

Because it is?? The mechanisms are different but it’s the same fundamental activity. I don’t understand why (well…I kinda do) people seem to insistent on not understanding it.

Imagine if someone had a super power that allowed them to read and remember the entire library of Congress in a day; then that person wrote a bunch of books and essays influenced by the information they consumed.

Nobody would be accusing that person of “violating copyright” because that would be stupid. Yet people, like almost everyone commenting under this article, continue to claim that’s what would be happening. Preposterous

Anonymous Coward says:

Re: Re: Re:

Imagine if someone had a super power that allowed them to read and remember the entire library of Congress in a day; then that person wrote a bunch of books and essays influenced by the information they consumed.

Yeah, I’d be fine with that and I’m sure a lot of other people would be fine with that because it would be a human applying his writing skills.

The learning models aren’t humans; they’re products meant to bring profit to corporations and pumping up stock prices with lofty promises. The learning models don’t deserve the same protections as humans.

Please go watch this Jimquisition video about AI.

Diogenes (profile) says:

Re: Re: Re:2 "The learning models aren’t humans;" is irrelevant

Again, this is humans using computers. Whenever anyone implies that the computer is violating copyright they are wrong. Computers cannot violate the law. Only the humans using the computer can. So the question is whether it is a violation of copyright to use a computer to mass read the internet.

MrWilson (profile) says:

Re: Re: Re:2

Yeah, I’d be fine with that and I’m sure a lot of other people would be fine with that because it would be a human applying his writing skills.

And people using an LLM is a human applying a tool that a human made. It’s just a complicated tool. Do you think photographers are artists? A lot of people didn’t consider photography to be art when cameras became more popular and accessible. It was “cheating” to point and click to produce an image. But now we don’t bat an eye at photography as an art form.

The learning models aren’t humans;

Neither are cameras, paint brushes, printing presses, or typewriters. Would you suggest true artists only do finger paintings and true writers never write down their work or would you concede that tools can be used for legitimate purposes?

they’re products meant to bring profit to corporations and pumping up stock prices with lofty promises.

This right here is the problem with your approach. You have a bias that is tainting your understanding of the issue. At it’s core, your argument is actually against the abuses of technology and money by wealthy exploitative tech bros. The problem is that LLMs and image generators aren’t only used by those people, but you’re arguing against the tools when it’s the humans using them in ways you don’t like that are the problem. And exploitative, greedy tech bros are a problem regardless of what they’re doing or what tools they’re using. You’re likely intentionally blind to the benefits of LLMs because you only see them as tools of oppression, despite the fact that some people have found useful and productive and positive uses for them – including uses that benefit women, minorities, LGBTQ, and neurodivergent individuals.

The learning models don’t deserve the same protections as humans.

This is actually true, but not the way you mean it. They don’t deserve protections, but the humans using these tools do.

Please go watch this Jimquisition video about AI.

That video has the same problem you have. It is an argument against bad people using technology in bad ways. It doesn’t actually provide any useful arguments against the technology itself. It also doesn’t provide any legal or technical critique. It just keeps calling it theft and violation of consent (where consent isn’t always legally required) and saying tech bros can’t do anything good. It’s also full of admittedly intentional hyperbole which makes the video just sound angry rather than well-reasoned.

But you and Sterling are either ignoring or ignorant of the fact that these tools have other uses and are being used by smaller companies and individuals who aren’t exploitative tech bros.

Tanner Andrews (profile) says:

Re: Re: Re:2 profit motive

The learning models aren’t humans; they’re products meant to bring profit to corporations

Yes, but some humans have the same intent. I read stuff and write based on what I have read, and you may be assured that I intend to obtain some advantage, such as money, from doing so. That advantage may come through my corporation, because people give money to the corp and then it gives money to me.

This comment has been flagged by the community. Click here to show it.

Anonymous Coward says:

I think that LLMs are plagiarism machines. Not necessarily in the legal sense, definitely in the artistic sense.
I recommend this video
https://www.youtube.com/watch?v=5qoOYrTzOfM
(which probably should have been featured here)
and then suggest to think of LLMs as a black box that gives you one of the levels of plagiarism, and you can’t know which. Also, this is why I think the outputs should not be copyrightable.

TKnarr (profile) says:

I think the judge is going to run into some issues on appeal surrounding copying. Whether or not the training violates copyright, the AI system had to make a copy of the material on it’s servers before it could use that material for training. That particular thing comes up elsewhere, where you have to copy something to your system before you can use it and that copying needs permission from the copyright holder separate from any permission to use what you’re copying. While 1201 is supposed to cover that, there’s been so many work-arounds established to enforce shrink-wrap licenses, DRM schemes, anti-cheat provisions and so on that I can’t imagine a competent lawyer not being able to establish a loophole big enough to drive a tractor-trailer through.

Though I’d like to see the appeals courts see reason and rule that yes, 1201 does make that copying legal and you can’t make it illegal again just by coming at it from a different angle.

James Burkhardt (profile) says:

Re:

Dear lord, you have no idea how the internet works, huh?

Thats how the internet works. ALL uses of the internet copy data into local storage. Courts have ruled in multiple cases, that the copying necessary for the internet to work is legal. There is no difference between temporary transitory storage for a human to view content and temporary transitory storage for a computer to view content.

Tanner Andrews (profile) says:

Re: eyeballs

the AI system had to make a copy of the material on it’s servers before it could use that material

The lens of my eye had to make a copy of the material which I read before I could use the material.

For people with vision impairments, there may need to be a camera converting images to tactile form. And if I read it on the computer, it is necessary for the computer to have a copy in memory in order to throw it up on the screen.

I give little weight to the argument that a copy must made as part of the process of using material.

Anonymous Coward says:

A study in confirmation bias

These comments are absolutely fascinating. Almost none of them engage with Mike’s point that scraping/training doesn’t infringe. The one or two that do completely contort the words to fit their own biases.

I suppose this is the hallmark of our times though: have an opinion, then work backwards to make up facts that support it.

This comment has been flagged by the community. Click here to show it.

Add Your Comment

Your email address will not be published. Required fields are marked *

Have a Techdirt Account? Sign in now. Want one? Register here

Comment Options:

Make this the or (get credits or sign in to see balance) what's this?

What's this?

Techdirt community members with Techdirt Credits can spotlight a comment as either the "First Word" or "Last Word" on a particular comment thread. Credits can be purchased at the Techdirt Insider Shop »

Follow Techdirt

Techdirt Daily Newsletter

Subscribe to Our Newsletter

Get all our posts in your inbox with the Techdirt Daily Newsletter!

We don’t spam. Read our privacy policy for more info.

Ctrl-Alt-Speech

A weekly news podcast from
Mike Masnick & Ben Whitelaw

Subscribe now to Ctrl-Alt-Speech »
Techdirt Deals
Techdirt Insider Discord
The latest chatter on the Techdirt Insider Discord channel...
Loading...