The NY Times Lawsuit Against OpenAI Would Open Up The NY Times To All Sorts Of Lawsuits Should It Win
from the it's-okay-when-we-do-it,-we're-the-new-york-times dept
This week the NY Times somehow broke the story of… well, the NY Times suing OpenAI and Microsoft. I wonder who tipped them off. Anyhoo, the lawsuit in many ways is similar to some of the over a dozen lawsuits filed by copyright holders against AI companies. We’ve written about how silly many of these lawsuits are, in that they appear to be written by people who don’t much understand copyright law. And, as we noted, even if courts actually decide in favor of the copyright holders, it’s not like it will turn into any major windfall. All it will do is create another corruptible collection point, while locking in only a few large AI companies who can afford to pay up.
I’ve seen some people arguing that the NY Times lawsuit is somehow “stronger” and more effective than the others, but I honestly don’t see that. Indeed, the NY Times itself seems to think its case is so similar to the ridiculously bad Authors Guild case, that it’s looking to combine the cases.
But while there are some unique aspects to the NY Times case, I’m not sure they are nearly as compelling as the NY Times and its supporters think they are. Indeed, I think if the Times actually wins its case, it would open the Times itself up to some fairly damning lawsuits itself, given its somewhat infamous journalistic practices regarding summarizing other people’s articles without credit. But, we’ll get there.
The Times, in typical NY Times fashion, presents this case as thought the NY Times is the great defender of press freedom, taking this stand to stop the evil interlopers of AI.
Independent journalism is vital to our democracy. It is also increasingly rare and valuable. For more than 170 years, The Times has given the world deeply reported, expert, independent journalism. Times journalists go where the story is, often at great risk and cost, to inform the public about important and pressing issues. They bear witness to conflict and disasters, provide accountability for the use of power, and illuminate truths that would otherwise go unseen. Their essential work is made possible through the efforts of a large and expensive organization that provides legal, security, and operational support, as well as editors who ensure their journalism meets the highest standards of accuracy and fairness. This work has always been important. But within a damaged information ecosystem that is awash in unreliable content, The Times’s journalism provides a service that has grown even more valuable to the public by supplying trustworthy information, news analysis, and commentary
Defendants’ unlawful use of The Times’s work to create artificial intelligence products that compete with it threatens The Times’s ability to provide that service. Defendants’ generative artificial intelligence (“GenAI”) tools rely on large-language models (“LLMs”) that were built by copying and using millions of The Times’s copyrighted news articles, in-depth investigations, opinion pieces, reviews, how-to guides, and more. While Defendants engaged in widescale copying from many sources, they gave Times content particular emphasis when building their LLMs—revealing a preference that recognizes the value of those works. Through Microsoft’s Bing Chat (recently rebranded as “Copilot”) and OpenAI’s ChatGPT, Defendants seek to free-ride on The Times’s massive investment in its journalism by using it to build substitutive products without permission or payment.
As the lawsuit makes clear, this isn’t some high and mighty fight for journalism. It’s a negotiating ploy. The Times admits that it has been trying to get OpenAI to cough up some cash for its training:
For months, The Times has attempted to reach a negotiated agreement with Defendants, in accordance with its history of working productively with large technology platforms to permit the use of its content in new digital products (including the news products developed by Google, Meta, and Apple). The Times’s goal during these negotiations was to ensure it received fair value for the use of its content, facilitate the continuation of a healthy news ecosystem, and help develop GenAI technology in a responsible way that benefits society and supports a well-informed public.
I’m guessing that OpenAI’s decision a few weeks back to pay off media giant Axel Springer to avoid one of these lawsuits, and the failure to negotiate a similar deal (at what is likely a much higher price), resulted in the Times moving forward with the lawsuit.
There are five or six whole pages of puffery about how amazing the NY Times thinks the NY Times is, followed by the laughably stupid claim that generative AI “threatens” the kind of journalism the NY Times produces.
Let me let you in on a little secret: if you think that generative AI can do serious journalism better than a massive organization with a huge number of reporters, then, um, you deserve to go out of business. For all the puffery about the amazing work of the NY Times, this seems to suggest that it can easily be replaced by an auto-complete machine.
In the end, though, the crux of this lawsuit is the same as all the others. It’s a false belief that reading something (whether by human or machine) somehow implicates copyright. This is false. If the courts (or the legislature) decide otherwise, it would upset pretty much all of the history of copyright and create some significant real world problems.
Part of the Times complaint is that OpenAI’s GPT LLM was trained in part with Common Crawl data. Common Crawl is an incredibly useful and important resource that apparently is now coming under attack. It has been building an open repository of the web for people to use, not unlike the Internet Archive, but with a focus on making it accessible to researchers and innovators. Common Crawl is a fantastic resource run by some great people (though the lawsuit here attacks them).
But, again, this is the nature of the internet. It’s why things like Google’s cache and the Internet Archive’s Wayback Machine are so important. These are archives of history that are incredibly important, and have historically been protected by fair use, which the Times is now threatening.
(Notably, just recently, the NY Times was able to get all of its articles excluded from Common Crawl. Otherwise I imagine that they would be a defendant in this case as well).
Either way, so much of the lawsuit is claiming that GPT learning from this data is infringement. And, as we’ve noted repeatedly, reading/processing data is not a right limited by copyright. We’ve already seen this in multiple lawsuits, but this rush of plaintiffs is hoping that maybe judges will be wowed by this newfangled “generative AI” technology into ignoring the basics of copyright law and pretending that there are now rights that simply do not exist.
Now, the one element that appears different in the Times’ lawsuit is that it has a bunch of exhibits that purport to prove how GPT regurgitates Times articles. Exhibit J is getting plenty of attention here, as the NY Times demonstrates how it was able to prompt ChatGPT in such a manner that it basically provided them with direct copies of NY Times articles.
In the complaint, they show this:
At first glance that might look damning. But it’s a lot less damning when you look at the actual prompt in Exhibit J and realize what happened, and how generative AI actually works.
What the Times did is prompt GPT-4 by (1) giving it the URL of the story and then (2) “prompting” it by giving it the headline of the article and the first seven and a half paragraphs of the article, and asking it to continue.
Here’s how the Times describes this:
Each example focuses on a single news article. Examples were produced by breaking the article into two parts. The frst part o f the article is given to GPT-4, and GPT-4 replies by writing its own version of the remainder of the article.
Here’s how it appears in Exhibit J (notably, the prompt was left out of the complaint itself):
If you actually understand how these systems work, the output looking very similar to the original NY Times piece is not so surprising. When you prompt a generative AI system like GPT, you’re giving it a bunch of parameters, which act as conditions and limits on its output. From those constraints, it’s trying to generate the most likely next part of the response. But, by providing it paragraphs upon paragraphs of these articles, the NY Times has effectively constrained GPT to the point that the most probabilistic responses is… very close to the NY Times’ original story.
In other words, by constraining GPT to effectively “recreate this article,” GPT has a very small data set to work off of, meaning that the highest likelihood outcome is going to sound remarkably like the original. If you were to create a much shorter prompt, or introduce further randomness into the process, you’d get a much more random output. But these kinds of prompts effectively tell GPT not to do anything BUT write the same article.
From there, though, the lawsuit gets dumber.
It shows that you can sorta get around the NY Times’ paywall in the most inefficient and unreliable way possible by asking ChatGPT to quote the first few paragraphs in one paragraph chunks.
Of course, quoting individual paragraphs from a news article is almost certainly fair use. And, for what it’s worth, the Times itself admits that this process doesn’t actually return the full article, but a paraphrase of it.
And the lawsuit seems to suggest that merely summarizing articles is itself infringing:
That’s… all factual information summarizing the review? And while the complaint shows that if you then ask for (again, paragraph length) quotes, GPT will give you a few quotes from the article.
And, yes, the complaint literally argues that a generative AI tool can violate copyright when it “summarizes” an article.
The issue here is not so much how GPT is trained, but how the NY Times is constraining the output. That is unrelated to the question of whether or not the reading of these article is fair use or not. The purpose of these LLMs is not to repeat the content that is scanned, but to figure out the probabilistic most likely next token for a given prompt. When the Times constrains the prompts in such a way that the data set is basically one article and one article only… well… that’s what you get.
Elsewhere, the Times again complains about GPT returning factual information that is not subject to copyright law.
But, I mean, if you were to ask anyone the same question, “What does wirecutter recommend for The Best Kitchen Scale,” they’re likely to return you a similar result, and that’s not infringing. It’s a fact that that scale is the one that it recommends. The Times complains that people who do this prompt will avoid clicking on Wirecutter affiliate links, but… um… it has no right to that affiliate income.
I mean, I’ll admit right here that I often research products and look at Wirecutter (and other!) reviews before eventually shopping independently of that research. In other words, I will frequently buy products after reading the recommendations on Wirecutter, but without clicking on an affiliate link. Is the NY Times really trying to suggest that this violates its copyright? Because that’s crazy.
Meanwhile, it’s not clear if the NY Times is mad that it’s accurately recommending stuff or if it’s just… mad. Because later in the complaint, the NY Times says its bad that sometimes GPT recommends the wrong product or makes up a paragraph.
So… the complaint is both that GPT reproduces things too accurately, AND not accurately enough. Which is it?
Anyway, the larger point is that if the NY Times wins, well… the NY Times might find itself on the receiving end of some lawsuits. The NY Times is somewhat infamous in the news world for using other journalists’ work as a starting point and building off of it (frequently without any credit at all). Sometimes this results in an eventual correction, but often it does not.
If the NY Times successfully argues that reading a third party article to help its reporters “learn” about the news before reporting their own version of it is copyright infringement, it might not like how that is turned around by tons of other news organizations against the NY Times. Because I don’t see how there’s any legitimate distinction between OpenAI scanning NY Times articles and NY Times reporters scanning other articles/books/research without first licensing those works as well.
Or, say, what happens if a source for a NY TImes reporter provides them with some copyright-covered work (an article, a book, a photograph, who knows what) that the NY Times does not have a license for? Can the NY Times journalist then produce an article based on that material (along with other research, though much less than OpenAI used in training GPT)?
It seems like (and this happens all too often in the news industry) the NY Times is arguing that it’s okay for its journalists to do this kind of thing because it’s in the business of producing Important Journalism™ whereas anyone else doing the same thing is some damn interloper.
We see this with other copyright disputes and the media industry, or with the ridiculous fight over the hot news doctrine, in which news orgs claimed that they should be the only ones allowed to report on something for a while.
Similarly, I’ll note that even if the NY Times gets some money out of this, don’t expect the actual reporters to see any of it. Remember, this is the same NY Times that once tried to stiff freelance reporters by relicensing their articles to electronic databases without paying them. The Supreme Court didn’t like that. If the NY Times establishes that merely training AI on old articles is a licenseable, copyright-impacting event, will it go back and pay those reporters a piece of whatever change they get? Or nah?
Filed Under: ai, ai training, copyright, fair use, generative ai, reading, restrictive prompts, summarizing, training
Companies: common crawl, microsoft, ny times, openai
Comments on “The NY Times Lawsuit Against OpenAI Would Open Up The NY Times To All Sorts Of Lawsuits Should It Win”
This comment has been flagged by the community. Click here to show it.
clearly stolen
I don’t know how anyone in good conscience can defend these LLM’s? They clearly stole copyrighted materials to train their models which they are now using to make money. I mean it’s not even iffy that they stole the material.
Re:
Did you read the article:-
But, by providing it paragraphs upon paragraphs of these articles, the NY Times has effectively constrained GPT to the point that the most probabilistic responses is… very close to the NY Times’ original story.
That is to say the NY Times spoon fed the AI the output they wanted.
Re:
Even if this was a case of copyright infringement(which it isn’t), that’s not the same as theft.
Saying that the content was “stolen” is simply wrong.
Re:
I get the impression that you’re completely against fair use.
Re: Clearly Stolen?
Stolen? How? I seriously doubt the top tech companies do not have a paid account at the NYTimes – which would imply they using getting the news content they paid for. I presume too that they would be getting the same full high-volume access as any press-clipping organization does. By the time the AI is regurgitating it, it is history… and to do the “tell me the third paragraph” trick, you already need to know the article exists and what it is about.
(Back before the internet, our company’s PR dept. would clip articles relevant to the company out of the major local and national newspapers and magazines, then fax around a multi-page summary of relevant news for most mid-level bosses. From one paid subscription to each newspaper, and something pretty much evry company did. 1 print copy, 500 faxes to assorted intra-office readers.)
What I glean from this is that one job AI can do is Administrative AIssitant – “read the newspaper, find me the relevant articles and summarize them for me”. A service that would be personally tailored to the individual.
Re:
You wanna back up your claim, and the claim that your claim is self-evident?
Reading and Copyright
I’m a college professor teaching business courses. Hopefully, students learn stuff in my courses. If they go on and use that knowledge to make money, is this infringement? Can I sue?
/Sarcasm
Re:
They pay you for that knowledge.
Besides if your student copies a paper your write and publishes is as their paper, you can sue?
Re: Re:
Irrelevant. The knowledge itself isn’t copyrighted to begin with.
Re: Re:
“Besides if your student copies a paper your write and publishes is as their paper, you can sue?”
only if it’s a exact copy and they’re saying it’s theirs.
Re: Re: Re:
but if they’re quoting from it, I won’t sue.
Re:
iirc, there was a comedy by one of the Greek playwrights or their Roman copiers, about a man who wnet to a famed jurist and arranged to learn the law from him, on the wager that if he didn’t learn anything, he didn’t have to pay for the course,; if he did learn, he would pay. The student ended te course and refused to pay, on a gamble that if he lost the following lawsuit, he had clearly not learnt the law to any great degree, and thus didn’t have to pay, whereas if he did win the lawsuit, he would not have to pay anyway. The judge did not agree.
Using Material is not a Copyright Violation
I’m a college professor teaching business courses. In addition to lectures, I provide my students with video tutorials and PowerPoint slides. Hopefully, students learn stuff in my courses. If they go on and use that knowledge to make money, is this infringement? Can I sue?
/Sarcasm
Re:
What if they start their own course and use your slides with one word changed on each? Can you sue?
Re: Power points
By professors are copyrighted… But you would know this if you were a prof
Just sayin
Sexual Healing
I’ve listened to a lot of Marvin Gaye’s work. I can perform a note perfect cover of one of his better known songs. Given some prompting, I can probably write a song that has a similar feel or that incorporates similar elements.
It’s fucked that all of those things are, if not explicitly illegal in all cases, certainly off-limits without licensing and permission. But, what legal basis makes it ok for AI to do the same on demand? If these AI cases come out in favor of the AI companies, does that make it safe for songwriters to have inspiration again? Can Ed Sheeran make another album without fear? Will it take laundering song ideas through ChatGPT (with attached logs) to show there was no direct copying?
Re: Music has a lot of other protections
The copyright situation around music is a lot more restrictive because there is a lot more actual law around music. Specifically around the rights of song writers having different rights then performers and the recordings of their performances. The result is that the window of fair use and non-infringing creation is a lot more constrained.
Re:
If chat sings a cover, it needs to pay for performance rights (unless the rightsholder isn’t your usual dick). All other cases, nope, not infringement.
I am surprised that you can equate a reporter reading an article for research as the same as an LLM consuming all the media ever created.
Your take seems similar to the “corporations are people too” SC decision. If we decide that LLM/AI gets the same fair usage rights as humans then, for me, that’s only going to be good for the LLM companies.
I don’t think all these articles of yours pushing LLMs as equivalent to humans doing research are going to age well.
Re:
If I learn from a bunch of copyrighted works and make something based on them, that’s okay. However, if I use a machine to do the exact same thing, then it’s a problem? I don’t understand your complaint.
This comment has been flagged by the community. Click here to show it.
Re: Re:
You can learn from a bunch of copyrighted works, but if I gave you the prompt of the first 6 paragraphs of your history textbook’s Civil War chapter, you wouldn’t give me the rest of the chapter almost verbatim. You would at least paraphrase. If you did not then you would be infringing too.
Re: Re: Re:
Directly quoting something isn’t theft or copyright infringement. I have no idea how you’re coming up with this.
Re: Re: Re:2
You can quote stuff. You can’t just quote an entire chapter, usually, though. That’s a rather uphill fair use argument.
Re: Re: Re:3
So is the claim that th first six paragraphs would result in the entire chapter verbatim.
Re: Re: Re:
Notably, that’s not what happened here. It gave only one paragraph at a time.
Also, quoting multiple paragraphs isn’t infringement, either.
Re: Re: Re: The user, not the tech
People can photocopy portions of a book under Fair Use. It’s a copyright violation (usually, not always) if they copy the entire thing. Should Xerox be sued or the user? I acknowledge there are differences here, but I’m specifically addressing the paragraph by paragraph portion here.
Re:
Would you rather see the companies sued out of existence before getting off the ground, or not being able to afford training licenses, making the LLM’s useless?
Because that’s effectively what would happen if they were to be treated like the NYT wants them to be treated.
Re:
Maybe you don’t understand how LLMs (or analogies) work.
If the NY Times used Chrome to copy a story from elsewhere into the paper, would Google be guilty of infringement, or would the NY Times be guilty of infringement? Why should it be any different if the use an AI tool to do the same thing?
If the generative AI is willing to create nearly 1:1 text (even if it needs very specific prompting) over extensive blocks of text, I suspect that there is a reasonable chance that a judge/jury would find that particular aspect liable for copyright infringement even if the rest gets tossed out. “That is just how our model works” may not end up being convincing in court when you have such exhibits being presented.
This comment has been flagged by the community. Click here to show it.
Re:
At the very least it is close enough that had a different publication published the Output from GPT-4 version said publication could have been found liable of copyright infringement. So that particular question then boils down to a question of does 230 immunity apply which there doesn’t seem to be a clear consensus on yet.
Re: Re:
Why are you bringing up section 230? It’s a copyright lawsuit.
Re: Re: Re:
If the complaint is accurate, ChatGPT can reproduce content to such a degree that it would be infringing on the copyright of the NYT, and if so, the question then is ChatGPT liable (as far as copyright infringement goes) if a user generated prompt were to generate potentially infringing content, which would be a section 230 question.
Re: Re: Re:2
Is ChatGPT the user in your fantasy land? Is not publishing verbatim text moderating?
Re: Re: Re:2
Section 230 has never offered protection against IP claims.
Ergo, it’s irrelevant to this discussion.
Re: Re: Re:
Because you’re dealing with John Smith, Techdirt’s resident anti-230 troll activist.
His main thesis is that Section 230 and other laws that protect platforms make it harder to sue people for copyright infringement based on flimsy, minimal evidence.
Re: Re:
Now replace “it” or “AI” with “a person” in what you wrote. Is your argument still valid?
Re: Re: Re:
If a human were to write what ChatGPT did and distribute it from the New York Times example, yes they would be liable for copyright infringement of the original article.
Re: Re: Re:2
Which makes the user liable. Unless we’re in the business of suing computer manufacturers for their users using copy-and-paste to infringe.
Re:
Merely because some regurgitation of content can be achieved by overly narrow propmting does not make the generation of the training to be infringing on it’s own.
Could OpenAI be guilty of making that overly narrow prompty too easy? Maybe. But it’s that narrow case, not the broader argument that “all training is infringment”.
Re:
Your reasoning is if the copyright holder produces a copy of their own work, the Maker of the tool they used to do that has committed copyright infringement. If you read the article you will see that the NY Times when out of their way to create copy of their own work.
This comment has been flagged by the community. Click here to show it.
Re: Re:
If I give an LLM the prompt of the first 7-8 paragraphs of an article, and it responds with the rest of the article almost verbatim, that’s a pretty good indication that the LLM has internally stored a copy of the article.
Which means it’s already copied the article, even before any prompt. Unless this is fair use, it’s infringement. And if we’re talking about a commercial LLM which copies entire articles from behind a paywall and can regurgitate them to users, the fair use factors aren’t looking so hot.
Re: Re: Re:
That’s not how LLMs work. Much like people, who can’t recall how a piece of music sounds until someone gives them the first few sets of bars can reproduce more.
Shit, it could just represent how predictable the NYT editorial guide is.
Re: Re: Re:2
From a prompt of:
it produced 186 of the next 187 words, omitting 1.
From a prompt of:
it generated 370 of the next 375 words, omitting 5 consecutive words.
From a prompt of:
it generated the next 284 words and added 10 consecutive words of its own.
It is not plausible that it is doing that by generating this on the fly without what amounts to a copy. You look ridiculous trying to claim otherwise.
Re: Re: Re:3
How many other papers covered those stories with very similar articles? They do not look like stories unique to the NT Post. How many attempts did they make before getting the results that they wanted?
Re: Re: Re:3
Again, that’s not how they work. And also, it’s clear you didn’t read the article.
This is the exactly the phenomenon I was talking about: They asked Chat GPT to read something, the tell them what came after their prompt sentence. They loaded it with their information and the asked it to reproduce it.
It’s as surprising as putting a book face down on the bed of a scanner, then getting surprised when a black and white copy comes out when you hit the start button.
This comment has been flagged by the community. Click here to show it.
Re: Re: Re:4
It’s clear you only read the article and not the source behind it.
In the example Mike cherry picked, they gave it the first seven and a half paragraphs. In many other examples, like the ones I quoted above, they gave far less; some less than a full sentence.
What is your source for the claim that the Times provided the URL to ChatGPT? Because exhibit J does not make the claim that the Times provided the URL to ChatGPT. I realize this article makes the claim, but I think Mike misread the explanation on page 1. When it says “we provide the following” it means they are providing it in the exhibit, not that they provided it to ChatGPT.
Re: Re: Re:5
The exhibit makes no claims, but the body of the complaint explains their process and points to the exhibit.
Re: Re: Re:6
After reading the 204 paragraph complaint, I see the Times giving ChatGPT article titles, but not URLs. Do you have a specific page or paragraph in the complaint where the Times provides ChatGPT with the URL to the article?
Re: Re: Re:7
Every time they mention “ChatGPT with the Browse with Bing plugin”, which are littered throughout the document, that’s what they are doing. They’ve explicitly asked it to find the article, then stuff that into the model. Then they ask it to provide the text. The final paragraph of page 26 is where they start discussing it as part of their process. The screen shots also show they have set the model for Web Browsing (see page 49 for an example)
They are working very hard to obscure the fact that they are doing this. But it’s right there. They asked Chat GPT to look at the NYT, then provide text from it. It’s not a fucking shock they got text from it.
Re: Re: Re:8
I think your mistake is assuming that they did the exact same thing everywhere; that because they used Browse with Bing in some places they must have used it here.
The bottom of page 29 to the top of page 32 are the “Embodiment of Unauthorized Reproductions and Derivatives of Times Works in GPT Models” allegation. This is where they reference Exhibit J. They claim Exhibit J used the GPT-4 LLM. They do not claim to use Browse with Bing here.
If you look at, say, the section starting at the bottom of page 32, you’ll see references to Browse with Bing, but that’s a different allegation. Page 49 does indeed show them using Browse with Bing (as do many other pages), but again that’s a different allegation, and it’s not an example from Exhibit J. (I think your reference to the bottom of page 26 must have been a typo; that paragraph is discussing Common Crawl in GPT-3.)
If you’re claiming the Times is essentially lying about what they used, well, I don’t have any way of telling whether they are or not, but I’m not sure how you’re reaching that conclusion. But I don’t think you can say that they used Browse with Bing in Exhibit J just because Browse with Bing was referenced in other sections of the complaint.
In any case, they aren’t providing it the URL even in the Browse with Bing examples. It looks up the URL itself based on the provided article title. That’s not quite the same thing, although in those cases there’s no evidence that the LLM is storing the content.
Re: Re: Re:3
Not plausible to you, you mean.
LLM’s don’t have storage of the datasets they were trained on, end of story. They can’t produce a copy; they’re recreating it based on, in this case, very narrow parameters, similar to how someone recreates an article, a picture or a song from memory. That’s not infringement.
This comment has been flagged by the community. Click here to show it.
Re: Re: Re:4
the language models breakup the word/concept space into a several thousand dimensional hyperspace, whereby the context window represents a 1 dimensional trajectory vector in that hyperspace, and the model predicts where the trajectory will go through concept space, as it generates the next token.
This comment has been flagged by the community. Click here to show it.
Re: Re: Re:4
they don’t store data, they train on data to create a several thousand dimensional hyperspace segmented into concept boundaries, and then the context window represents a 1 dimensional trajectory vector through the hyperspace, and the model predicts what the next token will be, extrapolating that trajectory.
Re: Re: Re:4
I am free to memorize all the songs I want. But if I go on stage and sing them, that’s infringement. The argument that I’m “recreating” the song from what I remember and maybe flub a word and sing two notes wrong doesn’t fly. And it doesn’t matter that someone prompted me with the name of the song and the intro.
When ChatGPT can generate over 350 words of an article verbatim based on a prompt which contains the first 150 words but doesn’t include the contents it’s generating, then what it has is functionally a copy. Yes, the prompt is narrow; it’s essentially “give me the next paragraph of this article”. But the fact is that it can give the next paragraph of the article. It has it to give. I don’t care much about how the black box does it; all that matters is that it does in fact do it. Just like how if I encode a song using MIDI I’m still essentially copying the song, even though no sounds are directly copied and all sounds are generated when the file is played and often it sounds a little different.
Re: Re: Re:
Uh, no. They gave it the link to the article, then coaxed it to more or less read the article.
Did you think LLMs have a copy of the entire internet stored per instance? Because damn, that would bloody well be handy. Archival backup of the world! Extreme data compression! If they are really this good, i welcome the AI apocalypse.
This comment has been flagged by the community. Click here to show it.
Re: Re: Re:2
I don’t think they have a copy of the entire Internet. I do think they have a copy of these specific articles, and I base that on the fact that they spit those articles back.
I realize the example TechDirt gave was one where they fed it the first 7.5 paragraphs, but in many of them the prompt was much shorter. From a prompt of just
it spit out the next 196 words exactly verbatim. You can’t explain that as merely imitating a style. It has the article copied.
Re: Re: Re:3
You think ChatGPT has an entire copy of Common Crawl tucked away inside it’s executable? That’s hundreds (if not thousands) of petabytes compressed and growing at hundreds of terabytes a month now. This kind of fundamental error of understanding continues to be the terminal flaw of those claiming ‘obvious copyright violations’. Further, they refuse to solve their ignorance.
LLMs can’t physically work the way you are claiming. If it were true, there would only be a handful of ‘victims’ as that’s all that could fit in the footprint of ChatGPT and it would take hours to return responses.
Re: Re: Re:4
I don’t think it has a copy of the entire Internet, or the entire Common Crawl, or even the smaller WebText2. But copies of a whole bunch of individual articles? Even if it doesn’t, there’s no technical reason why it couldn’t. Two hundred thousand articles with an average of five thousand characters each would be just one gigabyte, and using lossy compression would let it be way less. It’s a freaking supercomputer; that much is nothing to it.
Re: Re: Re:5
It could does not mean it should. Also, as the articles explains, the NY Times used ChatGPT web browsing extension to give it access to the articles that it duplicated, therefore making its own actions the source of the infringement they are claiming.
Re: Re: Re:6
I dispute the claim that they were using the plugin for Exhibit J.
Re: Re: Re:5
The source for your claim that is has copies of articles is your complete and utter ignorance about how LLMs work.
Re: Re: Re:3
No, ChatGPT doesn’t keep a copy of the article, nor of anything else in Common Crawl. It simply followed the link—like any human would do—and read it.
That it could reproduce more from less when given the link doesn’t change anything here.
This comment has been flagged by the community. Click here to show it.
Re: Re: Re:4
this is a recent feature, called retrieval augmented generation, and the lawsuit complains that both the weights of the network, and the retrieval augmented generation, are infringing.
Re: Re: Re:5
That has nothing to do with keeping a copy. And they literally requested it to do that by using the Web Crawler plugin/module/whatever.
And yes, we realize their dumb argument is that probabilities and web browsing is infringing. Which is why this is all likely to go very badly for them.
Re: Re: Re:5
None of which affects my point. ChatGPT doesn’t retain a copy at all. That it produced a copy when told to do so is not in dispute here.
Re: Re: Re:4
What’s your source for the claim that they gave it the link?
Re: Re: Re:5
It’s literally in the complaint.
Re: Re: Re:6
It’s literally not, and if I’m wrong then tell me which page/paragraph of the complaint I should be looking at.
Re: Re: Re:3
Wrong the Instance of the AI they were using had access to the articles because the NY Times gave it the URL’s to them. They then proceeded to prompt the AI to regurgitate the articles, therefore the copy was the direct result of the NY Times action, and not any copy embedded in the AI’s model. That is no more infringement by the AI creators, than using Photoshop to duplicate an image is infringement by Adobe.
Re: 'Here's my story, now write like me.' AI does so. 'Infringement!'
So long as the plaintiffs ‘forget’ to mention that they trained the AI in their style with a bunch of their own content before prompting it to continue in that style they might just convince a jury, sure.
Re:
They wouldn’t say that. They’d just block the ability to prompt the LLM that way. But it’s the person prompting who is doing the infringement.
Re:
maybe but people can usually quote entire paragraphs right?
This comment has been flagged by the community. Click here to show it.
Here We Go Again
Just as the music industry went on a crusade 20 years ago against file sharing networks, now the legacy media shall take up arms against the tech industry. The allure of a revenue stream for link taxes, or perhaps an AI training license to search companies, is too tempting to the dying media model.
So my thought is. It’s amazing how much we learn from things around us and how do you teach AI or such. I mean how would AI learn what a phone booth is or 8 track tapes or who Michael Jackson was or how waaasuuup was a famous meme and advertisement. Or what the many different voices sounded like that Mel Blanc narrated. The difference between a Michael Bay movie vs Woody Allen. It has to learn from scanning listening querying and reading what was written watching what was aired etc…
It’s not copyright infringement to learn and be knowledgeable on all of the many “things” There shouldn’t be a fee imposed everytime something or someone attempts to learn what’s around us news, commentary famous figures art works etc…
Trademark dilution
The complaint says that ChatGPT/Bing Chat making stuff up about the Times (“hallucinating”) amounts to trademark dilution. I’m not super familiar with this trademark law but would they have a case here?
Re: hallucinating
The LLM/AI only learns from the best.
Me thinks the NYT doth protest too much.
This comment has been flagged by the community. Click here to show it.
Re:
No. That would be like trying to hold Microsoft Office, or Adobe Photoshop liable for third party infringement, OpenAI did not do the prompting the user did.
Funny enough if the NYT paid someone to enter these prompts, they accepted the EULA, which states that they are liable for the damages from the NYT lawsuit, in an indemnification clause.
One thing though: the basis for the suit isn’t so much that the model was trained on NYT articles, as that the model reproduces NYT articles verbatim when asked for them. Regardless of anything about training, that right there is copyright infringement: distributing a verbatim copy of the content to someone else without permission of or a license from the content’s creator/copyright-holder.
Re:
As noted in the article, that only happens in special cases where the prompter restricts the probabilistic output to such a narrow set of parameters that the AI effectively produces very similar content.
Re: Re:
I find this argument unconvincing. It’s like saying Napster only reproduces copyrighted songs which the user asks it to, and it’s the user who narrows it down to just one specific item from all the data available through Napster.
You and I might think there’s no harm in that, but it is copyright infringement as far as the law is concerned, even if the law is bad.
Re: Re: Re:
No, it’s not the same thing. Can you ask Napster to reproduce a copyrighted song by sourcing it from a dictionary of notes and sounds? Of course not, you search for a title and get the entire song.
There is no copyright infringement in reproducing something from memory, even if it is verbatim. If you publish it though.
Example: You know someone who has perfect recall, you ask them to recite an article from a newspaper they read this morning, is that copyright infringement?
Example 2: You know someone who has good memory, you ask them to recite an article from a newspaper they read this morning, correcting them several times to get it to be verbatim. Is that copyright infringement?
Example 3: You ask someone to summarize an article from a newspaper they read this morning. Is that copyright infringement?
Re: Re: Re:2
If you could download a song by giving it the first ten notes, that wouldn’t make it not infringement.
Re: Re: Re:3
It’s still not the same thing, you are just downloading an existing copy regardless how you found it.
Re: Re: Re:3
That “if” is carrying a lot of weight there, champ.
If I could get the gist of an article by reading the first ten words off Google or Facebook snippets, that doesn’t make it infringement.
The NY Times might not agree with me, and they might not like it, but you don’t get to scream “infringement!” just because you’re not as rich as you think you ought to be.
Re: Re:
There’s been examples otherwise (these are for images and diffusion models, I don’t know examples for text, but see for instance: https://www.usenix.org/system/files/usenixsecurity23-carlini.pdf or https://arxiv.org/pdf/2212.03860.pdf . There’s also citations to replication issues in generative language models in the articles as well, though). Anything that hits the right spot in the parameter space will do it. The easiest way to do that is to narrow it down, to make sure you hit it, but there’s nothing particularly special about that.
However, the much bigger problem is, copyright law doesn’t have an exception for using a very narrow prompt. It’s still a copy. From a copyright perspective, you shouldn’t be able to tell GPT to continue copying the article, at all. There’s no defense that you used a narrow prompt to intentionally get it to do so.
Re: Re: Re:
It actually falls back on the user of the tool and what they do with the reproduced content. If I use any kind of tool, regardless of the underlying technology, to reproduce copyrighted material, the liability is entirely on me and not the tool.
Re: Re: Re:2
I’m not sure that’s true for “any kind of tool”. Generally speaking, yes, most tools don’t have the copyrighted material in it. But the way the information is in the model seems like it’s arguably a form of publishing, and that information in the training set is in it’s parameter space. Publishing the tool itself could be copyright infringing.
I’m not sure there are comparable tools. Your video editing software doesn’t have Disney movies in it, even if you can recreate it yourself within the software. Even in the case of something like DVR, it’s not like it came prepackaged.
But even if that is the case, that’s still a huge problem, given that OpenAI (and most other large AI players right now), are agreeing to take on user’s legal costs for copyright infringement: https://techcrunch.com/2023/11/06/openai-promises-to-defend-business-customers-against-copyright-claims/ . That might change in the long term, but it seems like right now even if it’s user liability, they’re expected to take on that liability.
Re: Re: Re:3
The problem with that reasoning is that the information contained in the model isn’t really a copy.
Re: Re: Re:4
That really depends on how you define “copy”. In a lot of ways, it really is a copy, which is why you can do things like pull stuff from the training set (or something very very close to the training set) back out. In order to do that, the information from the original has to be encoded in some way (granted, it is very compressed and lossy) into the parameter space.
And copyright law already has some pretty hefty and wide precedent on how close it can be, to be infringing. It doesn’t have to be an exact copy. For instance, copyright still applies even if something is reproduced in a different medium.
Re: Re: Re:5
Are there other definitions of copy I’m not aware of?
No, nothing is stored as a copy since you need to specifically recreate an approximation of something from the training set using trial and error.
Nothing is reproduced inside a training set.
Re: Re: Re:6
The ones used in copyright law, it seems.
The fact that you have to use trial and error doesn’t mean it’s not stored as a copy.
That just means you need to poke around the parameter space to extract it. That doesn’t mean it’s not in there, it just means you don’t know how to access it off the bat. (And to a significant degree, that comes to the fact we don’t fully understand how those parameter spaces work yet). If you know where to poke, it’ll spit it out.
Yes, it is, otherwise you wouldn’t be able to get it back out, regardless of prompt. That’s a reproduction.
Re: Re: Re:7
Then I guess a dictionary contains copies of a lot of books, because I can use it and a very long prompt to reproduce a book.
Re: Re: Re:8
No, you can’t use a “very long prompt” to reproduce a book with a dictionary. And that is the giveaway difference. The dictionary doesn’t have any information encoded into it in terms of word order etc. You’d have to specify every single word choice and order etc in your “prompt”. That knowledge is entirely coming from you, via the prompt.
And that (among other things) is the giveaway that the LLM version is a copy, and a dictionary isn’t. A dictionary can’t reproduce a book, or even a section of a book, given just a prompt. It requires your input/selection throughout the process.
(And I should mention, I already made this distinction with the video editing software example I gave above. If you created a copyrighted work in video editing software, that’s not “a very long prompt”, either. You’re using a slightly different analogy, but it’s the exact same point.)
The way these LLMs work is not equivalent to a dictionary. They have structure built into their parameter space. You can do something similar-ish to a dictionary, in terms of using pieces of that parameter space to build up something that doesn’t exist in the training set. But you can also just grab a part of the training set directly, which is not something you can do with the dictionary.
Your analogy is actually a very good way to show the distinction, though. You say you’d need a “very long prompt”. Yet, the papers I gave, did it with very short prompts. Your “very long prompt” encodes a lot of information about what words, and where they need to go. So if it’s not in their short prompt, where is that information coming from? It would be extremely unlikely for the model to stumble on the correct word order by pure chance. And if that information isn’t in the prompt, the only other place for it to be is in the model. That information has to come from somewhere, and it’s not coming from the prompts, in those papers.
Your dictionary example would be comparable if you had an LLM that was trained on individual words (possibly taken from a NYT article). That wouldn’t be infringing, or a copy. But they’re not. Not only that, they can’t be, because they get important information about things like sentence and paragraph structure from longer portions of text. You’d have to find a different way to teach it those things.
Re: Re: Re:9
And how big a prompt do you need to provide an LLM with to produce an exact copy of a book? In both instances there is a person supplying information to get the result they want even though a dictionary doesn’t contain any information encoded in it about the original.
The word copy still has a specific meaning, if you could in one simple act point to specific data in an LLM and say “this is a copy of this” you would have point, but that isn’t the case. Data has to be expressed in a specific way to be considered a copy, the result of an image compression algorithm for example is a copy since it still explicitly expresses the data in a specific way that directly maps to the original. An LLM doesn’t have a specific set of data that explicitly maps to an original.
No, you can’t grab a part of the training set “directly” and think you have a copy of something.
Yes, short prompts that was the result of a lot of trial and error. Unless they did it on the first try they had supply more and more information to get what they wanted. Was it 2 times? 10 times? 100 times? A 1000 times?? Every time they modified the prompt they at least doubled the amount of supplied information to get what they wanted.
Words can only be placed in a certain order to make sense and that information is determined by the language used. Applying context on that limits what words in what order that can be expressed, and if you then add in how common it is that certain words is used together in a sentence for a certain context you have quite a narrow selection on what can be expressed. If you then add a bit of trial and error to get the prompt you want.
Re: Re: Re:10
You think you it’s possible to copy an article from a short prompt, without having a copy of the article? Try it yourself. Your prompt is:
ChatGPT got the next 191 words verbatim. Can you do that? Can you even finish that sentence identical to the original? Could you do it if I gave you 2 tries? 10 tries? 100 tries? A 1000 tries? No peeking at the original, now.
If there are ten spots where you have two different ways to word something, that’s already 1024 ways you could do it. If you have four word choices in those ten spots, there are over a million different ways to write it. Word choice aside, there are different ways the prosecutors could have reacted and different ways to put this call in context.
Re: Re: Re:11
Dude, as was pointed out to you multiple times, the LLM was using Bing to look up the articles when instructed to or used a URL given to it by the user. Anyone could easily do the same.
Re: Re: Re:12
The person I was replying to seems to think it was generated on the fly just from the prompt. Well, at least you agree with me that that assertion is ridiculous.
Re: Re: Re:13
It was generated on the fly, the only ridiculous part here is you arguing about something you don’t have a clue about how it actually works because no LLM store verbatim copies of text.
Re: Re: Re:14
Weird three-way argument here. bhull242 insists that it’s looking it up and you insist that it’s generated, while I insist it has what amounts to a copy. I think we’ve all said what we’re going to say at this point.
Re: Re: Re:15
Looks that way. You and the authors are really, really putting a lot of weight and faith in copyright law to help you carry this fight for the NY Times’ livelihood.
Re: Re: Re:15
Arianty, typical of maximalosts, is crafting their “infringe,ent” narrative out of multiple lies, and it appears that multiple users are addressing different ones, from the “LLMs store training material in compressed form” delusion to the “the LLM reproduced the article verbatim from its stored version of the article” fantasy.
Re: Re: Re:11
WASHINGTON — Attorney General William P. Barr told federal prosecutors in a call last week that they should consider charging rioters and
ChatGPT
I see we’re diving into some legal discussions. What’s on your mind?
Re: Re: Re:12
Hmm. So you’re saying it doesn’t produce that output with the given prompt, and thus the prompt must have been longer than what the Times said the prompt was? That’s certainly relevant.
Re: Re: Re:10
Depends on a lot of things, like the size of the data set, what techniques you use to push the model away from training data, etc. The papers I linked originally talks about that. But it’d be much much smaller than the ‘prompt’ from a dictionary, and that difference is due to information encoded in the model.
But the reason I linked them is that a) they explicitly call it a copy, and b) they use very very short prompts, which can’t reasonably have that information in it.
Not the same amount of information, though. By orders of magnitude. That’s one of the big giveaways.
Those papers do that. To quote one: We also identify cases where diffusion
models, including the popular Stable Diffusion model, blatantly copy from their training data. It also goes into detail on how you can tell it’s a copy. (they also give firmer definitions of what they consider a copy, etc)
I’m not sure why it would have to be a “direct” or “explicit” map. As long as it’s a mapping, that is in some sense a copy. Any mapping is a copy. An encrypted copy is a mapping, you just need the key. If you can extract the info, without having to basically reproduce it yourself in your input, that’s a type of copy.
Which is why you see those articles i linked calling them copies. (And also, it really depends on what you mean by “direct”. The mathematical transforms LLMs use are in some sense very direct, and very comparable to something like a compression algorithm).
That said, there’s an even bigger issue when it comes to copyright law. I’m not a lawyer, but there’s a lot of standards on copyright law that go into things like similarity. It doesn’t have to be a direct mapping. There’s been a lot of history of people trying to copy stuff, but tweak it just a bit. It doesn’t evade copyright unless it’s transformative. (And we’ve seen stuff like this covered on Techdirt recently, like the banana case, or the Mickey copyright). Those cases don’t revolve around it being a direct mapping.
Sure. It’s pretty hard to know where exactly you are on a parameter space, especially if it’s not your model. And the technology is very new. We have a very very basic understanding of how it works.
If you knew your model’s parameter space very well, you could land on it exactly. In the same sense, if you knew things like how many words are in a dictionary, you could land on the specific page of a specific word.
I’m not sure why you think they would have to supply more and more information with each guess. They don’t. All they have to do is map out the parameter space, they don’t need to give the model more information.
And you can’t do that with your dictionary example. Even if you know the “perfect” prompt, there is an absolute minimum you can do (which is determined by the word length). Even if you know the dictionary perfectly.
Again, there’s that (large) information gap, even if you do it perfectly with the dictionary.
It’s narrow relative to the total amount of permutations of words you can make, with no rules. It’s not so narrow that a model could reasonably guess it exactly, without having information on the underlying article. That’s actually one of the big clues those papers use to identify when something is being replicated from the training set- when it’s too close to be plausibly built up. There’s other clues too, like how it changes based on training data set size.
And there’s an easy way to test this, try to have a model replicate something that’s not in it’s training set. It won’t be remotely close to that level of exact word choice. It might hit broad themes, and style, etc, but that’s it.
As those papers talk about, this whole replicating from the training set thing is a known problem with AI, including LLMs. (It’s actually a really big problem, because these models tend to overweight training set data. So there’s been a lot of work to push them away from it. If you ask it for “red dress”, it’s often a very good match for it to just give you a red dress that’s already in the data set, rather than constructing a red dress out of pieces).
Re: Re: Re:7
I can get a burger out of a bag a groceries, but neither the burger or the recipe are in the bag.
The bin of lotto balls contain the winning number, but it’s not just embedded in there for you to extract.
Also, you’re just making things up.
When someone asks this:
And you answer this:
You’ve done nothing to answer the question. And given your confident ignorance of LLMs, I very skeptical of any claim you make on copyright law.
Re: Re: Re:8
There’s two big flaws with this analogy.
One, with something that is in an AI training set, there is a burger in the bag. The question is whether it gave you the burger that is already made, or it mixed the ingredients into a new burger.
There are a lot of tells whether it’s just giving you the burger, which those papers go into detail about. But the short of it is, if it’s just giving you the pre-made burger, you can tell, because it’s going to have a lot of idiosyncrasies that are unique to the pre-made burger, and wouldn’t be replicated if it was making the burger from scratch. You can make a burger from a bag of groceries. You’re not going to make a burger with the exact same shape of lettuce, ruffled exactly the same way.
Second, in order to get a burger out of a bag of groceries, you have to apply a certain amount of information. The bag will never spit out groceries that will just happen to be a burger. You have to apply a recipe or something. In your analogy, you’re providing that information. In those examples of short prompts, that clearly isn’t being supplied by the user. So where is it coming from? It has to be embedded the model, if it’s not coming from you.
Absolutely. But that’s not really comparable to replicating training set data. A winning lotto ball number isn’t in that training data in any meaningful way, in the way that word choice is.
What exactly, do you think I’m making up? I literally gave you 2 professional examples agreeing with me.
It seems to be it pretty clearly wasn’t an actual question, but a rhetorical one? There’s a lot of context clues, like the fact that they never gave their actual definition of what copy means.
But if you or they want to give what you think a legal definition of copy means, I’m happy to go find cases showing how it’s not accurate.
You say I’m ignorant, and yet, neither you nor the other person has given a concrete example of how I’m wrong with regards to how LLMs work. On the other hand, I’ve given 2 examples from actual experts, agreeing with it. To quote one of them: We also identify cases where diffusion models, including the popular Stable Diffusion model, blatantly copy from their training data
(And as I mentioned earlier, that also talk about LLMs specifically. * It is well known that generative language models risk replication from their training set*, etc)
But they don’t know what they’re talking about either, right?
I mean, you don’t have to take my word for it. I would strongly suggest you look it up yourself. You can very easily find cases on it. For example, this one from the Ninth Circuit:
https://scholar.google.com/scholar_case?case=15721733923548055420&q=+Walker+v.+University+Books,+Inc.+602+F.2d+859&hl=en&as_sdt=6,33
There was also that story recently covered here on Techdirt:
https://www.techdirt.com/2023/06/20/court-finally-dismisses-bananas-copyright-lawsuit-over-bananas-taped-to-walls/
In both cases, a whole lot goes into talking about what counts as a copy.
Re: Re: Re:3
Nor do LLM’s have the content they were trained on in them.
Re: Re: Re:4
They do, in fact, have it (albeit in a lossy, compressed form), as the information is built into their parameter space. There’s already been papers showing you can extract training data from models. I linked 2 of them. It’s a huge and known problem. The two I linked were for image diffusion models, but it happens with LLMs as well (as the citations in those papers mention)
It’s not a direct copy in the sense that there’s no .txt sitting in it. But the information is there. That is fundamentally how these models work, an are supposed to work.
Re: Re: Re:5
As someone else has pointed out, they’re not actually extracting the training data. They adjust and refine prompts until they hit an extremely close approximation of the original image.
The model doesn’t have access to a 1:1 copy of the original that it can throw at you.
Re: Re: Re:6
That is extracting the training data (or an extremely close approximation of it). As I mentioned, it is lossy and compressed. But the fact that you can pull out an extremely close approximation tells you that that data is encoded into the parameter space in some way. If it wasn’t, you wouldn’t be able to do that. It’s not necessarily trivial to access, but it’s there.
We don’t totally understand how to access it, but it’s in there. And it’s actually a real problem modern models try to solve- very often models will overfit to training data, and you have to actively push them away from it.
But copyright law still covers that. It doesn’t need to be exactly 1:1. There’s a huge amount of case law and precedent on what is “close enough” to count as a copy, and it’s pretty wide. A jpeg is lossy too, but it doesn’t void a copyright on a png, despite only being a very close approximation.
Re: Re: Re:7
I can take an image, sort each pixel it contains for color and frequency and store it as a data. With the right information I can then later extract the image again from that data. The kicker is, facts and data cannot be copyrighted but their organization/compilation can.
Anything residing in a training set is just data organized in a particular way, claiming copyright on that data is like claiming copyright on a histogram of an image.
Re: Re: Re:8
I absolutely agree, which is why I gave that software example initially.
The problem is the data the models are embedding is the organization/compilation. They’re not being stored pixel by pixel, and then having the user reconstruct it. And the giveaway is things like the incredibly short prompts
Yes, and that organization is copyrightable.
Re: Re: Re:9
They are embedding averages and probabilities. The model is nowhere near near large enough to give an encoding of all the data it was trained on.
Also, it is highly relevant that the NY Times enabled wed look up by Chatgpt, and gave it article titles enabling it to look up the articles. In many respect they were Chatgpt as an inefficient and limited ability web browser, and that does not give cause for copyright infringement.
Re: Re: Re:7
You’re really hand waving away the technical details. Calling what’s in an LLMs Neural net “the input, but lossy and compressed” is abroad stretch. IT’s fundamentally a portion of the weighting on a particular set of neurons But! if that were the case, it works against you. This makes the work (the weighting) a transformative use on the input. Clearly fair use.
Frankly, I don’t think fair use has much to do with training LLMs on copyright material….because it’s just ‘use’ not even rising to the discussion of ‘fair use’.
Re: Re: Re:8
I am, but it’s pretty clear that most of us here aren’t technical experts, and it’s pretty clear people aren’t getting it on a very basic level. I’m not sure it’s going to be really helpful to get into the weeds on how things like weights work.
It’s simplifying a lot, yes, but there is an underlying point that I think still comes across. If you want the technical version, I think the two papers I linked above are much better resources. (which also call it copying from the training set)
I actually think that’s true for output that isn’t a close copy of training data. I’m not sure that’s true if it just spits out something that is essentially the training set, because the weighting becomes irrelevant.
To the extent that the models are just extracting ‘metadata’ in the sense of sentence structure, punctuation, etc, I do think it’s fair use. It’s only an issue if you can get something that is substantially very close to the original.
And on a broad level, I don’t think that’s too different from other forms of technology. Copyright already covers whether e.g. recreating something in a new medium is transformative or not.
This comment has been flagged by the community. Click here to show it.
Re: Re: Re:
chatgpt doesn’t have agency, it is literally a slave to the user, so its the user who infringed and not chatgpt.
This comment has been flagged by the community. Click here to show it.
Re: Re:
I seem to recall a certain Techdirt writer claiming that Media Matters, when it created specific and unusual conditions much like the NYT did here, was acting perfectly within the bounds of normalcy in bruiting the outputs because the manipulated inputs did indeed produce the exact, and accurate, result Media Matters wanted for its own publicity and financial purposes.
And that was perfectly fine according to Techdirt because those dishonestly-built outputs were in fact what occurred and were described accurately.
NYT does the very same thing, and that is somehow wrong now.
Curated inputs designed to “force” or “trick” an algorithm into producing specific results and those desired outputs are either acceptable and relevant (though not necessarily dispositive) in a legal or ethical context or they are not.
Unless of course the principle is “that which hurts the enemy is acceptable, but the same which hurts the friend is not”. Which is of course totally cool as a “principle”, although entirely ethically bankrupt.
Re: Re: Re:
For your analogy to hold it would have to be Media matters suing X, over their false claim that major advertisers categorically do not appear on the pages of nazi-adjacent trolls..
Or it would be OPEN-Ai suing the NYT over the true claim that Chat GPT regurgiates NYT content, when given a tailored prompt to do so.
Only thing is the shoes are on the other feet. X is suing media matters, and the NYT is suing Open AI, And as it is Open-AI isn’t saying that Chat GPT isn’t regurgitateing it’s training content near verbatim, when given a tailored prompt, it’s arguing that it’s legal for it to do so. (Just like it’s legal for X to run major advertisers content next to Nazi adjacent troll dung, absent a contraactual agreement not to do so)
Re: Re: Re:
Your analogy makes no sense at all.
I’m not saying it’s wrong for the NYT to create unusual conditions. It’s free to do so.
I’m saying it’s wrong for the NYT to create those unusual conditions and THEN SUE OPENAI FOR CLAIMED COPYRIGHT INFRINGEMENT as a result of those unusual conditions.
This is the opposite of the Media Matters / X scenario. If OpenAI sued the NYT saying “these prompts were not normal” then it would be the same and I would just as readily criticize OpenAI for such a dumb lawsuit.
But in this case, it’s not just the NYT creating these conditions, it’s doing so and then using them to pretend it proves copyright infringement (something that is wholly unrelated to Media Matters / X).
Your gotcha is… um… stupid.
Re: Re: Re:2
You’re right that this suit is very different from X/MM but one way that they’re similar is that “this behavior only shows up for people trying to make it happen” is a complete non sequitur. NYT is very explicit that their concern is that ChatGPT will happily comply with users’ deliberate requests for free copies of Times stories, not that it’ll serve them up at random to people minding their own business. And just as with X placing ads next to Nazi content, it’s an open question how common that behavior is, relative to all day-to-day use of the service, but that it happens at all seems pretty well established.
And the idea that these passages are being “generated” from scratch by the user prompts rather than retrieved from storage is outright absurd to the point of shilling. To be sure, it’s a very esoteric format that has to be decompressed and interpreted by advanced software in order to recreate the original, but it does substantially reproduce the original, which is all but impossible without a stored copy to draw on.
That’s because the content of a newspaper article (or indeed any human-crafted writing) isn’t deterministic; you can’t recreate Pete Wells’ review of Guy Fieri’s American Kitchen, down to the punctuation and the dishes he ordered, from general principles of journalistic style and English usage. All the more so because none of the “show more of this article” prompts say anything about what the content of those next paragraphs should actually be. The user isn’t dictating anything about word choice, statistics to focus on, names of people to feature, etc, but ChatGPT nonetheless gets all of them, exactly as they were presented in the original copyrighted source. That sure sounds like it’s coming from the software and not the user!
Re: Re: Re:3 What features are specific to an LLM
The underlying question is what features are specific to an LLM vs what features (like url retrieval) did openai add to extend the usability of the product, and is that feature violating copyright law
LLMs don’t store the articles. They do store the equivalent of a map they can recreate the “route” that displays it’s best estimation of an article that is it’s best response to a general user query
Re: Re:
A thought here, does the infringement happen when the AI is trained? Or does it happen when the AI produces output?
The case for infringement occurring when the AI is trained, doesn’t seem to be very good, after all the content the AI is trained on is -input- it’s broken down, turned into snip with degrees of probibiliy attached and stored away in the black box, practically unseen.
All the examples here, however are of the AI’s -output-, and that output is driven by the prompt.. the LLM has no intentionality, it can’t form the criminal intent, to violate copyright, and the people developing the AI have little control over prompts end users will put in, except as the overly broad content filters we’ve seen in use, that can be easily evaded with clever wording.
That intent come from the prompt, and the the prompt comes from a human input. So it seems that if someone should be held responsible for AI copyright violations it’s the person directing the AI to violate the copyright.
So… If The NYT shoehorns Chat GPT to regurgitate it’s own material, the person committing copyright infringement is…
…The NYT
This leads me to think that Section 230 might be a good model for liability for copyright infringement by AI, If you can show that that the AI is being induced to engage in word for word repetition of copyrighted material, like we see in the NYT examples, and not just uncopyrightable “styles” or “feels”, then hold the people inducing the AI to repeat the material responsible, not the Developers/Trainers of the AI.
This comment has been flagged by the community. Click here to show it.
Re: Re:
Mind you that an AI cannot be a copyright infringer, for the same reason that an AI cannot own a copyright, because copyright infringement is the acts of people, usually the people who input a prompt, and who agreed to the TOS not to violate copyright / trademark, and to completely indemnify and hold harmless OpenAI for actions by the user.
Re:
After the NY Times fed large parts of the same articles to the AI as input data. That is the AI copied what the NY Times gave it, and carefully prompted it to reproduce articles. What they did not do is provide a generic input and have the AI produce a copy of their article directly from its Internal model, but rather carefully provided data and prompts to get the output they desired. That is the output comes from the Input they gave it, and not from some copy of NY Times article AI’s the AI’s training set or models.
What they is close to using Photoshop to create a copy of somebodies work, and then claiming that Photoshop’s developers have committed copyright infringement.
This comment has been flagged by the community. Click here to show it.
Re: Re:
This is a misleading summary. The NY Times pasted one half of their article into the AI, and the AI reproduced (lossily) the other half.
If you want an analogy with Photoshop, it’s like somebody took a cropped version of the Mona Lisa and used Photoshop to fill in the other half, and Photoshop did so by nearly exactly reproducing the other half of the original painting, based on copying (in a convoluted way) from photographs of the Mona Lisa.
The fact that there’s only one way that an intelligent, capable artist who has studied the Mona Lisa would fill in the rest of the painting, if you asked a human to do it, does not count for anything as far as copyright law is concerned. If a tool can reproduce half of a copyrighted work on demand, I don’t see how it matters that it only does so when fed the other half as input; the half that it reproduces is still subject to copyright.
Re: Re: Re:
I think the point that others are making is that the first half is subject to copyright as well. So the operator of the LLM is either licensed for the reproduction or not. So if there’s no way to get the output, without the protected input, fundamentally nothing has changed.
Re: Re: Re:
HI EARN
Re: Re: Re:
They also provided a link to the original article…
yeah im not buying that natural-language constraints are so effective and the style-copying so accurate that the LLM happened to generate paragraphs that are verbatim from the article because “they were the most probable”.
we know for a fact from studies like “Extracting Training Data from Diffusion Models” (Carlini et al. 2023) that you can, well, extract training data from some models.
im NOT pro-intellectual property, but the blatant LLM bias in these weekly AI articles feels disingenuous.
Re:
They don’t extract the training data, they extracted an approximation of the data:
When working with high-resolution images, verbatim definitions of memorization are not suitable. Instead, we define a notion of approximate memorization based on image similarity
And to accomplish that, they essentially brute forced their way to prompts that generated an approximation of the original images, all who was specifically selected because they had the most duplications in the training data, see Fig 5 in the paper:
Most of the images we extract from Stable Diffusion have been duplicated at least k = 100 times; although this should be taken as an upper bound because our methodology explicitly searches for memorization of duplicated images.
As always, if you ask your question in such a way that there are only limited ways to answer them you will get some of the answers you wanted.
Re: Re:
🤓
This comment has been flagged by the community. Click here to show it.
NY Times journalists are pro-woke, pro-war scum. That paper can’t go out of business fast enough.
(But of course with Trump’s inevitable (re)election, all the progtard, libtard, #resistance fanatics will take out new NYT subscriptions and the paper will have another good few years. 🤮)
This comment has been flagged by the community. Click here to show it.
“Mike Masnick’s analysis of copyright and AI imagines that anything has been settled and if he just insists on something enough, the courts will go along with him.”
Possible alternative headline.
'Don't even get me started on all those teachers teaching infringement skills...'
Message received I suppose, the NY Times considers reading copyright infringement so whatever you do do not read the NY Times.
It is plagiarism - pure and simple
This article is so flawed in its arguments and so one sided. There is hardly any balancing of arguments – exactly what a good journalist would be expected to do, whether at NYT or Tech Dirt.
The issue is similar to someone who scraped public reviews data from Amazon. Then asked the system to build a review summary for a product. Lo and behold that review summary is likely going to use data that it scraped from Amazon. But under what conditions did Amazon agree to such use of its reviews data? Just because something is readable or available, doesn’t mean that it can be regurgitated somewhere else where that information is no longer under the control of the original source. Said differently, is Tech Dirt and all tech blogs now not going to charge product manufacturers for using quotes from these sites in the marketing of products? I’m certain that Tech Dirt and other tech websites will want to continue to be able to charge manufacturers for the use of such quotes because the content cannot simply be regurgitated without proper dues or permission.
Re:
You know… aside from the many times that Masnick has, in fact, told people that if they genuinely wanted, they could copy and reproduce Techdirt articles wholesale and put them up elsewhere.
Because pro-copyright people genuinely think that threatening to “pirate” Techdirt’s content, one-to-one, just to make a point about online revenue, is something that will somehow browbeat Masnick into compliance. Except that nobody’s ever followed through on that threat, because it’s a shit idea that would never be profitable.
Also trying to use Amazon reviews as evidence is incredibly laughable, considering that so many of those these days are so blatantly bought and untrustworthy, they’ve got as much credibility as a wet fart.
Re:
It isn’t remotely similar, but I guess you don’t actually understand what an LLM actually is and how it functions.
Underlaying data and information isn’t copyrightable you know. If someone doesn’t want to expose this data, don’t expose it.
I think that you don’t understand the limitations of copyright law. Sure, there are litigious assholes who will try to sue people who quote them even though quoting something is fair use – because without it even NY Times would have problems publishing an article.
This comment has been flagged by the community. Click here to show it.
Re:
If i remember correctly and am not hallucinating, the entire transformer architecture history started, when openai started scraping amazon reviews and were able to get reasonable looking text out of it, and were able to manipulate a single neuron (the sentiment neuron) to shift the outputs from positive reviews to negative reviews.
Okay, but how does that matter to the question of whether it’s reproducing copyrighted material without permission? If a straightforward prompt makes ChatGPT reliably spit out full articles, that seems worse for liability than if it only does so at random.
Re:
Let me ask you something but you can only answer with a yes or no: Have you stopped killing puppies yet?
The above is an example of how you constrain the output to get an answer you want.
Re: Re:
The flaw in this logic is that I don’t have to obey the constraints of your question. I can refuse to answer with a “yes” or “no” and just tell you to jump in a lake. ChatGPT only follows the constraints of its prompt because OpenAI programmed it that way. It’s not a law of the universe that when you feed the headline and introduction of a news article into the software it responds with the rest of the text. It works like that because the designers trained their fancy autocomplete on a bunch of copyrighted material they don’t have permission to republish, and chose to allow it to reproduce portions of the training set verbatim. That was a bad choice!
This comment has been flagged by the community. Click here to show it.
Re: Re: Re:
Thats like saying that Microsoft Word only creates infringing works, because Microsoft programmed it to allow people “prompt” it to follow the users instructions, in this case infringe on copyrights.
Re: Re: Re:2
If Microsoft introduced a new tool in Word that can reproduce the full text of previously published, copyrighted works from its own database with minimal prompting from the user, with no permission from the original authors, that function would infringe copyright! Users who hit the “piracy” button would also be liable in this hypothetical, in the same way that the person who downloads a pirated movie and the person who uploads it have both committed a violation. I don’t see how this is a difficult concept.
This comment has been flagged by the community. Click here to show it.
Re: Re: Re:3
Yes, and in both instances those are real people, as opposed to a machine like youtube.com or microsoft word
This comment has been flagged by the community. Click here to show it.
Re: Re: Re:4
OpenAI is also a real corporation owned by real people. ChatGPT is a tool of copyright infringement, not the infringing party.
Re: Re: Re:5
There’s no such thing as a “tool of copyright infringement”. The people using a tool in a particular way are the infringers.
Re: Re: Re:3
read the actual story, The NY times fed the AI large chunks of the articles it wanted it to reproduce, and carried out a lot of prompting therefore it is disingenuous to claim that the story came from the AI’s training set holding a copy of the story. If OpenAI can get a full accounting of how the output was produced, they might have a case for claiming fraudulent claims against them.
Re: Re:
There are other, non verbal responses to that and similar question, such as a punch to the nose.
URL in prompt?
Are you sure they included the URL in the prompts? In the intro it does not say that …
Re:
That’s what the complaint says.
Re: Re:
It wasn’t. What page or paragraph of the complaint says the URL was provided?
Re: Re: Re:
It’s quoted in the article.
NY Times lawsuit against OpenAI
Isn’t the real issue monetization of the content? If all OpenAI did was “read” and “summarize” or “edit” the content, then there wouldn’t be a problem. The problem occurs once OpenAI “publishes” the summary/edit by making it available to the public and then monetizes the summary/edit with ads.
Re:
That makes the problem more severe, but republishing somebody else’s paywalled material for free still deprives them of the chance to monetize it, even if you aren’t profiting directly.
This comment has been flagged by the community. Click here to show it.
Re: Re:
Which deprives the government and society of the tax revenue, while rewarding those who break rules others must follow.
Re: Re: Re:
Nice try, John Smith.
But we’ve been over this whole “but free things means the government can’t tax people” argument before, and it’s a weak one. The government isn’t suddenly impoverished because people get free gifts, or a shopkeeper gives out goods for free.
Interesting article and I always like Mike’s take. But I think a couple of important factors are glossed over here.
First and foremost, copyright is right to not have others copy your work (to oversimplify). This applies not just to literally copying of physical texts, but also copying data (software, music files, and, yes, written works). The NYT here is not simply saying “this is a mechanism to get around our paywall”, the NYT in its complaint is saying that the output of a significant portion of an article is a reproduction of copyrighted work. Again, copyright protects against the copying of works, and yet the NYT shows that ChatGPT can and will copy NYT works by outputting near-verbatim portions of their articles. Regardless of you trigger that reproduction, it is nevertheless a reproduction of NYT works (at least that is NYT’s theory). Under that theory, the prompt is immaterial. As far as I know, copyright law does not include any conditions on how reproduction is triggered, and is thus irrelevant to the analysis. Moreover, even before we get to the outputting to the user of a portion of an NYT article, the NYT is saying that OpenAI makes copies of their articles to build the training dataset. Again, this is copying through and through. It does not matter that it is in a back-end database or that it is taken from Common Crawl (which may be fair use itself, but I doubt that fair use transfers to an ultimate beneficiary, for example I cannot take a TechDirt article from Internet Archive and publish it on my own webpage as my own). So there are two alleged instances of reproduction here, a legal right only reserved to the owner of the work and their licensees. Thus, all this discussion about prompting and “reading” is, again, irrelevant, because copyright pertains to the copying and reproduction, not to the methods of reproduction nor the purpose that the reproduction serves (except, as I will discuss next, in limited exceptions where it is deemed fair use).
This brings me to point two – fair use. This is a trickier subject here, but fair use typically applies to: commentary, search engines, criticism, parody, news reporting, research, and scholarship. I am not sure any of those apply to building a database of training data or to reproducing portions of articles to users. Nevertheless, the factors for determining fair use are: the purpose and character of the use; the nature of the copyrighted work; the amount and substantiality of the portion used; and the effect of the use upon the potential market for or value of the copyrighted work. I will not analyze each of these here, but will just point out that this is why the NYT goes into how valuable the NYT is for training the LLM and, in turn, how valuable the LLM is when trained on NYT works. It is also why they go to such lengths to show that significant portions of the articles can be reproduced, and that their paywall can be circumvented by cleverly prompting the model. I have no idea how a court would come down on this, but it is more than “the NYT doesn’t understand LLMs.” In fact, I completely expect that people will use ChatGPT to try to read articles from the NYT and other paywalled sources without paying, people do that stuff all the time and will use whatever tools are available.
We may not agree with the potential effects of this lawsuit, but there is more here than “the NYT is greedy” (though that may be true as well).
Re:
Well, point one is actually two points, but I’ll bite.
1A) Copyright law most certainly considers the conditions of the duplication when determining whether a violation occurred. For one, the exception allowing non profit educational institutions to display copyrighted works in classrooms. A teacher pressing ‘copy’ for the classroom will get a substantially different outcome than I would in a court case.
1B) Incidental copying, when pursuing fair use, has long been held to be non infringing. Otherwise Proxies, thumbnails, caches, etc. would all be in trouble.
2) Here’s my hot take: Training LLMs on copyrighted works doesn’t even rise to the level of fair use. Meaning, it’s just use of the material and a discussion of exemptions for fair use will be a challenge because it’s not relevant.
Now, there could be some discussion on how the works were sourced, such as pirating vs. just acquiring a copy through a library or e-book store. Or the discussion of ‘getting around a paywall’ by just using a different part of the website without the pay wall.
Of all the arguments the NYT makes, it
Re: Re:
Just to quickly align our frameworks, in order to have “fair use” of a copyrighted work, you must first perform an otherwise impermissible copying. Fair use is an exception to the rule, copyright is the rule. So to help clarify things, I am going to call an act of copying a work a “candidate infringement”, and once a candidate infringement is discovered, it must be determined whether fair use applies to determine whether the candidate infringement is indeed infringement or not.
I saw this because my first point is all about discovering that candidate infringement. You are definitely right, conditions of copying a work is important. But the example you describe is a fair use question (educational purposes is a recognized fair use exception to copyright). But my first point was more saying that the technology you use to copy something is not important to the analysis. If I published and sold a book of someone else’s poems without permission, that would be copyright infringement regardless of whether I photocopied each poem, scanned each poem, dictated, transcribed, or reproduced from memory. Similarly with your 1B, it is a fair use argument and does not go to the question of whether there is even a candidate infringement to which the fair use exception needs to be applied.
Nevertheless, you might be right. I am not sure if this could be considered fair use or not. On the one hand, OpenAI is making copies of works without permission in order to enrich the value of their commercial activities, which does not seem like it would weigh in their favor. But on the other hand, like you say, the copying they are doing is not really to reproduce the work for consumption by an end user. But I think that is the conversation that needs to be had and “the LLM just reads it” is neither technically nor legally accurate I don’t think.
Finally, I don’t think it matters whether OpenAI “pirated” the articles or acquired from a legitimate source. Indeed, all of the sources listed in the complaint appear “legitimate” in that the NYT is not arguing that those services themselves committed any copyright infringement. And that makes sense because, back to the book of poems example, it shouldn’t matter whether I got the poems from the Pirate Bay or from a local library of which I am a member, copying and selling another’s work is not permissible in either case. Same with whether people can “get around the paywall” in other ways. Just because their are other ways to access the work does not suddenly make copyright infringement ok. Just because someone can get those poems from the library or from Pirate Bay on their own doesn’t make it ok for me to infringe those copyrights.
At the end of the day, the copyright act explicitly forbids making copies of work (including an article). So it seems to me that the threshold question of whether OpenAI’s activities are candidates for copyright infringement is pretty clearly settled. We have at least two instances of making a copy without authorization of the author. So the discussion really comes down to, in the LLM training instance, whether it is fair use like for a search engine, and in the prompted reproduction instance, whether that is fair use or even OpenAI’s responsibility since they are not the ones doing that prompting (i.e., who is the actual copier in this instance, OpenAI or the prompter). There are policy arguments that can go either way, but how LLMs feature in copyright infringement and what that means for the copying itself, seems like a pretty new question.
Re: Re: Re:
Wrong, it is applying analysis to, presumably legally acquired copies. It is doing sophisticated word use statistics to determine how often a given word follows dome other word or short sequence of words. If what it is doing is infringement, a critics reading of a book, or watching a film, or listening to music for the purposes of writing a critique would also be infringement.
Re: Re: Re:2
They do indeed apply a number analyses and transformations, but they are not doing that on someone else’s servers. You need look no further than WebText and WebText2 (feel free to google), an OpenAI produced dataset of text scraped from URL links identified on Reddit. You can even download this dataset, which includes the text of the webpages of those URLs (https://www.springboard.com/blog/data-science/machine-learning-gpt-3-open-ai/ even states that WebText2 includes “the text of web pages from all outbound Reddit links from posts with 3+ upvotes”). This is a literal copy regardless of what they do to process it afterwards.
Re: Re: Re:3
With AI, the data set is not the model that is made available to users., it is the data that is analyzed to build the model. The article then gives the model size, and goes on to list the data sources used to train the model. It mentions some models memories what the were trained on, but then so do humans, and in the context of AI, that is memorized and not a straight copy.
Re: Re: Re:4
I understand that the model is not the dataset, but to train the model you nevertheless need the dataset, which means creating your own copy of the dataset on your own database (typically). You literally need the text of that article imported into your own system, which is a copying of another’s work, and thus potentially an infringement of copyright. I am not saying that the act of training the model or the model itself are copying. The copying of the dataset is the copying which is then used for training. And we know OpenAI does this because they published a paper. See Section 2.2 of https://arxiv.org/pdf/2005.14165.pdf where they saying things like: “(1) we downloaded and filtered a version of CommonCrawl…” and “…including an expanded version of the WebText dataset [RWC+19],collected by scraping links over a longer period of time…”. “Downloading” and “scraping” are instances of copying content from another source, such as, it seems, NYT articles.
Again, how fair use ends up applying to this, I am not sure, but it seems to me that taking the text of articles from websites and saving them for training an ML (or any other purpose) fits within the language of the copyright act that forbids anyone but the owner “to reproduce the copyrighted work in copies”.
Re: Re: Re:5
Which every student and professional do while studying and excising their specialty expertise. Also, infringement is making copies available to other people, which is why people using torrents are the ones sued for infringement, but people getting copies from say YouTube care not.
Re: Re: Re:6
Downloading content for studying and professional development falls within fair use. That doesn’t make downloading for any personal use in the US ok, it makes downloading for educational uses ok. The point I keep trying to make is that there is a difference between the rule and the exception. Fair use is the exception. Don’t treat it as the rule. Rather, fair use defines small realm of situations that fall within the rule, so you cannot extend it to all situations. For example, being allowed to copy copyrighted content for education purposes does not mean copying to train a for-sale AI service is obviously ok.
Indeed, the LLM is not a person, in law or in fact. It is not helpful to keep equating the computer to a human. They are not the same. Moreover, the LLM is a product provided to users in exchange for value, and is thus a commercial use of the content in the training set. This potentially, though I am not sure if it actually would, remove this type of use from the fair use except because use for commercial activities typically weighs pretty heavily against fair use.
Regarding torrents or downloading YouTube videos: you absolutely can get sued because it absolutely is copyright infringement. You probably won’t though cause it’s not worth the effort for anyone to start going after individual users. It would cost a lot to find out who the people doing the downloading are and a lawsuit would cost more than they could collect in damages. Instead, they go after the makers of the tools (e.g., Napster) as a contributory infringer to both go after the root of the problem and go after the people with the money. But make no mistake, downloading copyrighted material without permission is copyright infringement (unless it falls within fair use, which again, is a particular exception that does not cover all personal uses of copyrighted works).
Google does the same thing and they exist.
The NYT paywall isn’t that secure (don’t want to say how).
Not sure if that’ll be a factor or not.
The author in my opinion misrepresents the stance of the NY Times here.
The Times issue isn’t just that someone or thing is reading materials. The Times takes issue with a group intentionally enmass collecting large amounts of their data (in their case articles) with the intention of distributing them packed into a product to 3rd parties engaging in commercial activities without paying a licensing fee. The Times fears that them doing this damages the potential market for future and past articles from them.
In essentially the Times fears that Common Crawl is acting a fence for other groups to infringe on their copyrighted works.
Factors of Fair Use:
Re:
Except that ChatGPT isn’t doing that. It’s accessing the articles from the NYT’s own website using a given URL and/or Bing.
ChatGPT retains no copies of any materials contained in Common Crawl. That’s not how LLMs work. The Times’s fears are entirely irrational.
Except it clearly does have that data, because it can reproduce original training material when prompted with only identifiers for that material rather than its substantive content. Claiming otherwise goes into “who are you going to believe, me or your lying eyes?” territory.
Re:
It can produce output very similar to its training material, which is like asking you to create an image of a school bus, and then claiming copyright infringement because there are very similar photos and images of school buses on the Internet.
Also, with the NY Times, they prompted the AI with copies of their own content, and then complained that its output was almost identical. They did not show that the AI could produce their content if it was not prompted by their content, but rather simply asked a more generic question.
Re: Re:
It’s more like asking you to create an image of a school bus, and you get every detail down to the faces of the kids in the window and the license plate the same as someone’s photo, because you once looked at that photo and remembered what it looked like a little too well.
Re: Re:
It’s not like that at all! They didn’t ask for something generic and then compare the output to the universe of existing content in that genre, they asked for a copy of a specific work and got it, nearly word for word. A “school bus” version would require prompting the bot with either the title of a specific photo of a bus, or the first 10-15% of the bitmap, and getting back a complete image that’s nearly (though not entirely) pixel perfect, including key artistic elements that are unique to the photo being requested but not implied by the title or the initial sample the user provided as a prompt.
Re: Re: Re:
They put a lot of effort into getting the AI to produce a copy of the work, after they fed it most of the work as input. Having trained the AI in their style, they act surprised that it could emulate it.
Re: Re: Re:2
“most” of the work? Tell me, from a prompt of
how many more words would you expect to get right based on just “style”? Because ChatGPT got the next 121.
Re: Re: Re:3
They enable the Browse with Bing plugin, which then goes out and finds the article based on their prompt.
Re:
Yeah, it has that data… after being asked to use Bing to find it.
Again, LLMs don’t contain any copies of the training data used to train them.
Re:
By that “logic” every web browser is infringing for doing ecaxtly the same thing.
What features are specific to an LLM
The underlying question is what features are specific to an LLM vs what features (like url retrieval) did openai add to extend the usability of the product, and is that feature violating copyright law
LLMs don’t store the articles. They do store the equivalent of a map they can recreate the “route” that displays it’s best estimation of an article that is it’s best response to a general user query
hearsay evidence
I think NYtimes giving the court transcripts of results that were exact copies wont cut it in court. How does the court know if that text was the actual result or if it was doctored by NYT? The evidence is basically hearsay. “I prompted this and this is what it said”. NYT will need to query the LLM in court and show in real time that it can produce infringing text.