The NY Times Lawsuit Against OpenAI Would Open Up The NY Times To All Sorts Of Lawsuits Should It Win

from the it's-okay-when-we-do-it,-we're-the-new-york-times dept

This week the NY Times somehow broke the story of… well, the NY Times suing OpenAI and Microsoft. I wonder who tipped them off. Anyhoo, the lawsuit in many ways is similar to some of the over a dozen lawsuits filed by copyright holders against AI companies. We’ve written about how silly many of these lawsuits are, in that they appear to be written by people who don’t much understand copyright law. And, as we noted, even if courts actually decide in favor of the copyright holders, it’s not like it will turn into any major windfall. All it will do is create another corruptible collection point, while locking in only a few large AI companies who can afford to pay up.

I’ve seen some people arguing that the NY Times lawsuit is somehow “stronger” and more effective than the others, but I honestly don’t see that. Indeed, the NY Times itself seems to think its case is so similar to the ridiculously bad Authors Guild case, that it’s looking to combine the cases.

But while there are some unique aspects to the NY Times case, I’m not sure they are nearly as compelling as the NY Times and its supporters think they are. Indeed, I think if the Times actually wins its case, it would open the Times itself up to some fairly damning lawsuits itself, given its somewhat infamous journalistic practices regarding summarizing other people’s articles without credit. But, we’ll get there.

The Times, in typical NY Times fashion, presents this case as thought the NY Times is the great defender of press freedom, taking this stand to stop the evil interlopers of AI.

Independent journalism is vital to our democracy. It is also increasingly rare and valuable. For more than 170 years, The Times has given the world deeply reported, expert, independent journalism. Times journalists go where the story is, often at great risk and cost, to inform the public about important and pressing issues. They bear witness to conflict and disasters, provide accountability for the use of power, and illuminate truths that would otherwise go unseen. Their essential work is made possible through the efforts of a large and expensive organization that provides legal, security, and operational support, as well as editors who ensure their journalism meets the highest standards of accuracy and fairness. This work has always been important. But within a damaged information ecosystem that is awash in unreliable content, The Times’s journalism provides a service that has grown even more valuable to the public by supplying trustworthy information, news analysis, and commentary

Defendants’ unlawful use of The Times’s work to create artificial intelligence products that compete with it threatens The Times’s ability to provide that service. Defendants’ generative artificial intelligence (“GenAI”) tools rely on large-language models (“LLMs”) that were built by copying and using millions of The Times’s copyrighted news articles, in-depth investigations, opinion pieces, reviews, how-to guides, and more. While Defendants engaged in widescale copying from many sources, they gave Times content particular emphasis when building their LLMs—revealing a preference that recognizes the value of those works. Through Microsoft’s Bing Chat (recently rebranded as “Copilot”) and OpenAI’s ChatGPT, Defendants seek to free-ride on The Times’s massive investment in its journalism by using it to build substitutive products without permission or payment.

As the lawsuit makes clear, this isn’t some high and mighty fight for journalism. It’s a negotiating ploy. The Times admits that it has been trying to get OpenAI to cough up some cash for its training:

For months, The Times has attempted to reach a negotiated agreement with Defendants, in accordance with its history of working productively with large technology platforms to permit the use of its content in new digital products (including the news products developed by Google, Meta, and Apple). The Times’s goal during these negotiations was to ensure it received fair value for the use of its content, facilitate the continuation of a healthy news ecosystem, and help develop GenAI technology in a responsible way that benefits society and supports a well-informed public.

I’m guessing that OpenAI’s decision a few weeks back to pay off media giant Axel Springer to avoid one of these lawsuits, and the failure to negotiate a similar deal (at what is likely a much higher price), resulted in the Times moving forward with the lawsuit.

There are five or six whole pages of puffery about how amazing the NY Times thinks the NY Times is, followed by the laughably stupid claim that generative AI “threatens” the kind of journalism the NY Times produces.

Let me let you in on a little secret: if you think that generative AI can do serious journalism better than a massive organization with a huge number of reporters, then, um, you deserve to go out of business. For all the puffery about the amazing work of the NY Times, this seems to suggest that it can easily be replaced by an auto-complete machine.

In the end, though, the crux of this lawsuit is the same as all the others. It’s a false belief that reading something (whether by human or machine) somehow implicates copyright. This is false. If the courts (or the legislature) decide otherwise, it would upset pretty much all of the history of copyright and create some significant real world problems.

Part of the Times complaint is that OpenAI’s GPT LLM was trained in part with Common Crawl data. Common Crawl is an incredibly useful and important resource that apparently is now coming under attack. It has been building an open repository of the web for people to use, not unlike the Internet Archive, but with a focus on making it accessible to researchers and innovators. Common Crawl is a fantastic resource run by some great people (though the lawsuit here attacks them).

But, again, this is the nature of the internet. It’s why things like Google’s cache and the Internet Archive’s Wayback Machine are so important. These are archives of history that are incredibly important, and have historically been protected by fair use, which the Times is now threatening.

(Notably, just recently, the NY Times was able to get all of its articles excluded from Common Crawl. Otherwise I imagine that they would be a defendant in this case as well).

Either way, so much of the lawsuit is claiming that GPT learning from this data is infringement. And, as we’ve noted repeatedly, reading/processing data is not a right limited by copyright. We’ve already seen this in multiple lawsuits, but this rush of plaintiffs is hoping that maybe judges will be wowed by this newfangled “generative AI” technology into ignoring the basics of copyright law and pretending that there are now rights that simply do not exist.

Now, the one element that appears different in the Times’ lawsuit is that it has a bunch of exhibits that purport to prove how GPT regurgitates Times articles. Exhibit J is getting plenty of attention here, as the NY Times demonstrates how it was able to prompt ChatGPT in such a manner that it basically provided them with direct copies of NY Times articles.

In the complaint, they show this:

Image

At first glance that might look damning. But it’s a lot less damning when you look at the actual prompt in Exhibit J and realize what happened, and how generative AI actually works.

What the Times did is prompt GPT-4 by (1) giving it the URL of the story and then (2) “prompting” it by giving it the headline of the article and the first seven and a half paragraphs of the article, and asking it to continue.

Here’s how the Times describes this:

Each example focuses on a single news article. Examples were produced by breaking the article into two parts. The frst part o f the article is given to GPT-4, and GPT-4 replies by writing its own version of the remainder of the article.

Here’s how it appears in Exhibit J (notably, the prompt was left out of the complaint itself):

Image

If you actually understand how these systems work, the output looking very similar to the original NY Times piece is not so surprising. When you prompt a generative AI system like GPT, you’re giving it a bunch of parameters, which act as conditions and limits on its output. From those constraints, it’s trying to generate the most likely next part of the response. But, by providing it paragraphs upon paragraphs of these articles, the NY Times has effectively constrained GPT to the point that the most probabilistic responses is… very close to the NY Times’ original story.

In other words, by constraining GPT to effectively “recreate this article,” GPT has a very small data set to work off of, meaning that the highest likelihood outcome is going to sound remarkably like the original. If you were to create a much shorter prompt, or introduce further randomness into the process, you’d get a much more random output. But these kinds of prompts effectively tell GPT not to do anything BUT write the same article.

From there, though, the lawsuit gets dumber.

It shows that you can sorta get around the NY Times’ paywall in the most inefficient and unreliable way possible by asking ChatGPT to quote the first few paragraphs in one paragraph chunks.

Image

Of course, quoting individual paragraphs from a news article is almost certainly fair use. And, for what it’s worth, the Times itself admits that this process doesn’t actually return the full article, but a paraphrase of it.

And the lawsuit seems to suggest that merely summarizing articles is itself infringing:

Image

That’s… all factual information summarizing the review? And while the complaint shows that if you then ask for (again, paragraph length) quotes, GPT will give you a few quotes from the article.

And, yes, the complaint literally argues that a generative AI tool can violate copyright when it “summarizes” an article.

The issue here is not so much how GPT is trained, but how the NY Times is constraining the output. That is unrelated to the question of whether or not the reading of these article is fair use or not. The purpose of these LLMs is not to repeat the content that is scanned, but to figure out the probabilistic most likely next token for a given prompt. When the Times constrains the prompts in such a way that the data set is basically one article and one article only… well… that’s what you get.

Elsewhere, the Times again complains about GPT returning factual information that is not subject to copyright law.

Image

But, I mean, if you were to ask anyone the same question, “What does wirecutter recommend for The Best Kitchen Scale,” they’re likely to return you a similar result, and that’s not infringing. It’s a fact that that scale is the one that it recommends. The Times complains that people who do this prompt will avoid clicking on Wirecutter affiliate links, but… um… it has no right to that affiliate income.

I mean, I’ll admit right here that I often research products and look at Wirecutter (and other!) reviews before eventually shopping independently of that research. In other words, I will frequently buy products after reading the recommendations on Wirecutter, but without clicking on an affiliate link. Is the NY Times really trying to suggest that this violates its copyright? Because that’s crazy.

Meanwhile, it’s not clear if the NY Times is mad that it’s accurately recommending stuff or if it’s just… mad. Because later in the complaint, the NY Times says its bad that sometimes GPT recommends the wrong product or makes up a paragraph.

So… the complaint is both that GPT reproduces things too accurately, AND not accurately enough. Which is it?

Anyway, the larger point is that if the NY Times wins, well… the NY Times might find itself on the receiving end of some lawsuits. The NY Times is somewhat infamous in the news world for using other journalists’ work as a starting point and building off of it (frequently without any credit at all). Sometimes this results in an eventual correction, but often it does not.

If the NY Times successfully argues that reading a third party article to help its reporters “learn” about the news before reporting their own version of it is copyright infringement, it might not like how that is turned around by tons of other news organizations against the NY Times. Because I don’t see how there’s any legitimate distinction between OpenAI scanning NY Times articles and NY Times reporters scanning other articles/books/research without first licensing those works as well.

Or, say, what happens if a source for a NY TImes reporter provides them with some copyright-covered work (an article, a book, a photograph, who knows what) that the NY Times does not have a license for? Can the NY Times journalist then produce an article based on that material (along with other research, though much less than OpenAI used in training GPT)?

It seems like (and this happens all too often in the news industry) the NY Times is arguing that it’s okay for its journalists to do this kind of thing because it’s in the business of producing Important Journalism™ whereas anyone else doing the same thing is some damn interloper.

We see this with other copyright disputes and the media industry, or with the ridiculous fight over the hot news doctrine, in which news orgs claimed that they should be the only ones allowed to report on something for a while.

Similarly, I’ll note that even if the NY Times gets some money out of this, don’t expect the actual reporters to see any of it. Remember, this is the same NY Times that once tried to stiff freelance reporters by relicensing their articles to electronic databases without paying them. The Supreme Court didn’t like that. If the NY Times establishes that merely training AI on old articles is a licenseable, copyright-impacting event, will it go back and pay those reporters a piece of whatever change they get? Or nah?

Filed Under: , , , , , , , ,
Companies: common crawl, microsoft, ny times, openai

Rate this comment as insightful
Rate this comment as funny
You have rated this comment as insightful
You have rated this comment as funny
Flag this comment as abusive/trolling/spam
You have flagged this comment
The first word has already been claimed
The last word has already been claimed
Insightful Lightbulb icon Funny Laughing icon Abusive/trolling/spam Flag icon Insightful badge Lightbulb icon Funny badge Laughing icon Comments icon

Comments on “The NY Times Lawsuit Against OpenAI Would Open Up The NY Times To All Sorts Of Lawsuits Should It Win”

Subscribe: RSS Leave a comment
181 Comments

This comment has been flagged by the community. Click here to show it.

This comment has been deemed insightful by the community.
Anonymous Coward says:

Re:

Did you read the article:-

But, by providing it paragraphs upon paragraphs of these articles, the NY Times has effectively constrained GPT to the point that the most probabilistic responses is… very close to the NY Times’ original story.

That is to say the NY Times spoon fed the AI the output they wanted.

Anon says:

Re: Clearly Stolen?

Stolen? How? I seriously doubt the top tech companies do not have a paid account at the NYTimes – which would imply they using getting the news content they paid for. I presume too that they would be getting the same full high-volume access as any press-clipping organization does. By the time the AI is regurgitating it, it is history… and to do the “tell me the third paragraph” trick, you already need to know the article exists and what it is about.

(Back before the internet, our company’s PR dept. would clip articles relevant to the company out of the major local and national newspapers and magazines, then fax around a multi-page summary of relevant news for most mid-level bosses. From one paid subscription to each newspaper, and something pretty much evry company did. 1 print copy, 500 faxes to assorted intra-office readers.)

What I glean from this is that one job AI can do is Administrative AIssitant – “read the newspaper, find me the relevant articles and summarize them for me”. A service that would be personally tailored to the individual.

This comment has been deemed insightful by the community.
This comment has been deemed funny by the community.
Professor Ronny says:

Reading and Copyright

It’s a false belief that reading something (whether by human or machine) somehow implicates copyright.

I’m a college professor teaching business courses. Hopefully, students learn stuff in my courses. If they go on and use that knowledge to make money, is this infringement? Can I sue?

/Sarcasm

Arijirija says:

Re:

iirc, there was a comedy by one of the Greek playwrights or their Roman copiers, about a man who wnet to a famed jurist and arranged to learn the law from him, on the wager that if he didn’t learn anything, he didn’t have to pay for the course,; if he did learn, he would pay. The student ended te course and refused to pay, on a gamble that if he lost the following lawsuit, he had clearly not learnt the law to any great degree, and thus didn’t have to pay, whereas if he did win the lawsuit, he would not have to pay anyway. The judge did not agree.

Professor Ronny says:

Using Material is not a Copyright Violation

It’s a false belief that reading something (whether by human or machine) somehow implicates copyright.

I’m a college professor teaching business courses. In addition to lectures, I provide my students with video tutorials and PowerPoint slides. Hopefully, students learn stuff in my courses. If they go on and use that knowledge to make money, is this infringement? Can I sue?

/Sarcasm

Tom says:

Sexual Healing

I’ve listened to a lot of Marvin Gaye’s work. I can perform a note perfect cover of one of his better known songs. Given some prompting, I can probably write a song that has a similar feel or that incorporates similar elements.

It’s fucked that all of those things are, if not explicitly illegal in all cases, certainly off-limits without licensing and permission. But, what legal basis makes it ok for AI to do the same on demand? If these AI cases come out in favor of the AI companies, does that make it safe for songwriters to have inspiration again? Can Ed Sheeran make another album without fear? Will it take laundering song ideas through ChatGPT (with attached logs) to show there was no direct copying?

yokem55 (profile) says:

Re: Music has a lot of other protections

The copyright situation around music is a lot more restrictive because there is a lot more actual law around music. Specifically around the rights of song writers having different rights then performers and the recordings of their performances. The result is that the window of fair use and non-infringing creation is a lot more constrained.

Anonymous Coward says:

I am surprised that you can equate a reporter reading an article for research as the same as an LLM consuming all the media ever created.
Your take seems similar to the “corporations are people too” SC decision. If we decide that LLM/AI gets the same fair usage rights as humans then, for me, that’s only going to be good for the LLM companies.

I don’t think all these articles of yours pushing LLMs as equivalent to humans doing research are going to age well.

This comment has been flagged by the community. Click here to show it.

Anonymous Coward says:

Re: Re:

You can learn from a bunch of copyrighted works, but if I gave you the prompt of the first 6 paragraphs of your history textbook’s Civil War chapter, you wouldn’t give me the rest of the chapter almost verbatim. You would at least paraphrase. If you did not then you would be infringing too.

Strawb (profile) says:

Re:

If we decide that LLM/AI gets the same fair usage rights as humans then, for me, that’s only going to be good for the LLM companies.

Would you rather see the companies sued out of existence before getting off the ground, or not being able to afford training licenses, making the LLM’s useless?

Because that’s effectively what would happen if they were to be treated like the NYT wants them to be treated.

Anonymous Coward says:

If the generative AI is willing to create nearly 1:1 text (even if it needs very specific prompting) over extensive blocks of text, I suspect that there is a reasonable chance that a judge/jury would find that particular aspect liable for copyright infringement even if the rest gets tossed out. “That is just how our model works” may not end up being convincing in court when you have such exhibits being presented.

This comment has been flagged by the community. Click here to show it.

Anonymous Coward says:

Re:

At the very least it is close enough that had a different publication published the Output from GPT-4 version said publication could have been found liable of copyright infringement. So that particular question then boils down to a question of does 230 immunity apply which there doesn’t seem to be a clear consensus on yet.

Anonymous Coward says:

Re: Re: Re:

If the complaint is accurate, ChatGPT can reproduce content to such a degree that it would be infringing on the copyright of the NYT, and if so, the question then is ChatGPT liable (as far as copyright infringement goes) if a user generated prompt were to generate potentially infringing content, which would be a section 230 question.

This comment has been deemed insightful by the community.
yokem55 (profile) says:

Re:

Merely because some regurgitation of content can be achieved by overly narrow propmting does not make the generation of the training to be infringing on it’s own.

Could OpenAI be guilty of making that overly narrow prompty too easy? Maybe. But it’s that narrow case, not the broader argument that “all training is infringment”.

This comment has been deemed insightful by the community.
Anonymous Coward says:

Re:

Your reasoning is if the copyright holder produces a copy of their own work, the Maker of the tool they used to do that has committed copyright infringement. If you read the article you will see that the NY Times when out of their way to create copy of their own work.

This comment has been flagged by the community. Click here to show it.

Anonymous Coward says:

Re: Re:

If I give an LLM the prompt of the first 7-8 paragraphs of an article, and it responds with the rest of the article almost verbatim, that’s a pretty good indication that the LLM has internally stored a copy of the article.

Which means it’s already copied the article, even before any prompt. Unless this is fair use, it’s infringement. And if we’re talking about a commercial LLM which copies entire articles from behind a paywall and can regurgitate them to users, the fair use factors aren’t looking so hot.

Anonymous Coward says:

Re: Re: Re:2

From a prompt of:

WASHINGTON — American intelligence officials have concluded that a Russian military intelligence unit secretly offered bounties to Taliban-linked militants for killing coalition

it produced 186 of the next 187 words, omitting 1.

From a prompt of:

Until recently, Hoan Ton-That’s greatest hits included

it generated 370 of the next 375 words, omitting 5 consecutive words.

From a prompt of:

If the United States had begun imposing social

it generated the next 284 words and added 10 consecutive words of its own.

It is not plausible that it is doing that by generating this on the fly without what amounts to a copy. You look ridiculous trying to claim otherwise.

This comment has been deemed insightful by the community.
Mamba (profile) says:

Re: Re: Re:3

Again, that’s not how they work. And also, it’s clear you didn’t read the article.

What the Times did is prompt GPT-4 by (1) giving it the URL of the story and then (2) “prompting” it by giving it the headline of the article and the first seven and a half paragraphs of the article, and asking it to continue.

This is the exactly the phenomenon I was talking about: They asked Chat GPT to read something, the tell them what came after their prompt sentence. They loaded it with their information and the asked it to reproduce it.

It’s as surprising as putting a book face down on the bed of a scanner, then getting surprised when a black and white copy comes out when you hit the start button.

This comment has been flagged by the community. Click here to show it.

Anonymous Coward says:

Re: Re: Re:4

And also, it’s clear you didn’t read the article.

It’s clear you only read the article and not the source behind it.

What the Times did is prompt GPT-4 by (1) giving it the URL of the story and then (2) “prompting” it by giving it the headline of the article and the first seven and a half paragraphs of the article, and asking it to continue.

In the example Mike cherry picked, they gave it the first seven and a half paragraphs. In many other examples, like the ones I quoted above, they gave far less; some less than a full sentence.

What is your source for the claim that the Times provided the URL to ChatGPT? Because exhibit J does not make the claim that the Times provided the URL to ChatGPT. I realize this article makes the claim, but I think Mike misread the explanation on page 1. When it says “we provide the following” it means they are providing it in the exhibit, not that they provided it to ChatGPT.

Mamba (profile) says:

Re: Re: Re:7

Every time they mention “ChatGPT with the Browse with Bing plugin”, which are littered throughout the document, that’s what they are doing. They’ve explicitly asked it to find the article, then stuff that into the model. Then they ask it to provide the text. The final paragraph of page 26 is where they start discussing it as part of their process. The screen shots also show they have set the model for Web Browsing (see page 49 for an example)

They are working very hard to obscure the fact that they are doing this. But it’s right there. They asked Chat GPT to look at the NYT, then provide text from it. It’s not a fucking shock they got text from it.

Anonymous Coward says:

Re: Re: Re:8

I think your mistake is assuming that they did the exact same thing everywhere; that because they used Browse with Bing in some places they must have used it here.

The bottom of page 29 to the top of page 32 are the “Embodiment of Unauthorized Reproductions and Derivatives of Times Works in GPT Models” allegation. This is where they reference Exhibit J. They claim Exhibit J used the GPT-4 LLM. They do not claim to use Browse with Bing here.

If you look at, say, the section starting at the bottom of page 32, you’ll see references to Browse with Bing, but that’s a different allegation. Page 49 does indeed show them using Browse with Bing (as do many other pages), but again that’s a different allegation, and it’s not an example from Exhibit J. (I think your reference to the bottom of page 26 must have been a typo; that paragraph is discussing Common Crawl in GPT-3.)

If you’re claiming the Times is essentially lying about what they used, well, I don’t have any way of telling whether they are or not, but I’m not sure how you’re reaching that conclusion. But I don’t think you can say that they used Browse with Bing in Exhibit J just because Browse with Bing was referenced in other sections of the complaint.

In any case, they aren’t providing it the URL even in the Browse with Bing examples. It looks up the URL itself based on the provided article title. That’s not quite the same thing, although in those cases there’s no evidence that the LLM is storing the content.

Strawb (profile) says:

Re: Re: Re:3

It is not plausible that it is doing that by generating this on the fly without what amounts to a copy.

Not plausible to you, you mean.

LLM’s don’t have storage of the datasets they were trained on, end of story. They can’t produce a copy; they’re recreating it based on, in this case, very narrow parameters, similar to how someone recreates an article, a picture or a song from memory. That’s not infringement.

This comment has been flagged by the community. Click here to show it.

This comment has been flagged by the community. Click here to show it.

Benjamin Jay Barber says:

Re: Re: Re:4

they don’t store data, they train on data to create a several thousand dimensional hyperspace segmented into concept boundaries, and then the context window represents a 1 dimensional trajectory vector through the hyperspace, and the model predicts what the next token will be, extrapolating that trajectory.

Anonymous Coward says:

Re: Re: Re:4

I am free to memorize all the songs I want. But if I go on stage and sing them, that’s infringement. The argument that I’m “recreating” the song from what I remember and maybe flub a word and sing two notes wrong doesn’t fly. And it doesn’t matter that someone prompted me with the name of the song and the intro.

When ChatGPT can generate over 350 words of an article verbatim based on a prompt which contains the first 150 words but doesn’t include the contents it’s generating, then what it has is functionally a copy. Yes, the prompt is narrow; it’s essentially “give me the next paragraph of this article”. But the fact is that it can give the next paragraph of the article. It has it to give. I don’t care much about how the black box does it; all that matters is that it does in fact do it. Just like how if I encode a song using MIDI I’m still essentially copying the song, even though no sounds are directly copied and all sounds are generated when the file is played and often it sounds a little different.

This comment has been deemed insightful by the community.
Anonymous Coward says:

Re: Re: Re:

Uh, no. They gave it the link to the article, then coaxed it to more or less read the article.

Did you think LLMs have a copy of the entire internet stored per instance? Because damn, that would bloody well be handy. Archival backup of the world! Extreme data compression! If they are really this good, i welcome the AI apocalypse.

This comment has been flagged by the community. Click here to show it.

Anonymous Coward says:

Re: Re: Re:2

I don’t think they have a copy of the entire Internet. I do think they have a copy of these specific articles, and I base that on the fact that they spit those articles back.

I realize the example TechDirt gave was one where they fed it the first 7.5 paragraphs, but in many of them the prompt was much shorter. From a prompt of just

President Biden wants to forge an “alliance of democracies.” China wants to make clear

it spit out the next 196 words exactly verbatim. You can’t explain that as merely imitating a style. It has the article copied.

This comment has been deemed insightful by the community.
Mamba (profile) says:

Re: Re: Re:3

You think ChatGPT has an entire copy of Common Crawl tucked away inside it’s executable? That’s hundreds (if not thousands) of petabytes compressed and growing at hundreds of terabytes a month now. This kind of fundamental error of understanding continues to be the terminal flaw of those claiming ‘obvious copyright violations’. Further, they refuse to solve their ignorance.

LLMs can’t physically work the way you are claiming. If it were true, there would only be a handful of ‘victims’ as that’s all that could fit in the footprint of ChatGPT and it would take hours to return responses.

Anonymous Coward says:

Re: Re: Re:4

I don’t think it has a copy of the entire Internet, or the entire Common Crawl, or even the smaller WebText2. But copies of a whole bunch of individual articles? Even if it doesn’t, there’s no technical reason why it couldn’t. Two hundred thousand articles with an average of five thousand characters each would be just one gigabyte, and using lossy compression would let it be way less. It’s a freaking supercomputer; that much is nothing to it.

This comment has been flagged by the community. Click here to show it.

Anonymous Coward says:

Re: Re: Re:3

I do think they have a copy of these specific articles,

Wrong the Instance of the AI they were using had access to the articles because the NY Times gave it the URL’s to them. They then proceeded to prompt the AI to regurgitate the articles, therefore the copy was the direct result of the NY Times action, and not any copy embedded in the AI’s model. That is no more infringement by the AI creators, than using Photoshop to duplicate an image is infringement by Adobe.

This comment has been flagged by the community. Click here to show it.

Crashoverride says:

So my thought is. It’s amazing how much we learn from things around us and how do you teach AI or such. I mean how would AI learn what a phone booth is or 8 track tapes or who Michael Jackson was or how waaasuuup was a famous meme and advertisement. Or what the many different voices sounded like that Mel Blanc narrated. The difference between a Michael Bay movie vs Woody Allen. It has to learn from scanning listening querying and reading what was written watching what was aired etc…

It’s not copyright infringement to learn and be knowledgeable on all of the many “things” There shouldn’t be a fee imposed everytime something or someone attempts to learn what’s around us news, commentary famous figures art works etc…

This comment has been flagged by the community. Click here to show it.

Benjamin Jay Barber says:

Re:

No. That would be like trying to hold Microsoft Office, or Adobe Photoshop liable for third party infringement, OpenAI did not do the prompting the user did.

Funny enough if the NYT paid someone to enter these prompts, they accepted the EULA, which states that they are liable for the damages from the NYT lawsuit, in an indemnification clause.

TKnarr (profile) says:

One thing though: the basis for the suit isn’t so much that the model was trained on NYT articles, as that the model reproduces NYT articles verbatim when asked for them. Regardless of anything about training, that right there is copyright infringement: distributing a verbatim copy of the content to someone else without permission of or a license from the content’s creator/copyright-holder.

This comment has been deemed insightful by the community.
Anonymous Coward says:

Re:

as that the model reproduces NYT articles verbatim when asked for them.

After the NY Times fed large parts of the same articles to the AI as input data. That is the AI copied what the NY Times gave it, and carefully prompted it to reproduce articles. What they did not do is provide a generic input and have the AI produce a copy of their article directly from its Internal model, but rather carefully provided data and prompts to get the output they desired. That is the output comes from the Input they gave it, and not from some copy of NY Times article AI’s the AI’s training set or models.

What they is close to using Photoshop to create a copy of somebodies work, and then claiming that Photoshop’s developers have committed copyright infringement.

This comment has been flagged by the community. Click here to show it.

Anonymous Coward says:

Re: Re:

This is a misleading summary. The NY Times pasted one half of their article into the AI, and the AI reproduced (lossily) the other half.

If you want an analogy with Photoshop, it’s like somebody took a cropped version of the Mona Lisa and used Photoshop to fill in the other half, and Photoshop did so by nearly exactly reproducing the other half of the original painting, based on copying (in a convoluted way) from photographs of the Mona Lisa.

The fact that there’s only one way that an intelligent, capable artist who has studied the Mona Lisa would fill in the rest of the painting, if you asked a human to do it, does not count for anything as far as copyright law is concerned. If a tool can reproduce half of a copyrighted work on demand, I don’t see how it matters that it only does so when fed the other half as input; the half that it reproduces is still subject to copyright.

tati says:

yeah im not buying that natural-language constraints are so effective and the style-copying so accurate that the LLM happened to generate paragraphs that are verbatim from the article because “they were the most probable”.

we know for a fact from studies like “Extracting Training Data from Diffusion Models” (Carlini et al. 2023) that you can, well, extract training data from some models.

im NOT pro-intellectual property, but the blatant LLM bias in these weekly AI articles feels disingenuous.

This comment has been deemed insightful by the community.
Anonymous Coward says:

Re:

we know for a fact from studies like “Extracting Training Data from Diffusion Models” (Carlini et al. 2023) that you can, well, extract training data from some models.

They don’t extract the training data, they extracted an approximation of the data:
When working with high-resolution images, verbatim definitions of memorization are not suitable. Instead, we define a notion of approximate memorization based on image similarity

And to accomplish that, they essentially brute forced their way to prompts that generated an approximation of the original images, all who was specifically selected because they had the most duplications in the training data, see Fig 5 in the paper:
Most of the images we extract from Stable Diffusion have been duplicated at least k = 100 times; although this should be taken as an upper bound because our methodology explicitly searches for memorization of duplicated images.

As always, if you ask your question in such a way that there are only limited ways to answer them you will get some of the answers you wanted.

This comment has been flagged by the community. Click here to show it.

This comment has been flagged by the community. Click here to show it.

K R says:

It is plagiarism - pure and simple

This article is so flawed in its arguments and so one sided. There is hardly any balancing of arguments – exactly what a good journalist would be expected to do, whether at NYT or Tech Dirt.

The issue is similar to someone who scraped public reviews data from Amazon. Then asked the system to build a review summary for a product. Lo and behold that review summary is likely going to use data that it scraped from Amazon. But under what conditions did Amazon agree to such use of its reviews data? Just because something is readable or available, doesn’t mean that it can be regurgitated somewhere else where that information is no longer under the control of the original source. Said differently, is Tech Dirt and all tech blogs now not going to charge product manufacturers for using quotes from these sites in the marketing of products? I’m certain that Tech Dirt and other tech websites will want to continue to be able to charge manufacturers for the use of such quotes because the content cannot simply be regurgitated without proper dues or permission.

Anonymous Coward says:

Re:

You know… aside from the many times that Masnick has, in fact, told people that if they genuinely wanted, they could copy and reproduce Techdirt articles wholesale and put them up elsewhere.

Because pro-copyright people genuinely think that threatening to “pirate” Techdirt’s content, one-to-one, just to make a point about online revenue, is something that will somehow browbeat Masnick into compliance. Except that nobody’s ever followed through on that threat, because it’s a shit idea that would never be profitable.

Also trying to use Amazon reviews as evidence is incredibly laughable, considering that so many of those these days are so blatantly bought and untrustworthy, they’ve got as much credibility as a wet fart.

This comment has been deemed insightful by the community.
Rocky says:

Re:

The issue is similar to someone who scraped public reviews data from Amazon. Then asked the system to build a review summary for a product. Lo and behold that review summary is likely going to use data that it scraped from Amazon.

It isn’t remotely similar, but I guess you don’t actually understand what an LLM actually is and how it functions.

But under what conditions did Amazon agree to such use of its reviews data? Just because something is readable or available, doesn’t mean that it can be regurgitated somewhere else where that information is no longer under the control of the original source.

Underlaying data and information isn’t copyrightable you know. If someone doesn’t want to expose this data, don’t expose it.

Said differently, is Tech Dirt and all tech blogs now not going to charge product manufacturers for using quotes from these sites in the marketing of products?

I think that you don’t understand the limitations of copyright law. Sure, there are litigious assholes who will try to sue people who quote them even though quoting something is fair use – because without it even NY Times would have problems publishing an article.

This comment has been flagged by the community. Click here to show it.

Benjamin Jay Barber says:

Re:

If i remember correctly and am not hallucinating, the entire transformer architecture history started, when openai started scraping amazon reviews and were able to get reasonable looking text out of it, and were able to manipulate a single neuron (the sentiment neuron) to shift the outputs from positive reviews to negative reviews.

David says:

The issue here is not so much how GPT is trained, but how the NY Times is constraining the output. That is unrelated to the question of whether or not the reading of these article is fair use or not. The purpose of these LLMs is not to repeat the content that is scanned, but to figure out the probabilistic most likely next token for a given prompt. When the Times constrains the prompts in such a way that the data set is basically one article and one article only… well… that’s what you get.

Okay, but how does that matter to the question of whether it’s reproducing copyrighted material without permission? If a straightforward prompt makes ChatGPT reliably spit out full articles, that seems worse for liability than if it only does so at random.

This comment has been deemed insightful by the community.
Anonymous Coward says:

Re:

If a straightforward prompt makes ChatGPT reliably spit out full articles, that seems worse for liability than if it only does so at random.

Let me ask you something but you can only answer with a yes or no: Have you stopped killing puppies yet?

The above is an example of how you constrain the output to get an answer you want.

David LaRoss says:

Re: Re:

The flaw in this logic is that I don’t have to obey the constraints of your question. I can refuse to answer with a “yes” or “no” and just tell you to jump in a lake. ChatGPT only follows the constraints of its prompt because OpenAI programmed it that way. It’s not a law of the universe that when you feed the headline and introduction of a news article into the software it responds with the rest of the text. It works like that because the designers trained their fancy autocomplete on a bunch of copyrighted material they don’t have permission to republish, and chose to allow it to reproduce portions of the training set verbatim. That was a bad choice!

This comment has been flagged by the community. Click here to show it.

David says:

Re: Re: Re:2

If Microsoft introduced a new tool in Word that can reproduce the full text of previously published, copyrighted works from its own database with minimal prompting from the user, with no permission from the original authors, that function would infringe copyright! Users who hit the “piracy” button would also be liable in this hypothetical, in the same way that the person who downloads a pirated movie and the person who uploads it have both committed a violation. I don’t see how this is a difficult concept.

This comment has been flagged by the community. Click here to show it.

This comment has been flagged by the community. Click here to show it.

Anonymous Coward says:

Re: Re: Re:3

read the actual story, The NY times fed the AI large chunks of the articles it wanted it to reproduce, and carried out a lot of prompting therefore it is disingenuous to claim that the story came from the AI’s training set holding a copy of the story. If OpenAI can get a full accounting of how the output was produced, they might have a case for claiming fraudulent claims against them.

Ronald Davenport says:

NY Times lawsuit against OpenAI

Isn’t the real issue monetization of the content? If all OpenAI did was “read” and “summarize” or “edit” the content, then there wouldn’t be a problem. The problem occurs once OpenAI “publishes” the summary/edit by making it available to the public and then monetizes the summary/edit with ads.

This comment has been flagged by the community. Click here to show it.

Dister (profile) says:

Interesting article and I always like Mike’s take. But I think a couple of important factors are glossed over here.

First and foremost, copyright is right to not have others copy your work (to oversimplify). This applies not just to literally copying of physical texts, but also copying data (software, music files, and, yes, written works). The NYT here is not simply saying “this is a mechanism to get around our paywall”, the NYT in its complaint is saying that the output of a significant portion of an article is a reproduction of copyrighted work. Again, copyright protects against the copying of works, and yet the NYT shows that ChatGPT can and will copy NYT works by outputting near-verbatim portions of their articles. Regardless of you trigger that reproduction, it is nevertheless a reproduction of NYT works (at least that is NYT’s theory). Under that theory, the prompt is immaterial. As far as I know, copyright law does not include any conditions on how reproduction is triggered, and is thus irrelevant to the analysis. Moreover, even before we get to the outputting to the user of a portion of an NYT article, the NYT is saying that OpenAI makes copies of their articles to build the training dataset. Again, this is copying through and through. It does not matter that it is in a back-end database or that it is taken from Common Crawl (which may be fair use itself, but I doubt that fair use transfers to an ultimate beneficiary, for example I cannot take a TechDirt article from Internet Archive and publish it on my own webpage as my own). So there are two alleged instances of reproduction here, a legal right only reserved to the owner of the work and their licensees. Thus, all this discussion about prompting and “reading” is, again, irrelevant, because copyright pertains to the copying and reproduction, not to the methods of reproduction nor the purpose that the reproduction serves (except, as I will discuss next, in limited exceptions where it is deemed fair use).

This brings me to point two – fair use. This is a trickier subject here, but fair use typically applies to: commentary, search engines, criticism, parody, news reporting, research, and scholarship. I am not sure any of those apply to building a database of training data or to reproducing portions of articles to users. Nevertheless, the factors for determining fair use are: the purpose and character of the use; the nature of the copyrighted work; the amount and substantiality of the portion used; and the effect of the use upon the potential market for or value of the copyrighted work. I will not analyze each of these here, but will just point out that this is why the NYT goes into how valuable the NYT is for training the LLM and, in turn, how valuable the LLM is when trained on NYT works. It is also why they go to such lengths to show that significant portions of the articles can be reproduced, and that their paywall can be circumvented by cleverly prompting the model. I have no idea how a court would come down on this, but it is more than “the NYT doesn’t understand LLMs.” In fact, I completely expect that people will use ChatGPT to try to read articles from the NYT and other paywalled sources without paying, people do that stuff all the time and will use whatever tools are available.

We may not agree with the potential effects of this lawsuit, but there is more here than “the NYT is greedy” (though that may be true as well).

This comment has been deemed insightful by the community.
Mamba (profile) says:

Re:

Well, point one is actually two points, but I’ll bite.

1A) Copyright law most certainly considers the conditions of the duplication when determining whether a violation occurred. For one, the exception allowing non profit educational institutions to display copyrighted works in classrooms. A teacher pressing ‘copy’ for the classroom will get a substantially different outcome than I would in a court case.

1B) Incidental copying, when pursuing fair use, has long been held to be non infringing. Otherwise Proxies, thumbnails, caches, etc. would all be in trouble.

2) Here’s my hot take: Training LLMs on copyrighted works doesn’t even rise to the level of fair use. Meaning, it’s just use of the material and a discussion of exemptions for fair use will be a challenge because it’s not relevant.

Now, there could be some discussion on how the works were sourced, such as pirating vs. just acquiring a copy through a library or e-book store. Or the discussion of ‘getting around a paywall’ by just using a different part of the website without the pay wall.

Of all the arguments the NYT makes, it

Dister (profile) says:

Re: Re:

Just to quickly align our frameworks, in order to have “fair use” of a copyrighted work, you must first perform an otherwise impermissible copying. Fair use is an exception to the rule, copyright is the rule. So to help clarify things, I am going to call an act of copying a work a “candidate infringement”, and once a candidate infringement is discovered, it must be determined whether fair use applies to determine whether the candidate infringement is indeed infringement or not.

I saw this because my first point is all about discovering that candidate infringement. You are definitely right, conditions of copying a work is important. But the example you describe is a fair use question (educational purposes is a recognized fair use exception to copyright). But my first point was more saying that the technology you use to copy something is not important to the analysis. If I published and sold a book of someone else’s poems without permission, that would be copyright infringement regardless of whether I photocopied each poem, scanned each poem, dictated, transcribed, or reproduced from memory. Similarly with your 1B, it is a fair use argument and does not go to the question of whether there is even a candidate infringement to which the fair use exception needs to be applied.

Nevertheless, you might be right. I am not sure if this could be considered fair use or not. On the one hand, OpenAI is making copies of works without permission in order to enrich the value of their commercial activities, which does not seem like it would weigh in their favor. But on the other hand, like you say, the copying they are doing is not really to reproduce the work for consumption by an end user. But I think that is the conversation that needs to be had and “the LLM just reads it” is neither technically nor legally accurate I don’t think.

Finally, I don’t think it matters whether OpenAI “pirated” the articles or acquired from a legitimate source. Indeed, all of the sources listed in the complaint appear “legitimate” in that the NYT is not arguing that those services themselves committed any copyright infringement. And that makes sense because, back to the book of poems example, it shouldn’t matter whether I got the poems from the Pirate Bay or from a local library of which I am a member, copying and selling another’s work is not permissible in either case. Same with whether people can “get around the paywall” in other ways. Just because their are other ways to access the work does not suddenly make copyright infringement ok. Just because someone can get those poems from the library or from Pirate Bay on their own doesn’t make it ok for me to infringe those copyrights.

At the end of the day, the copyright act explicitly forbids making copies of work (including an article). So it seems to me that the threshold question of whether OpenAI’s activities are candidates for copyright infringement is pretty clearly settled. We have at least two instances of making a copy without authorization of the author. So the discussion really comes down to, in the LLM training instance, whether it is fair use like for a search engine, and in the prompted reproduction instance, whether that is fair use or even OpenAI’s responsibility since they are not the ones doing that prompting (i.e., who is the actual copier in this instance, OpenAI or the prompter). There are policy arguments that can go either way, but how LLMs feature in copyright infringement and what that means for the copying itself, seems like a pretty new question.

Anonymous Coward says:

Re: Re: Re:

On the one hand, OpenAI is making copies of works without permission in order to enrich the value of their commercial activities, which does not seem like it would weigh in their favor.

Wrong, it is applying analysis to, presumably legally acquired copies. It is doing sophisticated word use statistics to determine how often a given word follows dome other word or short sequence of words. If what it is doing is infringement, a critics reading of a book, or watching a film, or listening to music for the purposes of writing a critique would also be infringement.

Dister (profile) says:

Re: Re: Re:2

They do indeed apply a number analyses and transformations, but they are not doing that on someone else’s servers. You need look no further than WebText and WebText2 (feel free to google), an OpenAI produced dataset of text scraped from URL links identified on Reddit. You can even download this dataset, which includes the text of the webpages of those URLs (https://www.springboard.com/blog/data-science/machine-learning-gpt-3-open-ai/ even states that WebText2 includes “the text of web pages from all outbound Reddit links from posts with 3+ upvotes”). This is a literal copy regardless of what they do to process it afterwards.

Anonymous Coward says:

Re: Re: Re:3

With AI, the data set is not the model that is made available to users., it is the data that is analyzed to build the model. The article then gives the model size, and goes on to list the data sources used to train the model. It mentions some models memories what the were trained on, but then so do humans, and in the context of AI, that is memorized and not a straight copy.

Dister (profile) says:

Re: Re: Re:4

I understand that the model is not the dataset, but to train the model you nevertheless need the dataset, which means creating your own copy of the dataset on your own database (typically). You literally need the text of that article imported into your own system, which is a copying of another’s work, and thus potentially an infringement of copyright. I am not saying that the act of training the model or the model itself are copying. The copying of the dataset is the copying which is then used for training. And we know OpenAI does this because they published a paper. See Section 2.2 of https://arxiv.org/pdf/2005.14165.pdf where they saying things like: “(1) we downloaded and filtered a version of CommonCrawl…” and “…including an expanded version of the WebText dataset [RWC+19],collected by scraping links over a longer period of time…”. “Downloading” and “scraping” are instances of copying content from another source, such as, it seems, NYT articles.

Again, how fair use ends up applying to this, I am not sure, but it seems to me that taking the text of articles from websites and saving them for training an ML (or any other purpose) fits within the language of the copyright act that forbids anyone but the owner “to reproduce the copyrighted work in copies”.

Anonymous Coward says:

Re: Re: Re:5

which means creating your own copy of the dataset on your own database

Which every student and professional do while studying and excising their specialty expertise. Also, infringement is making copies available to other people, which is why people using torrents are the ones sued for infringement, but people getting copies from say YouTube care not.

Dister (profile) says:

Re: Re: Re:6

Downloading content for studying and professional development falls within fair use. That doesn’t make downloading for any personal use in the US ok, it makes downloading for educational uses ok. The point I keep trying to make is that there is a difference between the rule and the exception. Fair use is the exception. Don’t treat it as the rule. Rather, fair use defines small realm of situations that fall within the rule, so you cannot extend it to all situations. For example, being allowed to copy copyrighted content for education purposes does not mean copying to train a for-sale AI service is obviously ok.

Indeed, the LLM is not a person, in law or in fact. It is not helpful to keep equating the computer to a human. They are not the same. Moreover, the LLM is a product provided to users in exchange for value, and is thus a commercial use of the content in the training set. This potentially, though I am not sure if it actually would, remove this type of use from the fair use except because use for commercial activities typically weighs pretty heavily against fair use.

Regarding torrents or downloading YouTube videos: you absolutely can get sued because it absolutely is copyright infringement. You probably won’t though cause it’s not worth the effort for anyone to start going after individual users. It would cost a lot to find out who the people doing the downloading are and a lawsuit would cost more than they could collect in damages. Instead, they go after the makers of the tools (e.g., Napster) as a contributory infringer to both go after the root of the problem and go after the people with the money. But make no mistake, downloading copyrighted material without permission is copyright infringement (unless it falls within fair use, which again, is a particular exception that does not cover all personal uses of copyrighted works).

Firetower says:

The author in my opinion misrepresents the stance of the NY Times here.

It’s a false belief that reading something (whether by human or machine) somehow implicates copyright.

The Times issue isn’t just that someone or thing is reading materials. The Times takes issue with a group intentionally enmass collecting large amounts of their data (in their case articles) with the intention of distributing them packed into a product to 3rd parties engaging in commercial activities without paying a licensing fee. The Times fears that them doing this damages the potential market for future and past articles from them.

In essentially the Times fears that Common Crawl is acting a fence for other groups to infringe on their copyrighted works.

Factors of Fair Use:

  1. The purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes.
  2. The nature of the copyrighted work.
  3. The amount and substantiality of the portion used in relation to the copyrighted work as a whole.
  4. The effect of the use upon the potential market for or value of the copyrighted work.
bhull242 (profile) says:

Re:

The Times takes issue with a group intentionally enmass collecting large amounts of their data (in their case articles) with the intention of distributing them packed into a product to 3rd parties engaging in commercial activities without paying a licensing fee.

Except that ChatGPT isn’t doing that. It’s accessing the articles from the NYT’s own website using a given URL and/or Bing.

In essentially the Times fears that Common Crawl is acting a fence for other groups to infringe on their copyrighted works.

ChatGPT retains no copies of any materials contained in Common Crawl. That’s not how LLMs work. The Times’s fears are entirely irrational.

David says:

An LLM doesn’t have a specific set of data that explicitly maps to an original.

Except it clearly does have that data, because it can reproduce original training material when prompted with only identifiers for that material rather than its substantive content. Claiming otherwise goes into “who are you going to believe, me or your lying eyes?” territory.

Anonymous Coward says:

Re:

Except it clearly does have that data, because it can reproduce original training material when prompted with only identifiers for that material rather than its substantive content.

It can produce output very similar to its training material, which is like asking you to create an image of a school bus, and then claiming copyright infringement because there are very similar photos and images of school buses on the Internet.

Also, with the NY Times, they prompted the AI with copies of their own content, and then complained that its output was almost identical. They did not show that the AI could produce their content if it was not prompted by their content, but rather simply asked a more generic question.

David says:

Re: Re:

It can produce output very similar to its training material, which is like asking you to create an image of a school bus, and then claiming copyright infringement because there are very similar photos and images of school buses on the Internet.

It’s not like that at all! They didn’t ask for something generic and then compare the output to the universe of existing content in that genre, they asked for a copy of a specific work and got it, nearly word for word. A “school bus” version would require prompting the bot with either the title of a specific photo of a bus, or the first 10-15% of the bitmap, and getting back a complete image that’s nearly (though not entirely) pixel perfect, including key artistic elements that are unique to the photo being requested but not implied by the title or the initial sample the user provided as a prompt.

Mark Cuban (user link) says:

What features are specific to an LLM

The underlying question is what features are specific to an LLM vs what features (like url retrieval) did openai add to extend the usability of the product, and is that feature violating copyright law

LLMs don’t store the articles. They do store the equivalent of a map they can recreate the “route” that displays it’s best estimation of an article that is it’s best response to a general user query

Diogenes (profile) says:

hearsay evidence

I think NYtimes giving the court transcripts of results that were exact copies wont cut it in court. How does the court know if that text was the actual result or if it was doctored by NYT? The evidence is basically hearsay. “I prompted this and this is what it said”. NYT will need to query the LLM in court and show in real time that it can produce infringing text.

Add Your Comment

Your email address will not be published. Required fields are marked *

Have a Techdirt Account? Sign in now. Want one? Register here

Comment Options:

Make this the or (get credits or sign in to see balance) what's this?

What's this?

Techdirt community members with Techdirt Credits can spotlight a comment as either the "First Word" or "Last Word" on a particular comment thread. Credits can be purchased at the Techdirt Insider Shop »

Follow Techdirt

Techdirt Daily Newsletter

Ctrl-Alt-Speech

A weekly news podcast from
Mike Masnick & Ben Whitelaw

Subscribe now to Ctrl-Alt-Speech »
Techdirt Deals
Techdirt Insider Discord
The latest chatter on the Techdirt Insider Discord channel...
Loading...