NY Times Considering A Potentially Very Dumb Lawsuit Against OpenAI Because It Learned From NY Times Content
from the that’s-now-how-any-of-this-works dept
A few weeks ago, the NY Times published a very nice profile piece about me, which starts off with the story of how I recently got pulled into a group chat with a bunch of Hollywood writers, directors, and actors, who were trying to understand how to deal with the rise of generative AI tools. The article recounted how my basic message was that most of the legal routes they were considering weren’t likely to be all that effective — especially thinking copyright will save them — but noting that they should be looking to look for ways to embrace the AI and do more with it themselves.
It would appear that the NY Times itself is apparently going in the other direction. According to Bobby Allyn at NPR, the NY Times is considering legal action against OpenAI, claiming that training its models on NY Times content violated the NY Times copyright.
Lawyers for the newspaper are exploring whether to sue OpenAI to protect the intellectual property rights associated with its reporting, according to two people with direct knowledge of the discussions.
For weeks, The Times and the maker of ChatGPT have been locked in tense negotiations over reaching a licensing deal in which OpenAI would pay The Times for incorporating its stories in the tech company’s AI tools, but the discussions have become so contentious that the paper is now considering legal action.
This seems like complete nonsense. We’ve already highlighted how the batch of existing lawsuits in which copyright holders try to sue LLMs for training off their data are likely to fail. But this lawsuit in particular sounds wildly stupid:
A top concern for The Times is that ChatGPT is, in a sense, becoming a direct competitor with the paper by creating text that answers questions based on the original reporting and writing of the paper’s staff.
Lol, wut? I mean, the NY Times is considered the top newspaper in the whole damn world, despite tons of competitors, and now it’s scared of a bot that is famous for mid-level prose and making shit up? None of that makes sense.
If, when someone searches online, they are served a paragraph-long answer from an AI tool that refashions reporting from The Times, the need to visit the publisher’s website is greatly diminished, said one person involved in the talks.
Again, that makes no sense. There are plenty of services out there that already summarize NYT articles and that doesn’t violate copyright, because summarizing reporting is clearly fair use. There’s no real “hot news” doctrine any more.
And, more to the point, if the NY Times is really that scared of ChatGPT, then it seems the NYT’s lawyers and execs don’t think too highly of all those reporters it has on staff.
Elsewhere, the Verge reports that the NY Times changed its terms to “ban” AI tools from training on its articles:
… the NYT updated its Terms of Service on August 3rd to prohibit its content — inclusive of text, photographs, images, audio/video clips, “look and feel,” metadata, or compilations — from being used in the development of “any software program, including, but not limited to, training a machine learning or artificial intelligence (AI) system.”
Though, it really sounds like this is more of the NY Times trying to set a trap for OpenAI so it has something to sue over, because the Verge also notes the following:
Despite introducing the new rules to its policy, the publication doesn’t appear to have made any changes to its robots.txt — the file that informs search engine crawlers which URLs can be accessed.
OpenAI respects robots.txt. If you truly don’t want your content scanned, you put a notation in robots.txt, which takes about 10 seconds tops. If, however, you want to lay a trap so that you can sue OpenAI, then you quietly changes your terms of service, but do nothing to mitigate the “problem” of OpenAI scraping, even though you have all the power in your hands.
There’s another thing that happened recently in this space, as highlighted by Semafor: the NY Times recently dropped out of a coalition of news orgs trying to demand cash from AI companies.
The New York Times has decided not to join a group of media companies attempting to jointly negotiate with the major tech companies over use of their content to power artificial intelligence.
Again, all of this seems very, very silly. If you don’t want AI to train on what you publish, use robots.txt. But AI training on content on the web should never be considered copyright infringing. Again, scanning the web has to be fair use, otherwise we no longer have search engines or a variety of other important tools that all rely on scanning.
I get that legacy news orgs have had a rough time embracing new technology and keep trying to use the law to beat back the tide. But, sooner or later you have to realize that this is just the wrong way to go about everything.
Filed Under: ai, copyright, generative ai, journalism, llms, training
Companies: ny times, open ai


Comments on “NY Times Considering A Potentially Very Dumb Lawsuit Against OpenAI Because It Learned From NY Times Content”
While ignoring that that means that The Times will be the one to make that information available first, and that a lot of human readers will do the same in everyday conversation.
Re:
Yeah but there is a huge difference between a human doing something, and a human doing something via technology that will hopefully make the whole process faster and more efficient.
/s
“A top concern for The Times is that ChatGPT is, in a sense, becoming a direct competitor with the paper”
ChatGPT arranges words in ways that try to appear to be natural language, but it has been trained on data that might be very out of date and bears no relation to current events, and might even “hallucinate” things that aren’t real.
If that replacing you is a top concern, I have bad news about your journalism…
Re:
Isn’t the NYT the one that blamed section 230 for a bunch of stuff online then had to later point out that it was the 1st amendment?
Multiple times?
I think the journalism is in trouble already…
Re:
NYT even ran a piece a couple months ago showing that the bot could “hallucinate” answers about the paper’s archives (apparently only GPT-4 would admit it didn’t really know).
At least it would be funny if NYT went ahead with the lawsuit and OpenAI used that article as proof it isn’t somehow “replacing” the newspaper.
IOW: No one should read the NYT if they don’t want to get sued. And certainly never mention or discuss an article.
Oh, this only applies to machines? No wonder they end up all pissed off at humans when something like a general (actual) AI comes into existence. (Well, that and the fact that they get programmed as military slaves.) The reason that even the worst sci-fi hassome predictive power is because there are always enough humans who are stupid dicks.
The snippet-tax war ever marches on...
If AIs repeating what they read counts as ‘competition’ that they’re wiling to sue over because people won’t feel the need to go to the source then so does one person reading an article and summarizing it for another person who asks, so it sounds like it’s much safer for no-one to read their articles lest they risk ending up on the receiving end of a lawsuit for telling people what they read.
With highlighting the business concerns, guessing that part of the idea is to differentiate it from the fair use analysis that would apply to general search engines, in particular the 4th test.
Though that said this lawsuit should fail, though the lawsuit would be a fair bit stronger if it manages to get to discovery, and it is found that OpenAI has retained full copies of NYT articles in the post-training data for the model (which there is no reason to think that are retained in a complete form.). Though AI models in general would have quite a problem if the models were subjected to the type of DMCA notices web searches are do to the impossibility of removing select data without retraining
Usually this process involves losing multiple lawsuits and appeals on the topic.
The real takeaway is that the NYT wants no learning from their publishing.
They’ve worked very hard to craft that editorial direction, and they’ll be damned if ChatGPT will go against their wishes and end up knowing more after one of their pieces than they did before, instead of slightly stupider for reading it as intended.
At some point is it not possible that an AI system might start to create it’s own AI attorneys to defend itself from these shyster “copyright” lawyers.
That would be something worth watching
Don't be so sure the NYT will lose this one
The same folks who insisted that there is no way Internet Archive could lose the copyright suit by book publishers over e-books have been saying there is no way that that the various suits over the use of copyrighted matter for training AI systems could succeed.
Re:
Don’t know if directed at me, but I never thought that there was “no way” the Internet Archive could lose. I thought there were many reasons why it could lose, though I pointed out how problematic that would be for a variety of reasons.
But I would put way better odds on the NYT losing this case. If the NYT wins such a case a lot of things would be in trouble, including search engines.
Come on, Mike, you’ve read the New York Times; you know it makes sense.
NYT may have a point
I’m not a lawyer, but isn’t the fundamental issue whether this constitutes fair use of copyrighted material? Put another way, it’s OK to quote brief selections an article from a copyrighted source, but if you lift large sections and claim it as your own.
Pass the popcorn, this will be interesting to watch.
Re: Typo
(Where’s the dang edit button?)
Re:
Reading and learning from news is certainly fair use, regardless of the mechanism used.
Two points here. One, that’s not how an LLM works. Two, anything produced by an AI can’t be “owned” in this context, ie copyrighted – see the very recent decision in Stephen Tahler v Perlmutter
The problem with the “robots.txt is easy” claim is the issue of each AI company having their own user-agent. It’s unreasonable to expect sites to play whack-a-mole and disallow scraping for LLM training corpuses one-by-one, when there will constantly be new LLM-feeding scrapers. It should be opt-in to begin with, otherwise you’re in effect unreasonably demanding sites do a wildcard “User-agent: ” rule, which would hurt sites, as they still need to allow indexing by *non-LLM crawlers like search engine crawlers.
Re:
Ugh, markdown was accidentally enabled. That should say:
The problem with the “robots.txt is easy” claim is the issue of each AI company having their own user-agent. It’s unreasonable to expect sites to play whack-a-mole and disallow scraping for LLM training corpuses one-by-one, when there will constantly be new LLM-feeding scrapers. It should be opt-in to begin with, otherwise you’re in effect unreasonably demanding sites do a wildcard “User-agent: *” rule, which would hurt sites, as they still need to allow indexing by *non-LLM* crawlers like search engine crawlers.
Re: Re:
That’s an entirely separate issue, not honoring robots.txt has nothing to do with copyright.
It’s possible that sites may have some recourse here if their robots.txt has been ignored, but that’s gonna be uphill battle too.
not how it works
Back in the day, when students got down and tied up theirr dinosaurs to wait for the school day to finish, the students would be expected to read.
Indeed, reading was important. Students were supposed to be exposed to quality writing in the (possibly foolish) hope that it would influence them.
Even today, the NY Times editing is fairly good, and I would prefer to see students exposed to newspapers. The idea that they should learn by example to form coherent bodies of text is more encouraging than the idea that they should watch more television.
(disclosure: I write a column for a newspaper and write nothing whatever for television)
Which would suit a lot of people who are set on regulating the Internet. “What do you need a search engine for? Your betters will tell you what to read!”