Anthropic’s AI Lawsuit Settlement May Not Go Through, But It Exposes A Truth About Copyright

from the copyright-is-a-mess dept

The latest generation of AI systems, based on large language models (LLMs), is perceived as the biggest threat in decades to the established copyright order. The scale of that threat can be gauged by the flurry of AI lawsuits that publishers and others have launched against generative AI companies. Since the first of these, reported here on Walled Culture back in January 2023, there have been dozens of others, catalogued on Wikipedia, and represented visually on the Chat GPT is Eating the World site. One is against Anthropic. Three authors alleged in a class-action lawsuit that the company had used unauthorized copies of their works to train its AI-powered chatbot, Claude:

Anthropic has built a multibillion-dollar business by stealing hundreds of thousands of copyrighted books. Rather than obtaining permission and paying a fair price for the creations it exploits, Anthropic pirated them.

In June of this year, Anthropic won a partial victory. The federal judge considering the case ruled that the training of the company’s system on legally purchased copies of books was fair use, and did not need the authors’ permission. However, Judge Alsup also ruled that Anthropic should face trial for downloading millions of books from sites such as Library Genesis (LibGen) and the Pirate Library Mirror (PiLiMi), both of which held unauthorized copies of works. The potential penalty was huge. Under US law, the company might have to pay damages of up to $150,000 per work. With millions of books allegedly downloaded from the online sites, that could amount to many billions of dollars, even a trillion dollars. Faced with certain ruin if such a penalty were handed down, Anthropic had a strong incentive to settle out of court. On 5 September, the parties proposed just such a settlement. The New York Times had the following summary:

In a landmark settlement, Anthropic, a leading artificial intelligence company, has agreed to pay $1.5 billion to a group of authors and publishers after a judge ruled it had illegally downloaded and stored millions of copyrighted books.

The settlement is the largest payout in the history of U.S. copyright cases. Anthropic will pay $3,000 per work to 500,000 authors.

The agreement is a turning point in a continuing battle between A.I. companies and copyright holders that spans more than 40 lawsuits across the country. Experts say the agreement could pave the way for more tech companies to pay rights holders through court decisions and settlements or through licensing fees.

Some saw the $3,000 per work figure as setting a benchmark for future deals that other AI companies would need to follow in order to settle similar lawsuits (although a settlement would not set a legal precedent). Music publishers were hopeful they could point to the settlement with writers in order to win a similar deal for musicians. Others worried that the overall size of the settlement – $1.5 billion – meant that only the largest companies could afford to pay such sums, shutting out smaller startups and limiting competition in this nascent market. Indeed, big as the $1.5 billion settlement was, it paled in comparison to the $13 billion that Anthropic has recently raised, to say nothing of its nominal $183 billion valuation. But a post by Dave Hansen on the Authors Alliance blog puts all these breathless predictions and impressive numbers into perspective. For example, he points out:

The settlement isn’t a settlement with “authors.” Or at least not just authors. The moment Judge Alsup defined and certified the class in this case to include any rightsholder with an interest in the exclusive copyright right of reproduction in a LibGen/PilLiMi book downloaded by Anthropic, this case became at least as important for publishers as authors.

Crucially, that means only a portion of that $1.5 billion would go to the actual authors. Some of it would go to the usual suspects: the plaintiff’s lawyers. But there are other costs that must be covered too, and Hansen writes: “it’s easy to see that about a quarter to a third of this settlement is being used up before rightsholders see anything.” And then there is the question of who exactly those “rightsholders” are: the writers or the publishers? Probably both in many cases, with a variable split depending on the contract they signed.

Even before those complex questions are addressed, there is a huge assumption that the proposed settlement will go through in its present form. That’s by no means assured. As Bloomberg Law reported, Judge Alsup said he was worried that lawyers were striking a deal behind the scenes that will be forced “down the throat of authors,” and that the agreement is “nowhere close to complete.”

Judge William Alsup at the hearing said the motion to approve the deal was denied without prejudice, but in a minute order after the hearing said approval is postponed pending submission of further clarifying information.

During the first hearing since the deal was announced on Sept. 5, Alsup said he felt “misled” and needs to see more information about the claim process for class members.

Another important point underlined by Dave Hansen on the Authors Alliance blog is that even if the settlement goes through, it doesn’t really help to resolve any of the larger copyright issues raised by the new LLMs:

The settlement isn’t far-reaching. While the payment is record-setting for a copyright class action ($1.5 billion), the settlement terms are pretty narrow in scope. Anthropic simply gets a release from liability for past conduct – namely, use of the LibGen and PiLiMi datasets. It is therefore unlike the proposed settlement in the Google Books Settlement that would have created a novel licensing scheme for a wide variety of future uses

The Google Books Settlement is discussed in Walled Culture the book (free digital versions available), as is another notable moment in copyright history. This concerns the fate of Jammie Thomas, a single mother of two. In 2007, she was found liable for $222,000 in damages for sharing twenty-four songs on the P2P service Kazaa. The judge, ordering a new trial for Thomas, called “the award of hundreds of thousands of dollars in damages unprecedented and oppressive”, and took the opportunity to “implore Congress to amend the Copyright Act to address liability and damages in peer-to-peer network cases such as the one currently before this Court.” On retrial, Thomas was found liable for even more: $1.92 million.

It is instructive to compare that $1.92 million fine for sharing 24 songs – $80,000 per work – with the $1,500 per work that Anthropic is now offering to pay. This confirms once more that when it comes to copyright and its enforcement, there is one law for the rich corporations, and another law for the rest of us.

Follow me @glynmoody on Mastodon and on Bluesky. Originally posted to WalledCulture.

Filed Under: , , , , ,
Companies: anthropic

Rate this comment as insightful
Rate this comment as funny
You have rated this comment as insightful
You have rated this comment as funny
Flag this comment as abusive/trolling/spam
You have flagged this comment
The first word has already been claimed
The last word has already been claimed
Insightful Lightbulb icon Funny Laughing icon Abusive/trolling/spam Flag icon Insightful badge Lightbulb icon Funny badge Laughing icon Comments icon

Comments on “Anthropic’s AI Lawsuit Settlement May Not Go Through, But It Exposes A Truth About Copyright”

Subscribe: RSS Leave a comment
29 Comments
MrWilson (profile) says:

As an author whose work has been used to train LLMs, I don’t expect to make much of anything off of lawsuits or settlements or licensing deals. The publishers exploit authors and pocket the bulk of the profit already, so pretending we’ll be first in line for a payout is unrealistically optimistic.

Much like Etsy sellers work for Etsy’s profit and gig drivers work for Uber and Lyft’s profits, authors work for publishers’ profits. They’re just brilliantly outsourcing the operational costs to you. If you’re raging against LLM companies exploiting authors but not also raging against the big media companies exploiting the authors for far longer, you’re betraying the fact that you don’t give a fuck about authors.

TKnarr (profile) says:

One of the biggest obstacles to this settlement will, I think, be the limited scope of the authors who qualify as class members. Among other things they have to have registered the copyright (or have had it registered by their publisher) before the class action was filed. A huge number of authors didn’t have copyrights registered because you only need that done before you sue over them. Most publishers don’t bother registering until that point. All the authors who fall into that won’t be covered by the settlement and can sue Anthropic on their own or as another class. Alsup’s already commented on this. If he thinks this will leave that much liability unsettled, he may decide to nix the settlement or expand the class to cover the majority of authors but keep the $3000/work figure (which’d increase the amount Anthropic would have to pay by a lot).

Amusingly it’s the tech companies that put Anthropic in this position. The DMCA explicitly made transient copies and copies necessary for software to work non-infringing. In their quest to get click-through EULAs and ToS enforceable, the tech companies gutted those provisions through court decisions. Absent that, Anthropic probably could have argued it was only liable for those works it’s software actually regurgitated verbatim, not the entire database behind it.

Arianity (profile) says:

Re:

Amusingly it’s the tech companies that put Anthropic in this position. The DMCA explicitly made transient copies and copies necessary for software to work non-infringing. In their quest to get click-through EULAs and ToS enforceable, the tech companies gutted those provisions through court decisions. Absent that, Anthropic probably could have argued it was only liable for those works it’s software actually regurgitated verbatim, not the entire database behind it.

That’d be unlikely given the particulars, I think. They’re not getting hit for the entire database, just the shadow library from pirated works. And part of Alsup’s ruling explicitly calls out that it was a “forever library”, not a transient one.

TKnarr (profile) says:

Re: Re:

Possible, but they could make the case that it was a cache the same as your browser uses to store copies of (copyrighted) web pages so it doesn’t have to fetch them from the server every time you view them. Just like that browser cache the shadow library wasn’t distributed to anyone else, and caching like that was one of the things the DMCA’s provisions were supposed to protect.

Arianity (profile) says:

Re: Re: Re:

The problem is that they were using it more generally. As the ruling puts it:

Building a central library of works to be available for any number of further uses was itself the use for which Anthropic acquired these copies. One further use was making further copies for training LLMs. But not every book Anthropic pirated was used to train LLMs. And, every pirated library copy was retained even if it was determined it would not be so used. Pirating copies to build a research library without paying for it, and to retain copies should they prove useful for one thing or another, was its own use

Anonymous Coward says:

Re:

A huge number of authors didn’t have copyrights registered because you only need that done before you sue over them. Most publishers don’t bother registering until that point.

Most authors don’t have a publisher, and never will. But they’ll also never be on LibGen.

Amusingly it’s the tech companies that put Anthropic in this position.

“Tech” meaning what in this case? Publishing? Given that the idea of a “company” is itself technology, “tech company” is little more that a term that means whatever the hell the person using it wants it to mean.

TKnarr (profile) says:

Re: Re:

Tech companies like Microsoft, Oracle, Blizzard and so on. The ones who need copyright to back up their claim that even if you bought their software you had to make a copy of it onto your hard drive to install it and you hadn’t bought a license to do _that, and the only way to get that license was to agree to their EULA first (which you never saw until you tried to install the software).

Anonymous Coward says:

Others worried that the overall size of the settlement – $1.5 billion – meant that only the largest companies could afford to pay such sums, shutting out smaller startups and limiting competition in this nascent market.

That is worrying. No company should be able to afford to pay for deliverately breaking the law, it should be a given that their entire capital structure is wiped out overnight. In point of fact, we should skip the matter of fines entirely and just wipe out the capital structure directly.

As for smaller companies, they can go ahead and purchase those books for the market value of $10 million like everyone else.

Ehud Gavron (profile) says:

Walled culture

This simmering crap has come to a boil.

You own the rights to the content you produce as per the copyright laws or equivalent in your jurisdiction. Copyright maximalists (pirmarly in the US) pretend they own much more and courts are loathe to rule on this.

If you CHOOSE TO MAKE YOUR WORK PUBLIC there are some rights you voluntarily give away. To later claim that someone owes you millions or billions of dollars for using what YOU CHOSE OT PUT OUT isn’t in any way rewarding you for being stupid nor for being creative.

LLMs are a thing now. LLM training on publicly-available materials is a thing too. Piracy is a loaded pejorative term that is NOT a legal term, a term of the art, etc. Somali dudes on a skiff attacking a ship — that’s piracy. Anthropic scraping the net for publicly available documents – not it.

But as in any case where politics interferes with the law, “justice” stands aside for “lobbyists.”

So here we are. Anthropic wanted out quick with a cheap deal. Now it’s in legal limbo. The real problem is that this “piracy” schtick should have been deal with back in the Napster days.

But it wasn’t. Too many pussies eager to “settle up cheap.” And now we reap what they sowed.

Anonymous Coward says:

Re:

Anthropic scraping the net for publicly available documents – not it.

It’s not even clear that it’s illegal. LibGen may be violating copyright, but the people who download from there? The trial didn’t get far enough to find out. In general, I don’t recall anyone getting into much trouble for downloading.

So here we are. Anthropic wanted out quick with a cheap deal.

I wonder about that. I mean, did they want “cheap”? They’re valued at $183 billion. Paying 1% of that to lock out their competitors is pretty good value. Even 5 or 10 percent might be. Anthropic just need to make a bit of a show about it being “punishing”, to keep too many people (especially the judge) from noticing that.

Arianity (profile) says:

Re: Re: Re:

True, because despite the name, copyright is not a right.

Eh, it depends on how you’re using the term. U.S. law explicitly calls it a right (or rights, plural). Essentially, it’s a form of property rights. It’s the correct legal term, at least in the U.S.

Regardless of what you want to call it though, it’s not something that is lost when a work is published, unless they’re confusing it with public domain. Which these works are not, and isn’t applicable here.

n00bdragon (profile) says:

If a child borrows a book from a friend (let’s just say it’s a famous one like… oh, the Lord of the Rings, for the purpose of argument), actually heck, let’s say he’s a brazen little thief and steals the Lord of the Rings from Barnes & Noble, reads it, and then goes on to become a well known published fantasy author, can he now be sued for copyright infringement?

His stories are all about elves and dwarves and wizards and dragons, but there’s nothing straight ripped line for line from LOTR. He was however, essentially, trained on a pirated dataset.

Anonymous Coward says:

Re:

can he now be sued for copyright infringement?

Probably. Who’d even heard of a “sua sponte” dismissal before Trump’s recent lawsuit was dismissed by the judge? In almost all other cases, anyone can be sued for anything.

His stories are all about elves and dwarves and wizards and dragons, but there’s nothing straight ripped line for line from LOTR.

Years ago, I started reading “The Sword of Shannara” by Terry Brooks. I found it to be little more than a Lord of the Rings re-telling, with the names changed. Terry was “ripping” from LOTR as much as these “A.I.” services are “ripping”, and got away with it.

Anonymous Coward says:

Although ChatGPT is named in the article, it doesn’t get trained the way other LLMs might be. The conversation I had with it clarifying this follows:

Since you can’t access external links (as I learned yesterday), how are you trained?

Great question! While I can’t browse the web or access live links, I was trained using a huge dataset of text from a wide variety of sources, such as books, websites, articles, and other publicly available written material. This training process involves learning patterns, grammar, facts, reasoning skills, and even some nuances of language.

Here’s how it generally works:

  1. Data Collection: My training data consists of large text corpora—basically, massive collections of text data that have been pre-processed to remove any personal or private information. The data spans a broad range of topics, which helps me understand a variety of subjects, from science and history to literature and casual conversation.
  1. Training Process: During the training, I use a method called unsupervised learning, where I predict what word or phrase comes next in a sentence based on the previous ones. The more text I process, the better I get at making these predictions and understanding the structure of language. This helps me “learn” the connections between words and concepts.
  2. Fine-Tuning: After the initial training, I undergo fine-tuning, where the model is adjusted to behave more safely and responsibly. This involves using curated datasets and sometimes reinforcement learning to guide my responses to be more accurate, coherent, and respectful.
  3. Evaluation and Iteration: After being trained, my responses are evaluated for quality, and any issues (like biases or errors) are addressed in future updates. This makes me more reliable and accurate over time.

While I don’t access the internet in real-time or update dynamically, my responses are based on the information I was trained on up until my last update (September 2021 for factual information). So, if I give an answer about current events or new trends, it might be a little outdated!

Does that help clarify how it works?

Anonymous Coward says:

Re:

Actually, ChatGPT can access the internet, but it’s limited in how it does so. Here’s the conversation I just had with it:

Somebody’s just said you can’t access the internet. Do you think this is a problem because it makes your info out of date, or do you think it’s good because it means you can’t be turned into a Nazi cesspool the way Grok was?

That’s a really thoughtful question, and it actually gets to the heart of one of the biggest design trade-offs with systems like me.

You’re right on both counts:

  • Not having unrestricted internet access does mean my information can get out of date*, especially on fast-moving topics like current events, science, or politics. My knowledge was trained on a huge amount of information up to my last update, but after that I can miss developments unless I’m explicitly given a way to fetch new data.
  • But being disconnected from the open web also keeps me much “cleaner.” It means I’m far less likely to absorb whatever’s trending on social media or fringe forums—which can be misinformation, hate speech, or outright extremist ideology. What happened with Grok is a good example of the risks: a model that’s hooked directly into live online discourse can get shaped (or warped) by toxic communities very quickly if it’s not carefully filtered.

So, in a way, it’s a deliberate safety measure. OpenAI tries to strike a balance by letting me use the web when needed (through a controlled tool) but not just roam the internet like a person does. That way:

  • When you need fresh info, I can do a targeted search.
  • But I don’t constantly ingest unmoderated, real-time content, which reduces the chance of being hijacked or echoing the worst parts of the internet.

If I were permanently connected with no safeguards, I could easily become inaccurate, biased, or manipulated, especially by coordinated campaigns. But if I’m never connected, I risk being out of date. The middle ground—limited, auditable access—is basically the compromise to get the best of both worlds.

Would you prefer an AI that’s always connected (but riskier) or one that’s more self-contained (but sometimes stale)?

Add Your Comment

Your email address will not be published. Required fields are marked *

Have a Techdirt Account? Sign in now. Want one? Register here

Comment Options:

Make this the or (get credits or sign in to see balance) what's this?

What's this?

Techdirt community members with Techdirt Credits can spotlight a comment as either the "First Word" or "Last Word" on a particular comment thread. Credits can be purchased at the Techdirt Insider Shop »

Follow Techdirt

Techdirt Daily Newsletter

Subscribe to Our Newsletter

Get all our posts in your inbox with the Techdirt Daily Newsletter!

We don’t spam. Read our privacy policy for more info.

Ctrl-Alt-Speech

A weekly news podcast from
Mike Masnick & Ben Whitelaw

Subscribe now to Ctrl-Alt-Speech »
Techdirt Deals
Techdirt Insider Discord
The latest chatter on the Techdirt Insider Discord channel...
Loading...