hear hear

Copyright infringement has only ever been claimed against two things:
1. if someone’s Final product is so like the copyrighted product that it is clearly copied from it.
2. if someone makes copies and distributes them without permission.
Training isnt either of these. Only the final product can be infringement and only if you can show a side by side comparison of sameness.

Anonymous Coward

November 3, 2023 at 2:33 pm

Please can someone address why it is legal to use e.g. a copy of Stephen King’s IT that they found on the internet, to train a LLM.

Anonymous Coward

November 3, 2023 at 2:36 pm

Re:

See above article

Anonymous Coward

November 3, 2023 at 3:31 pm

Re: Re:

You download a database of text. It contains the hypothetical copy of Stephen King’s IT. You use that database to train the LLM. You violated the copyright by downloading the book, just as you would have if you borrowed a copy from the library and scanned each page.

Your scraping program browses deviantart. It downloads all the pictures. You feed the pictures into an image AI. You violated copyright by downloading the pictures.

Again, can someone explain why copyright has no role in a system built on mass copyright infringement.

Anonymous Coward

November 3, 2023 at 3:47 pm

Re: Re: Re:

A a large language model does not copy a book into its data base, but analyses the books use of language to build and modify its database. The training is done under the same rules as you or I reading content on the Internet, or downloading copies where that is allowed.

Anonymous Coward

November 3, 2023 at 3:56 pm

Re: Re: Re:

Infringement is infringement. If you downloaded an infringing copy of a novel, that’s infringement. Doesn’t matter if it is for an LLM training dataset or not.

No one looking at what is displayed on DeviantArt is infringing. It’s on display.

Next?

Darkness Of Course (profile)

November 3, 2023 at 5:05 pm

Re: Re: Re:² Define infringement

If infringement is your concern, then non-infringement is by definition not something that would be of your concern.

If reading a book at a library isn’t infringing, which it isn’t, then a machine “reading” the same book cannot be infringing. That the machine can read all the books in the library in less time than a normal reader can read one book, is also, not infringing.

Clearly you are operating under a miasma of ignorance, or with notions that are based on invalid emotional arguments. They don’t keep the works, they train with them. That action is not significantly different than a human reading the book. Maintaining a database of everything they read is intractable, and unnecessary.

Seriously, how could you be so ignorant as to believe that the entire freaking web would be stored in individual databases for all the different AI projects? The cost alone would be prohibitive.

Anonymous Coward

November 5, 2023 at 8:23 pm

Re: Re: Re:³

If reading a book at a library isn’t infringing, which it isn’t, then a machine “reading” the same book cannot be infringing.

The “reading” of the book isn’t infringement, but is it then storing some of that book verbatim? I bet it is.

An “AI” is not a human or an intelligent being; it’s a mere program. A human is allowed to memorize what they read. If a computer program does it, that’s called “making a copy” and can well be infringing.

John

November 6, 2023 at 1:30 am

Re: Re: Re:⁴ Memorize?

You might want to consider the numbers. Last I heard of any numbers, ChatGPT used a training set of 45 terabytes and the resulting database is 800 gigabytes. Simple math indicates that the data ChatGPT uses is only about 1.8% of the size of the training data. Now, let’s look at how compressible English text is. The best program on the Large Text Compression benchmark was only able to achieve a compression result 10.8%. That implies that if the AI was retaining copies of training inputs, it is exceeding the best known compression by at least a factor of 6. Not exactly probable.

Anonymous Coward

November 6, 2023 at 3:23 am

Re: Re: Re:⁵

I’ve heard people say things like they were able to get the entire first page of a book (small enough to be way less than 10% but enough to potentially be infringing) out of an AI. Maybe they were lying, or maybe the AI looked up a current version of a page to do that instead of its training data.

Rocky

November 7, 2023 at 1:46 pm

Re: Re: Re:⁶

I’ve heard people say things like they were able to get the entire first page of a book (small enough to be way less than 10% but enough to potentially be infringing) out of an AI. Maybe they were lying, or maybe the AI looked up a current version of a page to do that instead of its training data.

“were able”, ie they spent time crafting a prompt to get the result they wanted. Given enough prompting, even a human can produce the first page of any random book they’ve read.

If someone asks me I can quote the first page of my favorite book verbatim, does that mean my brain infringes the author’s copyright?

There’s examples of people who can recite a book verbatim after reading it once.

Mamba (profile)

November 3, 2023 at 5:15 pm

Re: Re: Re:

The premise of your first point is based on the assumption that they pirated a book, which isn’t required. Digital Libraries exist.

Your second point has been hashed out multiple multiple times. Mostly because that’s how web browsers work.

Anonymous Coward

November 3, 2023 at 6:08 pm

Re: Re: Re:

You copyright fanatics have claimed every system is a system built on copyright infringement, which is why we have shit like game devs being allowed to use the DMCA to take down criticisms of their product – even when the review itself doesn’t contain footage of their game.

Perhaps someone should refer you to the story “The Boy Who Cried Wolf”, or is that also copyright infringement to you?

Anonymous Coward

November 3, 2023 at 7:14 pm

Re: Re: Re:

In the first case, it’s irrelevant. You could fully shoplift a book and it wouldn’t be infringement to read it.

For the second, there is no possible infringement since it’s the same as accessing the website.

Anonymous Coward

November 4, 2023 at 4:07 am

Re: Re: Re:

“You violated copyright by downloading the pictures.”

Including the CC-licensed and CC-0 images? Ion think so.

Diogenes (profile)

November 4, 2023 at 7:48 am

Re: Re: Re: evidence not on record

If you can show it illegally downloaded a copyrighted work then its infringement just the same as if you got caught downloading a song from a pirate website. But you havent shown that, and afaik it hasnt been shown in court.

Toom1275 (profile)

November 8, 2023 at 8:18 am

Re: Re: Re:

Because “built on mass copyright infringement” is a delusion that springs from utter ignorance, and such things have no place in determining legal respinse.

Anonymous Coward

November 3, 2023 at 3:30 pm

Re:

For the same reason that colleges have libraries for their students to use.

PaulT (profile)

November 4, 2023 at 2:21 pm

Re:

Same reason it’s legal for me to read it before I write something.

There’s many concerns, but every author had built on what they read before, including King (The Dark Tower’s opening is openly based on a poem, Salem’s Lot is openly based on Dracula, etc.).

There’s problems here, but even the people you’re defending will admit they used others’ work, at least subliminally.

woof (profile)

November 3, 2023 at 2:35 pm

What about "training" students?

How is teaching (aka training) machines with an author’s works ANY DIFFERENT than training (aka teaching) students with her works?

It isn’t…and shouldn’t be treated differently.

Copyright isn’t a factor when a student reads a book. It shouldn’t be a factor when a machine “reads” a book.

Anonymous Coward

November 5, 2023 at 8:31 pm

Re:

It’s not a factor when the machine or the student “reads” a book. It’s absolutely a factor if either one, after reading, “takes notes” that include, say, a chapter of the book verbatim.

So exactly how much copying is being done by the AI?

Rocky

November 7, 2023 at 1:59 pm

Re: Re:

None. The simplest and very bad analogy for how LLM’s actually stores things it has been trained on is that it stores the relationship between words and the likelihood they appear together for a given context.

Ethin Probst (profile)

November 3, 2023 at 2:44 pm

As someone who’s blind, Copyright has negatively impacted me a lot. DRM in books is a particular problem. Oh, companies like Amazon have (tried) to make their books accessible but it doesn’t always go well. IMO section 1201 should either be repealed, or be modified so that it is not a violation of that section (or of any other section of title 17) if the tampering, infringement, or deactivation of mechanisms is for accessibility purposes.

Anonymous Coward

November 3, 2023 at 3:13 pm

Maybe Fair Use, However AI Output Is Publishing

AI can take in everything just like human consumption, however, any output is inherent publishing. So if copyrighted material comes out of it’s virtual mouth, violation and owner/operator must pay or “kill” it as prison is useless to the undead.

Anonymous Coward

November 3, 2023 at 3:33 pm

Re:

Private uses of AI is no more publishing that anything anybody you know shows or tells you.

Anonymous Coward

November 3, 2023 at 3:57 pm

Re:

AI can’t have copyright. Missed that bit.

Anonymous Coward

November 3, 2023 at 5:27 pm

Re:

AI can take in everything just like human consumption,

I’ll grant it.

any output is inherent publishing.

I’ll grant it.

So if copyrighted material comes out of it’s virtual mouth …

So, okay, sure. That would be plagiarism, as if a human had done so.

But…

paraphrased material is not copyright infringement
declaration of facts, or lists of facts, are not copyright infringement.
numbers are not infringement.
new works that fail “substantial similarity” are not infringement
anything qualifying under Fair Use is not infringement

And note that my list above is not comprehensive.

“Copyrighted work goes in, something comes out” is not a valid basis to declare that copyright infringement has occurred.

Everything else in your comment is incoherent. Please try again.

Anonymous Coward

November 3, 2023 at 7:16 pm

Re:

any output is inherent publishing.

jesse what the fuck are you talking about dot jaypeg

Anonymous Coward

November 4, 2023 at 1:10 am

Re: Re:

Copyright fanatics are desperate to look for some way to run an end around copyright law penalties and shaping the narrative to make themselves not look like complete assholes to the public.

Since copyright infringement and theft have very different definitions legally, they’ve been trying to claim that the act of “making available” counts as publishing, or some other word that constitutes distributing paid content in a way that they disagree with. i.e., it’s not the downloading that’s the crime, but the action that made it possible for someone else to access content they didn’t directly compensate the copyright holder for. In the context of the RIAA, talking about distribution instead of downloading also takes some of the heat off of them for suing end users like they did with children and grandmothers, because they could then argue that they weren’t going after people downloading a CD’s worth of songs; they were going after people costing them thousands of millions of dollars in revenue by providing a free option.

(Side note, this is also why John Smith loves claiming “contributory infringement” or “distributor liability”, because anything that makes it easier for copyright holders to sue is entirely his end goal.)

Of course, that strategy eventually fell apart. Comparing someone who has an unsecured WiFi connection with a cartel that actually manufactures bootleg disks is not remotely in the same ballpark, and when judges started asking for proof that infringement actually happened, copyright holders started running like hell because they had absolutely nothing aside from vague moralistic claims. It’s simply not possible for anyone to tell whether a full file got downloaded or how much each IP address contributed to a torrent swarm.

But let’s say we actually went with the “every output is publishing” argument – how is AI generated output different from a mixtape? What role did the copyright holder actually have in creating the new remixed image? (Sure, to be fair, copyright holders have long since loathed the right to remix, but I don’t see them winning that fight any time soon.)

Anonymous Coward

November 4, 2023 at 4:13 am

Re: Re: Re:

“…the act of “making available” counts as publishing…”

So every book sold by Amazon is published by them? 🐱

Anonymous Coward

November 4, 2023 at 4:41 am

Re: Re: Re:²

So every book sold by Amazon is published by them?

If it meant that copyright holders could demand more money by making Amazon responsible for some lost revenue, or allowing them to force a judge into making a ruling? Sure, copyright fans would be more than happy to make that argument.

It’s why John Smith has been pushing hard for the “distributor liability” angle for defamation, too. According to him, just mentioning that someone was brought to court just for being accused of a heinous crime constituted reputational damage for which jail time and severe fines should be a thing, whether it’s a person recounting facts or a news site hosting an article.

Anonymous Coward

November 4, 2023 at 9:07 am

Re: Re: Re:

I see, thanks for explaining this. Intellectual monopolists are so dishonest.

Anonymous Coward

November 3, 2023 at 3:55 pm

While I totally agree with this approach to copyright, we have to remember that the DMCA exists, and part of its intent is to control what people (and machines) can do to consume copyrighted content.

So the Copyright Office is not above another similar carve-out for anything involving machine learning.

Anonymous Coward

November 3, 2023 at 5:14 pm

I think there are three areas that need to be addressed, and the article only deals with the second of the three:

1) How the books/media are acquired to be used for training. Were they acquired “legally”? Given the quantity involved, I highly doubt it – likely neither purchased nor borrowed (from a library). More likely just hoovered from shady repositories. But that’s not really a matter for copyright law, since copyright is about distribution/publication, not acquisition. There was probably a violation of law involved, but it was theft, not copyright. (The only copyright violation here would be on the part of whoever aggregated the texts that the training software used.)

2) The aspect described by the article: the method of consumption. Here, my feeling is that the article is correct. There is no violation by the act of training.

3) The output based on that training. Here the question becomes whether the new art is transformative or not. Historically, there has been a surprisingly low bar for what’s considered transformative, and AI generated text/images clear that bar without breaking a sweat.

Mamba (profile)

November 3, 2023 at 5:16 pm

Re:

Theft? Get real.

Anonymous Coward

November 3, 2023 at 7:18 pm

Re:

Anyone who thinks “theft” has anything to do with this has ou

Anonymous Coward

November 3, 2023 at 7:19 pm

Re:

Anyone who says “theft” has anything to do with this has outed themselves as not remotely informed enough to talk about this cogently.

Anonymous Coward

November 4, 2023 at 4:43 am

Re: Re:

Anyone mentioning “theft” in the context of “copyright infringement” has long since indicated that they’re not interested in honest discussions on the subject. They mean to use emotional, manipulative, table-banging arguments to get their point across.

Diogenes (profile)

November 4, 2023 at 7:54 am

its a tool

Keep in mind that AI training is just a tool humans use to statistically analyze works freely available on the internet. If there was any infringing its not the AI doing it – its the human using the AI. Once past that you need to decide if the human is in fact violating any copyright laws in his use of the internet.

The Phule

November 4, 2023 at 7:57 am

No copyright for AI

Absolutely nothing published by AI should have any sort of copyright.

Diogenes (profile)

November 4, 2023 at 8:44 am

Re: AI isnt the publisher

Of course AI cant publish anything. Its the owner of the AI that would be publishing content created by its AI tool.

The Phule

November 4, 2023 at 7:59 am

Protection for artists

I do think that copyright has a role in regulating the output of an AI.

If I demand, say, the complete animorphs series as a stageplay from an ai, it shouldn’t be capable of complying. those characters are copyrighted and that setting is copyrighted.

AI would only be capable of producing generics.

Diogenes (profile)

November 4, 2023 at 8:15 am

Re: already covered

Its already infringement for a human to copy the animorphs series, so a human using AI to copy it would also be infringement.

Anonymous Coward

November 4, 2023 at 9:45 am

Re:

You appear to making a false assumption, and that is the AI has an internal copy of the series, and other works in its database, and it does not. Indeed to achieve what you are suggesting, the user would have to have a lot of detailed knowledge about the series.

Also, think of the poor artists is a dumb card to play, as most artists are poor and create their art without much hope of payment. That is the artists who make nothing, but continue to create, by far out number those who are capable of making a living from their art. Indeed copyright mainly benefits the publishers and not the artists, and the publishers have a good reason to limit the competition to the works they purchase, in that is is far easier to make a profit when there are few works being made available. Long copyrights, stopping derivative works and fair use, and keeping works off the market are all aimed at limiting peoples choices and the overheads of keeping works on the market so as to maximize profits.

PaulT (profile)

November 4, 2023 at 2:27 pm

Re:

“If I demand, say, the complete animorphs series as a stageplay from an ai, it shouldn’t be capable of complying. those characters are copyrighted and that setting is copyrighted.”

OK. But, then there’s another pass and all the references to copyrighted characters are removed. Are they still infringing? If so, how does that affect human written works like Fifty Shades Of Gray, which started as Twilight fan fictions? Where are the lines drawn?

Anonymous Coward

November 5, 2023 at 9:17 pm

It would be an absurd result – and one inconsistent with what the Progress Clause of the Constitution enables copyright law to do – if copyright law could prevent the public from getting to consume the works that copyright law has incentivized the creation of. Such barriers would also conflict with the right to read found in the First Amendment (or, stated more broadly, the right to receive information and ideas).

You’re wrong. Copyright law already does this. For example, it prohibits unauthorized translations as derivative works. It prohibits the unauthorized publication of unpublished works. It allows Disney to put a movie in the “vault” for a couple of decades.

If people can direct their screen reader to read one work, they should be able to direct their screen reader to read many works.

AI isn’t a screen reader. And your screen reader presumably reads what’s on your screen. It doesn’t store a bunch of copyrighted articles just in case you want to read them later (and I don’t think it would be legal for it to do so.)

The Internet can be a bit weird when it comes to copyrighted stuff. Your browser gets a bunch of HTML and tries to assemble it into something readable. Different browsers or different hardware will make the same website look different, and there are also various plugins available. I think there’s an implied license when a site sends you HTML that you’re allowed to use various software to process and view it, including things like screen readers (not to mention potential ADA problems if things like screen readers aren’t allowed.) I don’t think that extends to using it to train AI. That’s not the same category of software.

And this is especially a different category when this AI is commercial software and the people giving it the training data are not the same people who will be using it to get information. You can tell your screen reader to read your screen to you; you can’t tell your screen reader to broadcast it to a room full of people paying you for the privilege.

Anonymous Coward

November 7, 2023 at 2:02 pm

Re:

It doesn’t store a bunch of copyrighted articles just in case you want to read them later

Neither does an AI.

Friday
19:39	LittleBigPlanet: Now You Don't Own What You've Created, Either (8)
15:09	Ctrl-Alt-Speech: The Bell Tolls For TikTok (2)
13:34	Florida Appeals Court Says The Right To Record Extends To Phone Calls With Cops (3)
12:06	Court Dismisses Mark Zuckerberg Personally From Massive ‘Social Media Addicts Children’ Lawsuit (2)
10:45	Net Neutrality Is Back! For Now. (25)
10:40	Daily Deal: U-STREAM Home Streaming Studio with 10" Ring Light & Tripod (0)
09:20	Biden Bans The App His Campaign Insists Is An Important Place To Talk To Voters (29)
05:21	People Are Slowly Realizing Their Auto Insurance Rates Are Skyrocketing Because Their Car Is Covertly Spying On Them (37)
Thursday
20:05	Flynn Family's SLAPP Suit Against CNN Slapped Down By Judge (18)
15:31	Two Decades Of Content In 'Garry's Mod' Taken Down, Possibly By Nintendo Impersonator (26)

Wherein The Copia Institute Tells The Copyright Office There’s No Place For Copyright Law In AI Training

from the copyright-free-zone dept

Comments on “Wherein The Copia Institute Tells The Copyright Office There’s No Place For Copyright Law In AI Training”

Add Your Comment Cancel reply

Comment Options:

What's this?

Techdirt Daily Newsletter

The Techdirt Greenhouse

Trending Posts

Friday

Thursday

More

Email This Story

Tools & Services

Company

Contact

More