Wherein The Copia Institute Tells The Copyright Office There’s No Place For Copyright Law In AI Training

from the copyright-free-zone dept

These days everyone seems to be talking about AI, and the Copyright Office is no exception, although it may make sense for it to speak here because people keep trying to invoke copyright as a concept implicated by various aspects of AI, including, and perhaps especially, with regard to “training” AI systems. So the Copyright Office recently launched a study to get feedback on the role copyright has, or should be changed to have, in shaping any law that bears on AI, and earlier this week the Copia Institute filed an initial comment in that study.

In our comment we made several points, but the main one was that, at least when it comes to AI training, copyright law needs to butt out. It has no role to play now, nor could it constitutionally be changed to have one. And regardless of the legitimacy to any concerns for how AI may be used, allowing copyright to be an obstructing force in order to prevent AI systems from being developed will only have damaging effects not just deterring any benefits that the innovation might be able to provide but undermining the expressive freedoms we depend on.

In explaining our conclusion we first observed that one overarching problem poisoning any policy discussion on AI is that “artificial intelligence” is a terrible term that obscures what we are actually talking about. Not only do we tend to conflate the ways we develop it (or “train” it), with the way we use it, which presents its own promises and potential perils, but in general we all too often regard it as some new form of powerful magic that can either miraculously solve all sorts of previously intractable problems or threaten the survival of humanity. “AI” can certainly inspire both naïve enthusiasm prone to deploying it in damaging ways, and also equally unfounded moral panics preventing it from being used beneficially. It also can prompt genuine concerns as well as genuine excitement. Any policy discussion addressing it must therefore be able to cut through the emotion and tease out exactly what aspect of AI we are talking about when we are addressing those effects.  We cannot afford to take analytical shortcuts, especially if it would lead us to inject copyright into an area of policy where it does not belong and its presence would instead cause its own harm.

Because AI is not in fact magic; in reality it is simply a sophisticated software tool that helps us process information and ideas around us. And copyright law exists to make sure that there is information and ideas for the public to engage with. It does so by bestowing on the copyright owner certain exclusive rights in the hopes that this exclusivity makes it economically viable for them to create the works containing those ideas and information. But these exclusive rights necessarily all focus on the creation and performance of their works. None of the rights limit how the public can then consume those works once they exist, because, indeed, the whole point of helping ensure they could exist is so that the public can consume them. Copyright law wouldn’t make sense, and probably not be constitutional per the Progress Clause, if the way it worked constrained that consumption and thus the public’s engagement with those ideas and information.

It also would offend the First Amendment because the right of free expression inherently includes what is often referred to as the right to read (or, more broadly, the right to receive information and ideas). Which is a big reason why book bans are so constitutionally odious, because they explicitly and deliberately attack that right. But people don’t just have the right to consume information and ideas directly through their own eyes and ears. They have the right to use tools to help them do it, including technological ones. As we explained in our comment, the ability to use tools to receive and perceive created works is often integral to facilitating that consumption – after all, how could the public listen to a record without a record player, or consume digital media without a computer. No law could prevent the use of tools without seriously impinging upon the inherent right to consume the works entirely. The United States is also a signatory to the Marrakesh Treaty, which addresses the unique need by those with visual and audio impairments to use tools such as screen readers to help them consume the works to which they would otherwise be entitled to perceive. Of course, it is not only those with such impairments who may have need to use such tools, and the right to format shift should allow anyone to use a screen reader to help them consume works if such tools will help them glean those ideas effectively.

What too often gets lost in the discussion of AI is that because we are not talking about some exceptional form of magic but rather just fancy software, AI training must be understood as simply being an extension of these same principles that allow the public to use tools, including software tools, to help them consume works. After all, if people can direct their screen reader to read one work, they should be able to direct their screen reader to read many works. Conversely, if they cannot use a tool to read many works, then it undermines their ability to use a tool to help them read any. Thus it is critically important that copyright law not interfere with AI training in order not to interfere with the public’s right to consume works as they currently should be able to do.

So at minimum such AI training needs to be considered a fair use, but the better practice is to recognize that there is no role for copyright to play when it comes to AI training at all. To say it is allowed as a fair use is to inflate the power of a copyright holder beyond what the statute or Constitution should allow because it suggests that using tools to consume works could ever potentially be an infringement, which only happens to be excused in this context. But copyright law is not supposed to give copyright owners such power over the consumption of their works, which we would then need to be dependent on fair use to temper. It should never apply to limit the consumption of works in any context, and we should not let concerns about AI generally, or their uses or outputs specifically, to open the door to copyright law ever becoming an obstacle to that consumption.

Filed Under: , , , , , ,

Rate this comment as insightful
Rate this comment as funny
You have rated this comment as insightful
You have rated this comment as funny
Flag this comment as abusive/trolling/spam
You have flagged this comment
The first word has already been claimed
The last word has already been claimed
Insightful Lightbulb icon Funny Laughing icon Abusive/trolling/spam Flag icon Insightful badge Lightbulb icon Funny badge Laughing icon Comments icon

Comments on “Wherein The Copia Institute Tells The Copyright Office There’s No Place For Copyright Law In AI Training”

Subscribe: RSS Leave a comment
47 Comments
This comment has been deemed insightful by the community.
Diogenes (profile) says:

hear hear

Copyright infringement has only ever been claimed against two things:
1. if someone’s Final product is so like the copyrighted product that it is clearly copied from it.
2. if someone makes copies and distributes them without permission.
Training isnt either of these. Only the final product can be infringement and only if you can show a side by side comparison of sameness.

Anonymous Coward says:

Re: Re:

You download a database of text. It contains the hypothetical copy of Stephen King’s IT. You use that database to train the LLM. You violated the copyright by downloading the book, just as you would have if you borrowed a copy from the library and scanned each page.

Your scraping program browses deviantart. It downloads all the pictures. You feed the pictures into an image AI. You violated copyright by downloading the pictures.

Again, can someone explain why copyright has no role in a system built on mass copyright infringement.

Darkness Of Course (profile) says:

Re: Re: Re:2 Define infringement

If infringement is your concern, then non-infringement is by definition not something that would be of your concern.

If reading a book at a library isn’t infringing, which it isn’t, then a machine “reading” the same book cannot be infringing. That the machine can read all the books in the library in less time than a normal reader can read one book, is also, not infringing.

Clearly you are operating under a miasma of ignorance, or with notions that are based on invalid emotional arguments. They don’t keep the works, they train with them. That action is not significantly different than a human reading the book. Maintaining a database of everything they read is intractable, and unnecessary.

Seriously, how could you be so ignorant as to believe that the entire freaking web would be stored in individual databases for all the different AI projects? The cost alone would be prohibitive.

Anonymous Coward says:

Re: Re: Re:3

If reading a book at a library isn’t infringing, which it isn’t, then a machine “reading” the same book cannot be infringing.

The “reading” of the book isn’t infringement, but is it then storing some of that book verbatim? I bet it is.

An “AI” is not a human or an intelligent being; it’s a mere program. A human is allowed to memorize what they read. If a computer program does it, that’s called “making a copy” and can well be infringing.

John says:

Re: Re: Re:4 Memorize?

You might want to consider the numbers. Last I heard of any numbers, ChatGPT used a training set of 45 terabytes and the resulting database is 800 gigabytes. Simple math indicates that the data ChatGPT uses is only about 1.8% of the size of the training data. Now, let’s look at how compressible English text is. The best program on the Large Text Compression benchmark was only able to achieve a compression result 10.8%. That implies that if the AI was retaining copies of training inputs, it is exceeding the best known compression by at least a factor of 6. Not exactly probable.

Anonymous Coward says:

Re: Re: Re:5

I’ve heard people say things like they were able to get the entire first page of a book (small enough to be way less than 10% but enough to potentially be infringing) out of an AI. Maybe they were lying, or maybe the AI looked up a current version of a page to do that instead of its training data.

Rocky says:

Re: Re: Re:6

I’ve heard people say things like they were able to get the entire first page of a book (small enough to be way less than 10% but enough to potentially be infringing) out of an AI. Maybe they were lying, or maybe the AI looked up a current version of a page to do that instead of its training data.

“were able”, ie they spent time crafting a prompt to get the result they wanted. Given enough prompting, even a human can produce the first page of any random book they’ve read.

If someone asks me I can quote the first page of my favorite book verbatim, does that mean my brain infringes the author’s copyright?

There’s examples of people who can recite a book verbatim after reading it once.

Anonymous Coward says:

Re: Re: Re:

You copyright fanatics have claimed every system is a system built on copyright infringement, which is why we have shit like game devs being allowed to use the DMCA to take down criticisms of their product – even when the review itself doesn’t contain footage of their game.

Perhaps someone should refer you to the story “The Boy Who Cried Wolf”, or is that also copyright infringement to you?

PaulT (profile) says:

Re:

Same reason it’s legal for me to read it before I write something.

There’s many concerns, but every author had built on what they read before, including King (The Dark Tower’s opening is openly based on a poem, Salem’s Lot is openly based on Dracula, etc.).

There’s problems here, but even the people you’re defending will admit they used others’ work, at least subliminally.

This comment has been deemed insightful by the community.
woof (profile) says:

What about "training" students?

How is teaching (aka training) machines with an author’s works ANY DIFFERENT than training (aka teaching) students with her works?

It isn’t…and shouldn’t be treated differently.

Copyright isn’t a factor when a student reads a book. It shouldn’t be a factor when a machine “reads” a book.

This comment has been deemed insightful by the community.
Ethin Probst (profile) says:

As someone who’s blind, Copyright has negatively impacted me a lot. DRM in books is a particular problem. Oh, companies like Amazon have (tried) to make their books accessible but it doesn’t always go well. IMO section 1201 should either be repealed, or be modified so that it is not a violation of that section (or of any other section of title 17) if the tampering, infringement, or deactivation of mechanisms is for accessibility purposes.

Anonymous Coward says:

Re:

AI can take in everything just like human consumption,

I’ll grant it.

any output is inherent publishing.

I’ll grant it.

So if copyrighted material comes out of it’s virtual mouth …

So, okay, sure. That would be plagiarism, as if a human had done so.

But…

  • paraphrased material is not copyright infringement
  • declaration of facts, or lists of facts, are not copyright infringement.
  • numbers are not infringement.
  • new works that fail “substantial similarity” are not infringement
  • anything qualifying under Fair Use is not infringement

And note that my list above is not comprehensive.

“Copyrighted work goes in, something comes out” is not a valid basis to declare that copyright infringement has occurred.

Everything else in your comment is incoherent. Please try again.

Anonymous Coward says:

Re: Re:

Copyright fanatics are desperate to look for some way to run an end around copyright law penalties and shaping the narrative to make themselves not look like complete assholes to the public.

Since copyright infringement and theft have very different definitions legally, they’ve been trying to claim that the act of “making available” counts as publishing, or some other word that constitutes distributing paid content in a way that they disagree with. i.e., it’s not the downloading that’s the crime, but the action that made it possible for someone else to access content they didn’t directly compensate the copyright holder for. In the context of the RIAA, talking about distribution instead of downloading also takes some of the heat off of them for suing end users like they did with children and grandmothers, because they could then argue that they weren’t going after people downloading a CD’s worth of songs; they were going after people costing them thousands of millions of dollars in revenue by providing a free option.

(Side note, this is also why John Smith loves claiming “contributory infringement” or “distributor liability”, because anything that makes it easier for copyright holders to sue is entirely his end goal.)

Of course, that strategy eventually fell apart. Comparing someone who has an unsecured WiFi connection with a cartel that actually manufactures bootleg disks is not remotely in the same ballpark, and when judges started asking for proof that infringement actually happened, copyright holders started running like hell because they had absolutely nothing aside from vague moralistic claims. It’s simply not possible for anyone to tell whether a full file got downloaded or how much each IP address contributed to a torrent swarm.

But let’s say we actually went with the “every output is publishing” argument – how is AI generated output different from a mixtape? What role did the copyright holder actually have in creating the new remixed image? (Sure, to be fair, copyright holders have long since loathed the right to remix, but I don’t see them winning that fight any time soon.)

Anonymous Coward says:

Re: Re: Re:2

So every book sold by Amazon is published by them?

If it meant that copyright holders could demand more money by making Amazon responsible for some lost revenue, or allowing them to force a judge into making a ruling? Sure, copyright fans would be more than happy to make that argument.

It’s why John Smith has been pushing hard for the “distributor liability” angle for defamation, too. According to him, just mentioning that someone was brought to court just for being accused of a heinous crime constituted reputational damage for which jail time and severe fines should be a thing, whether it’s a person recounting facts or a news site hosting an article.

Anonymous Coward says:

I think there are three areas that need to be addressed, and the article only deals with the second of the three:

1) How the books/media are acquired to be used for training. Were they acquired “legally”? Given the quantity involved, I highly doubt it – likely neither purchased nor borrowed (from a library). More likely just hoovered from shady repositories. But that’s not really a matter for copyright law, since copyright is about distribution/publication, not acquisition. There was probably a violation of law involved, but it was theft, not copyright. (The only copyright violation here would be on the part of whoever aggregated the texts that the training software used.)

2) The aspect described by the article: the method of consumption. Here, my feeling is that the article is correct. There is no violation by the act of training.

3) The output based on that training. Here the question becomes whether the new art is transformative or not. Historically, there has been a surprisingly low bar for what’s considered transformative, and AI generated text/images clear that bar without breaking a sweat.

The Phule says:

Protection for artists

I do think that copyright has a role in regulating the output of an AI.

If I demand, say, the complete animorphs series as a stageplay from an ai, it shouldn’t be capable of complying. those characters are copyrighted and that setting is copyrighted.

AI would only be capable of producing generics.

Anonymous Coward says:

Re:

You appear to making a false assumption, and that is the AI has an internal copy of the series, and other works in its database, and it does not. Indeed to achieve what you are suggesting, the user would have to have a lot of detailed knowledge about the series.

Also, think of the poor artists is a dumb card to play, as most artists are poor and create their art without much hope of payment. That is the artists who make nothing, but continue to create, by far out number those who are capable of making a living from their art. Indeed copyright mainly benefits the publishers and not the artists, and the publishers have a good reason to limit the competition to the works they purchase, in that is is far easier to make a profit when there are few works being made available. Long copyrights, stopping derivative works and fair use, and keeping works off the market are all aimed at limiting peoples choices and the overheads of keeping works on the market so as to maximize profits.

PaulT (profile) says:

Re:

“If I demand, say, the complete animorphs series as a stageplay from an ai, it shouldn’t be capable of complying. those characters are copyrighted and that setting is copyrighted.”

OK. But, then there’s another pass and all the references to copyrighted characters are removed. Are they still infringing? If so, how does that affect human written works like Fifty Shades Of Gray, which started as Twilight fan fictions? Where are the lines drawn?

Anonymous Coward says:

It would be an absurd result – and one inconsistent with what the Progress Clause of the Constitution enables copyright law to do – if copyright law could prevent the public from getting to consume the works that copyright law has incentivized the creation of. Such barriers would also conflict with the right to read found in the First Amendment (or, stated more broadly, the right to receive information and ideas).

You’re wrong. Copyright law already does this. For example, it prohibits unauthorized translations as derivative works. It prohibits the unauthorized publication of unpublished works. It allows Disney to put a movie in the “vault” for a couple of decades.

If people can direct their screen reader to read one work, they should be able to direct their screen reader to read many works.

AI isn’t a screen reader. And your screen reader presumably reads what’s on your screen. It doesn’t store a bunch of copyrighted articles just in case you want to read them later (and I don’t think it would be legal for it to do so.)

The Internet can be a bit weird when it comes to copyrighted stuff. Your browser gets a bunch of HTML and tries to assemble it into something readable. Different browsers or different hardware will make the same website look different, and there are also various plugins available. I think there’s an implied license when a site sends you HTML that you’re allowed to use various software to process and view it, including things like screen readers (not to mention potential ADA problems if things like screen readers aren’t allowed.) I don’t think that extends to using it to train AI. That’s not the same category of software.

And this is especially a different category when this AI is commercial software and the people giving it the training data are not the same people who will be using it to get information. You can tell your screen reader to read your screen to you; you can’t tell your screen reader to broadcast it to a room full of people paying you for the privilege.

Add Your Comment

Your email address will not be published. Required fields are marked *

Have a Techdirt Account? Sign in now. Want one? Register here

Comment Options:

Make this the or (get credits or sign in to see balance) what's this?

What's this?

Techdirt community members with Techdirt Credits can spotlight a comment as either the "First Word" or "Last Word" on a particular comment thread. Credits can be purchased at the Techdirt Insider Shop »

Follow Techdirt

Techdirt Daily Newsletter

Ctrl-Alt-Speech

A weekly news podcast from
Mike Masnick & Ben Whitelaw

Subscribe now to Ctrl-Alt-Speech »
Techdirt Deals
Techdirt Insider Discord
The latest chatter on the Techdirt Insider Discord channel...
Loading...