Wherein The Copia Institute Tells The Copyright Office There’s No Place For Copyright Law In AI Training
from the copyright-free-zone dept
These days everyone seems to be talking about AI, and the Copyright Office is no exception, although it may make sense for it to speak here because people keep trying to invoke copyright as a concept implicated by various aspects of AI, including, and perhaps especially, with regard to “training” AI systems. So the Copyright Office recently launched a study to get feedback on the role copyright has, or should be changed to have, in shaping any law that bears on AI, and earlier this week the Copia Institute filed an initial comment in that study.
In our comment we made several points, but the main one was that, at least when it comes to AI training, copyright law needs to butt out. It has no role to play now, nor could it constitutionally be changed to have one. And regardless of the legitimacy to any concerns for how AI may be used, allowing copyright to be an obstructing force in order to prevent AI systems from being developed will only have damaging effects not just deterring any benefits that the innovation might be able to provide but undermining the expressive freedoms we depend on.
In explaining our conclusion we first observed that one overarching problem poisoning any policy discussion on AI is that “artificial intelligence” is a terrible term that obscures what we are actually talking about. Not only do we tend to conflate the ways we develop it (or “train” it), with the way we use it, which presents its own promises and potential perils, but in general we all too often regard it as some new form of powerful magic that can either miraculously solve all sorts of previously intractable problems or threaten the survival of humanity. “AI” can certainly inspire both naïve enthusiasm prone to deploying it in damaging ways, and also equally unfounded moral panics preventing it from being used beneficially. It also can prompt genuine concerns as well as genuine excitement. Any policy discussion addressing it must therefore be able to cut through the emotion and tease out exactly what aspect of AI we are talking about when we are addressing those effects. We cannot afford to take analytical shortcuts, especially if it would lead us to inject copyright into an area of policy where it does not belong and its presence would instead cause its own harm.
Because AI is not in fact magic; in reality it is simply a sophisticated software tool that helps us process information and ideas around us. And copyright law exists to make sure that there is information and ideas for the public to engage with. It does so by bestowing on the copyright owner certain exclusive rights in the hopes that this exclusivity makes it economically viable for them to create the works containing those ideas and information. But these exclusive rights necessarily all focus on the creation and performance of their works. None of the rights limit how the public can then consume those works once they exist, because, indeed, the whole point of helping ensure they could exist is so that the public can consume them. Copyright law wouldn’t make sense, and probably not be constitutional per the Progress Clause, if the way it worked constrained that consumption and thus the public’s engagement with those ideas and information.
It also would offend the First Amendment because the right of free expression inherently includes what is often referred to as the right to read (or, more broadly, the right to receive information and ideas). Which is a big reason why book bans are so constitutionally odious, because they explicitly and deliberately attack that right. But people don’t just have the right to consume information and ideas directly through their own eyes and ears. They have the right to use tools to help them do it, including technological ones. As we explained in our comment, the ability to use tools to receive and perceive created works is often integral to facilitating that consumption – after all, how could the public listen to a record without a record player, or consume digital media without a computer. No law could prevent the use of tools without seriously impinging upon the inherent right to consume the works entirely. The United States is also a signatory to the Marrakesh Treaty, which addresses the unique need by those with visual and audio impairments to use tools such as screen readers to help them consume the works to which they would otherwise be entitled to perceive. Of course, it is not only those with such impairments who may have need to use such tools, and the right to format shift should allow anyone to use a screen reader to help them consume works if such tools will help them glean those ideas effectively.
What too often gets lost in the discussion of AI is that because we are not talking about some exceptional form of magic but rather just fancy software, AI training must be understood as simply being an extension of these same principles that allow the public to use tools, including software tools, to help them consume works. After all, if people can direct their screen reader to read one work, they should be able to direct their screen reader to read many works. Conversely, if they cannot use a tool to read many works, then it undermines their ability to use a tool to help them read any. Thus it is critically important that copyright law not interfere with AI training in order not to interfere with the public’s right to consume works as they currently should be able to do.
So at minimum such AI training needs to be considered a fair use, but the better practice is to recognize that there is no role for copyright to play when it comes to AI training at all. To say it is allowed as a fair use is to inflate the power of a copyright holder beyond what the statute or Constitution should allow because it suggests that using tools to consume works could ever potentially be an infringement, which only happens to be excused in this context. But copyright law is not supposed to give copyright owners such power over the consumption of their works, which we would then need to be dependent on fair use to temper. It should never apply to limit the consumption of works in any context, and we should not let concerns about AI generally, or their uses or outputs specifically, to open the door to copyright law ever becoming an obstacle to that consumption.
Filed Under: 1st amendment, ai, copyright, free speech, right to read, screen readers, us copyright office
Comments on “Wherein The Copia Institute Tells The Copyright Office There’s No Place For Copyright Law In AI Training”
hear hear
Copyright infringement has only ever been claimed against two things:
1. if someone’s Final product is so like the copyrighted product that it is clearly copied from it.
2. if someone makes copies and distributes them without permission.
Training isnt either of these. Only the final product can be infringement and only if you can show a side by side comparison of sameness.
Please can someone address why it is legal to use e.g. a copy of Stephen King’s IT that they found on the internet, to train a LLM.
Re:
See above article
Re: Re:
You download a database of text. It contains the hypothetical copy of Stephen King’s IT. You use that database to train the LLM. You violated the copyright by downloading the book, just as you would have if you borrowed a copy from the library and scanned each page.
Your scraping program browses deviantart. It downloads all the pictures. You feed the pictures into an image AI. You violated copyright by downloading the pictures.
Again, can someone explain why copyright has no role in a system built on mass copyright infringement.
Re: Re: Re:
A a large language model does not copy a book into its data base, but analyses the books use of language to build and modify its database. The training is done under the same rules as you or I reading content on the Internet, or downloading copies where that is allowed.
Re: Re: Re:
Infringement is infringement. If you downloaded an infringing copy of a novel, that’s infringement. Doesn’t matter if it is for an LLM training dataset or not.
No one looking at what is displayed on DeviantArt is infringing. It’s on display.
Next?
Re: Re: Re:2 Define infringement
If infringement is your concern, then non-infringement is by definition not something that would be of your concern.
If reading a book at a library isn’t infringing, which it isn’t, then a machine “reading” the same book cannot be infringing. That the machine can read all the books in the library in less time than a normal reader can read one book, is also, not infringing.
Clearly you are operating under a miasma of ignorance, or with notions that are based on invalid emotional arguments. They don’t keep the works, they train with them. That action is not significantly different than a human reading the book. Maintaining a database of everything they read is intractable, and unnecessary.
Seriously, how could you be so ignorant as to believe that the entire freaking web would be stored in individual databases for all the different AI projects? The cost alone would be prohibitive.
Re: Re: Re:3
The “reading” of the book isn’t infringement, but is it then storing some of that book verbatim? I bet it is.
An “AI” is not a human or an intelligent being; it’s a mere program. A human is allowed to memorize what they read. If a computer program does it, that’s called “making a copy” and can well be infringing.
Re: Re: Re:4 Memorize?
You might want to consider the numbers. Last I heard of any numbers, ChatGPT used a training set of 45 terabytes and the resulting database is 800 gigabytes. Simple math indicates that the data ChatGPT uses is only about 1.8% of the size of the training data. Now, let’s look at how compressible English text is. The best program on the Large Text Compression benchmark was only able to achieve a compression result 10.8%. That implies that if the AI was retaining copies of training inputs, it is exceeding the best known compression by at least a factor of 6. Not exactly probable.
Re: Re: Re:5
I’ve heard people say things like they were able to get the entire first page of a book (small enough to be way less than 10% but enough to potentially be infringing) out of an AI. Maybe they were lying, or maybe the AI looked up a current version of a page to do that instead of its training data.
Re: Re: Re:6
“were able”, ie they spent time crafting a prompt to get the result they wanted. Given enough prompting, even a human can produce the first page of any random book they’ve read.
If someone asks me I can quote the first page of my favorite book verbatim, does that mean my brain infringes the author’s copyright?
There’s examples of people who can recite a book verbatim after reading it once.
Re: Re: Re:
The premise of your first point is based on the assumption that they pirated a book, which isn’t required. Digital Libraries exist.
Your second point has been hashed out multiple multiple times. Mostly because that’s how web browsers work.
Re: Re: Re:
You copyright fanatics have claimed every system is a system built on copyright infringement, which is why we have shit like game devs being allowed to use the DMCA to take down criticisms of their product – even when the review itself doesn’t contain footage of their game.
Perhaps someone should refer you to the story “The Boy Who Cried Wolf”, or is that also copyright infringement to you?
Re: Re: Re:
In the first case, it’s irrelevant. You could fully shoplift a book and it wouldn’t be infringement to read it.
For the second, there is no possible infringement since it’s the same as accessing the website.
Re: Re: Re:
“You violated copyright by downloading the pictures.”
Including the CC-licensed and CC-0 images? Ion think so.
Re: Re: Re: evidence not on record
If you can show it illegally downloaded a copyrighted work then its infringement just the same as if you got caught downloading a song from a pirate website. But you havent shown that, and afaik it hasnt been shown in court.
Re: Re: Re:
Because “built on mass copyright infringement” is a delusion that springs from utter ignorance, and such things have no place in determining legal respinse.
Re:
For the same reason that colleges have libraries for their students to use.
Re:
Same reason it’s legal for me to read it before I write something.
There’s many concerns, but every author had built on what they read before, including King (The Dark Tower’s opening is openly based on a poem, Salem’s Lot is openly based on Dracula, etc.).
There’s problems here, but even the people you’re defending will admit they used others’ work, at least subliminally.
What about "training" students?
How is teaching (aka training) machines with an author’s works ANY DIFFERENT than training (aka teaching) students with her works?
It isn’t…and shouldn’t be treated differently.
Copyright isn’t a factor when a student reads a book. It shouldn’t be a factor when a machine “reads” a book.
Re:
It’s not a factor when the machine or the student “reads” a book. It’s absolutely a factor if either one, after reading, “takes notes” that include, say, a chapter of the book verbatim.
So exactly how much copying is being done by the AI?
Re: Re:
None. The simplest and very bad analogy for how LLM’s actually stores things it has been trained on is that it stores the relationship between words and the likelihood they appear together for a given context.
As someone who’s blind, Copyright has negatively impacted me a lot. DRM in books is a particular problem. Oh, companies like Amazon have (tried) to make their books accessible but it doesn’t always go well. IMO section 1201 should either be repealed, or be modified so that it is not a violation of that section (or of any other section of title 17) if the tampering, infringement, or deactivation of mechanisms is for accessibility purposes.
Maybe Fair Use, However AI Output Is Publishing
AI can take in everything just like human consumption, however, any output is inherent publishing. So if copyrighted material comes out of it’s virtual mouth, violation and owner/operator must pay or “kill” it as prison is useless to the undead.
Re:
Private uses of AI is no more publishing that anything anybody you know shows or tells you.
Re:
AI can’t have copyright. Missed that bit.
Re:
I’ll grant it.
I’ll grant it.
So, okay, sure. That would be plagiarism, as if a human had done so.
But…
And note that my list above is not comprehensive.
“Copyrighted work goes in, something comes out” is not a valid basis to declare that copyright infringement has occurred.
Everything else in your comment is incoherent. Please try again.
Re:
jesse what the fuck are you talking about dot jaypeg
Re: Re:
Copyright fanatics are desperate to look for some way to run an end around copyright law penalties and shaping the narrative to make themselves not look like complete assholes to the public.
Since copyright infringement and theft have very different definitions legally, they’ve been trying to claim that the act of “making available” counts as publishing, or some other word that constitutes distributing paid content in a way that they disagree with. i.e., it’s not the downloading that’s the crime, but the action that made it possible for someone else to access content they didn’t directly compensate the copyright holder for. In the context of the RIAA, talking about distribution instead of downloading also takes some of the heat off of them for suing end users like they did with children and grandmothers, because they could then argue that they weren’t going after people downloading a CD’s worth of songs; they were going after people costing them thousands of millions of dollars in revenue by providing a free option.
(Side note, this is also why John Smith loves claiming “contributory infringement” or “distributor liability”, because anything that makes it easier for copyright holders to sue is entirely his end goal.)
Of course, that strategy eventually fell apart. Comparing someone who has an unsecured WiFi connection with a cartel that actually manufactures bootleg disks is not remotely in the same ballpark, and when judges started asking for proof that infringement actually happened, copyright holders started running like hell because they had absolutely nothing aside from vague moralistic claims. It’s simply not possible for anyone to tell whether a full file got downloaded or how much each IP address contributed to a torrent swarm.
But let’s say we actually went with the “every output is publishing” argument – how is AI generated output different from a mixtape? What role did the copyright holder actually have in creating the new remixed image? (Sure, to be fair, copyright holders have long since loathed the right to remix, but I don’t see them winning that fight any time soon.)
Re: Re: Re:
“…the act of “making available” counts as publishing…”
So every book sold by Amazon is published by them? 🐱
Re: Re: Re:2
If it meant that copyright holders could demand more money by making Amazon responsible for some lost revenue, or allowing them to force a judge into making a ruling? Sure, copyright fans would be more than happy to make that argument.
It’s why John Smith has been pushing hard for the “distributor liability” angle for defamation, too. According to him, just mentioning that someone was brought to court just for being accused of a heinous crime constituted reputational damage for which jail time and severe fines should be a thing, whether it’s a person recounting facts or a news site hosting an article.
Re: Re: Re:
I see, thanks for explaining this. Intellectual monopolists are so dishonest.
While I totally agree with this approach to copyright, we have to remember that the DMCA exists, and part of its intent is to control what people (and machines) can do to consume copyrighted content.
So the Copyright Office is not above another similar carve-out for anything involving machine learning.
I think there are three areas that need to be addressed, and the article only deals with the second of the three:
1) How the books/media are acquired to be used for training. Were they acquired “legally”? Given the quantity involved, I highly doubt it – likely neither purchased nor borrowed (from a library). More likely just hoovered from shady repositories. But that’s not really a matter for copyright law, since copyright is about distribution/publication, not acquisition. There was probably a violation of law involved, but it was theft, not copyright. (The only copyright violation here would be on the part of whoever aggregated the texts that the training software used.)
2) The aspect described by the article: the method of consumption. Here, my feeling is that the article is correct. There is no violation by the act of training.
3) The output based on that training. Here the question becomes whether the new art is transformative or not. Historically, there has been a surprisingly low bar for what’s considered transformative, and AI generated text/images clear that bar without breaking a sweat.
Re:
Theft? Get real.
Re:
Anyone who thinks “theft” has anything to do with this has ou
Re:
Anyone who says “theft” has anything to do with this has outed themselves as not remotely informed enough to talk about this cogently.
Re: Re:
Anyone mentioning “theft” in the context of “copyright infringement” has long since indicated that they’re not interested in honest discussions on the subject. They mean to use emotional, manipulative, table-banging arguments to get their point across.
its a tool
Keep in mind that AI training is just a tool humans use to statistically analyze works freely available on the internet. If there was any infringing its not the AI doing it – its the human using the AI. Once past that you need to decide if the human is in fact violating any copyright laws in his use of the internet.
No copyright for AI
Absolutely nothing published by AI should have any sort of copyright.
Re: AI isnt the publisher
Of course AI cant publish anything. Its the owner of the AI that would be publishing content created by its AI tool.
Protection for artists
I do think that copyright has a role in regulating the output of an AI.
If I demand, say, the complete animorphs series as a stageplay from an ai, it shouldn’t be capable of complying. those characters are copyrighted and that setting is copyrighted.
AI would only be capable of producing generics.
Re: already covered
Its already infringement for a human to copy the animorphs series, so a human using AI to copy it would also be infringement.
Re:
You appear to making a false assumption, and that is the AI has an internal copy of the series, and other works in its database, and it does not. Indeed to achieve what you are suggesting, the user would have to have a lot of detailed knowledge about the series.
Also, think of the poor artists is a dumb card to play, as most artists are poor and create their art without much hope of payment. That is the artists who make nothing, but continue to create, by far out number those who are capable of making a living from their art. Indeed copyright mainly benefits the publishers and not the artists, and the publishers have a good reason to limit the competition to the works they purchase, in that is is far easier to make a profit when there are few works being made available. Long copyrights, stopping derivative works and fair use, and keeping works off the market are all aimed at limiting peoples choices and the overheads of keeping works on the market so as to maximize profits.
Re:
OK. But, then there’s another pass and all the references to copyrighted characters are removed. Are they still infringing? If so, how does that affect human written works like Fifty Shades Of Gray, which started as Twilight fan fictions? Where are the lines drawn?
You’re wrong. Copyright law already does this. For example, it prohibits unauthorized translations as derivative works. It prohibits the unauthorized publication of unpublished works. It allows Disney to put a movie in the “vault” for a couple of decades.
AI isn’t a screen reader. And your screen reader presumably reads what’s on your screen. It doesn’t store a bunch of copyrighted articles just in case you want to read them later (and I don’t think it would be legal for it to do so.)
The Internet can be a bit weird when it comes to copyrighted stuff. Your browser gets a bunch of HTML and tries to assemble it into something readable. Different browsers or different hardware will make the same website look different, and there are also various plugins available. I think there’s an implied license when a site sends you HTML that you’re allowed to use various software to process and view it, including things like screen readers (not to mention potential ADA problems if things like screen readers aren’t allowed.) I don’t think that extends to using it to train AI. That’s not the same category of software.
And this is especially a different category when this AI is commercial software and the people giving it the training data are not the same people who will be using it to get information. You can tell your screen reader to read your screen to you; you can’t tell your screen reader to broadcast it to a room full of people paying you for the privilege.
Re:
Neither does an AI.