Hide Techdirt is off for the holidays! We'll be back soon, and until then don't forget to check out our fundraiser »

When Copyright Enters the AI Conversation

from the reading-by-robots dept

This series of posts explores how we can rethink the intersection of AI, creativity, and policy. From examining outdated regulatory metaphors to questioning copyright norms and highlighting the risks of stifling innovation, each post addresses a different piece of the AI puzzle. Together, they advocate for a more balanced, forward-thinking approach that acknowledges the potential of technological evolution while safeguarding the rights of creators and ensuring AI’s development serves the broader interests of society. You can read the firstsecondthirdfourth, and fifth posts in the series.

Whenever content is involved, copyright enters the conversation. And when we talk about AI, we’re talking about systems that absorb petabytes of content to meet their training needs. So naturally, copyright issues are at the forefront of the debate.

Interestingly, copyright usually only becomes an issue when there’s the perception that someone or something is successful—and that copyright holders are missing out on potential control or revenues. For decades, “reading by robots” has been a part of our digital lives. Just think of search engines crawling billions of pages to index them. These robots read far more content than any human ever could. But it wasn’t until AI began learning from this content—and, more crucially, producing content that appeared successful—that the rules inspired by the Queen Anne Statute of 1710 come into play.

The Input Side: Potential Innovation and the Garbage In, Garbage Out Principle

On the input side, generative AI relies heavily on the data it consumes, but under EU law, its access is carefully regulated. The 2019 EU Directive on Copyright in the Digital Single Market (DCDSM) sets the framework for text and data mining (TDM). Article 3 of the Directive permits TDM for scientific research only, while Article 4 allows it more broadly—provided the rightsholder hasn’t expressly reserved their rights.

With the AI Act adopted in 2024 referring to these provisions, we’re left with a raft of questions about the future of AI models. One of the key concerns is the potential for a data winter—a scenario where AI models face limited access to the data they need to evolve and improve.

This brings us to a fundamental concept in AI—Garbage In, Garbage Out. AI models are only as good as the data they are trained on. If access to high-quality, diverse datasets is restricted by rigid copyright rules, AI systems will end up training on lower-quality data. Poor-quality data leads to unreliable, biassed, or outright inaccurate AI outputs. Just as a chef can only make a great dish with fresh ingredients, AI needs high-quality input to deliver reliable, innovative, and useful results. Restricting access due to copyright concerns risks leading AI into a “data winter” where innovation freezes, limited by the garbage fed into the system.

A data winter not only stifles technological advancement but also risks widening the gap between regions that enforce stricter copyright policies and those that embrace more flexible rules. Ultimately, Europe’s global competitiveness in AI hinges on whether it can provide an environment where AI can access the data it needs without unnecessary restrictions.

But access to diverse data is also important from a cultural perspective: if AI is trained predominantly on Anglo-Saxon or non-European content, it naturally reflects those cultures in its outputs. This could mean that European creativity becomes increasingly marginalised, with AI-generated content lacking in cultural relevance and failing to reflect the diversity of Europe. AI should be a tool that amplifies the diversity of human expression, not one that homogenises it.

Challenges on the Output Side: Copyright Protection for AI-Generated Content

Now let’s look at the output side of generative AI. The assumption that creative works, like movies, video games, or books, are automatically protected by copyright may not apply to AI-generated content. The traditional protection of creative expression hinges on human authorship, and while creative elements like prompt choices could be considered for copyright, the level of protection will likely be much lower than expected. This could mean that parts of a work—such as AI-generated backgrounds in video games or movies—could be freely copied by others.

This uncertainty could lead to increased pressure from creative industries to modify copyright law, pushing for more familiar levels of protection that might extend copyright to currently unprotected AI-generated content. If such changes happen, we could end up in a spiral where access to knowledge becomes more restricted, stifling creativity and innovation. We’ve seen similar debates before—most notably during the advent of photography, when early courts struggled to determine whether machine-created works could be protected.

The path forward requires a careful balancing act: we need copyright laws that protect human creativity and labour without hampering access to the data that AI—and society—need to innovate and grow. By avoiding a data winter and ensuring AI systems have access to diverse, quality inputs, we can harness AI’s potential to drive the creative industries forward, rather than allow outdated copyright rules to drag progress backward.

Caroline De Cock is a communications and policy expert, author, and entrepreneur. She serves as Managing Director of N-square Consulting and Square-up Agency, and Head of Research at Information Labs. Caroline specializes in digital rights, policy advocacy, and strategic innovation, driven by her commitment to fostering global connectivity and positive change.

Filed Under: , , , , , ,

Rate this comment as insightful
Rate this comment as funny
You have rated this comment as insightful
You have rated this comment as funny
Flag this comment as abusive/trolling/spam
You have flagged this comment
The first word has already been claimed
The last word has already been claimed
Insightful Lightbulb icon Funny Laughing icon Abusive/trolling/spam Flag icon Insightful badge Lightbulb icon Funny badge Laughing icon Comments icon

Comments on “When Copyright Enters the AI Conversation”

Subscribe: RSS Leave a comment
26 Comments
Anonymous Coward says:

With the AI Act adopted in 2024 referring to these provisions, we’re left with a raft of questions about the future of AI models. One of the key concerns is the potential for a data winter—a scenario where AI models face limited access to the data they need to evolve and improve.

Well, I’m not allowed to view that link, but I suspect a bot could get around it. Supposedly these “A.I.” services are already defeating CAPTCHAs; somebody got one to explain its steps, and got a result like “Now I’m clicking the ‘I’m human’ box to prove I’m not a robot.”

It’s not all that suprising, really, given that the operators of some of these CAPTCHA services have openly stated that the results are used to train bots (for optical character recognition, to identify school buses and fire hydrants, and so on). So I’m skeptical about a “data winter” being caused by bot-blocking—or by copyright rules, which the systems are already ignoring.

I suspect the actual limit is going to be finding data that’s not auto-generated slop. Already, these bots must be accidentally training, in large part, on the output of other bots. It’s not all that different from humans, though: humans are also learning from the badly-written text of other humans (and bots), and repeating their mistakes. Even books by major authors are often not well copy-edited (or perhaps I should say that happens especially once they become popular); see, for example, the ubiquitous comma-splicing in the later Harry Potter books.

I see two reasonably reliable ways to get non-computer-generated data. Companies like Facebook can probably guess which of their users are human, and can use that data under their terms of service; except, as just noted, most of those humans can’t write worth a damn. The other way is to use old data, especially out-of-copyright data. That has two problems: one is simply that it’s old, such that your “A.I.” might end up talking like a grizzled 1890s prospector (nevermind the biases noted by Caroline already). The other is that it’s subject to errors, such as JBIG2 character substitutions by scanners and just bad optical character recognition.

Anonymous Coward says:

What an absolutely pathetic opinion from an industry shill. Even sadder if you aren’t even scamming the scammers out of their plagiarism money to defend their ruination of our planet and ability to think and create as humans.

1) What innovation? What wondrous new contributions to human knowledge and society has an LLM ever actually made? They hallucinate presidents and tell us to eat rocks.

2) LLMs _do_not_learn_. If they did, they could count the number of Rs in strawberry by now. Something that can learn doesn’t need hundreds of billions of dollars, countless billions of gallons of water, and grid-crushing amounts of electricity to figure out something every human on earth learned in elementary school.

3) A human manipulating a tool to produce a creative work based on their individual, human perceptions and choices bears no resemblance to a mouth breathing prompt jockey acting like a Problem Client continually yelling at a machine to copy more ideas from others more better more faster because they think learning a trade or skill is too boring or too hard. Why should we cater our entire Human Endeavor to the laziest and dumbest among us?

4) The path forward involves abandoning these resource monopolizing, planet destroying, brain atrophying monstrosities and never again taking seriously the opinions of anyone who ever defended them.

Anonymous Coward says:

The real issue here is “Intellectual Property”, the infinite supply of human thoghts and ideas.

Do humans inherently ‘own’ their thoughts as an economic ‘Property Right’ that should be enforced by government ??

Until one can rationally answer this question, productive intellectual analysis of Copyright, Patents, Trademarks, etc is impossible.

Stephen T. Stone (profile) says:

Re:

Do humans inherently ‘own’ their thoughts as an economic ‘Property Right’ that should be enforced by government ??

No. Ideas can’t be copyrighted⁠—only fixed expressions of ideas. For example: “Man travels through a post-apocalyptic world” is a broad idea, which means the expression thereof can range from the Mad Max franchise to the MST3K-worthy Warrior of the Lost World.

Anonymous Coward says:

Re: Re:

No. Ideas can’t be copyrighted⁠

According to the laws as written, sure. In practice, courts do uphold copyright on ideas. For example, someone using any attribute of Sherlock Holmes or Mickey Mouse from after the public-domain stuff can expect to get sued. The “unofficial sequel” to Catcher in the Rye was banned by a U.S. court on copyright grounds, despite copying none of the fixed expressions.

Stephen T. Stone (profile) says:

Re: Re: Re:

In practice, courts do uphold copyright on ideas. For example, someone using any attribute of Sherlock Holmes or Mickey Mouse from after the public-domain stuff can expect to get sued.

The argument in such cases is that the elements of a character not yet present in public domain works are part of the fixed expression of that character and are therefore off-limits so long as those elements remain in works covered by copyright. I don’t think it’s the best argument, but it’s the one that’s held up by law.

The “unofficial sequel” to Catcher in the Rye was banned by a U.S. court on copyright grounds, despite copying none of the fixed expressions.

The creation of derivative works based on existing copyrighted material generally requires the permission of the copyright holder. Exceptions do exist⁠—parody, for example⁠—and sometimes derivative works are ignored because they help generate interest in the original work (e.g., fan art and fanfiction). But if you’re trying to publish an “unofficial sequel” to a book still covered by copyright, you’d do well to file off all the serial numbers before you pull the trigger. I mean, Fifty Shades of Grey started off as Twilight fanfiction before E. L. James rewrote it as an original piece.

Anonymous Coward says:

Re: Re: Re:2

The argument in such cases is that the elements of a character not yet present in public domain works are part of the fixed expression of that character

That’s a bit like when software patents are not allowed, so people patent a general-purpose computer running software that implements their idea. Which in practice is the exact same thing, but lawyers bullshit the courts into believing there’s a distinction.

The creation of derivative works based on existing copyrighted material generally requires the permission of the copyright holder.

U.S. law says:

A “derivative work” is a work based upon one or more preexisting works, such as a translation, musical arrangement, dramatization, fictionalization, motion picture version, sound recording, art reproduction, abridgment, condensation, or any other form in which a work may be recast, transformed, or adapted. A work consisting of editorial revisions, annotations, elaborations, or other modifications which, as a whole, represent an original work of authorship, is a “derivative work”.

A “sequel” is usually based only on the pure ideas of a copyrighted work, using none or almost none of the content. It’s therefore not a “derived work”, or at least isn’t supposed to be. But, again, lawyers are good bullshitters…

Stephen T. Stone (profile) says:

Re: Re: Re:3

lawyers bullshit the courts into believing there’s a distinction

Like I said: I don’t think it’s the best argument, but it’s the one that’s held up by law.

A “sequel” is usually based only on the pure ideas of a copyrighted work, using none or almost none of the content. It’s therefore not a “derived work”, or at least isn’t supposed to be.

If a given work is a sequel to an existing work, chances are it’s using at least one of the characters from the existing work. Mad Max: Fury Road has no other characters from the other Mad Max films besides the titular Road Warrior himself, but it’s still a sequel to the other films. A sequel to Catcher in the Rye would necessarily have to involve the use of at least one character from that book, and chances are the sequel would reference events from that book. That would absolutely make it a direct derivative work because it is referencing a direct expression of ideas. Remember, copyright is about the expression of ideas, not the ideas themselves⁠—and if you’re going to copy or build upon someone else’s expression of ideas, you should either have permission or have an argument for Fair Use (e.g., parody) ready to deploy in case you need it in a court of law.

CCBEB says:

Re: Re:

no, owned PROPERTY can only be something physical.

but politicians instead decided that some types of intangible human ideas were especially valuable to society … and could readily encouraged by an artificial legal ‘grant’ of private-property status … in defiance of basic economics.

It’s been a huge mess ever since, loaded with inefficiency, corruption & severe negative consequences to society.

eMike (profile) says:

Interestingly, copyright usually only becomes an issue when there’s the perception that someone or something is successful—and that copyright holders are missing out on potential control or revenues.

Does this mean that when a corporation infringes enough copyright to bankrupt the entire country with a DMCA style lawsuit, that it’s ok because they successfully torrented all that data?

eMike (profile) says:

Re: Re:

Fair. Let me try again.

Saying that it only matters when copyright holders think they’re owed money means that the AI companies think that they should be allowed to have free access to all data in the world. They’re wasting the bandwidth of everybody else to get access to all these petabytes of data that they feel like they shouldn’t have to pay for, but somebody else should give them even though it has real costs for the providers.

Meta torrented tons of data and will likely see no repercussions, while people have been bankrupted by the RIAA for much, much less.

Lines like this show that AI sympathizers seem to think that because it’s AI they should be allowed to break the laws the rest of us have to live by. “You only want money. We only downloaded a little data but we’re doing something good with it, so it’s ok. We promise.”

drew (profile) says:

Anything in, garbage out.

The old Garbage In, Garbage Out maxim is not a useful expression here because it implies a converse (Truth In, Truth Out) that is not correct. Even fed 100% accurate content LLMs are capable of hallucinations because they are just probability models and have no concept of either truth or garbage.
It’s also worth noting that the UK is ploughing its own furrow on this and is allowing copyright on machine-created work. Which is spectacularly dumb but it will be interesting to watch as a case study.

Arianity (profile) says:

(Your regular reminder that the blurb at the bottom is a euphemism for lobbying, evaluate accordingly)

Interestingly, copyright usually only becomes an issue when there’s the perception that someone or something is successful—and that copyright holders are missing out on potential control or revenues

I’m not sure that’s true at all? There’s plenty of artists whose only art is posted on something like deviantart, who get annoyed at their art being stolen. Some are just thrilled to be noticed, but plenty will stand on principle. Especially if it’s uncredited. Financial success definitely gives more of an incentive and means (and perhaps most importantly, garners attention. No one notices when a small artist makes a fuss), sure- but creatives often have pretty strong feelings on how their work is used, one way or another.

(Similarly, people did try to use copyright to kill crawling- they just failed. Not for lack of trying.)

we need copyright laws that protect human creativity and labour without hampering access to the data that AI

Can’t say this series has been terribly reassuring on this front. So far the main argument seems to be not to worry, because the human element is irreplaceable. Honestly not even clear what’s left to protect via the law, if that hope were true.

n00bdragon (profile) says:

Whenever content is involved, copyright enters the conversation. And when we talk about AI, we’re talking about systems that absorb petabytes of content to meet their training needs. So naturally, copyright issues are at the forefront of the debate.

This is a non-sequitur. The amount of data consumed has no relevance to copyright law. Libraries do not become infringement centers if you read “too many” books out of them. Everything that follows this is absurd. AI has no interaction with Copyright law. It doesn’t make creative choices (it follows a repeatable algorithm) and it isn’t a person. It has all the copyright status of a road striper.

This comment has been flagged by the community. Click here to show it.

Add Your Comment

Your email address will not be published. Required fields are marked *

Have a Techdirt Account? Sign in now. Want one? Register here

Comment Options:

Make this the or (get credits or sign in to see balance) what's this?

What's this?

Techdirt community members with Techdirt Credits can spotlight a comment as either the "First Word" or "Last Word" on a particular comment thread. Credits can be purchased at the Techdirt Insider Shop »

Follow Techdirt

Techdirt Daily Newsletter

Subscribe to Our Newsletter

Get all our posts in your inbox with the Techdirt Daily Newsletter!

We don’t spam. Read our privacy policy for more info.

Ctrl-Alt-Speech

A weekly news podcast from
Mike Masnick & Ben Whitelaw

Subscribe now to Ctrl-Alt-Speech »
Techdirt needs your support! Get the first Techdirt Commemorative Coin with donations of $100
Techdirt Deals
Techdirt Insider Discord
The latest chatter on the Techdirt Insider Discord channel...
Loading...