When Copyright Enters the AI Conversation
from the reading-by-robots dept
This series of posts explores how we can rethink the intersection of AI, creativity, and policy. From examining outdated regulatory metaphors to questioning copyright norms and highlighting the risks of stifling innovation, each post addresses a different piece of the AI puzzle. Together, they advocate for a more balanced, forward-thinking approach that acknowledges the potential of technological evolution while safeguarding the rights of creators and ensuring AI’s development serves the broader interests of society. You can read the first, second, third, fourth, and fifth posts in the series.
Whenever content is involved, copyright enters the conversation. And when we talk about AI, we’re talking about systems that absorb petabytes of content to meet their training needs. So naturally, copyright issues are at the forefront of the debate.
Interestingly, copyright usually only becomes an issue when there’s the perception that someone or something is successful—and that copyright holders are missing out on potential control or revenues. For decades, “reading by robots” has been a part of our digital lives. Just think of search engines crawling billions of pages to index them. These robots read far more content than any human ever could. But it wasn’t until AI began learning from this content—and, more crucially, producing content that appeared successful—that the rules inspired by the Queen Anne Statute of 1710 come into play.
The Input Side: Potential Innovation and the Garbage In, Garbage Out Principle
On the input side, generative AI relies heavily on the data it consumes, but under EU law, its access is carefully regulated. The 2019 EU Directive on Copyright in the Digital Single Market (DCDSM) sets the framework for text and data mining (TDM). Article 3 of the Directive permits TDM for scientific research only, while Article 4 allows it more broadly—provided the rightsholder hasn’t expressly reserved their rights.
With the AI Act adopted in 2024 referring to these provisions, we’re left with a raft of questions about the future of AI models. One of the key concerns is the potential for a data winter—a scenario where AI models face limited access to the data they need to evolve and improve.
This brings us to a fundamental concept in AI—Garbage In, Garbage Out. AI models are only as good as the data they are trained on. If access to high-quality, diverse datasets is restricted by rigid copyright rules, AI systems will end up training on lower-quality data. Poor-quality data leads to unreliable, biassed, or outright inaccurate AI outputs. Just as a chef can only make a great dish with fresh ingredients, AI needs high-quality input to deliver reliable, innovative, and useful results. Restricting access due to copyright concerns risks leading AI into a “data winter” where innovation freezes, limited by the garbage fed into the system.
A data winter not only stifles technological advancement but also risks widening the gap between regions that enforce stricter copyright policies and those that embrace more flexible rules. Ultimately, Europe’s global competitiveness in AI hinges on whether it can provide an environment where AI can access the data it needs without unnecessary restrictions.
But access to diverse data is also important from a cultural perspective: if AI is trained predominantly on Anglo-Saxon or non-European content, it naturally reflects those cultures in its outputs. This could mean that European creativity becomes increasingly marginalised, with AI-generated content lacking in cultural relevance and failing to reflect the diversity of Europe. AI should be a tool that amplifies the diversity of human expression, not one that homogenises it.
Challenges on the Output Side: Copyright Protection for AI-Generated Content
Now let’s look at the output side of generative AI. The assumption that creative works, like movies, video games, or books, are automatically protected by copyright may not apply to AI-generated content. The traditional protection of creative expression hinges on human authorship, and while creative elements like prompt choices could be considered for copyright, the level of protection will likely be much lower than expected. This could mean that parts of a work—such as AI-generated backgrounds in video games or movies—could be freely copied by others.
This uncertainty could lead to increased pressure from creative industries to modify copyright law, pushing for more familiar levels of protection that might extend copyright to currently unprotected AI-generated content. If such changes happen, we could end up in a spiral where access to knowledge becomes more restricted, stifling creativity and innovation. We’ve seen similar debates before—most notably during the advent of photography, when early courts struggled to determine whether machine-created works could be protected.
The path forward requires a careful balancing act: we need copyright laws that protect human creativity and labour without hampering access to the data that AI—and society—need to innovate and grow. By avoiding a data winter and ensuring AI systems have access to diverse, quality inputs, we can harness AI’s potential to drive the creative industries forward, rather than allow outdated copyright rules to drag progress backward.
Caroline De Cock is a communications and policy expert, author, and entrepreneur. She serves as Managing Director of N-square Consulting and Square-up Agency, and Head of Research at Information Labs. Caroline specializes in digital rights, policy advocacy, and strategic innovation, driven by her commitment to fostering global connectivity and positive change.
Filed Under: access to data, ai, copyright, creativity, creativity and ai, incentives, right to read
Techdirt is off for the holidays! We'll be back soon, and until then don't forget to 



Comments on “When Copyright Enters the AI Conversation”
Well, I’m not allowed to view that link, but I suspect a bot could get around it. Supposedly these “A.I.” services are already defeating CAPTCHAs; somebody got one to explain its steps, and got a result like “Now I’m clicking the ‘I’m human’ box to prove I’m not a robot.”
It’s not all that suprising, really, given that the operators of some of these CAPTCHA services have openly stated that the results are used to train bots (for optical character recognition, to identify school buses and fire hydrants, and so on). So I’m skeptical about a “data winter” being caused by bot-blocking—or by copyright rules, which the systems are already ignoring.
I suspect the actual limit is going to be finding data that’s not auto-generated slop. Already, these bots must be accidentally training, in large part, on the output of other bots. It’s not all that different from humans, though: humans are also learning from the badly-written text of other humans (and bots), and repeating their mistakes. Even books by major authors are often not well copy-edited (or perhaps I should say that happens especially once they become popular); see, for example, the ubiquitous comma-splicing in the later Harry Potter books.
I see two reasonably reliable ways to get non-computer-generated data. Companies like Facebook can probably guess which of their users are human, and can use that data under their terms of service; except, as just noted, most of those humans can’t write worth a damn. The other way is to use old data, especially out-of-copyright data. That has two problems: one is simply that it’s old, such that your “A.I.” might end up talking like a grizzled 1890s prospector (nevermind the biases noted by Caroline already). The other is that it’s subject to errors, such as JBIG2 character substitutions by scanners and just bad optical character recognition.
What an absolutely pathetic opinion from an industry shill. Even sadder if you aren’t even scamming the scammers out of their plagiarism money to defend their ruination of our planet and ability to think and create as humans.
1) What innovation? What wondrous new contributions to human knowledge and society has an LLM ever actually made? They hallucinate presidents and tell us to eat rocks.
2) LLMs _do_not_learn_. If they did, they could count the number of Rs in strawberry by now. Something that can learn doesn’t need hundreds of billions of dollars, countless billions of gallons of water, and grid-crushing amounts of electricity to figure out something every human on earth learned in elementary school.
3) A human manipulating a tool to produce a creative work based on their individual, human perceptions and choices bears no resemblance to a mouth breathing prompt jockey acting like a Problem Client continually yelling at a machine to copy more ideas from others more better more faster because they think learning a trade or skill is too boring or too hard. Why should we cater our entire Human Endeavor to the laziest and dumbest among us?
4) The path forward involves abandoning these resource monopolizing, planet destroying, brain atrophying monstrosities and never again taking seriously the opinions of anyone who ever defended them.
The real issue here is “Intellectual Property”, the infinite supply of human thoghts and ideas.
Do humans inherently ‘own’ their thoughts as an economic ‘Property Right’ that should be enforced by government ??
Until one can rationally answer this question, productive intellectual analysis of Copyright, Patents, Trademarks, etc is impossible.
Re:
No. Ideas can’t be copyrighted—only fixed expressions of ideas. For example: “Man travels through a post-apocalyptic world” is a broad idea, which means the expression thereof can range from the Mad Max franchise to the MST3K-worthy Warrior of the Lost World.
Re: Re:
According to the laws as written, sure. In practice, courts do uphold copyright on ideas. For example, someone using any attribute of Sherlock Holmes or Mickey Mouse from after the public-domain stuff can expect to get sued. The “unofficial sequel” to Catcher in the Rye was banned by a U.S. court on copyright grounds, despite copying none of the fixed expressions.
Re: Re: Re:
The argument in such cases is that the elements of a character not yet present in public domain works are part of the fixed expression of that character and are therefore off-limits so long as those elements remain in works covered by copyright. I don’t think it’s the best argument, but it’s the one that’s held up by law.
The creation of derivative works based on existing copyrighted material generally requires the permission of the copyright holder. Exceptions do exist—parody, for example—and sometimes derivative works are ignored because they help generate interest in the original work (e.g., fan art and fanfiction). But if you’re trying to publish an “unofficial sequel” to a book still covered by copyright, you’d do well to file off all the serial numbers before you pull the trigger. I mean, Fifty Shades of Grey started off as Twilight fanfiction before E. L. James rewrote it as an original piece.
Re: Re: Re:2
That’s a bit like when software patents are not allowed, so people patent a general-purpose computer running software that implements their idea. Which in practice is the exact same thing, but lawyers bullshit the courts into believing there’s a distinction.
U.S. law says:
A “sequel” is usually based only on the pure ideas of a copyrighted work, using none or almost none of the content. It’s therefore not a “derived work”, or at least isn’t supposed to be. But, again, lawyers are good bullshitters…
Re: Re: Re:3
Like I said: I don’t think it’s the best argument, but it’s the one that’s held up by law.
If a given work is a sequel to an existing work, chances are it’s using at least one of the characters from the existing work. Mad Max: Fury Road has no other characters from the other Mad Max films besides the titular Road Warrior himself, but it’s still a sequel to the other films. A sequel to Catcher in the Rye would necessarily have to involve the use of at least one character from that book, and chances are the sequel would reference events from that book. That would absolutely make it a direct derivative work because it is referencing a direct expression of ideas. Remember, copyright is about the expression of ideas, not the ideas themselves—and if you’re going to copy or build upon someone else’s expression of ideas, you should either have permission or have an argument for Fair Use (e.g., parody) ready to deploy in case you need it in a court of law.
Re: Re:
no, owned PROPERTY can only be something physical.
but politicians instead decided that some types of intangible human ideas were especially valuable to society … and could readily encouraged by an artificial legal ‘grant’ of private-property status … in defiance of basic economics.
It’s been a huge mess ever since, loaded with inefficiency, corruption & severe negative consequences to society.
Does this mean that when a corporation infringes enough copyright to bankrupt the entire country with a DMCA style lawsuit, that it’s ok because they successfully torrented all that data?
Re:
While i have no love for “AI”, and have issues with seemingly thoughtful people who cannot see that “AI” is not going to get better, and nothing “AI” does can ever justify its costs, i can make no sense out of what you wrote.
Re: Re:
Fair. Let me try again.
Saying that it only matters when copyright holders think they’re owed money means that the AI companies think that they should be allowed to have free access to all data in the world. They’re wasting the bandwidth of everybody else to get access to all these petabytes of data that they feel like they shouldn’t have to pay for, but somebody else should give them even though it has real costs for the providers.
Meta torrented tons of data and will likely see no repercussions, while people have been bankrupted by the RIAA for much, much less.
Lines like this show that AI sympathizers seem to think that because it’s AI they should be allowed to break the laws the rest of us have to live by. “You only want money. We only downloaded a little data but we’re doing something good with it, so it’s ok. We promise.”
Re:
No, that’s not what it means.
It means that corporations only become concerned with copyright issues when there’s money to be made, or rather, when there’s money to be missed out on making.
Re:
The weasel term “infringement” was invented to conceal the fact that using someone else’s intangible ideas is NOT property THEFT under any previous concept of Common Law nor statute law.
Hmmm, stupid meets vapor…
Anything in, garbage out.
The old Garbage In, Garbage Out maxim is not a useful expression here because it implies a converse (Truth In, Truth Out) that is not correct. Even fed 100% accurate content LLMs are capable of hallucinations because they are just probability models and have no concept of either truth or garbage.
It’s also worth noting that the UK is ploughing its own furrow on this and is allowing copyright on machine-created work. Which is spectacularly dumb but it will be interesting to watch as a case study.
Re:
How’s about “DIGO”: Data in, garbage out?
Re: Re:
How about DInGO?
“DInGO ate my work.”
(Your regular reminder that the blurb at the bottom is a euphemism for lobbying, evaluate accordingly)
I’m not sure that’s true at all? There’s plenty of artists whose only art is posted on something like deviantart, who get annoyed at their art being stolen. Some are just thrilled to be noticed, but plenty will stand on principle. Especially if it’s uncredited. Financial success definitely gives more of an incentive and means (and perhaps most importantly, garners attention. No one notices when a small artist makes a fuss), sure- but creatives often have pretty strong feelings on how their work is used, one way or another.
(Similarly, people did try to use copyright to kill crawling- they just failed. Not for lack of trying.)
Can’t say this series has been terribly reassuring on this front. So far the main argument seems to be not to worry, because the human element is irreplaceable. Honestly not even clear what’s left to protect via the law, if that hope were true.
This is a non-sequitur. The amount of data consumed has no relevance to copyright law. Libraries do not become infringement centers if you read “too many” books out of them. Everything that follows this is absurd. AI has no interaction with Copyright law. It doesn’t make creative choices (it follows a repeatable algorithm) and it isn’t a person. It has all the copyright status of a road striper.
Re:
Don’t people who put stripes on roads deserve some kind of protection? What’s the point of even doing it if anyone (including A.I.) can copy their work for free?
Re: Re:
That’s what the hi-vis vests and hard hats are providing.
lose lose
Seems like AI is a loser on both sides. First they are liable for copyright infringement for the training, and then anything they create cannot be copyrighted so it can be taken without compensation.
I have no sympathy for any AI training which ignores robots.txt, and no sympathy for any website which objects to being crawled but doesn’t have a robots.txt forbidding it.
This comment has been flagged by the community. Click here to show it.
haha her last name is cock
A new authority is needed to investigate this.