AI Training: What Creators Need To Know About Copyright, Tokens, And Data Winter

from the beware-the-data-winter dept

This is the final piece in a series of posts that explores how we can rethink the intersection of AI, creativity, and policy. From examining outdated regulatory metaphors to questioning copyright norms and highlighting the risks of stifling innovation, each post addresses a different piece of the AI puzzle. Together, they advocate for a more balanced, forward-thinking approach that acknowledges the potential of technological evolution while safeguarding the rights of creators and ensuring AI’s development serves the broader interests of society. You can read the firstsecondthirdfourthfifth, and sixth posts in the series.

As the conversation about AI’s impact on creative industries continues, there’s a common misconception that AI models are “stealing” content by absorbing it for free. But if we take a closer look at how AI training works, it becomes clear that this isn’t the case at all. AI models don’t simply replicate or repackage creative works—they break them down into something much more abstract: tokens. These tokens are tiny, fragmented pieces of data that no longer represent the creative expression of an idea. And here’s where the distinction lies: copyright is meant to protect expression, not individual words, phrases, or patterns that make up those works.

The Lego Analogy: Breaking Down Creative Works into Tokens

Imagine you’re a creator, and your work is like a detailed Lego model of the Star Wars Millennium Falcon. It’s intricate, with every piece perfectly assembled to create something unique and valuable. Now imagine that an AI system comes along—not to take your Millennium Falcon and display it as its own creation, but to break it down into individual Lego blocks. These blocks are then scattered among millions of others from different sources, and the AI uses them to build entirely new structures—things that look nothing like the Millennium Falcon.

In this analogy, the Lego blocks are the tokens that AI models use. These tokens are fragments of data—tiny bits of information stripped of the original context and creative expression. Just like Lego pieces, tokens are abstract and can be recombined in an infinite number of ways to create something entirely new. The AI doesn’t copy your Falcon; it takes the building blocks (tokens) and uses them to create something that’s not a replica of the original but something completely different, like a castle or a spaceship you’ve never seen before.

This is the key distinction: AI models aren’t absorbing entire creative works and reproducing them as their own. They’re learning patterns from vast datasets and using those patterns to generate new content. The tokens no longer reflect the expression of the original work, and thus, they don’t infringe on the creative essence that copyright law is designed to protect.

Why Recent Content Matters: AI Needs to Reflect Modern Language and Values

There’s another critical point that often gets overlooked: AI models need access to recent, contemporary content to be useful, relevant, and ethical. Let’s imagine for a moment what would happen if AI models were restricted to learning only from public domain works, many of which are decades or even centuries old.

While public domain works are valuable, they often reflect the social norms and biases of their time. If AI models are trained primarily on outdated texts, there’s a serious risk that they could “speak” in a way that’s misogynistic, biased, anti-LGBTQ+, or even outright racist. Many public domain works contain language and ideas that are no longer acceptable in today’s society, and if AI is limited to these sources, it may inadvertently propagate harmful, antiquated views.

To ensure that AI reflects current values, inclusive language, and modern social norms, it needs access to recent content. This means analyzing and learning from today’s books, articles, speeches, and other forms of communication. If creators and copyright holders opt out of allowing their content to be used for AI training, we risk creating models that don’t reflect the diversity, progress, and inclusivity of modern society.

For example, language evolves quickly—just look at the increased use of gender-neutral pronouns or terms like intersectionality in recent years. If AI is cut off from these contemporary linguistic trends, it will struggle to understand and engage with the world as it is today. It would be like asking an AI trained exclusively on Shakespearean English to have a conversation with a 21st-century teenager—it simply wouldn’t work.

Article 4 of the EU Directive: Opting Out of Text and Data Mining

Let’s bring the EU Directive on Copyright in the Digital Single Market (DSM) into the picture. The Directive includes provisions (Article 4) allowing copyright holders to opt out of having their content used in text and data mining (TDM). TDM is crucial for training AI models, as it allows them to analyze and learn from large datasets. The opt-out mechanism gives creators and copyright holders the ability to expressly reserve their works from being used for TDM.

However, it’s important to remember that this opt-out applies to all AI models, not just generative AI systems like ChatGPT. This means that by opting out in a broad, blanket manner, creators could inadvertently limit the potential of AI models that have nothing to do with creative industries—tools that are critical for advancements in healthcare, education, and even in day-to-day conveniences that many of us benefit from.

The Risk of a Data Winter: Why Broad Opt-Outs Could Harm Innovation

What happens if creators and copyright holders across Europe start opting out of TDM on a large scale? The answer is something AI researchers dread: a data winter. Without access to a diverse and rich array of data, AI models will struggle to evolve. This could slow innovation not just in the creative industries, but across the entire economy.

AI needs high-quality data to function properly. The principle of Garbage In, Garbage Out applies here: if AI models are starved of diverse input, their output will be flawed, biased, and of lower quality. And while this may not seem like an issue for some industries, it has a ripple effect. Every AI tool we rely on—from smart assistants to medical research applications—depends on robust training data. Restricting access to this data doesn’t just hinder progress in AI innovation; it stifles public interest tools that have far-reaching benefits for society.

Think about it: many creators themselves probably use AI-driven tools in their daily lives—whether it’s for streamlining workflows, generating new ideas, or even just organizing information. By opting out of TDM, they could inadvertently be damaging the very tools that enhance their own creative processes.

The Way Forward: Balance Between Protection and Innovation

While copyright is crucial for protecting creators and ensuring fair compensation, it’s equally important not to over-regulate in a way that stifles innovation. AI models aren’t absorbing entire works for free; they’re breaking them down into unrecognizable tokens that enable transformative uses. Rather than opting out of TDM as a knee-jerk reaction, creators should consider the long-term consequences of limiting AI’s potential to innovate and enhance their own industries.

A balance needs to be struck. Copyright protection should ensure that creators are fairly compensated, but it shouldn’t be wielded as a tool to restrict the very data that drives AI innovation. Creators and policymakers must recognize that AI isn’t the enemy—it’s a collaborator. And if we’re not careful, we might find ourselves facing a data winter, where the tools we rely on for both convenience and advancement are weakened due to short-sighted decisions.

Caroline De Cock is a communications and policy expert, author, and entrepreneur. She serves as Managing Director of N-square Consulting and Square-up Agency, and Head of Research at Information Labs. Caroline specializes in digital rights, policy advocacy, and strategic innovation, driven by her commitment to fostering global connectivity and positive change.

Filed Under: , , , , , , ,

Rate this comment as insightful
Rate this comment as funny
You have rated this comment as insightful
You have rated this comment as funny
Flag this comment as abusive/trolling/spam
You have flagged this comment
The first word has already been claimed
The last word has already been claimed
Insightful Lightbulb icon Funny Laughing icon Abusive/trolling/spam Flag icon Insightful badge Lightbulb icon Funny badge Laughing icon Comments icon

Comments on “AI Training: What Creators Need To Know About Copyright, Tokens, And Data Winter”

Subscribe: RSS Leave a comment
27 Comments
Who Cares (profile) says:

Don’t blame the content creators for a possible data winter blame the owners of the various advanced chatbots (and that is what they are even though everyone dresses them up in buzz & hype words as to get money to burn) for that.

It is a combination of
* the crawlers absolutely hammering websites with requests. And don’t you dare to impede that in anyway as the result is an even worse flood in an attempt to bypass the restrictions.
* The search engines damaging the content creator income by AI slopping said creators website in a blurb stopping people from actually visiting those sites (even if due to the mixing of multiple sites the chance that what comes out if correct is worse then betting on a coin toss if the subject in question is not settled).
* What is being produced by those bot is slop, it takes at least as long to check what is being produced is usable and fixing what is wrong as doing the work yourself.

In aggregate a data winter would be a boon for content creators. If the bots keel over that means less costs to maintain a server, more income from actual people visiting the website (hopefully doing other actions that boost visibility/income) and about the same time to do the work as before the bots existed.

This comment has been deemed insightful by the community.
Bloof (profile) says:

‘Hey creators, you don’t understand AI! Don’t use your legal rights to opt out of allowing the industry I am paid to lobby for to use your creative works for free in any way you please, you might cause scary data winter! Imagine a world where google can’t stripmine the work produced by the websites they are actively starving of revenue! That would be bad because it might affect the AI systems that aren’t being produced by selfish d-heads in an attempt to dominate and throttle the internet entirely. You might lose the good things if you don’t let the world’s richest corporations steal whatever they like, don’t you want nice science? Innovation? They neeeeeed to bombard your personal website with tens of thousands of requests every few minutes, ignore Robots.TXT and cycle through every avenue of attack to scrape every word and every last pixel and give you nothing in return in the name of innovation.’

AI companies and industry lobbyists do not care about innovation, they care about profit and put that ahead of human lives. I understand that the people running the site love tech toys and hate copyright, but it’s hard not to lose respect when they’re knowingly publishing puff pieces. I am glad it’s the last article in this series, I hope we never see it’s like again.

Stephen T. Stone (profile) says:

AI models aren’t absorbing entire works for free; they’re breaking them down into unrecognizable tokens that enable transformative uses

…by first absorbing entire works for free. You can’t have it both ways: Either AI models are or aren’t absorbing entire works regardless of what the models do with all those works.

I’m no fan of copyright and even I think all these pieces have been way too deferential to AI companies.

Anonymous Coward says:

Re:

AI models aren’t absorbing entire works for free; they’re breaking them down into unrecognizable tokens that enable transformative uses

…by first absorbing entire works for free.

[Citation needed] What evidence do you have that the copyrighted material used isn’t either paid for or else used under license? Remember, extraordinary claims require extraordinary evidence.

Stephen T. Stone (profile) says:

Re: Re:

What evidence do you have that the copyrighted material used isn’t either paid for or else used under license?

I know that trying to license out thousands upon thousands of books for an LLM is likely cost-prohibitive for even a multinational corporation. Sure, the people building the AI model will hoover up the public domain stuff, but if they want to train their model on anything even remotely modern, there’s only two ways to get the necessary materials: Buy thousands upon thousands of books or grab them via illicit filesharing. As far as any evidence proving as much, I don’t have it myself, but I’m pretty sure the authors who filed suit against Microsoft for using their works to train an AI model might have the goods.

Arianity (profile) says:

Re: Re:

What evidence do you have that the copyrighted material used isn’t either paid for or else used under license? Remember, extraordinary claims require extraordinary evidence.

This isn’t really an extraordinary claim. If you want the best evidence, you can look to recent lawsuits (both the evidence, and also the defenses presented- which are about fair use, not that the material was paid/licensed). See for instance here, with internal memos stating LibGen is ‘a dataset we know to be pirated.'”. Or here– Anthropic pirated ~7mil books. There are other cases from companies like Suno/Udio, where they admit to pirating, etc.

But even before all that, it was well known. Between the scraping/torrenting, the models regurgitating/leaking training material, we know there’s copyrighted materials in the commonly used datasets, etc. And just technologically, we know that the biggest models need more data than they would be able to get by paying/licensing.

This isn’t even an open secret, it’s just open, at this point.

This comment has been deemed insightful by the community.
Cathay (profile) says:

Sadly, this piece comes over as propaganda

In my field – software – there have been cases of recognisable chunks of source code from training data being output. This isn’t especially surprising if the code is specialised with few examples in the training data.

Since they were from confidential repositories on the Microsoft-owned GitHub, it became impossible to trust any assurances about confidentiality.

This piece is pure propaganda. None of it can be trusted. LLM “AI” has become a field like cryptocurrency: an entirely predatory business.

Arianity (profile) says:

(Your final reminder that the blurb at the bottom is a euphemism for lobbying, evaluate accordingly)

The tokens no longer reflect the expression of the original work, and thus, they don’t infringe on the creative essence that copyright law is designed to protect.

This is not how AI models work. As pointed out in previous posts, AI models can and do, do spit out training data works largely verbatim. If you ask the AI model for a Millennium Falcon in just the right way, it will give you essentially exactly that Millennium Falcon. Not only can models do this, it happens enough even inadvertently that it’s an active area of research to make them do it less.

This is why, for instance, if you train a data set of only Millennium Falcons, your model will spit out Millennium Falcon derivatives. The model doesn’t just learn about the individual pieces, it uses the input data to learn how they connect. (You even literally acknowledge this in your next section- if you break up old public works into Legos, you can in fact speak modernly- unless you’re learning from and limited to the structure of those works. I can build a Millennium Falcon from Lego pieces that didn’t come from the Millennium Falcon kit. You’ve also acknowledged it in previous pieces, r.e. Ariana Grande). This is also why synthetic data isn’t trivial.

(That is, of course, to say nothing of the competitive problem of now having a tool that can spit out an infinite variation of Millennium Birds to compete with the original. Which would still concern creators even if they were immaculate Legos)

Creators and policymakers must recognize that AI isn’t the enemy—it’s a collaborator

It’s both. Because it’s a tool. The hammer can be used to both build, and smash.

creators should consider the long-term consequences of limiting AI’s potential to innovate and enhance their own industries.

They are. If only politicians and lobbyists would do the same, and offer something more than as always, the creative industries will continue to thrive as they’re actively being put into the woodchipper. So much for the promised copyright laws that protect human creativity and labour.

Anonymous Coward says:

Re:

As pointed out in previous posts, AI models can and do, do spit out training data works largely verbatim.

As in this article? As pointed out in the article I linked, the AI has to be prompted to spit out works verbatim before it will do so, meaning any infringement is committed by the prompter, not the AI. Anything else you want to get badly wrong, NYT shill?

Stephen T. Stone (profile) says:

Re: Re:

the AI has to be prompted to spit out works verbatim before it will do so, meaning any infringement is committed by the prompter

And yet, the AI model has the exact works verbatim sitting within it. Why wouldn’t that model (and the people who made it) be liable for infringement since it basically distributes an entire copyrighted work without the permission of the original author?

Arianity (profile) says:

Re: Re: Re:2

Ehh, the line starts to get very blurry. It’s not a database how we would normally think of it, but “a bunch of data about that source material in the form of probabilities”, means that when it contains the right data, you can pull the source material back out. When that happens, you’ve essentially encoded a copy into the weights. In some sense, that is a type of database- the information is there, just encoded in a very strange and unoptimal (for this purpose) way.

There’s a lot of tricks people will use to get the model to not do that. Some of it before/during training, some after during things like fine tuning, and some prompt protection. It still happens pretty regularly right now, though.

Arianity (profile) says:

Re: Re:

As in this article?

No, I’m referring to a previous article in this series.

the AI has to be prompted to spit out works verbatim before it will do so, meaning any infringement is committed by the prompter, not the AI.

That is not how copyright law works, nor does the article you’re claiming say it works that way (at least in the U.S. I can’t say much about the EU, which is what OP is concerned with. But I doubt that it varies significantly in this regard). You may be confusing it with cases where the LLM pulls new data from the internet (they can web search now. That is largely what the TD story you linked is talking about). That isn’t how it works for training data. (Indeed, this is why the recent lawsuit with Alsup ruling explicitly did not consider prompting, at all.)

That said, even in the case where it’s pulling from the internet, the article does not make that claim, either. So I don’t know where you think you got that from.

Sidenote: To be clear, this doesn’t have to be a specific prompt to intentionally cause it, either. All you have to do is land in a spot in the model that’s close to training data. It’s just far, far easier to do with a specific one, for obvious reasons. Here is OpenAI admitting that regurgitation is a “rare bug”. It’s an active area of research how to reduce it from happening. See for instance this very famous paper.

Anonymous Coward says:

This is the key distinction: AI models aren’t absorbing entire creative works and reproducing them as their own.

Bullshit as usual from a bullshit artist.

“AI” works by taking works without permission and trying to guess what a prompt looks like.

It’s plagiarism and will never be “innovative.” This has been known and proven again and again.

That Techdirt still allows this propaganda swill on their site makes me question just how long it’ll be before they’re suddenly seeing nothing wrong with what they’ve previously called “misuses of technology.”

Anonymous Coward says:

Creators and policymakers must recognize that AI isn’t the enemy—it’s a collaborator

Creators need to do no such thing, regardless of how bad their understanding and arguments are.

AI is not a collaborator. It isn’t even an own thing. It is a tool created by corporatists and weird tech-religion believers for their own ends.

And it is always going to be garbage out.

Politicians and ISPs complain about Netflix and whatever. How about AI? How about abusing the War Games Law one last time to initiate a Data Snowball Earth? (Perhaps before AI adds its brutal finishing touches on accelerating climate change.)

Arianity (profile) says:

Re:

FWIW, part of this is because OP is looking at it specifically from a European viewpoint (they are European). They don’t use the same 4 factor test, as well as already having other opt outs (e.g. TDM, mentioned briefly in a previous post in the series). And also their government isn’t so sclerotic that new changes can’t be made this century.

Honestly it’s a shame we didn’t get more discussion about how Europe is handling it differently, it would’ve been really nice insight.

Add Your Comment

Your email address will not be published. Required fields are marked *

Have a Techdirt Account? Sign in now. Want one? Register here

Comment Options:

Make this the or (get credits or sign in to see balance) what's this?

What's this?

Techdirt community members with Techdirt Credits can spotlight a comment as either the "First Word" or "Last Word" on a particular comment thread. Credits can be purchased at the Techdirt Insider Shop »

Follow Techdirt

Techdirt Daily Newsletter

Subscribe to Our Newsletter

Get all our posts in your inbox with the Techdirt Daily Newsletter!

We don’t spam. Read our privacy policy for more info.

Ctrl-Alt-Speech

A weekly news podcast from
Mike Masnick & Ben Whitelaw

Subscribe now to Ctrl-Alt-Speech »
Techdirt Deals
Techdirt Insider Discord
The latest chatter on the Techdirt Insider Discord channel...
Loading...