EU Parliament Fails To Understand That The Right To Read Is The Right To Train

from the reading-is-fundamental-(to-AI) dept

Walled Culture recently wrote about an unrealistic French legislative proposal that would require the listing of all the authors of material used for training generative AI systems. Unfortunately, the European Parliament has inserted a similarly impossible idea in its text for the upcoming Artificial Intelligence (AI) Act. The DisCo blog explains that MEPs added new copyright requirements to the Commission’s original proposal:

These requirements would oblige AI developers to disclose a summary of all copyrighted material used to train their AI systems. Burdensome and impractical are the right words to describe the proposed rules.

In some cases it would basically come down to providing a summary of half the internet.

Leaving aside the impossibly large volume of material that might need to be summarized, another issue is that it is by no means clear when something is under copyright, making compliance even more infeasible. In any case, as the DisCo post rightly points out, the EU Copyright Directive already provides a legal framework that addresses the issue of training AI systems:

The existing European copyright rules are very simple: developers can copy and analyse vast quantities of data from the internet, as long as the data is publicly available and rights holders do not object to this kind of use. So, rights holders already have the power to decide whether AI developers can use their content or not.

This is a classic case of the copyright industry always wanting more, no matter how much it gets. When the EU Copyright Directive was under discussion, many argued that an EU-wide copyright exception for text and data mining (TDM) and AI in the form of machine learning would be hugely beneficial for the economy and society. But as usual, the copyright world insisted on its right to double dip, and to be paid again if copyright materials were used for mining or machine learning, even if a license had already been obtained to access the material.

As I wrote in a column five years ago, that’s ridiculous, because the right to read is the right to mine. Updated for our AI world, that can be rephrased as “the right to read is the right to train”. By failing to recognize that, the European Parliament has sabotaged its own AI Act. Its amendment to the text will make it far harder for AI companies to thrive in the EU, which will inevitably encourage them to set up shop elsewhere.

If the final text of the AI Act still has this requirement to provide a summary of all copyright material that is used for training, I predict that the EU will become a backwater for AI. That would be a huge loss for the region, because generative AI is widely expected to be one of the most dynamic and important new tech sectors. If that happens, backward-looking copyright dogma will once again have throttled a promising digital future, just as it has done so often in the recent past.

Follow me @glynmoody on Mastodon. Originally posted to WalledCulture.

Filed Under: , , ,

Rate this comment as insightful
Rate this comment as funny
You have rated this comment as insightful
You have rated this comment as funny
Flag this comment as abusive/trolling/spam
You have flagged this comment
The first word has already been claimed
The last word has already been claimed
Insightful Lightbulb icon Funny Laughing icon Abusive/trolling/spam Flag icon Insightful badge Lightbulb icon Funny badge Laughing icon Comments icon

Comments on “EU Parliament Fails To Understand That The Right To Read Is The Right To Train”

Subscribe: RSS Leave a comment
36 Comments
Anonymous Coward says:

The existing European copyright rules are very simple: developers can copy and analyse vast quantities of data from the internet, as long as the data is publicly available and rights holders do not object to this kind of use. So, rights holders already have the power to decide whether AI developers can use their content or not.

How does this power actually work out in practice today in the EU? Like if say a French author were to find out that an AI company trained on the Books3 data set and that a pirated version of their book was in that data set, how would the rights holder exercise this power under EU law?

Anonymous Coward says:

I think there are some major problems with your analysis.

First, you ignore that quite a bit of this scraping was done in the dead of night. You are arguing that since copyright holders didn’t tell you not to sneak into their house and read it that it was all on the up and up. Because what happened is the big AI companies went and scraped this data without asking for permission, without telling anyone they were going to do it, and if I remember right at least one has admitted that it will stop ignoring the desires of websites to tell it not to scrape them… So no, not giving anyone the ability to tell them no, and knowingly ignoring it anyway cannot be defined as copyright holders giving them permission.

Second, there is quite a bit of information “publicly” available that is, stolen or hacked. I’m sure your credit card information, and mine, is out there somewhere already. License plates, location data, private photos and videos, medical information. Publicly available contains a lot of things that we recognize normally as either illegal to use without express permission or illegal to use commercially without express permission. Hell let’s go down the deep dark route, who needs warrants when AI can just take all the information.

Third, we already recognize that scale matters. I cannot go take music or video I bought and play it to a large audience. We apply, or don’t apply laws to companies based on scale. Trying to take scale and efficiency out of the equation as if they do not matter is like arguing that a nuke and a firecracker, or a bi plane and a military jet, or a gas tanker and a bicycle, should be regulated the same.

Fourth, the power imbalance. Because for the most part you have a few big companies that used their already existing power and influence to go and take advantage of what all the smaller players created.

Mamba (profile) says:

Re:

This is all, of course, complete bullshit.

Nothing was done ‘in the dead of night’, it was all done in complete and public view. You might not like it, but fair use and first sale doctrine determine what can be used and how…not the writers.

Also, there’s been no evidence that anything has been ‘stolen’ or hacked. Not that copyright violation constitutes a theft.

And no, public performance isn’t about scale, it’s about venue. Any public performance of a copyright song is a violation, regardless of the audience size. Just one person violates the copyright protections.

Finally, remember that the small companies got there first with AI. OpenAI didn’t exist before 2015. Google, Meta and others were left scrambling. Microsoft had to buy into the business, it didn’t even create it on their own.

I’m not a lawyer, but virtually every legal scholar I’ve found is certain that training AI will be fair use.

TKnarr (profile) says:

I think you misunderstand the scope of the “right to read”. It doesn’t extend to a right to do anything you want with what you’ve read without consequence. Go ask any editor, or any author, about exactly why they don’t read unsolicited manuscripts or, in the case of authors, any fan works based on the books they’ve written. They don’t do those things because, even if they have a right to read those things, doing so opens them up to claims of copyright infringement if they subsequently produce works too close to what they’d read (and “too close” doesn’t have to be very close at all, more than one case has shown that).

Anonymous Coward says:

Re:

You’re not creating art. You’re the equivalent of David Zaslav, telling something you don’t understand to make you something you couldn’t make yourself, while exhibiting profound contempt for the actual labor that was necessary to create the conditions under which you instructions could be carried out. Though at least WBDiscovery pays its serfs something.

Crafty Coyote says:

Re: Re:

You may have a point in that I’m not making art. But inspiring others to do what I could not do, and financially supporting them to do that, all while they maintain their innocence is a noble endeavor. Ideas, in spite of being called “property” aren’t really property at all, and if they are then their owners can call the police if they wish to complain.

Anonymous Coward says:

Re: Re: Re:

But inspiring others to do what I could not do, and financially supporting them to do that, all while they maintain their innocence is a noble endeavor.

So, asking artists for permission to use their work to train AI and compensating them fairly is an unconscionable infringement of whichever right you’ve hallucinated, but paying a “””prompt engineer””” to have an AI puke up a mangled version of stolen work is noble and pure?

You’re deranged.

Crafty Coyote says:

Re: Re: Re:2

A computer can only puke a mangled work because computers can’t improvise as humans do, but the only complaint of theft applies to someone physically stealing the picture you’re working on. The AI is only a passive tool or instrument, a real artist should be able to create with or without it.

Copyright, on the other hand, is a detriment to artists, because it limits what actual humans can make, and criminalizes the action of being inspired by what they’ve seen in life. The AI being amoral has access to an infinite amount of pre existing art and has no concept of guilt or innocence, yet it also lacks human ingenuity

Anonymous Coward says:

Re: Re: Re:2

So, asking artists for permission to use their work to train AI and compensating

Would earn them nothing, as the training sets are thousands of works, and trying to distribute any payments fairly would cost more than the licensing fees. Alternatively it would be like the music collection societies, taxing all the little know artists to pay the famous artists a bit more money.

David Wilson says:

The Right to Read is the Right to Train

What hooey. I should be able to create a copyrighted work and release it with an EULA that clearly stipulates that the work is for use by humans only and that it may not be incorporated into any AI or artifact without my explicit written permission. It is merely a question of will. There is absolutely nothing inevitable about copyrighted works being used as training data for LLMs or other data repositories.

Flakbait (profile) says:

Natural Intelligence

So, AI has to have a bibliography/references section to show where it learned things.

Should humans (Natural Intelligence / NI) authors so up in arms over having their work used to teach AI now have to do the same thing – citing everything they’ve ever read?

For instance, John Grisham’s latest thriller needs to cite everything he’s ever read, from Dr. Seuss on up to every legal brief.

Tim Burr, Professional Lumberjack (profile) says:

Happy Medium, Scale is Relevant

I think there’s got to be a happy medium somewhere. It’s a false equivalence to conflate LLM AI ‘reading’ with human ‘reading’. For starters, while the LLMs do use a neural network that purports to mimic actual brains, the manner of ‘memory’ within the LLM training set isn’t actually similar to human brains, because the model is inherently static in CUDA memory until more training is done to update it. It doesn’t adapt in real time to the emotional impact of what it has ‘read’ after it’s spent time emotionally processing/digesting it.

This is inherently different from human brains, which have immense amounts of plasticity, and which change every time they read and absorb written language, then process, and store pieces of that content based on the individual’s emotional reaction to it. The human brain is simply far too imperfect at literary pattern matching compared to LLMs, because it’s absorbing and retaining primarily the emotional impact of what it reads, the generic meaning and summary of the contents occur based on the inherent importance that a brain assigns to what it reads.

That is why it’s absurd to claim there is, in any way, any similarity between human brains and LLM training datasets. For all their flaws, LLMs retain raw information far more perfectly than human brains, even if it isn’t necessarily a photocopy, and they aren’t enabled, inspired, or restrained by any emotion whatsoever. A few months back, I could go to chatgpt and prompt it to perfectly reproduce the opening pages of The Hobbit. Sometimes a few words would be slightly different. I’m sure it could do that with tens of thousands of literary works. While some humans memorize entire Shakespeare plays and have memorized The Odyssey etc., even the most acute human memory can’t retain that much data with that much accuracy. Even for people who memorize plays and epics, they lose it without regularly refreshing it and devoting huge amounts of their life to keeping it fresh. And even then, they will call ‘line’ during rehearsal, or forget little parts and improvise. That’s with, at best, a few dozen literary works having been memorized. Scale matters. That’s why academic use of copyrighted data is often considered fair use, because it’s happening at a very small scale compared to commercial or bootleg reproduction of content. At some point, even our corrupt asshat legislatures and their lackey judges all agreed that small scale reproduction and use of copyrighted material is acceptable for certain use cases, even without permission from or compensation to the original rightsholders.

Yes, humans shouldn’t own ideas, but if I spend millions of dollars paying humans to train a LLM, and I know I use copyrighted works during the ingestion process, and I know those copyrighted works will then be used and tokenized to a significantly similar degree in these LLMs, and then the LLMs will be monetized and made available to massive portions of the public, the rightsholders should at least have the right to opt out. There is an inherent matter of scale at work here, and generative AI isn’t reading and absorbing works the same way a human would. The AI isn’t just a very smart human, it’s now far more powerful at data reproduction than the average human, because the model has memorized the patterns of the words of entire libraries, allowing them to reproduce significant portions of them that are nearly perfect, or at least with far more acuity and permanence than the best human minds.

So I say again, going forward, rightsholders should be able to opt out. Maybe what should happen is the LLM companies should be required to pay for a copy of every work they used for every human that trained the LLMs. That seems like a very reasonable solution. If these datasets were trained on tens of thousands of books and had hundreds of humans training them, that’s not a small amount of book sales. Maybe they should be paid again for the same amount when every new public model and training set is released. Or something like that. You can’t tell me a Microsoft subsidiary can’t afford it. And if it’s an academic work, the cost should be a lot cheaper/different.

That being said, you can’t put the toothpaste back into the tube. And moreover, the LLMs aren’t perfect photocopies of books, and rightsholders need to fucking research the technology and stop claiming/treating it like it is.

What this all really proves is that copyright laws are still broken, designed to enrich a very small number of lucky successful people, and enable them to create multigenerational nobilities off 150-year royalty schedules. Put it back to 25 years or whatever. 50 years, fine. Tolkien is dead and his works should be in the public domain soon if not already.

Anonymous Coward says:

Re:

A few months back, I could go to chatgpt and prompt it to perfectly reproduce the opening pages of The Hobbit.

Noting that you prompted it to get a specific output, it is you that is committing copyright infringement, and not the tool you used to get that output. All the waffle about AI being different does not matter when a person directs it to get specific result, as it is the human that decides what they want, and how to get there.

Tim Burr, Professional Lumberjack (profile) says:

Re: Re:

Completely agree. the point isn’t about whether the tool itself is actively reproducing copyrighted material, my point was to illustrate that calling this “learning” in the same way that humans “learn” is completely ignorant at best and disingenuous at worst. I have no idea how access to digital libraries is handled in these sorts of cases. But it seems there should be some schema whereby corporate access to copyrighted data by hundreds or thousands of workers, even if accessed through the abstraction of LLM training, should require compensation commensurate with the number of people working on it. Moreover, if I am still the rightsholder and I am uncomfortable with the material being used by LLMs, I should have the ability to completely opt out.

If I am the adult descendant of a rightsholder and it was my grandparent who created it 50 years ago, I should fall off a cliff and that should be public domain and freely available to everyone.

Additionally, it is my opinion that Carthage must be destroyed copyright terms should be amended back to place them in the public domain within 30 to 50 years of publication.

Anonymous Coward says:

Re: Re:

If the text of book is being perfectly reproduced with prompts (which aren’t the text of the book), it means that the AI already has a copy of that text. Hence, copying without permission; hence, infringement by the AI’s makers/trainers, unless it falls under fair use (which it might, but that’s a question for another day.)

Anonymous Coward says:

Re: Re: Re:

If the text of book is being perfectly reproduced with prompts (which aren’t the text of the book), it means that the AI already has a copy of that text. Hence, copying without permission; hence, infringement by the AI’s makers/trainers, unless it falls under fair use (which it might, but that’s a question for another day.)

Someone doesn’t actually understand how LLM’s work, or copyright for that matter. You can’t get an LLM reproduce a book verbatim by a prompt, unless you provide it with one whopping long prompt after many trial and errors in an effort to steer it to create a perfect copy, just like how you could get someone to create a perfect reproduction by telling them what to write and constantly correcting them.

On the other hand, I know of savants that can reproduce a book verbatim if you just tell them the title.

Crafty Coyote says:

Getting around these asinine laws requires a creativity that knows no bounds. Like when self-expression borrowing media that hasn’t hit an arbitrary number of years yet is illegal, find other innocent people- or in this case, amoral AI that can’t possibly know right from wrong- perhaps in a foreign country to bring those ideas to market.

More self-sacrifice than creativity, really, but it will get the job done.

Add Your Comment

Your email address will not be published. Required fields are marked *

Have a Techdirt Account? Sign in now. Want one? Register here

Comment Options:

Make this the or (get credits or sign in to see balance) what's this?

What's this?

Techdirt community members with Techdirt Credits can spotlight a comment as either the "First Word" or "Last Word" on a particular comment thread. Credits can be purchased at the Techdirt Insider Shop »

Follow Techdirt

Techdirt Daily Newsletter

Ctrl-Alt-Speech

A weekly news podcast from
Mike Masnick & Ben Whitelaw

Subscribe now to Ctrl-Alt-Speech »
Techdirt Deals
Techdirt Insider Discord
The latest chatter on the Techdirt Insider Discord channel...
Loading...