Anonymous Coward

The existing European copyright rules are very simple: developers can copy and analyse vast quantities of data from the internet, as long as the data is publicly available and rights holders do not object to this kind of use. So, rights holders already have the power to decide whether AI developers can use their content or not.

How does this power actually work out in practice today in the EU? Like if say a French author were to find out that an AI company trained on the Books3 data set and that a pirated version of their book was in that data set, how would the rights holder exercise this power under EU law?

Anonymous Coward

October 26, 2023 at 5:05 pm

Re:

The same way one handles any copyright infringement claim.

Anonymous Coward

October 26, 2023 at 1:42 pm

It seems apropos to link The Right To Read.

glenn

October 26, 2023 at 3:44 pm

When it comes to intelligence, artificial beats “legislative” (no, not surprising at all).

Anonymous Coward

October 26, 2023 at 5:07 pm

Next: A “robots.txt” bit one can set in any kind of medium.
God help you if you remove it and make the content available somewhere.

got_runs? (profile)

October 26, 2023 at 5:27 pm

Copyright laws go above and beyond any normal law.

Crafty Coyote

October 26, 2023 at 7:26 pm

Re:

No matter what laws you have, nothing trumps human rights to read whatever books they want and speak freely.

Or to feed their AI, which is a brilliant strategy to defeat those laws

Anonymous Coward

October 26, 2023 at 8:40 pm

Re: Re:

AI is a human right. Sure buddy.

Anonymous Coward

October 27, 2023 at 12:03 pm

Re: Re: Re:

You don’t need an AI to have free access to reading.

Anonymous Coward

October 27, 2023 at 12:52 pm

Re: Re: Re:²

But AI can help people express themselves.

Anonymous Coward

October 26, 2023 at 6:00 pm

…

I think there are some major problems with your analysis.

First, you ignore that quite a bit of this scraping was done in the dead of night. You are arguing that since copyright holders didn’t tell you not to sneak into their house and read it that it was all on the up and up. Because what happened is the big AI companies went and scraped this data without asking for permission, without telling anyone they were going to do it, and if I remember right at least one has admitted that it will stop ignoring the desires of websites to tell it not to scrape them… So no, not giving anyone the ability to tell them no, and knowingly ignoring it anyway cannot be defined as copyright holders giving them permission.

Second, there is quite a bit of information “publicly” available that is, stolen or hacked. I’m sure your credit card information, and mine, is out there somewhere already. License plates, location data, private photos and videos, medical information. Publicly available contains a lot of things that we recognize normally as either illegal to use without express permission or illegal to use commercially without express permission. Hell let’s go down the deep dark route, who needs warrants when AI can just take all the information.

Third, we already recognize that scale matters. I cannot go take music or video I bought and play it to a large audience. We apply, or don’t apply laws to companies based on scale. Trying to take scale and efficiency out of the equation as if they do not matter is like arguing that a nuke and a firecracker, or a bi plane and a military jet, or a gas tanker and a bicycle, should be regulated the same.

Fourth, the power imbalance. Because for the most part you have a few big companies that used their already existing power and influence to go and take advantage of what all the smaller players created.

Anonymous Coward

October 26, 2023 at 8:41 pm

Re:

All very good points that do nothing to address the fact that AI is a shiny new toy and techdirt has a boner for it

Mamba (profile)

October 27, 2023 at 3:14 pm

Re:

This is all, of course, complete bullshit.

Nothing was done ‘in the dead of night’, it was all done in complete and public view. You might not like it, but fair use and first sale doctrine determine what can be used and how…not the writers.

Also, there’s been no evidence that anything has been ‘stolen’ or hacked. Not that copyright violation constitutes a theft.

And no, public performance isn’t about scale, it’s about venue. Any public performance of a copyright song is a violation, regardless of the audience size. Just one person violates the copyright protections.

Finally, remember that the small companies got there first with AI. OpenAI didn’t exist before 2015. Google, Meta and others were left scrambling. Microsoft had to buy into the business, it didn’t even create it on their own.

I’m not a lawyer, but virtually every legal scholar I’ve found is certain that training AI will be fair use.

TheDumberHalf

October 26, 2023 at 6:54 pm

AI to AI

Train an AI on superior works of other AI. Solved.

Anonymous Coward

October 26, 2023 at 8:44 pm

Re:

That’s a good idea. AI can produce work far faster than humans can, and can be reliably told not to produce the kind of bigoted, hateful, or irrational content that so often taints these big data sets.

I can’t think of any reason why one would want icky human writers involved at all.

Anonymous Coward

October 27, 2023 at 3:43 am

Re: Re:

AI can produce work far faster than humans can,

Currently it is rate limited to how many humans can tell it what to produce, and iterate several times to get something that makes sense.

Anonymous Coward

October 27, 2023 at 9:00 am

Re: Re:

“and can be reliably told not to produce the kind of bigoted, hateful, or irrational content that so often taints these big data sets.”

Sure AI can be told a lot of things, does not mean AI does anything to comply.

or maybe I missed the /s

TKnarr (profile)

October 26, 2023 at 7:08 pm

I think you misunderstand the scope of the “right to read”. It doesn’t extend to a right to do anything you want with what you’ve read without consequence. Go ask any editor, or any author, about exactly why they don’t read unsolicited manuscripts or, in the case of authors, any fan works based on the books they’ve written. They don’t do those things because, even if they have a right to read those things, doing so opens them up to claims of copyright infringement if they subsequently produce works too close to what they’d read (and “too close” doesn’t have to be very close at all, more than one case has shown that).

Anonymous Coward

October 26, 2023 at 8:45 pm

Re:

Techdirt is also opposed to copyright, so that one’s not going to fly around here

Crafty Coyote

October 26, 2023 at 7:15 pm

It doesn’t have to throttle the future, we have people who would be willing to make art even if they get jailed, and computers that don’t know right and wrong.

We can defeat copytight

Anonymous Coward

October 26, 2023 at 8:52 pm

Re:

You’re not creating art. You’re the equivalent of David Zaslav, telling something you don’t understand to make you something you couldn’t make yourself, while exhibiting profound contempt for the actual labor that was necessary to create the conditions under which you instructions could be carried out. Though at least WBDiscovery pays its serfs something.

Crafty Coyote

October 26, 2023 at 10:36 pm

Re: Re:

You may have a point in that I’m not making art. But inspiring others to do what I could not do, and financially supporting them to do that, all while they maintain their innocence is a noble endeavor. Ideas, in spite of being called “property” aren’t really property at all, and if they are then their owners can call the police if they wish to complain.

Anonymous Coward

October 26, 2023 at 11:24 pm

Re: Re: Re:

But inspiring others to do what I could not do, and financially supporting them to do that, all while they maintain their innocence is a noble endeavor.

So, asking artists for permission to use their work to train AI and compensating them fairly is an unconscionable infringement of whichever right you’ve hallucinated, but paying a “””prompt engineer””” to have an AI puke up a mangled version of stolen work is noble and pure?

You’re deranged.

Crafty Coyote

October 26, 2023 at 11:55 pm

Re: Re: Re:²

A computer can only puke a mangled work because computers can’t improvise as humans do, but the only complaint of theft applies to someone physically stealing the picture you’re working on. The AI is only a passive tool or instrument, a real artist should be able to create with or without it.

Copyright, on the other hand, is a detriment to artists, because it limits what actual humans can make, and criminalizes the action of being inspired by what they’ve seen in life. The AI being amoral has access to an infinite amount of pre existing art and has no concept of guilt or innocence, yet it also lacks human ingenuity

Anonymous Coward

October 27, 2023 at 1:33 am

Re: Re: Re:³

Copyright is civil law, not criminal

Not being able to copy Mickey Mouse doesn’t seem to have done Picasso much harm.

Anonymous Coward

October 27, 2023 at 3:49 am

Re: Re: Re:²

So, asking artists for permission to use their work to train AI and compensating

Would earn them nothing, as the training sets are thousands of works, and trying to distribute any payments fairly would cost more than the licensing fees. Alternatively it would be like the music collection societies, taxing all the little know artists to pay the famous artists a bit more money.

David Wilson

October 26, 2023 at 8:25 pm

The Right to Read is the Right to Train

What hooey. I should be able to create a copyrighted work and release it with an EULA that clearly stipulates that the work is for use by humans only and that it may not be incorporated into any AI or artifact without my explicit written permission. It is merely a question of will. There is absolutely nothing inevitable about copyrighted works being used as training data for LLMs or other data repositories.

Anonymous Coward

October 26, 2023 at 8:59 pm

Re:

You seem to be operating under the impression that we live in a society where people respect one another

Crafty Coyote

October 26, 2023 at 10:13 pm

Re:

That EULA will be nearly impossible to enforce, especially if you’re dealing with AI programs, and more so if they’re operating internationally.

I don’t agree with all things AI, and I think the artwork is cheap and generic, but it can be an effective tool for killing copyright.

Flakbait (profile)

October 27, 2023 at 3:41 am

Natural Intelligence

So, AI has to have a bibliography/references section to show where it learned things.

Should humans (Natural Intelligence / NI) authors so up in arms over having their work used to teach AI now have to do the same thing – citing everything they’ve ever read?

For instance, John Grisham’s latest thriller needs to cite everything he’s ever read, from Dr. Seuss on up to every legal brief.

Tim Burr, Professional Lumberjack (profile)

October 27, 2023 at 7:24 am

Happy Medium, Scale is Relevant

I think there’s got to be a happy medium somewhere. It’s a false equivalence to conflate LLM AI ‘reading’ with human ‘reading’. For starters, while the LLMs do use a neural network that purports to mimic actual brains, the manner of ‘memory’ within the LLM training set isn’t actually similar to human brains, because the model is inherently static in CUDA memory until more training is done to update it. It doesn’t adapt in real time to the emotional impact of what it has ‘read’ after it’s spent time emotionally processing/digesting it.

This is inherently different from human brains, which have immense amounts of plasticity, and which change every time they read and absorb written language, then process, and store pieces of that content based on the individual’s emotional reaction to it. The human brain is simply far too imperfect at literary pattern matching compared to LLMs, because it’s absorbing and retaining primarily the emotional impact of what it reads, the generic meaning and summary of the contents occur based on the inherent importance that a brain assigns to what it reads.

That is why it’s absurd to claim there is, in any way, any similarity between human brains and LLM training datasets. For all their flaws, LLMs retain raw information far more perfectly than human brains, even if it isn’t necessarily a photocopy, and they aren’t enabled, inspired, or restrained by any emotion whatsoever. A few months back, I could go to chatgpt and prompt it to perfectly reproduce the opening pages of The Hobbit. Sometimes a few words would be slightly different. I’m sure it could do that with tens of thousands of literary works. While some humans memorize entire Shakespeare plays and have memorized The Odyssey etc., even the most acute human memory can’t retain that much data with that much accuracy. Even for people who memorize plays and epics, they lose it without regularly refreshing it and devoting huge amounts of their life to keeping it fresh. And even then, they will call ‘line’ during rehearsal, or forget little parts and improvise. That’s with, at best, a few dozen literary works having been memorized. Scale matters. That’s why academic use of copyrighted data is often considered fair use, because it’s happening at a very small scale compared to commercial or bootleg reproduction of content. At some point, even our corrupt asshat legislatures and their lackey judges all agreed that small scale reproduction and use of copyrighted material is acceptable for certain use cases, even without permission from or compensation to the original rightsholders.

Yes, humans shouldn’t own ideas, but if I spend millions of dollars paying humans to train a LLM, and I know I use copyrighted works during the ingestion process, and I know those copyrighted works will then be used and tokenized to a significantly similar degree in these LLMs, and then the LLMs will be monetized and made available to massive portions of the public, the rightsholders should at least have the right to opt out. There is an inherent matter of scale at work here, and generative AI isn’t reading and absorbing works the same way a human would. The AI isn’t just a very smart human, it’s now far more powerful at data reproduction than the average human, because the model has memorized the patterns of the words of entire libraries, allowing them to reproduce significant portions of them that are nearly perfect, or at least with far more acuity and permanence than the best human minds.

So I say again, going forward, rightsholders should be able to opt out. Maybe what should happen is the LLM companies should be required to pay for a copy of every work they used for every human that trained the LLMs. That seems like a very reasonable solution. If these datasets were trained on tens of thousands of books and had hundreds of humans training them, that’s not a small amount of book sales. Maybe they should be paid again for the same amount when every new public model and training set is released. Or something like that. You can’t tell me a Microsoft subsidiary can’t afford it. And if it’s an academic work, the cost should be a lot cheaper/different.

That being said, you can’t put the toothpaste back into the tube. And moreover, the LLMs aren’t perfect photocopies of books, and rightsholders need to fucking research the technology and stop claiming/treating it like it is.

What this all really proves is that copyright laws are still broken, designed to enrich a very small number of lucky successful people, and enable them to create multigenerational nobilities off 150-year royalty schedules. Put it back to 25 years or whatever. 50 years, fine. Tolkien is dead and his works should be in the public domain soon if not already.

Anonymous Coward

October 27, 2023 at 8:01 am

Re:

A few months back, I could go to chatgpt and prompt it to perfectly reproduce the opening pages of The Hobbit.

Noting that you prompted it to get a specific output, it is you that is committing copyright infringement, and not the tool you used to get that output. All the waffle about AI being different does not matter when a person directs it to get specific result, as it is the human that decides what they want, and how to get there.

Tim Burr, Professional Lumberjack (profile)

October 27, 2023 at 9:07 am

Re: Re:

Completely agree. the point isn’t about whether the tool itself is actively reproducing copyrighted material, my point was to illustrate that calling this “learning” in the same way that humans “learn” is completely ignorant at best and disingenuous at worst. I have no idea how access to digital libraries is handled in these sorts of cases. But it seems there should be some schema whereby corporate access to copyrighted data by hundreds or thousands of workers, even if accessed through the abstraction of LLM training, should require compensation commensurate with the number of people working on it. Moreover, if I am still the rightsholder and I am uncomfortable with the material being used by LLMs, I should have the ability to completely opt out.

If I am the adult descendant of a rightsholder and it was my grandparent who created it 50 years ago, I should fall off a cliff and that should be public domain and freely available to everyone.

Additionally, it is my opinion that ~~Carthage must be destroyed~~ copyright terms should be amended back to place them in the public domain within 30 to 50 years of publication.

Anonymous Coward

October 28, 2023 at 5:06 pm

Re: Re:

If the text of book is being perfectly reproduced with prompts (which aren’t the text of the book), it means that the AI already has a copy of that text. Hence, copying without permission; hence, infringement by the AI’s makers/trainers, unless it falls under fair use (which it might, but that’s a question for another day.)

Anonymous Coward

October 28, 2023 at 5:45 pm

Re: Re: Re:

If the text of book is being perfectly reproduced with prompts (which aren’t the text of the book), it means that the AI already has a copy of that text. Hence, copying without permission; hence, infringement by the AI’s makers/trainers, unless it falls under fair use (which it might, but that’s a question for another day.)

Someone doesn’t actually understand how LLM’s work, or copyright for that matter. You can’t get an LLM reproduce a book verbatim by a prompt, unless you provide it with one whopping long prompt after many trial and errors in an effort to steer it to create a perfect copy, just like how you could get someone to create a perfect reproduction by telling them what to write and constantly correcting them.

On the other hand, I know of savants that can reproduce a book verbatim if you just tell them the title.

Crafty Coyote

October 26, 2023 at 8:17 pm

Getting around these asinine laws requires a creativity that knows no bounds. Like when self-expression borrowing media that hasn’t hit an arbitrary number of years yet is illegal, find other innocent people- or in this case, amoral AI that can’t possibly know right from wrong- perhaps in a foreign country to bring those ideas to market.

More self-sacrifice than creativity, really, but it will get the job done.

Saturday
12:06	This Week In Techdirt History: April 21st - 27th (1)
Friday
19:39	LittleBigPlanet: Now You Don't Own What You've Created, Either (17)
15:09	Ctrl-Alt-Speech: The Bell Tolls For TikTok (2)
13:34	Florida Appeals Court Says The Right To Record Extends To Phone Calls With Cops (5)
12:06	Court Dismisses Mark Zuckerberg Personally From Massive ‘Social Media Addicts Children’ Lawsuit (6)
10:45	Net Neutrality Is Back! For Now. (29)
10:40	Daily Deal: U-STREAM Home Streaming Studio with 10" Ring Light & Tripod (0)
09:20	Biden Bans The App His Campaign Insists Is An Important Place To Talk To Voters (32)
05:21	People Are Slowly Realizing Their Auto Insurance Rates Are Skyrocketing Because Their Car Is Covertly Spying On Them (44)
Thursday
20:05	Flynn Family's SLAPP Suit Against CNN Slapped Down By Judge (18)

EU Parliament Fails To Understand That The Right To Read Is The Right To Train

from the reading-is-fundamental-(to-AI) dept

Comments on “EU Parliament Fails To Understand That The Right To Read Is The Right To Train”

Add Your Comment Cancel reply

Comment Options:

What's this?

Techdirt Daily Newsletter

The Techdirt Greenhouse

Trending Posts

Saturday

Friday

Thursday

More

Email This Story

Tools & Services

Company

Contact

More