Clearing Rights For A ‘Non-Infringing’ Collection Of AI Training Media Is Hard

from the public-domain-impossibility-theorem dept

In response to a number of copyright lawsuits about AI training datasets, we are starting to see efforts to build ‘non-infringing’ collections of media for training AI. While I continue to believe that most AI training is covered by fair use in the US and therefore inherently ‘non-infringing’, I think these efforts to build ‘safe’ or ‘clean’ or whatever other word one might use data sets are quite interesting. One reason they are interesting is that they can help illustrate why trying to build such a data set at scale is such a challenge.

That’s why I was excited to read about Source.Plus (via a post from Open Future). Source.Plus is a tool from Spawning that purports to aggregate over 37 million “public domain and CC0 images integrated from dozens of libraries and museums.” That’s a lot less than are used to train current generative models, but still a lot of images that could be used for all sorts of useful things.

However, it didn’t take too much poking around on the site to find an illustration of why accurately aggregating nominally openly licensed images at scale can be such a challenge.

The site has plenty of OpenGLAM images that are clearly old enough to be in the public domain. It also has a number of newer images (like photographs) that are said to be licensed under CC0. Curious, I clicked on the first photograph I found on the Source.Plus home page:

photograph of a library reading room full of patrons shot from above

According to the image page on Source.Plus, the image was from Wikimedia Commons and licensed under a CC0 public domain dedication. It listed the creator as Pixabay and the uploader (to Wikimedia) as Philipslearning.

Clicking through to the wikimedia page reveals that the original source for the image was Pixabay, and that it was uploaded on March 9, 2023 by Philipslearning (an account that appears to no longer exist, for whatever that is worth). The file metadata says that the image itself was taken on May 18, 2016.

Clicking through to the Pixabay page for the image reveals that the image is available under the Pixabay Content License. That license is fairly permissive, but does state:

  • You cannot sell or distribute Content (either in digital or physical form) on a Standalone basis. Standalone means where no creative effort has been applied to the Content and it remains in substantially the same form as it exists on our website.
  • If Content contains any recognisable trademarks, logos or brands, you cannot use that Content for commercial purposes in relation to goods and services. In particular, you cannot print that Content on merchandise or other physical products for sale.
  • You cannot use Content in any immoral or illegal way, especially Content which features recognisable people.
  • You cannot use Content in a misleading or deceptive way.
  • You cannot use any of the Content as part of a trade-mark, design-mark, trade-name, business name or service mark.

Which is to say, not CC0.

However, further investigation into the Pixabay Wikipedia page suggests that images uploaded to Pixabay before January 9, 2019 are actually released under CC0. Section 4 of the Pixabay terms confirms that. The additional information on the image’s Pixabay page confirms that it was uploaded on May 17, 2016 (which matches the metadata added by the unknown Philipslearning on the image’s wikimedia page).

All of which means that this image is, in all likelihood, available under a CC0 public domain dedication. Which is great! Everything was right!

At the same time, the accuracy of that status feels a bit fragile. This fragility works in the context of wikipedia, or if you are looking for a handful of openly-licensed images. Is it likely to hold up at training set scale across tens of millions of images? Maybe? What does it mean to be ‘good enough’ in this case? If trainers do require permission from rightsholders to train, and one relied on Source.Plus/wikimedia for the CC0 status of a work, and that status turned out to be incorrect, should the fact that they thought they were using a CC0 image be relevant to their liability?

Michael Weingberg is the Executive Director of NYU’s Engelberg Center for Innovation Law and Policy. This post is republished from his blog under its CC BY-SA 4.0 license. Hero Image: Interieur van de Bodleian Library te Oxford

Filed Under: , , ,

Rate this comment as insightful
Rate this comment as funny
You have rated this comment as insightful
You have rated this comment as funny
Flag this comment as abusive/trolling/spam
You have flagged this comment
The first word has already been claimed
The last word has already been claimed
Insightful Lightbulb icon Funny Laughing icon Abusive/trolling/spam Flag icon Insightful badge Lightbulb icon Funny badge Laughing icon Comments icon

Comments on “Clearing Rights For A ‘Non-Infringing’ Collection Of AI Training Media Is Hard”

Subscribe: RSS Leave a comment
28 Comments
Anonymous Coward says:

Re:

For real property, the law recognized long ago that ownership has a huge problem: how do I know who, if anyone, owns the land on which I’d like to build a driveway (for example)? In almost any part of the world, some government keeps track of that and can answer the question. And if the owner is dead or defunct, or appears to have abandoned the land, there’s a process to deal with it.

With copyright, we seem to have none of that. We just kind of accept that maybe someone will come along out of nowhere, make claims we can’t really verify, and sue us or just intercept our advertising revenue.

Crafty Coyote says:

Re: Re:

Life is too short to live enslaved by fear. Check through government records- the absence of an official copyright confirmation- and then have fun. Dead or defunct owners of this property should be treated like mines that might still have a vein of gems or precious metals in them. I’ll definitely go wildcatting around some forgotten music or images and see if I’m not genius enough to strike it rich.

Anonymous Coward says:

AI training is not fair use

And that’s because it’s not a derivative work: it’s a copy. All of these “AI” systems are just massive exercises in linear algebra incorporating all the data they’ve ingested. They’re not “learning”, they’re just adjusting the values of the weights in the models until they produce the output desired by their makers. There’s no intelligence, no understanding, no comprehension in them. As the (justifiably) famous paper observes, they’re stochastic parrots.

The only difference between these systems and a system which ingested the same content and spit it all back in the same order that it was read is that this one does it when prompted.

Anonymous Coward says:

Re:

AI training is not fair use And that’s because it’s not a derivative work: it’s a copy

This does not follow. A full and exact copy can indeed be fair use. Libraries, for example, have been storing tiny copies of major newspapers since a hundred years ago. An CD-rip of a disc you own is totally fine too, as much as the copyright maximalists hate it.

A US Copyright Office document was linked in comments on a recent story, confirming that short sentences are “de minimus” and not eligible for copyright. If that’s all that the “AI” models spit out, it might be fine, but I fully expect that’s gonna be disputed and eventually end up at SCOTUS.

Anonymous Coward says:

That’s what is strange about the billions of images on internet, they free to access but not free to use.
How many sunsets on internet? By judging at how many people I’ve seen taking them with their phone each time I can see one, a lot, but only a tiny fraction is really free to use.
And it’s certainly where Google and Facebook can shine, by playing dirty using non-free (like pictures on Google Maps) even private pictures until they AI is becoming good enough to be newly trained on free content (if it’s really matter one day), and where open-source models and data will struggle.
So maybe, this copyright “issues” on AI generative content should be, in some way, ignored to allow everyone to get access to enough content (known as the “whole” internet) to make the technology affordable not only for the biggest companies.

Mamba (profile) says:

Re:

How is Google using the photos they took to train an AI ‘playing dirty’? Or are you talking about the photos people upload to maps, where you’re required grant them a license to sue the image for derivative works?

I also maintain that AI doesn’t even need to rely on fair use. If Google, for example, purchases a digital copy of a book and trains the AI on it…it’s just use.

Tom B says:

Using Unreliable Training Data

I doubt that anyone is going to agree with me here, but so be it.
You shouldn’t be using the internet (or internet-assessable) public or private data to train your AI! The only solution that prevents the cost of lawsuits and guarantees control over the training process is to come up with your own training data, 100%. What’s that, you say? Too hard? Takes too long? Costs too much? My heart bleeds for you. Trying to make the big bucks without putting in the — admittedly — hard work is the corporate equivalent of school students cheating off each other’s test papers. Don’t complain if your student gets caught hallucinating the wrong answers! There’s a very good reason we’re about 50-100 years away from true, reliable Artificial Intelligence. There are no shortcuts, fair use or no fair use.

Anonymous Coward says:

Re:

come up with your own inspiration, 100%. What’s that, you say? Too hard? Takes too long? Costs too much? My heart bleeds for you.

come up with your own 3D modeling software, 100%. What’s that, you say? Too hard? Takes too long? Costs too much? My heart bleeds for you.

come up with your own literary genre, 100%. What’s that, you say? Too hard? Takes too long? Costs too much? My heart bleeds for you.

Yeah… I don’t think that statement works out the way you want it to.

Anonymous Coward says:

I know of a number of images that have been uploaded to Wikimedia with a CCO attribution that, if put into a Google Image Search, will be virtually identical to images with restrictive copyright licensing.

Possibly what this project could do, if they really wanted to ensure license compliance, is also do a reverse image search on every image in the collection, and prune anything for which others have claimed copyright on similar images.

This would in most cases be silly, as many corporations do exactly the opposite and claim copyright over images first published with less restrictions. But at least the training corpus would have a solid backing of good provenance and good-faith pruning. Copyrighted works will still sneak in, but are less likely to skew the output in any meaningful way.

Anonymous Coward says:

It’s cute that you only want the system you fucking brunchlords to change only because it stands in the way of your fucking stock options.

SORRY NOT SORRY you lot are finally facing the problems of content creators writ large.

And I am enjoying every single fucking planck unit of it.

Enjoy the hell you created. We hope you’ll fix it, but you’ll only try to carve out exemptions for your ingroup, so from the rest of us in the hell you created:

WELCOME.

terop (profile) says:

For small entities like programmers or artists, it’s better to drop AI completely than start building your own non-infringing AI databases. The amount of data is just too large for the operation to be suitable for small teams like that. Dropping AI (like I have done) is clearly the right solution. Going with the small data storage is significantly better solution.

Add Your Comment

Your email address will not be published. Required fields are marked *

Have a Techdirt Account? Sign in now. Want one? Register here

Comment Options:

Make this the or (get credits or sign in to see balance) what's this?

What's this?

Techdirt community members with Techdirt Credits can spotlight a comment as either the "First Word" or "Last Word" on a particular comment thread. Credits can be purchased at the Techdirt Insider Shop »

Follow Techdirt

Techdirt Daily Newsletter

Subscribe to Our Newsletter

Get all our posts in your inbox with the Techdirt Daily Newsletter!

We don’t spam. Read our privacy policy for more info.

Ctrl-Alt-Speech

A weekly news podcast from
Mike Masnick & Ben Whitelaw

Subscribe now to Ctrl-Alt-Speech »
Techdirt Deals
Techdirt Insider Discord
The latest chatter on the Techdirt Insider Discord channel...
Loading...