Clearing Rights For A ‘Non-Infringing’ Collection Of AI Training Media Is Hard
from the public-domain-impossibility-theorem dept
In response to a number of copyright lawsuits about AI training datasets, we are starting to see efforts to build ‘non-infringing’ collections of media for training AI. While I continue to believe that most AI training is covered by fair use in the US and therefore inherently ‘non-infringing’, I think these efforts to build ‘safe’ or ‘clean’ or whatever other word one might use data sets are quite interesting. One reason they are interesting is that they can help illustrate why trying to build such a data set at scale is such a challenge.
That’s why I was excited to read about Source.Plus (via a post from Open Future). Source.Plus is a tool from Spawning that purports to aggregate over 37 million “public domain and CC0 images integrated from dozens of libraries and museums.” That’s a lot less than are used to train current generative models, but still a lot of images that could be used for all sorts of useful things.
However, it didn’t take too much poking around on the site to find an illustration of why accurately aggregating nominally openly licensed images at scale can be such a challenge.
The site has plenty of OpenGLAM images that are clearly old enough to be in the public domain. It also has a number of newer images (like photographs) that are said to be licensed under CC0. Curious, I clicked on the first photograph I found on the Source.Plus home page:

According to the image page on Source.Plus, the image was from Wikimedia Commons and licensed under a CC0 public domain dedication. It listed the creator as Pixabay and the uploader (to Wikimedia) as Philipslearning.
Clicking through to the wikimedia page reveals that the original source for the image was Pixabay, and that it was uploaded on March 9, 2023 by Philipslearning (an account that appears to no longer exist, for whatever that is worth). The file metadata says that the image itself was taken on May 18, 2016.
Clicking through to the Pixabay page for the image reveals that the image is available under the Pixabay Content License. That license is fairly permissive, but does state:
- You cannot sell or distribute Content (either in digital or physical form) on a Standalone basis. Standalone means where no creative effort has been applied to the Content and it remains in substantially the same form as it exists on our website.
- If Content contains any recognisable trademarks, logos or brands, you cannot use that Content for commercial purposes in relation to goods and services. In particular, you cannot print that Content on merchandise or other physical products for sale.
- You cannot use Content in any immoral or illegal way, especially Content which features recognisable people.
- You cannot use Content in a misleading or deceptive way.
- You cannot use any of the Content as part of a trade-mark, design-mark, trade-name, business name or service mark.
Which is to say, not CC0.
However, further investigation into the Pixabay Wikipedia page suggests that images uploaded to Pixabay before January 9, 2019 are actually released under CC0. Section 4 of the Pixabay terms confirms that. The additional information on the image’s Pixabay page confirms that it was uploaded on May 17, 2016 (which matches the metadata added by the unknown Philipslearning on the image’s wikimedia page).
All of which means that this image is, in all likelihood, available under a CC0 public domain dedication. Which is great! Everything was right!
At the same time, the accuracy of that status feels a bit fragile. This fragility works in the context of wikipedia, or if you are looking for a handful of openly-licensed images. Is it likely to hold up at training set scale across tens of millions of images? Maybe? What does it mean to be ‘good enough’ in this case? If trainers do require permission from rightsholders to train, and one relied on Source.Plus/wikimedia for the CC0 status of a work, and that status turned out to be incorrect, should the fact that they thought they were using a CC0 image be relevant to their liability?
Michael Weingberg is the Executive Director of NYU’s Engelberg Center for Innovation Law and Policy. This post is republished from his blog under its CC BY-SA 4.0 license. Hero Image: Interieur van de Bodleian Library te Oxford
Filed Under: ai, copyright, public domain, training data


Comments on “Clearing Rights For A ‘Non-Infringing’ Collection Of AI Training Media Is Hard”
Almost like a handful of rich assholes owning all the imaginary property for 100+ years at a time is unworkable.
Re:
And when people realize that the property is indeed imaginary, they’ll be ready to “steal” that property and get away with it. And those rich assholes will realize that when people are willing to be arrested to expose the lie, we’ll win
Re:
For real property, the law recognized long ago that ownership has a huge problem: how do I know who, if anyone, owns the land on which I’d like to build a driveway (for example)? In almost any part of the world, some government keeps track of that and can answer the question. And if the owner is dead or defunct, or appears to have abandoned the land, there’s a process to deal with it.
With copyright, we seem to have none of that. We just kind of accept that maybe someone will come along out of nowhere, make claims we can’t really verify, and sue us or just intercept our advertising revenue.
Re: Re:
Life is too short to live enslaved by fear. Check through government records- the absence of an official copyright confirmation- and then have fun. Dead or defunct owners of this property should be treated like mines that might still have a vein of gems or precious metals in them. I’ll definitely go wildcatting around some forgotten music or images and see if I’m not genius enough to strike it rich.
AI training is not fair use
And that’s because it’s not a derivative work: it’s a copy. All of these “AI” systems are just massive exercises in linear algebra incorporating all the data they’ve ingested. They’re not “learning”, they’re just adjusting the values of the weights in the models until they produce the output desired by their makers. There’s no intelligence, no understanding, no comprehension in them. As the (justifiably) famous paper observes, they’re stochastic parrots.
The only difference between these systems and a system which ingested the same content and spit it all back in the same order that it was read is that this one does it when prompted.
Re:
The trained model is the work, clown shoe.
Re:
Lol, no. There is no copy.
If you saw a picture one, even stared at it for an hour, or studied it for elements or styles you’d use in something you might create in the future, is there a copy? (Spoiler: No.)
Re:
This does not follow. A full and exact copy can indeed be fair use. Libraries, for example, have been storing tiny copies of major newspapers since a hundred years ago. An CD-rip of a disc you own is totally fine too, as much as the copyright maximalists hate it.
A US Copyright Office document was linked in comments on a recent story, confirming that short sentences are “de minimus” and not eligible for copyright. If that’s all that the “AI” models spit out, it might be fine, but I fully expect that’s gonna be disputed and eventually end up at SCOTUS.
Re:
And that it doesn’t actually store the data. And that it’s not always a perfect copy. And that they have nothing in common.
Re:
So if I read a bunch of research papers on a range of subjects, and then use most or all of the same words in a different order in a research paper of my own on a completely different subject, that’s a copy? How so?
That’s what is strange about the billions of images on internet, they free to access but not free to use.
How many sunsets on internet? By judging at how many people I’ve seen taking them with their phone each time I can see one, a lot, but only a tiny fraction is really free to use.
And it’s certainly where Google and Facebook can shine, by playing dirty using non-free (like pictures on Google Maps) even private pictures until they AI is becoming good enough to be newly trained on free content (if it’s really matter one day), and where open-source models and data will struggle.
So maybe, this copyright “issues” on AI generative content should be, in some way, ignored to allow everyone to get access to enough content (known as the “whole” internet) to make the technology affordable not only for the biggest companies.
Re:
How is Google using the photos they took to train an AI ‘playing dirty’? Or are you talking about the photos people upload to maps, where you’re required grant them a license to sue the image for derivative works?
I also maintain that AI doesn’t even need to rely on fair use. If Google, for example, purchases a digital copy of a book and trains the AI on it…it’s just use.
Using Unreliable Training Data
I doubt that anyone is going to agree with me here, but so be it.
You shouldn’t be using the internet (or internet-assessable) public or private data to train your AI! The only solution that prevents the cost of lawsuits and guarantees control over the training process is to come up with your own training data, 100%. What’s that, you say? Too hard? Takes too long? Costs too much? My heart bleeds for you. Trying to make the big bucks without putting in the — admittedly — hard work is the corporate equivalent of school students cheating off each other’s test papers. Don’t complain if your student gets caught hallucinating the wrong answers! There’s a very good reason we’re about 50-100 years away from true, reliable Artificial Intelligence. There are no shortcuts, fair use or no fair use.
Re:
We’d all still be working from an abacus we had to build ourselves if you had your way.
Re:
Yeah… I don’t think that statement works out the way you want it to.
Re:
I will note YOU have been trained off other peoples works. and you are still using other people’s languages.
I know of a number of images that have been uploaded to Wikimedia with a CCO attribution that, if put into a Google Image Search, will be virtually identical to images with restrictive copyright licensing.
Possibly what this project could do, if they really wanted to ensure license compliance, is also do a reverse image search on every image in the collection, and prune anything for which others have claimed copyright on similar images.
This would in most cases be silly, as many corporations do exactly the opposite and claim copyright over images first published with less restrictions. But at least the training corpus would have a solid backing of good provenance and good-faith pruning. Copyrighted works will still sneak in, but are less likely to skew the output in any meaningful way.
Re:
One thing that could really help with the CCO attribution would be for the AI to find the provenance of every image that it was trained with. That could really help clear the underbrush. Unfortunately you can’t really know that it didn’t just hallucinate the provenance. Oops,
A good faith defense seems reasonable. Especially if there’s a process to remove it from the database.
It’s cute that you only want the system you fucking brunchlords to change only because it stands in the way of your fucking stock options.
SORRY NOT SORRY you lot are finally facing the problems of content creators writ large.
And I am enjoying every single fucking planck unit of it.
Enjoy the hell you created. We hope you’ll fix it, but you’ll only try to carve out exemptions for your ingroup, so from the rest of us in the hell you created:
WELCOME.
Re:
lolwut
My god, this is literally an article about nothing
“determining what is actually open license is hard. Why are AI companies so bad at this?”
Hilarious.
The absolute state of this shitty site.
Re:
I imagine you’re so hard right now because you just made yet another point-free criticism.
Re:
Yet you keep coming back.
It’s almost as if you have nothing better to do with your life.
Re:
“This food I’ve eaten for the twentieth time still tastes like shit!”
You’re just insulting yourself when you comment here. If you don’t like it but continue to return, either you’re actually getting something out of it and thus lying or a very dumb masochist. Either way isn’t a good look.
Re: Re:
i always go back to the same awful restaurant* so i can keep writing terrible reviews. Doesn’t everyone? What? Don’t they?
[*Really just a restaurant which serves cuisine that i don’t care for. It just makes me angry that other cuisines exist.]
AI is free to use my content as they sere fit.
For small entities like programmers or artists, it’s better to drop AI completely than start building your own non-infringing AI databases. The amount of data is just too large for the operation to be suitable for small teams like that. Dropping AI (like I have done) is clearly the right solution. Going with the small data storage is significantly better solution.