An Open Training Set For AI Goes Global
from the fair-trade-ai-training-data dept
As many of the AI stories on Walled Culture attest, one of the most contentious areas in the latest stage of AI development concerns the sourcing of training data. To create high-quality large language models (LLMs) massive quantities of training data are required. In the current genAI stampede, many companies are simply scraping everything they can off the Internet. Quite how that will work out in legal terms is not yet clear. Although a few court cases involving the use of copyright material for training have been decided, many have not, and the detailed contours of the legal landscape remain uncertain.
However, there is an alternative to this “grab it all” approach. It involves using materials that are either in the public domain or released under a “permissive” license that allows LLMs to be trained on them without any problems. There’s plenty of such material online, but its scattered nature puts it at a serious disadvantage compared to downloading everything without worrying about licensing issues. To address that, the Common Corpus was created and released just over a year ago by the French startup Pleias. A press release from the AI Alliance explains the key characteristics of the Common Corpus:
Truly Open: contains only data that is permissively licensed and provenance is documented
Multilingual: mostly representing English and French data, but contains at least 1[billion] tokens for over 30 languages
Diverse: consisting of scientific articles, government and legal documents, code, and cultural heritage data, including books and newspapers
Extensively Curated: spelling and formatting has been corrected from digitized texts, harmful and toxic content has been removed, and content with low educational content has also been removed.
There are five main categories of material: OpenGovernment, OpenCulture, OpenScience, OpenWeb, and OpenSource:
OpenGovernment contains Finance Commons, a dataset of financial documents from a range of governmental and regulatory bodies. Finance Commons is a multimodal dataset, including both text and PDF corpora. OpenGovernment also contains Legal Commons, a dataset of legal and administrative texts. OpenCulture contains cultural heritage data like books and newspapers. Many of these texts come from the 18th and 19th centuries, or even earlier.
OpenScience data primarily comes from publicly available academic and scientific publications, which are most often released as PDFs. OpenWeb contains datasets from YouTube Commons, a dataset of transcripts from public domain YouTube videos, and websites like Stack Exchange. Finally, OpenSource comprises code collected from GitHub repositories which were permissibly licensed.
The initial release contained over 2 trillion tokens – the usual way of measuring the volume of training material, where tokens can be whole words and parts of words. A significant recent update of the corpus has taken that to over 2.267 trillion tokens. Just as important as the greater size, is the wider reach: there are major additions of material from China, Japan, Korea, Brazil, India, Africa and South-East Asia. Specifically, the latest release contains data for eight languages with more than 10 billion tokens (English, French, German, Spanish, Italian, Polish, Greek, Latin) and 33 languages with more than 1 billion tokens. Because of the way the dataset has been selected and curated, it is possible to train LLMs on fully open data, which leads to auditable models. Moreover, as the original press release explains:
By providing clear provenance and using permissibly licensed data, Common Corpus exceeds the requirements of even the strictest regulations on AI training data, such as the EU AI Act. Pleias has also taken extensive steps to ensure GDPR compliance, by developing custom procedures to enable personally identifiable information (PII) removal for multilingual data. This makes Common Corpus an ideal foundation for secure, enterprise-grade models. Models trained on Common Corpus will be resilient to an increasingly regulated industry.
Another advantage for many users is that material with high “toxicity scores” has already been removed, thus ensuring that any LLMs trained on the Common Corpus will have fewer problems in this regard.
The Common Corpus is a great demonstration of the power of openness and permissive copyright licensing, and how they bring benefits that other approaches can’t match. For example: “Common Corpus makes it possible to train models compatible with the Open Source Initiative’s definition of open-source AI, which includes openness of use, meaning use is permitted for ‘any purpose and without having to ask for permission’. ” That fact, along with the multilingual nature of the Common Corpus, would make the latest version a great fit for any EU move to create “public AI” systems, something advocated on this blog a few months back. The French government is already backing the project, as are other organizations supporting openness:
The Corpus was built up with the support and concerted efforts of the AI Alliance, the French Ministry of Culture as part of the prefiguration of the service offering of the Alliance for Language technologies EDIC (ALT-EDIC).
This dataset was also made in partnership with Wikimedia Enterprise and Wikidata/Wikimedia Germany. We’re also thankful to our partner Libraries Without Borders for continuous assistance on extending low resource language support.
The corpus was stored and processed with the generous support of the AI Alliance, Jean Zay (Eviden, Idris), Tracto AI, Mozilla.
The unique advantages of the Common Corpus mean that more governments should be supporting it as an alternative to proprietary systems, which generally remain black boxes in terms of where their training data comes from. Publishers too would also be wise to fund it, since it offers a powerful resource explicitly designed to avoid some of the thorniest copyright issues plaguing the generative AI field today.
Follow me @glynmoody on Mastodon and on Bluesky. Originally published to Walled Culture.
Filed Under: ai, ai training, common corpus, copyright, open licenseing, open licensing, public domain, training data
Companies: common corpus, pleias


Comments on “An Open Training Set For AI Goes Global”
Wow, imagine if they just made those materials available to humans all in one place.
It definitely sounds like an ethical dataset for LLMs, though. Maybe there is some French company who can fix the stack of pressing remaining problems.
The idea is neat, by 1B tokens if very small, GPT-3 was trained on about 500B tokens and Meta’s Llama-3 was trained on 15 trillions tokens.
But for very small LLMs or for fine-tuning, it could be useful (because it doesn’t mean that more tokens or parameters always gives more relevant answers).
Re:
It’s 2.25 trillions tokens, so much larger than the GPT-3 corpus. Llama-3 was further trained on multiple epochs, so likely more like 4-5 trillion tokens unique.
This is like insisting you’re only going to learn how to read by browsing AOOO because you can’t trust the copyright of books you read at the library.
The goal of the publishers in this context is to leverage the thorny copyright issues into some form of income. Using this as an example to the jury of how they could have done it “correctly” but chose not to might be wise, funding it is rather antithetical.
Source-checked, diverse, de-toxified and educational? Is there any chance of building a repository like that for human learning too?
Auditable?
I have been experimenting quite extensively with AI recently and auditability of the results is probably the most fundamental problem I see with current AI models.
How does this data set contribute to auditability or is this auditability in another sense that what I am thinking?
Re:
It is “auditable” in a vague hand-waving sense that’s convenient for the hypesters trying to shove AI into everything. It’s not “auditable” in any rigorous sense that those of us who do serious work in the field would recognize.
For those who aren’t in the field: true auditability requires the ability to trace the flow of information from input(s) to output(s) in all cases. For example, if an LLM emits the sentence “Brazil is a country in South America” then it must be possible to trace backwards from that sentence to ALL inputs that were used, in any way, to construct it. This works the other way as well: it must be possible to trace every piece of input data through to every output that resulted, in any way, from the use of that piece of data.
Note that this doesn’t guarantee repeatability or reliability, two related qualities. In other words, it doesn’t tell us whether or not the same linkages will exist when we try this tomorrow (because some additional input may have arrived or because some different processing may happen in the model) and it doesn’t tell us whether the output is correct (because there may be bad input data or the model may be broken). What auditability gets us is a method for tackling those two problems. Without it, we have no idea what the model is doing or why not just today, but tomorrow — i.e., even if it appears to working well today, we have no idea whether or not it will fail catastrophically tomorrow.
This is so fucking cool I love it