The problem is non textual data. Images for example, are not in CommonCrawl.
As you are much more protected against copyright and generally infringement claims by just hosting URL, rather than the content itself basically everybody hosts collections composed of lists of URLs, forcing every user to re-download everything. For example coyo has that approach and many of the image are unreachable https://github.com/kakaobrain/coyo-dataset/tree/main/download##missing-images
I enjoyed your post, but agree with others that it was a bit long and with some repetitions, it would have gained from being more succinct. Still I agree with most of the gist of it.
Techdirt has not posted any stories submitted by fawzi.
The problem is non textual data. Images for example, are not in CommonCrawl. As you are much more protected against copyright and generally infringement claims by just hosting URL, rather than the content itself basically everybody hosts collections composed of lists of URLs, forcing every user to re-download everything. For example coyo has that approach and many of the image are unreachable https://github.com/kakaobrain/coyo-dataset/tree/main/download##missing-images
I enjoyed your post, but agree with others that it was a bit long and with some repetitions, it would have gained from being more succinct. Still I agree with most of the gist of it.