Why Carl Malamud's Latest Brilliant Project, To Mine The World's Research Papers, Is Based In India

from the sci-hub-to-the-rescue-again dept

Carl Malamud is one of Techdirt’s heroes. We’ve been writing about his campaign to liberate US government documents and information for over ten years now. The journal Nature has a report on a new project of his, which is in quite a different field: academic knowledge. The idea will be familiar to readers of this site: to carry out text and data mining (TDM) on millions of academic articles, in order to discover new knowledge. It’s a proven technique with huge potential to produce important discoveries. That raises the obvious question: if large-scale TDM of academic papers is so powerful, why hasn’t it been done before? The answer, as is so often the case, is that copyright gets in the way. Academic publishers use it to control and impede how researchers can help humanity:

[Malamud’s] unprecedented project is generating much excitement because it could, for the first time, open up vast swathes of the paywalled literature for easy computerized analysis. Dozens of research groups already mine papers to build databases of genes and chemicals, map associations between proteins and diseases, and generate useful scientific hypotheses. But publishers control — and often limit — the speed and scope of such projects, which typically confine themselves to abstracts, not full text.

Malamud’s project gets around the limitations imposed by copyright and publishers thanks to two unique features. First, Malamud “had come into possession (he won’t say how) of eight hard drives containing millions of journal articles from Sci-Hub”. Drawing on Sci-Hub‘s huge holdings means his project doesn’t need to go begging to publishers in order to obtain full texts to be mined. Secondly, Malamud is basing his project in India:

Over the past year, Malamud has — without asking publishers — teamed up with Indian researchers to build a gigantic store of text and images extracted from 73 million journal articles dating from 1847 up to the present day. The cache, which is still being created, will be kept on a 576-terabyte storage facility at Jawaharlal Nehru University (JNU) in New Delhi.

India was chosen because of an important court battle that concluded two years ago. As Techdirt reported then, it is legal in India to make photocopies of copyright material in an educational context. Malamud’s contention is that this allows him to mine academic material in India without the permission of publishers. But he also believes that his TDM project would be legal in the US:

The data mining, he says, is non-consumptive: a technical term meaning that researchers don’t read or display large portions of the works they are analysing. “You cannot punch in a DOI [article identifier] and pull out the article,” he says. Malamud argues that it is legally permissible to do such mining on copyrighted content in countries such as the United States. In 2015, for instance, a US court cleared Google Books of copyright infringement charges after it did something similar to the JNU depot: scanning thousands of copyrighted books without buying the rights to do so, and displaying snippets from these books as part of its search service, but not allowing them to be downloaded or read in their entirety by a human.

The fact that TDM is “non-consumptive” means that the unhelpful attitude of academic publishers is even more unjustified than usual. They lose nothing from the analytical process, which is merely extracting knowledge. But from a sense of entitlement publishers still demand to be paid for unrestricted computer access to texts that have already been licensed by academic institutions anyway. That selfish and obstructive attitude to TDM may be about to backfire spectacularly. The Nature article notes:

No one will be allowed to read or download work from the repository, because that would breach publishers’ copyright. Instead, Malamud envisages, researchers could crawl over its text and data with computer software, scanning through the world’s scientific literature to pull out insights without actually reading the text.

The thing is, if anyone were by any chance interested in reading the full text, there’s an obvious place to turn to. After all, the mining is carried out using papers held by Sci-Hub, so?

Follow me @glynmoody on Twitter, Diaspora, or Mastodon.

Filed Under: , , , , , , , ,
Companies: sci-hub

Rate this comment as insightful
Rate this comment as funny
You have rated this comment as insightful
You have rated this comment as funny
Flag this comment as abusive/trolling/spam
You have flagged this comment
The first word has already been claimed
The last word has already been claimed
Insightful Lightbulb icon Funny Laughing icon Abusive/trolling/spam Flag icon Insightful badge Lightbulb icon Funny badge Laughing icon Comments icon

Comments on “Why Carl Malamud's Latest Brilliant Project, To Mine The World's Research Papers, Is Based In India”

Subscribe: RSS Leave a comment
Anonymous Coward says:

Re: Re:

The journals will still argue that the only way piracy can enable this, is that the journals created the publishing infrastructure in the first place and published the articles. And they’d further add that the quality of the contents of these papers is fully because of their own hard work as middlemen/brokers.

They’d further argue that Sci-Hub is costing them revenue that is preventing them from doing wonderful things to progress science and the useful arts.

Of course, they’d argue that the lack of unicorns is due to piracy if they felt that could protect their publishing racket.

Meanwhile, Sci-Hub would have next to nothing if these journals didn’t exist, so they do have a point.

arXiv andother pre-print repositories, on the other hand, WOULD exist anyway. And they’d probably be richer centers of knowledge if the likes of Elsevier didn’t exist.

Interestingly, Elsevier no longer bills themselves as a journal publisher:

"Elsevier is a Dutch information and analytics company…."

bob says:

Re: Re: Re:

And if the journals didn’t exist first you wouldn’t need sci-hub. Instead you would have some other repository to deal with. Depending on how that repository is managed you might still get a sci-hub option.

The journals were very important in the beginning but now with the internet they are not as necessary. Just a matter of time till they adjust their business operations or die at this point in time.

Anonymous Coward says:

Re: Re: Re:

Meanwhile, Sci-Hub would have next to nothing if these journals didn’t exist, so they do have a point.

That’s like saying doctors would have next to nothing if infectious disease didn’t exist.

How the hell is the existence of access-restrictive assholes/infectious disease supposed to be the better scenario?

Anonymous Coward says:

Re: Re: Re:

Meanwhile, Sci-Hub would have next to nothing if these journals didn’t exist, so they do have a point.

More a case of Sci_Hub would not exist if the journals did not price gouge those who produce the papers in the jounals and make co-operation in any field horribly expensive if the researchers play by the rules.

Anonymous Coward says:

Researchers need to get published in the journals to get peer review, they need to get published ,and to get promoted .
peoples career path is based on their research they publish in certain journals ,librarys have to pay to subscribe to those journals so thier students and professors can keep up with advances in research and advances in science .Not all librarys can afford to pay for all the scientific journals .
We need to move to an open free publishing platform,
most scientific research is funded by the taxpayer ,
set up a web site like git hub,
eg open science.org .
ALL research funded by the government or the tax payer must be published there ,
Any scientist or professor can register free to publish research papers there .
Whether they are based in america, canada or europe .

At the moment the tax payer pays for research than public funded universitys have to pay for it .

rayashcraft says:

Copyright has nothing to do with it. A visiting professor (MIT Media Laboratory) simply founded Public.Resource.Org and that’s it. Elsevier, one of the best information analytics systems, for instance, created their own version of Wikipedia. Research paper writers can search ScienceDirect Topics and get Topic pages that are generated automatically for academic paper writing.
Research topics https://domyhomeworkonline.net

Add Your Comment

Your email address will not be published.

Have a Techdirt Account? Sign in now. Want one? Register here

Comment Options:

Make this the or (get credits or sign in to see balance) what's this?

What's this?

Techdirt community members with Techdirt Credits can spotlight a comment as either the "First Word" or "Last Word" on a particular comment thread. Credits can be purchased at the Techdirt Insider Shop »

Follow Techdirt

Techdirt Daily Newsletter

Techdirt Deals
Techdirt Insider Discord
The latest chatter on the Techdirt Insider Discord channel...