Why Carl Malamud's Latest Brilliant Project, To Mine The World's Research Papers, Is Based In India

from the sci-hub-to-the-rescue-again dept

Fri, Jul 19th 2019 01:33pm - Glyn Moody

Carl Malamud is one of Techdirt’s heroes. We’ve been writing about his campaign to liberate US government documents and information for over ten years now. The journal Nature has a report on a new project of his, which is in quite a different field: academic knowledge. The idea will be familiar to readers of this site: to carry out text and data mining (TDM) on millions of academic articles, in order to discover new knowledge. It’s a proven technique with huge potential to produce important discoveries. That raises the obvious question: if large-scale TDM of academic papers is so powerful, why hasn’t it been done before? The answer, as is so often the case, is that copyright gets in the way. Academic publishers use it to control and impede how researchers can help humanity:

[Malamud’s] unprecedented project is generating much excitement because it could, for the first time, open up vast swathes of the paywalled literature for easy computerized analysis. Dozens of research groups already mine papers to build databases of genes and chemicals, map associations between proteins and diseases, and generate useful scientific hypotheses. But publishers control — and often limit — the speed and scope of such projects, which typically confine themselves to abstracts, not full text.

Malamud’s project gets around the limitations imposed by copyright and publishers thanks to two unique features. First, Malamud “had come into possession (he won’t say how) of eight hard drives containing millions of journal articles from Sci-Hub”. Drawing on Sci-Hub‘s huge holdings means his project doesn’t need to go begging to publishers in order to obtain full texts to be mined. Secondly, Malamud is basing his project in India:

Over the past year, Malamud has — without asking publishers — teamed up with Indian researchers to build a gigantic store of text and images extracted from 73 million journal articles dating from 1847 up to the present day. The cache, which is still being created, will be kept on a 576-terabyte storage facility at Jawaharlal Nehru University (JNU) in New Delhi.

India was chosen because of an important court battle that concluded two years ago. As Techdirt reported then, it is legal in India to make photocopies of copyright material in an educational context. Malamud’s contention is that this allows him to mine academic material in India without the permission of publishers. But he also believes that his TDM project would be legal in the US:

The data mining, he says, is non-consumptive: a technical term meaning that researchers don’t read or display large portions of the works they are analysing. “You cannot punch in a DOI [article identifier] and pull out the article,” he says. Malamud argues that it is legally permissible to do such mining on copyrighted content in countries such as the United States. In 2015, for instance, a US court cleared Google Books of copyright infringement charges after it did something similar to the JNU depot: scanning thousands of copyrighted books without buying the rights to do so, and displaying snippets from these books as part of its search service, but not allowing them to be downloaded or read in their entirety by a human.

The fact that TDM is “non-consumptive” means that the unhelpful attitude of academic publishers is even more unjustified than usual. They lose nothing from the analytical process, which is merely extracting knowledge. But from a sense of entitlement publishers still demand to be paid for unrestricted computer access to texts that have already been licensed by academic institutions anyway. That selfish and obstructive attitude to TDM may be about to backfire spectacularly. The Nature article notes:

No one will be allowed to read or download work from the repository, because that would breach publishers’ copyright. Instead, Malamud envisages, researchers could crawl over its text and data with computer software, scanning through the world’s scientific literature to pull out insights without actually reading the text.

The thing is, if anyone were by any chance interested in reading the full text, there’s an obvious place to turn to. After all, the mining is carried out using papers held by Sci-Hub, so?

Comments on “Why Carl Malamud's Latest Brilliant Project, To Mine The World's Research Papers, Is Based In India”

Ninja (profile)

July 19, 2019 at 2:17 pm

So basically instead of copyright promoting the progress of science and useful arts it’s piracy that’s doing so. Despite all the copyright.

Anonymous Coward

July 19, 2019 at 2:44 pm

Re: Re:

The journals will still argue that the only way piracy can enable this, is that the journals created the publishing infrastructure in the first place and published the articles. And they’d further add that the quality of the contents of these papers is fully because of their own hard work as middlemen/brokers.

They’d further argue that Sci-Hub is costing them revenue that is preventing them from doing wonderful things to progress science and the useful arts.

Of course, they’d argue that the lack of unicorns is due to piracy if they felt that could protect their publishing racket.

Meanwhile, Sci-Hub would have next to nothing if these journals didn’t exist, so they do have a point.

arXiv andother pre-print repositories, on the other hand, WOULD exist anyway. And they’d probably be richer centers of knowledge if the likes of Elsevier didn’t exist.

Interestingly, Elsevier no longer bills themselves as a journal publisher:

"Elsevier is a Dutch information and analytics company…."

bob

July 19, 2019 at 5:42 pm

Re: Re: Re:

And if the journals didn’t exist first you wouldn’t need sci-hub. Instead you would have some other repository to deal with. Depending on how that repository is managed you might still get a sci-hub option.

The journals were very important in the beginning but now with the internet they are not as necessary. Just a matter of time till they adjust their business operations or die at this point in time.

Anonymous Coward

July 20, 2019 at 12:48 am

Re: Re: Re:

Meanwhile, Sci-Hub would have next to nothing if these journals didn’t exist, so they do have a point.

That’s like saying doctors would have next to nothing if infectious disease didn’t exist.

How the hell is the existence of access-restrictive assholes/infectious disease supposed to be the better scenario?

Anonymous Coward

July 20, 2019 at 4:08 am

Re: Re: Re:

Meanwhile, Sci-Hub would have next to nothing if these journals didn’t exist, so they do have a point.

More a case of Sci_Hub would not exist if the journals did not price gouge those who produce the papers in the jounals and make co-operation in any field horribly expensive if the researchers play by the rules.

Anonymous Coward

July 21, 2019 at 6:59 am

Re: Re:

Abolish copyright.

Anonymous Coward

July 22, 2019 at 9:14 am

Re: Re:

So basically instead of copyright promoting the progress of science and useful arts it’s piracy that’s doing so.

Sure, if you accept "piracy" to mean "things copyright holders object to". But if you believe copyright is holding us back, it’s unfair to call Malamud a pirate.

Anonymous Coward

July 20, 2019 at 5:01 am

Researchers need to get published in the journals to get peer review, they need to get published ,and to get promoted .
peoples career path is based on their research they publish in certain journals ,librarys have to pay to subscribe to those journals so thier students and professors can keep up with advances in research and advances in science .Not all librarys can afford to pay for all the scientific journals .
We need to move to an open free publishing platform,
most scientific research is funded by the taxpayer ,
set up a web site like git hub,
eg open science.org .
ALL research funded by the government or the tax payer must be published there ,
Any scientist or professor can register free to publish research papers there .
Whether they are based in america, canada or europe .

At the moment the tax payer pays for research than public funded universitys have to pay for it .

Anonymous Coward

July 22, 2019 at 1:21 am

"democratizing access to all scientific literature", from Nature

It seems that Carl Malamud is taking on the challenge that cost us the life of Aaron Swartz.

Good luck to him.

rayashcraft (profile)

October 19, 2021 at 12:44 am

Laboratory) simply founded Public.Resource and that’s it.

rayashcraft

October 19, 2021 at 12:46 am

Copyright has nothing to do with it. A visiting professor (MIT Media Laboratory) simply founded Public.Resource.Org and that’s it. Elsevier, one of the best information analytics systems, for instance, created their own version of Wikipedia. Research paper writers can search ScienceDirect Topics and get Topic pages that are generated automatically for academic paper writing.
Research topics https://domyhomeworkonline.net

Add Your Comment

Monday
15:22	Prosecutor Nopes Out Of The DOJ After Being Handed The James Comey '8647' Case (2)
13:05	John Deere Faces Second Class Action For Monopolizing Repair (4)
11:09	Judge Reopens Trump's IRS Case, Wants To Know If The Court Was Defrauded (18)
11:04	Daily Deal: uTalk Language Education (0)
09:31	CBP Commander Greg Bovino Is Taking Guest Speaker Spots At White Nationalist Conferences (12)
05:29	AT&T Sues California Regulators For Trying To Make Broadband Affordable (8)
Sunday
12:00	Funniest/Most Insightful Comments Of The Week At Techdirt (15)
Saturday
12:00	This Week In Techdirt History: May 24th - 30th (2)
Friday
19:39	Knox County, TN Rolls Back 'Roots' Book Ban After Backlash (10)
15:24	How AI Can Lead To False Arrests & Wrongful Convictions (24)

Why Carl Malamud's Latest Brilliant Project, To Mine The World's Research Papers, Is Based In India

from the sci-hub-to-the-rescue-again dept

Comments on “Why Carl Malamud's Latest Brilliant Project, To Mine The World's Research Papers, Is Based In India”

Re: Re:

Re: Re: Re:

Re: Re: Re:

Re: Re: Re:

Re: Re:

Re: Re:

"democratizing access to all scientific literature", from Nature

Add Your Comment Cancel reply

Comment Options:

What's this?

Get all our posts in your inbox with the Techdirt Daily Newsletter!

The Techdirt Greenhouse

Monday

Sunday

Saturday

Friday

More

Tools & Services

Company

Contact

More

Why Carl Malamud's Latest Brilliant Project, To Mine The World's Research Papers, Is Based In India

from the sci-hub-to-the-rescue-again dept

Comments on “Why Carl Malamud's Latest Brilliant Project, To Mine The World's Research Papers, Is Based In India”

Add Your Comment Cancel reply

Comment Options:

What's this?

Techdirt Daily Newsletter

Get all our posts in your inbox with the Techdirt Daily Newsletter!

The Techdirt Greenhouse

Monday

Sunday

Saturday

Friday

More

Email This Story

Tools & Services

Company

Contact

More