Elsevier Says Downloading And Content-Mining Licensed Copies Of Research Papers 'Could Be Considered' Stealing
from the gotta-protect-that-39%-profit-margin dept
Elsevier has pretty much established itself as the most hated company in the world of academic publishing, a fact demonstrated most recently when all the editors and editorial board resigned from one of its top journals to set up their own, open access rival. A blog post by the statistician Chris H.J. Hartgerink shows that Elsevier is still an innovator when it comes to making life hard for academics. Hartgerink's work at Tilburg University in the Netherlands concerns detecting potentially problematic research that might involve data fabrication -- obviously an important issue for the academic world. A key technique he is employing is content mining -- essentially bringing together large bodies of text and data in order to extract interesting facts from them:
I am trying to extract test results, figures, tables, and other information reported in papers throughout the majority of the psychology literature. As such, I need the research papers published in psychology that I can mine for these data. To this end, I started 'bulk' downloading research papers from, for instance, [Elsevier's] Sciencedirect. I was doing this for scholarly purposes and took into account potential server load by limiting the amount of papers I downloaded per minute to 9. I had no intention to redistribute the downloaded materials, had legal access to them because my university pays a subscription, and I only wanted to extract facts from these papers.
He spread out the downloads over ten days so as not to hammer Elsevier's servers -- which in any case are doubtless pretty beefy given the 39% profit margin the company enjoys:
I downloaded approximately 30GB of data from Sciencedirect in approximately 10 days. This boils down to a server load of 35KB/s, 0.0021GB/min, 0.125GB/h, 3GB/day.
Elsevier's response to this super-considerate researcher is a classic:
Approximately two weeks after I started downloading psychology research papers, Elsevier notified my university that this was a violation of the access contract, that this could be considered stealing of content, and that they wanted it to stop. My librarian explicitly instructed me to stop downloading (which I did immediately), otherwise Elsevier would cut all access to Sciencedirect for my university.
There are clear parallels with the situation that Aaron Schwarz found himself in, but with a key difference. Elsevier is not only stopping Hartgerink from carrying out his research, but threatening to cut off all access to the company's journals and books for everyone working at Tilburg University if he tries to continue. Alicia Wise, Elsevier's Director of Access & Policy, added the following comment on Hartgerink's blog post:
We are happy for you to text mind content that we publish via the ScienceDirect API, but not via screen scraping.
When she was asked why it was necessary to use the API, rather than simply downloading articles, she replied:
The reason that we require miners to use the API is so that we can meet their needs AND ALSO the needs of our human users who can continue to read, search and download articles and not have their service interrupted in any way.
But that doesn't make any sense when Hartgerink had taken such pains to avoid any such adverse affects. Moreover, another commenter noted that Elsevier’s API often fails to work, rendering it useless for content mining. Even when it does work:
In many cases the API returns only metadata in the XML, compared to the fulltext PDF I can access on the website. Simply downloading the paper via the normal web service for readers is easy -- much easier than using the API.
What is really at stake here is control. Elsevier wants to be acknowledged as the undisputed gatekeeper for all possible uses of the research it publishes -- most of which was paid for by the public through taxes. And as far as the company is concerned, daring to use that knowledge in new ways without additional permission is simply "stealing."