Seven Years Ago, CERN Gave Open Access A Huge Boost; Now It's Doing The Same For Open Data

from the tim-berners-lee-would-be-proud dept

Techdirt readers will be very familiar with CERN, the European Council for Nuclear Research (the acronym comes from the French version: Conseil Européen pour la Recherche Nucléaire). It’s best known for two things: being the birthplace of the World Wide Web, and home to the Large Hadron Collider (LHC), the world’s largest and most powerful particle accelerator. Over 12,000 scientists of 110 nationalities, from institutes in more than 70 countries, work at CERN. Between them, they produce a huge quantity of scientific papers. That made CERN’s decision in 2013 to release nearly all of its published articles as open access one of the most important milestones in the field of academic publishing. Since 2014, CERN has published 40,000 open access articles. But as Techdirt has noted, open access is just the start. As well as the final reports on academic work, what is also needed is the underlying data. Making that data freely available allows others to check the analysis, and to use it for further investigation — for example, by combining it with data from elsewhere. The push for open data has been underway for a while, and has just received a big boost from CERN:

The four main LHC collaborations (ALICE, ATLAS, CMS and LHCb) have unanimously endorsed a new open data policy for scientific experiments at the Large Hadron Collider (LHC), which was presented to the CERN Council today. The policy commits to publicly releasing so-called level 3 scientific data, the type required to make scientific studies, collected by the LHC experiments. Data will start to be released approximately five years after collection, and the aim is for the full dataset to be publicly available by the close of the experiment concerned. The policy addresses the growing movement of open science, which aims to make scientific research more reproducible, accessible, and collaborative.

The level 3 data released can contribute to scientific research in particle physics, as well as research in the field of scientific computing, for example to improve reconstruction or analysis methods based on machine learning techniques, an approach that requires rich data sets for training and validation.

CERN’s open data portal already contains 2 petabytes of data — a figure that is likely to rise rapidly, since LHR experiments typically generate massive quantities of data. However, the raw data will not in general be released. The open data policy document (pdf) explains why:

This is due to the complexity of the data, metadata and software, the required knowledge of the detector itself and the methods of reconstruction, the extensive computing resources necessary and the access issues for the enormous volume of data stored in archival media. It should be noted that, for these reasons, general direct access to the raw data is not even available to individuals within the collaboration, and that instead the production of reconstructed data (i.e. Level-3 data) is performed centrally. Access to representative subsets of raw data — useful for example for studies in the machine learning domain and beyond — can be released together with Level-3 formats, at the discretion of each experiment.

There will also be Level 2 data, “provided in simplified, portable and self-contained formats suitable for educational and public understanding purposes”. CERN says that it may create “lightweight” environments to allow such data to be explored more easily. Virtual computing environments for the Level 3 data will be made available to aid the re-use of this primary research material. Although the data is being released using a Creative Commons CC0 waiver, acknowledgements of the data’s origin are required, and any new publications that result must be clearly distinguishable from those written by the original CERN teams.

As with the move to open access in 2013, the new open data policy is unlikely to have much of a direct impact for people outside the high energy physics community. But it does represent an extremely strong and important signal that CERN believes open data must and will become the norm.

Follow me @glynmoody on Twitter, Diaspora, or Mastodon.

Filed Under: , , , , ,
Companies: cern

Rate this comment as insightful
Rate this comment as funny
You have rated this comment as insightful
You have rated this comment as funny
Flag this comment as abusive/trolling/spam
You have flagged this comment
The first word has already been claimed
The last word has already been claimed
Insightful Lightbulb icon Funny Laughing icon Abusive/trolling/spam Flag icon Insightful badge Lightbulb icon Funny badge Laughing icon Comments icon

Comments on “Seven Years Ago, CERN Gave Open Access A Huge Boost; Now It's Doing The Same For Open Data”

Subscribe: RSS Leave a comment
This comment has been deemed funny by the community.
Anonymous Coward says:

This is a disaster! How will scientists be motivated to collect data if their great-grandchildren can’t cash in on the copyrights? How will they pay for their supercolliders, supercomputers, and vacation homes? How can they keep individuals from inferior races from doing science also?

And just imagine, some of the data might be chanted by a rapper without attribution. Or used to remote-control a John Deere tractor.

Stand up and stop the madness! Send your anti-proton to CERN now!

Christenson says:

More! More!

I saw two issues:
a) It’s reasonable for CERN to not want random people with no qualifications implying they are associated with them, just as Techdirt wouldn’t want just anyone implying they do work for Techdirt — but it should be framed as a Trademark issue over confusion, not "must attribute this data".
b) Releasing a reasonable quantity of samples of the basic data from the sensors should be required. This allows important independent checks of the data reduction algorithms to happen. Anyone else remember an ozone hole that was made invisible by certain satellite data reduction algorithms assuming what was seen was a sensor problem?

Given the huge volume of raw data, CERN would be really smart to collocate and possibly allow guests to run their own data reduction at the time the data is taken.

Add Your Comment

Your email address will not be published.

Have a Techdirt Account? Sign in now. Want one? Register here

Comment Options:

Make this the or (get credits or sign in to see balance) what's this?

What's this?

Techdirt community members with Techdirt Credits can spotlight a comment as either the "First Word" or "Last Word" on a particular comment thread. Credits can be purchased at the Techdirt Insider Shop »

Follow Techdirt

Techdirt Daily Newsletter

Techdirt Deals
Techdirt Insider Discord
The latest chatter on the Techdirt Insider Discord channel...