from the tim-berners-lee-would-be-proud dept
Techdirt readers will be very familiar with CERN, the European Council for Nuclear Research (the acronym comes from the French version: Conseil Européen pour la Recherche Nucléaire). It’s best known for two things: being the birthplace of the World Wide Web, and home to the Large Hadron Collider (LHC), the world’s largest and most powerful particle accelerator. Over 12,000 scientists of 110 nationalities, from institutes in more than 70 countries, work at CERN. Between them, they produce a huge quantity of scientific papers. That made CERN’s decision in 2013 to release nearly all of its published articles as open access one of the most important milestones in the field of academic publishing. Since 2014, CERN has published 40,000 open access articles. But as Techdirt has noted, open access is just the start. As well as the final reports on academic work, what is also needed is the underlying data. Making that data freely available allows others to check the analysis, and to use it for further investigation — for example, by combining it with data from elsewhere. The push for open data has been underway for a while, and has just received a big boost from CERN:
The four main LHC collaborations (ALICE, ATLAS, CMS and LHCb) have unanimously endorsed a new open data policy for scientific experiments at the Large Hadron Collider (LHC), which was presented to the CERN Council today. The policy commits to publicly releasing so-called level 3 scientific data, the type required to make scientific studies, collected by the LHC experiments. Data will start to be released approximately five years after collection, and the aim is for the full dataset to be publicly available by the close of the experiment concerned. The policy addresses the growing movement of open science, which aims to make scientific research more reproducible, accessible, and collaborative.
The level 3 data released can contribute to scientific research in particle physics, as well as research in the field of scientific computing, for example to improve reconstruction or analysis methods based on machine learning techniques, an approach that requires rich data sets for training and validation.
CERN’s open data portal already contains 2 petabytes of data — a figure that is likely to rise rapidly, since LHR experiments typically generate massive quantities of data. However, the raw data will not in general be released. The open data policy document (pdf) explains why:
This is due to the complexity of the data, metadata and software, the required knowledge of the detector itself and the methods of reconstruction, the extensive computing resources necessary and the access issues for the enormous volume of data stored in archival media. It should be noted that, for these reasons, general direct access to the raw data is not even available to individuals within the collaboration, and that instead the production of reconstructed data (i.e. Level-3 data) is performed centrally. Access to representative subsets of raw data — useful for example for studies in the machine learning domain and beyond — can be released together with Level-3 formats, at the discretion of each experiment.
There will also be Level 2 data, “provided in simplified, portable and self-contained formats suitable for educational and public understanding purposes”. CERN says that it may create “lightweight” environments to allow such data to be explored more easily. Virtual computing environments for the Level 3 data will be made available to aid the re-use of this primary research material. Although the data is being released using a Creative Commons CC0 waiver, acknowledgements of the data’s origin are required, and any new publications that result must be clearly distinguishable from those written by the original CERN teams.
As with the move to open access in 2013, the new open data policy is unlikely to have much of a direct impact for people outside the high energy physics community. But it does represent an extremely strong and important signal that CERN believes open data must and will become the norm.