Sites Inria

Version française

logiciel libre

28/05/2019

Guix 1.0, one step closer to scientific reproducibility in HPC

Reproducing scientific experiments largely based on IT equipment has become practically impossible as a result of the constantly-evolving nature of software, versions and features. Version 1.0 of the free software Guix, which was made available in May 2019, tackles this challenge head-on, particularly when it comes to high performance computing (HPC), enabling users to “go back in time” in order to find previous versions of programs. To find out more about this, we spoke to Ludovic Courtès from Inria Bordeaux, one of the main figures behind this group project. 

Reproducing scientific experiments largely based on IT equipment has become practically impossible as a result of the constantly-evolving nature of software, versions and features. Version 1.0 of the free software Guix, which was made available in May 2019, tackles this challenge head-on, particularly when it comes to high performance computing (HPC), enabling users to “go back in time” in order to find previous versions of programs. To find out more about this, we spoke to Ludovic Courtès from Inria Bordeaux, one of the main figures behind this group project. 

A lack of reproducibility hinders scientific research 

Let’s take a closer look at what this term “reproducibility” actually means in a scientific context. “Reproducibility” is used to refer to the ability to carry out an experiment in a different location or at a different time using identical experimental conditions. This is child’s play if all you are doing is mixing three chemical products together on a lab bench, but it becomes infinitely more complex in experiments involving tens and even hundreds of different programs - which is commonplace - and when the use of HPC becomes necessary, with computer clusters linking hundreds of machines. 

HPC is becoming the standard in fields such as computer simulation, genomics or bioinformatics and the reproducibility of experiments has become a controversial issue.In scientific research, it is essential that we are able to verify published results and to evaluate their sensitivity to certain parameters. Otherwise, our work is hindered.

explains Ludovic Courtès.

The limitations of containers and package managers

Two types of tools are currently used to tackle this issue. The first, package managers, describe the aspects of the application related to the experiment: libraries, compilers, source codes for software, etc. However, given that researchers have no control over which equipment will be used, they don’t know which software versions will be in use. “This is problematic, because each program has an impact on the behaviour of the others, as well as on the overall result”, explains Ludovic Courtès.
Containers, the second available tool, are more comprehensive. Containers house both the application and all of the programs on which it depends, including part of the operating system. Reproducibility is total, but rigid. This “black box” doesn't go into detail in terms of the specifications, meaning no changes can be made to the parameters in order to test other hypotheses. This leaves researchers feeling frustrated. 

A comprehensive, reliable and modifiable description of the experiment

So where does Guix 1.0 fit into all of this? “This is a tool that takes a package managers approach, but with all the benefits of a container”, explains Ludovic Courtès. As is the case with package managers, it describes the different aspects of an experiment, thus enabling changes to be made. As is the case with containers, it is comprehensive and faithful to the original experiment. You can even “go back in time” to find a previous version of a program, the one used at the start, for example, should you so require.  
This impressive undertaking came about as a result of a partnership with Software Heritage, the collaborative, universal and permanent software archiving project launched in 2016 with support from Inria and Unesco. “If the URL for the site hosting the code is no longer valid or if the development platform has closed, Guix acts transparently to retrieve the original source code from our archives”, explains Roberto Di Cosmo, CEO of Software Heritage. “This means there is no need to change the package descriptions.”

10,000 free programs available

Since 2012, Guix has been developed by a community made up of more than 250 contributors, whose proposals and codes are validated in turn by their peers. It currently has somewhere in the region of 10,000 free programs, including a section dedicated to genomics and bioinformatics. It is the fruit of two years spent working with laboratories overseas, who played an active role in the development of this open 1.0 version on HPC.
These teams have really been able to reap the benefits of this. “Increasingly, our experiments involve the use of programs hosted on third-party sites”, explains Altuna Akalin, a researcher at the Max Delbrück Center for Molecular Medicine (Berlin).“Guix enables us to accurately reconstruct their software environment and takes the painstaking job of verification off our hands.” 
The German researchers used the tool to reproduce four current DNA sequencing experiments by using different machines and by generating the code more than 300 times. 97.8% of the time, these codes are identical right down to the last bit, and a lot of the residual bugs are easy to correct. We don't yet have full reproducibility, but Guix 1.0 is a major step towards making that dream a reality.

* Max Delbrück Center for Molecular Medicine (Germany) and Utrecht Bioinformatics Center (the Netherlands)  

Top