The 2019 Inria-French Academy of Sciences-Dassault Systèmes Innovation Prize : scikit-learn , a success story for machine learning free software

Date :
Changed on 16/01/2020
With 1,400 contributors across the globe, 42 million visits in 2018, and the achievement of being the third most used free software for machine learning in the world, scikit-learn has been a huge success, as evidenced by the existence of a consortium of user companies set up to fund its development. The five Inria researchers that form the very heart of the team and who have been running it for years have been awarded the Inria-French Academy of Sciences-Dassault Systèmes Innovation Prize.

Machine learning, or statistical learning, is used in artificial intelligence to enable computers to “learn” based on data which they analyse in order to devise predictive laws, without any prior programming being required. It has a varied range of applications, including recommending products on Amazon, medical diagnoses, neuroscience, insurance, driverless vehicles, etc.

An alternative to a data scientist shortage

The scikit-learn free software, which has been supported by Inria for ten years, is specifically dedicated to statistical learning. Its design puts it in reach of a vast range of users, and not only experts in the field, or data scientists. “It is written in Python programming language, which is used universally on the web. It is very simple to start using, very well documented and illustrated with hundreds of examples. It can be used on any subject”, explains Gaël Varoquaux, project manager. “In other words, it democratises statistical learning.”  

The 38 year-old researcher first became involved in the scikit-learn adventure in 2009, joining as a community coordinator and developer. He is currently overseeing the work carried out by the seven-company consortium (which includes Microsoft, Intel, AXA and BNP Paribas) that was set up in 2018 to fund development of the software. His name and his face are known to thousands of data scientists (he speaks in online videos), to such an extent that he is sometimes cornered on the metro by people seeking answers to highly specialist technical questions.

Scikit-learn in five dates

  • 2007 - David Cournapeau, a French computing student, publishes his PhD in a machine learning project which he has christened scikits.learn.
  • 2009 - Inria decide to change the name of the project to scikit-learn
  • 2011 - The scikit-learn community is made up of several hundred members. Some of these members get together in Spain for a first software sprint (a development meeting).
  • 2015 - scikit-learn ramps up in research and industry, with more than 100,000 monthly users.
  • 2018 - Inria and seven backer companies set up a consortium in order to fund the maintenance and development of scikit-learn. 42 million visits are logged over the course of the year.

Coéquipiers et colauréats du prix, sont au nombre de quatre

Such dizzying success hasn’t gone to his head, however.

Project manager - OK. Hero - absolutely not. As is the case with all free software, scikit-learn is a collaborative endeavour. It benefits from contributions made by a community of 1,400 contributors, from the constant support of Inria and from the work of my colleagues.

Varoquaux has four colleagues, each of whom is a joint winner of the prize. 

A_Bertrand Thirion_Inria-0304-708_540_720
© Inria / Photo G. Scagnelli
Bertrand Thirion, director of the Parietal laboratory that Gaël Varoquaux is a part of. “He played a key role in our development, promoting a strategic vision and creating a framework that enabled the team to grow. He encouraged us to focus on the scientific vision, his priority being to create tools for science.”
A_ Loic Esteve_Inria-0304-704_540_720
© Inria / Photo G. Scagnelli
Loïc Estève’s contribution to the project involved meticulously tackling bugs. Since the consortium was set up, he has assisted the four engineers employed full-time to work on maintenance and software development. His work also involves looking for ways to make scikit-learn even more accessible and educational in order to expand its use.
A_Olivier_Grisel_IMG_0008_540_720
© Inria / Photo E. Invernizzi
Olivier Grisel, in charge of day to day code development, is committed to the convergence of his algorithms: “Scikit-learn’s main strength is that it offers one single programming interface that is capable of running predictive models which are highly varied from a mathematical perspective and which can be deployed in a range of scientific, commercial or industrial applications.”
A_ Alexandre Gramfort_Inria-0304-886_300_400
© Inria / Photo G. Scagnelli
Alexandre Gramfort has been involved in the project from its very beginnings, even if he did leave Parietal for a few years to join Télécom ParisTech's research laboratory. His current contribution involves setting goals for scikit-learn, in addition to technical consultancy in algorithms and scientific calculations..

One interface - many models

Gaël Varoquaux is also keen to stress the role played by the companies which make up the consortium, even if they were not included in the prize.

They are backers, not clients. They believe that if scikit-learn delivers more added value to the whole of society, then they will benefit from that. They share their feedback, ask questions which lead to research projects and suggest possible changes. They don’t place orders and they don’t demand anything.

Building bridges with the corporate world

For scikit-learn’s project manager, the 2019 Innovation Prize is a real achievement: “Researchers whom I admire and who are working on subjects of vital importance have praised our work, which is centred around technology and applications. They too believe it is essential to build bridges with the corporate world and to let societal challenges guide our work.”

Although he has since become a key figure within scikit-learn, Gaël Varoquaux began his career in an entirely different field of research, with a PhD in quantum physics. This required the ability to handle complex data, and Varoquaux quickly discovered how interesting this could be. So much so, in fact, that, in 2008, he made a huge career change, joining Inria’s Parietal team, which specialises in using data taken from MRI images of the human brain.

A_Gael Varoquaux_Inria-0304-682_300_400
© Inria / Photo G. Scagnelli

Gaël Varoquaux, bio express

Holder of a Master’s degree in quantum physics from the École Normale Supérieure and a PhD in quantum physics from the University of Orsay, Gaël Varoquaux developed a passion for IT and data processing during his studies. In 2008, he decided to change course and joined the Parietal team at the Inria Saclay-Ile-de-France Research Centre, specialists in brain modelling for use in neuroscience. Varoquaux used scikit-learn in his research and was active in coordinating the community of developers. In 2018, he became project manager for the scikit-learn consortium.

New: statistical learning based on “dirty” data

Within this team, Varoquaux has been a scikit-learn contributor, a community coordinator and project manager. Following the creation of the consortium, his role has become less operational and now takes up roughly 20% of his time. 

Gaël Varoquaux, an insatiable researcher with a weakness for a risky bet - such as switching from quantum physics to computing - is making the most of this opportunity to explore a new field: statistical learning based on so-called “dirty” data, termed this way because it is not taken from randomised tests or standardised databases.

My aim is to devise predictive models based on sources such as questionnaires given to elderly patients on their memory issues or data on hepatitis B vaccinations, the living standards of certain groups within society or the prevalence of liver cancer. The goal here is to supply public health policies with reliable, ready-to-use data.

Testimonies

Marcin Detyniecki, Director of R&D at AXA

“Scikit-learn is the Swiss army knife of machine learning”

“AXA has somewhere in the region of 300 data scientists, but no doubt several thousand internal users of scikit-learn. This unique tool opens up additional risk prediction techniques to our actuaries, helping us to speed up the compensation process for car accidents or to detect insurance fraud. It’s the Swiss army knife of machine learning! What’s more, it’s opensource, and developed by a public science body. What this means is that we’re not at the mercy of a software publisher and benefit from total impartiality. Given this, we felt it was essential for us to participate in the consortium. We couldn’t continue to use scikit-learn without ever contributing to its development.”

Léo Dreyfus-Schmidt, Director of Research at Dataiku

“Exceptional quality documentation”

“Our start-up was launched in 2013 as a collaborative data science platform offering services ranging from data acquisition to the deployment of predictive models in production. As far as machine learning was concerned, we opted for scikit-learn instead of developing our own solution. We were already part of its community of users and this also ensured a level of transparency in terms of the algorithms employed, which is something our clients really appreciate in that it allows them to understand how the tools they are using actually work. Another advantage of the software is that it is so well made that it is possible for a novice to train in machine learning quicker and more effectively than would be possible with a conventional course.”

Sébastien Conort, Chief Data Scientist for BNP Paribas Cardif

“Scikit-learn is our reference tool when it comes to machine learning”

 “We are proud to be a member of the scikit-learn community and to support this leading machine learning software library. Widely used by our teams of data scientists both in France and in a dozen or so countries worldwide, this reference tool ensures a high level of reliability for the predictive models designed using it. Scikit-learn helps us to create innovative services, including the automated and accelerated processing of supporting documents in the event of a loss. It also improves internal processes, such as dispatching incoming mail or risk monitoring. Our goal is to automate 80% of all our processes between now and 2022.”