Sites Inria

Version française


Pierre Michaud - 27/03/2014

Scikit-Learn enhances the intelligence of our systems

Scikit-Learn is a Machine Learning library, developed thanks to the Python programming language. This open source tool, developed on the initiative of the Parietal team, has been a complete success.

Scikit Learn went down a storm at the last Strata conference in February 2014. The enthusiastic participants approved of both its innovative character and ease of use. However before being unveiled to the public, much work was carried out to make this software solution operational. Initially launched in 2007 by a few members of the Python scientific community, the Scikit-Learn project didn't take off properly until three years later, spearheaded by Gaël Varoquaux. As part of his work on functional brain imaging within the Parietal team, the researcher requires a predictive modeling tool compatible with the Python ecosystem. In 2010, he organised an open participative development workshop with the aim of making statistical data analysis methods available in open source. Two years later, a stable version went online.

"At the beginning of the project, we set a certain number of objectives" , explained Olivier Grisel, expert engineer in the Parietal team. “Firstly, in order to guarantee that the library could be easily installed on a variety of platforms, we worked to ensure that it was neatly packaged. But we also decided to compile extensive documentation on the use of the tool, with practical examples. Finally, to guarantee its correct functioning over the long-term, we ensured that all methods implemented underwent a series of automatic tests. We were also able to check that changes to the code base did not introduce bugs."

A tool praised by the entire community

Examples of data's classification (raw data on the left) produced by Scikit-Learn

The project soon rallied external contributors from around the world. "Most contributors were researchers or PhD students working on automatic text analysis, computer vision, genetic data, etc." , added Olivier Grisel. "In fact, at present, most contributions come from outside of Inria. An entire community has built up around Scikit-Learn" ; a community of contributors and users alike, not all of whom have academic backgrounds. Many internet start-ups use Scikit-Learn to predict the consumer behaviour of users, offer product recommendations or detect abusive behaviour (fraud, spam, etc.). This use in industrial applications has been facilitated through the choice of a very free open source licence (BSD): companies can not only use the software but also incorporate it into their own products without specific authorisation. Furthermore, the Python ecosystem lends itself particularly well to an iterative approach to data modeling: the user can type one or more lines of script into an interactive shell, run the script immediately and view the evaluation result, in the form of graphical representations, for example. "This has proven very valuable at the exploratory phase of data analysis" , concluded Olivier Grisel.

Keywords: Machine learning Development Parietal team Python Open source Olivier Grisel