Learning from health data
In the health sector, particularly in digital health, multicentre studies require data from several establishments or different countries to be brought together on a server, for example, to identify risk factors for dementia due to Alzheimer's disease or to study the functioning of certain bacteria. However, assembling health data in a single location raises confidentiality issues.
An alternative would be to process the data from each institution on-site without having to anonymise or transfer it. This is one of the topics that the 25 members of the Magnet (for Machine Learning in Information Networks) project team, which includes researchers, engineers, and post-doctoral students, have been working on at the University of Lille Inria Centre since 2016. Their aim is to facilitate a more ethical and protective use of personal data and new artificial intelligence and machine learning algorithms and thereby make them suitable for use in areas where confidentiality is critical.
Centralised learning has its limits
Highly prized by the major stakeholders in digital technology, in possession of huge databases containing all the usage data from their services, machine learning is used in an increasing number of complex processes, of which classic examples include natural language text analysis, voice recognition, and the representation of certain data in the form of graphs. It is based on algorithms that are trained using vast amounts of centralised data.
However, such centralisation is not well suited to the most sensitive sectors such as healthcare. The first problem is that :
“It is difficult to transfer certain sensitive and protected data, particularly hospital data, to a third party”, says Aurélien Bellet, a researcher on the project team.
The second is that “when the transferred data is anonymised, it risks losing some of its inherent value. In addition, anonymisation techniques are not perfect and there is always a risk that the learned model could allow the sensitive data used in the training to be traced”, adds the researcher.
Towards federated learning
With federated machine learning, “the innovation lies in the creation of learning algorithms capable of operating with data stored on the network without having to transfer it to a single location”, explains Marc Tommasi, Magnet Team Leader.
The project team won a European project with REACT-EU and, last year they won a call for proposals by the CNIL (French National Commission for Information Technology and Civil Liberties) to “deploy federated calculation algorithms within a network of university hospitals in the context of decentralised multicentre clinical trials”, a clear sign of the interest in these new methods. It follows on from an exploratory action called Flamed (Federated Learning and Analytics on Medical Data), which aims to explore a decentralised approach to artificial intelligence applied to the health sector.
Conserving the sovereignty of health data
Why is this important? “The hospitals that Magnet works with - which form the G4 Health Cooperation Group (the university hospitals of Amiens, Caen, Lille, and Rouen) - want to avoid centralisation and retain control over the data they collect through their activities”, explains Marc Tommasi.
The researcher also adds that :
“From a legal point of view, given the legal requirements that must be respected when processing health data outside of the establishment, federated learning should also facilitate the organisation of multicentre studies (which are conducted using data belonging to several hospitals).”
The recent work with the CNIL has helped the Magnet researchers to assess the risks concerning the protection of personal data and the measures to be implemented in order to comply with the regulations in force, such as the GDPR (General Data Protection Regulation).
Formalising reusable methods
One of the research challenges facing Magnet's researchers is the design of algorithms that protect the confidentiality of the data being processed and take into account its heterogeneity. The characteristics of the data can vary from one institution to another depending on their habits, specialities or care policies. In addition, a certain amount of fairly complex engineering work has to be carried out with each of the participating hospitals, "in particular so that the federated learning algorithms can be run within their own information systems (and therefore pass through firewalls) and then communicate with the learning models trained in other places”, explains Marc Tommasi.
The aim of the projects conducted with the university hospitals and the CNIL is “to develop methods based on real use cases in order to carry out multicentre medical studies without having to transfer the data”, explains Aurélien Bellet. “The formalisation of the approaches used in the pilot studies should serve as a basis for other studies of the same type in the future, in health or other fields”. For Magnet, the aim is also to propose an open-source infrastructure and a “federated learning library” that can be used by other university hospitals, public organisations or companies that want to work with decentralised data. There are a great number of potential users.
Marc Tommasi is a professor of Computer Science at the University of Lille and has led the Magnet project team since 2016, where he mainly works on machine learning. Magnet is a joint team between the Inria Centre at the University of Lille and the CRISTAL (Lille Research Centre for Computer Science, Signal and Automation, CNRS UMR 9189) laboratory.
Aurélien Bellet is a researcher in the Inria Magnet project team who specialises in Machine Learning Theory and algorithms. He is particularly interested in the design of privacy-friendly algorithms in a federated and decentralised learning context.
Find out more
- Le « bac à sable » données personnelles de la CNIL, 2022 (in French)
- Aurélien Bellet et Marc Tommasi : l’apprentissage fédéré, un « nouveau paradigme pour l’apprentissage machine », Laboratoire d’innovation numérique de la Cnil (LINC), 3/4/2022 (in French)
- Chacun chez soi et les données seront bien gardées : l’apprentissage fédéré, Laboratoire d’innovation numérique de la Cnil (LINC), 4/4/2022 (in French).
- Aurélien Bellet - Decentralized and Privacy-Preserving Machine Learning, Conférence IA - Institut Henri Poincaré, 16 et 17/11/2021.
- Apprentissage fédéré pour les données médicales, Gilles Wainrib, Collège de France, 14/3/2018 (in French).
- Apprentissage fédéré : une nouvelle approche de l’apprentissage machine, Yann Bocchi, Haute École spécialisée de Suisse occidentale, 11/08/2021 (in French).