Speech Recognition and Privacy

Changed on 13/04/2021

Emmanuel Vincent, a Senior Research Scientist Inria Nancy – Grand Est, is the coordinator of the European project COMPRISE which aims to develop privacy-preserving speech recognition algorithms. Interview with a scientist who is dreaming up original ways of ensuring that our voice assistants don’t unwittingly turn into spies.

Portrat d'Emmanuel Vincent entouré de hauts-parleurs — @ Emmanuel Vincent

He speaks softly, clearly, at a measured pace, using words intended to educate. Perhaps this is because Emmanuel Vincent is well placed to understand the importance of intelligible speech. Listening to this researcher is discovering how audio signal modelling combined with artificial intelligence makes it possible to develop voice recognition apps. Emmanuel Vincent is contributing to this endeavour and brings the expertise that he developed before joining Inria Nancy – Grand Est.

Combination of mathematical and musical education

Having studied mathematics in college, he was drawn to the challenges posed by audio signals and earned a DEA[*] in acoustics, signal processing and computer science applied to music. This degree was obtained in 2001 at IRCAM* and prepared him for his doctoral work on the separation of musical sources. What would be more natural for this scientist, a former student of the Conservatoire and an accomplished musician, than to take an interest in this field? “Sound source separation allows you to identify the contributions of different instruments to a musical recording and to restore the atmostphere of a concert by working on sound distribution in that space,” explains the researcher, whose musical knowledge supplements his scientific expertise. “I came across the topic when I was doing a post-doctorate at Queen Mary University of London, where I also designed a new method for coding musical audio at a very low bit rate.“

Emmanuel is pursuing his research work at Inria, which he joined in 2006, initially at the Rennes – Bretagne Atlantique Centre. While at the METISS^* team, which works on audio signal processing, he became interested in speech in particular. After six years in Brittany, he moved to Lorraine and joined MULTISPEECH^*. The work of this team, directed by Denis Jouvet, explores the many aspects of voice and sound processing, with particular attention to the challenges of “multi-source” speech (processing conversations between multiple individuals), “multilingual” speech (learning foreign languages), and “multimodal” speech (enabling lip-syncing). Their research has applications to the design of hearing aids and hands-free voice assistants such as those found in smart speakers and to the identification of sound events for security cameras.

A cybersecurity challenge

Emmanuel Vincent’s special interest lies in combining mathematical models and artificial intelligence algorithms, particularly deep learning. “To be effective, these algorithms require a large amount of data in the learning phase. And speech data convey sensitive information about ourselves, our preferences, our friends and family, etc.” says the researcher. Malicious algorithms (deepfakes) are even capable of counterfeiting our voices, which raises serious issues regarding our individual and collective security and freedom.

Privacy-preserving algorithms

“Our research focuses on developing machine learning methods that protect personal data, for example by deleting from a conversation any word that reveals the person's identity and keeping only what’s really useful for the algorithm, or by distorting the voice to mask the speaker's identity,” says Emmanuel Vincent. The scientific challenge is therefore to develop algorithms that can learn as much as possible from as little data as possible – and that can defeat the latest biometric methods which are able to recognise modified voices!

These are the scientific and technical challenges faced by the European project COMPRISE, which Emmanuel is coordinating. With a €3 million budget over three years (2018-2021), the project gathers some 30 researchers and engineers from MULTISPECH and MAGNET,* Saarland University (Germany) and four European firms specialising in software development and legal compliance for data processing. The project, which has a strong application focus, aims to develop voice assistants that are of interest to mobile app developers, online retailers, and the medical sector.

“We wouldn’t have been able to get this project off the ground without the support and guidance of Inria’s European Partnerships Office, and I could not have run it effectively without the help of our project manager Zaineb Chelly, who carries the coordination, management and communication tasks required by such a project,” says Emmanuel before concluding: “As coordinator, I have the chance of being able to guide the research in the direction I had imagined.” A direction that reveals an ethical dimension to the work of this researcher, who has only one thing to say: It's possible to design algorithms that are both effective and respectful of users’ rights.

[*] DEA: Diplôme d’études approfondies, a post-graduate degree for acceptance to a Ph.D. programme

IRCAM: (Institut de recherche et coordination acoustique/musique) Institute for Research and Coordination in Acoustics/Music.

METISS (Modélisation et expérimentation pour le traitement des informations et des signaux sonores) Modelling and Experimentation for the Processing of Information and Audio Signals – was a joint team from Université de Rennes 1, Inria Rennes – Bretagne Atlantique, and the CNRS, directed by Frédéric Bimbot.

MULTISPEECH (Speech Modeling for Facilitating Oral-Based Communication) is a joint project team of Université de Lorraine, Inria – Nancy-Grand Est, and the CNRS.

MAGNET (Machine Learning in Information Networks) as a joint team from Inria Lille – Nord-Europe and Université de Lille, and directed by Marc Tommasi.