Sites Inria

Version française


Isabelle Kling - 17/09/2019

Inria researcher coorganized international challenge on vocal cybersecurity

A scientist from Inria Nancy-Grand Est co-organized the largest (to date) challenge on the detection of faked voice signals. It attracted more than 150 participants from 30 countries to make vocal access systems more secure, improve detection technologies, and learn about the most efficient faking technologies.

« Access denied, please repeat »… as vocal access systems are spreading, this sentence is becoming increasingly familiar. However, its impact can range from “mildly annoying” when you are ordering a Big Mac at a drive-in, to “very stressful” when you are trying to access an online banking platform from your home computer. The ability of vocal access systems to reliably distinguish faked (or spoofed) voices from genuine ones, and grant access accordingly, is crucial to ensure consumers’ trust, and thus support the development of these technologies. As a consequence, it mobilises an increasing number of developers and scientists around the world, like MD Sahidullah, from Inria Nancy-Grand Est.

Sahidullah, a researcher from the Multispeech team, co-organised with researchers from the international ASVSpoof consortium, the largest challenge to this date on the detection of faked audio signals: the Automatic Speaker Verification Spoofing challenge (or ASVSpoof). Its results are being published today, during the 2019 Interspeech conference, and they are expected to generate an important jump forward in the efficacy of detection technologies.

 Three main spoofing technologies

Vocal access systems can let you into two broad categories of spaces: physical spaces, such as an office building or a gym, and virtual spaces, such as an online banking platform or the ordering service of a fast-food. Each type of space can be fooled by specific spoofing technologies: playback of recorded audio sounds for physical spaces, and artificial speech synthesis or voice conversion for virtual spaces. The playback of recorded voices is the most primitive faking method involving voice, but recent improvements in recording and replaying technologies are making it increasingly difficult to detect. The other two spoofing methods stem from more recent technologies, such as deep neural networks. Artificial speech synthesis is also called “text-to-speech synthesis”, and relies on the ability of a computer to automatically generate an artificial voice to “read” a given text; with voice conversion methods, on the other hand, hackers can transform the real voice of one person to someone else’s voice. 

The development of neural networks has given text-to-speech synthesis and voice conversion technologies an impressive boost, making them produce very realistic sound and thus hard to detect. Beyond the obvious security-threat to virtual spaces, such technologies represent a wider risk of mass-manipulation: they are core to the creation of deepfake videos, that distort real speeches from political and famous personality to tweak them into making fake statements.

The participants to the ASVSpoof challenge were given a mix of genuine audio recordings and of spoofed audio recordings using the three technologies described above. For the first part of the challenge they were told which one was which, so they could develop and calibrate their detection tools. Those newly optimized detection tools were then used in the second part of the challenge to evaluate whether other, unknown, recordings were genuine or not. 

The voice, the next biometric marker?

“We expect the voice to become a biometric marker, like fingerprints,” explains MD Sahidullah. “It is very important to develop reliable tools to detect fakes in all circumstances. Challenges such as ASVSpoof gather more than 150 teams around the analysis of the same data-set and allow for a very fast technical improvement that would not be possible otherwise.”

 The most efficient detection technologies that competed in the ASVSpoof challenge use a combination of techniques: using only one technique didn’t allow participants to score well. Although the details of these combinations are not published, the most efficient detection tools to protect the access to virtual spaces all use deep learning technologies, such as convolutional neural networks.

 A new metric to define detection efficiency

Beyond the development boost generated by the format of the challenge, its organisers also introduced a new detection-efficiency-metric that could become the new norm: the tandem decision cost function (t-DCF). Traditionally, the efficiency of detection systems is measured in percentages of fakes voices accepted and genuine voices refused: the lower the percentage, the more efficient the system. However, that metric does not take the user-experience, or the cost, into consideration, and does not really fit with “real-world situations”.

For example: the importance of protecting the access to an online banking platform is extremely high. The cost associated with an illegal access justifies that even the slightest doubt on the identity of the person leads to refusing access. A bank will calibrate its security system in such a way that it will prefer to refuse access to genuine customers (whose voice might be slightly different because of a slight cold) than to grant access to someone whose identity is not 100% sure. So, in addition to its detection system, its access system would be set with very high standards. 

The ordering service of a fast-food, on the other hand, may not require such high security as online banking. The cost of an illegal access (and of a fake order for burgers) can be lower than refusing a genuine client who will feel frustrated and will not come back. In that situation, the detection system might be the same as for the bank, but its access system will be set with lower standards and accept everybody whose voice is evaluated as being “only” 95% genuine, for example.

 The originality of the new metric developed for this challenge combines both the measure of the efficiency of the detection system and that of the access system: it is thus closer to real-life conditions. The metric is a cost function: the lower the value, the more efficient the overall system. For the needs of the 2019 version of the ASVSpoof challenge, the access system was fixed, so the participants did not need any expertise with access systems. The next edition, likely in 2021, might introduce more sophisticated attacks and even closer to the real-life conditions.