Adversarial Attacks on Speech Separation Systems

Publication Date


Document Type

Conference Proceeding

Publication Title

Proceedings - 21st IEEE International Conference on Machine Learning and Applications, ICMLA 2022



First Page


Last Page



Speech separation is a special form of blind source separation in which the objective is to decouple two or more sources such that they are distinct. The need for such an ability grows as speech activated device usage increases in our every day life. These systems, however, are susceptible to malicious actors. In this work, we repurpose proven adversarial attacks and leverage them against a combination speech separation and speech recognition system. The attack adds adversarial noise to a mixture of two voices such that the two outputs of the speech separation system are similarly transcribed by the speech recognition system despite hearing clear differences in the speech. Against ConvTasNet, degradation of separation remains low at 0.34 decibels, allowing the speech recognition system to still work. When testing against automatic speech recognition, the attack achieves a 64.07% word error rate (WER) against Wav2Vec2, compared to 4.22% for unmodified samples. Against Speech2Text, the WER is 84.55%, compared to 10% WER for unmodified samples. For similarity to the target transcript, the attack achieves 24.77% character error rate (CER), reduced from 113% CER. This indicates relatively high similarity between the target transcription and the resulting transcription.


machine learning, speech recognition, speech synthesis


Computer Science