Publication Date
Fall 2025
Degree Type
Master's Project
Degree Name
Master of Science in Computer Science (MSCS)
Department
Computer Science
First Advisor
Amith Kamath Belman
Second Advisor
Fabio Di Troia
Third Advisor
Wilson Tang
Keywords
Voice Authentication, Data Poisoning, Biometrics, SVM, Machine Learning, Automated Speaker Verification, HiFi GAN
Abstract
Voice Authentication (VA), also known as Automatic Speaker Verification (ASV), is a widely adopted authentication method, particularly in automated systems like banking services, where it serves as a secondary layer of user authentication. Despite its popularity, VA systems are vulnerable to various attacks, including replay, impersonation, and the emerging threat of deepfake audio that mimics the voice of legitimate users. To mitigate these risks, several defense mechanisms have been proposed. One such solution, ‘‘Voice Pops", aims to distinguish an individual’s unique phoneme pronunciations during the enrollment process. While promising, the effectiveness of VA+VoicePop against a broader range of attacks, particularly logical or adversarial attacks, remains insufficiently explored. We propose a novel attack method, which we refer to as SyntheticPop, designed to target the phoneme recognition capabilities of the VA+VoicePop system. The first iteration of this attack exploits the feature extraction process of VA+VoicePops by poisoning 20% of training samples labeled "spoof" through embedding a 90Hz wave throughout the audio, in order to confuse the model’s ideas of what is considered legitimate audio. This attack proved successful, as system accuracy dropped 55%, representing the needs for more robust training algorithms. However, while successful, this method presents two main obstacles: the human test and detection due to the attack being very broad and easily detectable due to many unnatural peaks in the audio. This led to the development of SyntheticPop+FS2, which leverages the temporal stability of FS2 and introduces SyntheticPops at phoneme locations. Through our testing, it is revealed that while not leading to significant model collapse like SyntheticPop, SyntheticPop+FS2 still leads to a respectable 11.86% decrease in accuracy under the same total poison (20%). In addition, the strategic placement of SyntheticPops leads to a less pattern recognizable audio signal and the addition of a small amount of Gaussian noise helps mask energy signatures that could indicate the use of FS2.
Recommended Citation
Jamdar, Eshaq, "SyntheticPop: An Investigation Into Poisoning Automated Speaker Verification Systems" (2025). Master's Projects. 1618.
DOI: https://doi.org/10.31979/etd.zrgc-3rzy
https://scholarworks.sjsu.edu/etd_projects/1618