Author

Eshaq Jamdar

Publication Date

Fall 2025

Degree Type

Master's Project

Degree Name

Master of Science in Computer Science (MSCS)

Department

Computer Science

First Advisor

Amith Kamath Belman

Second Advisor

Fabio Di Troia

Third Advisor

Wilson Tang

Keywords

Voice Authentication, Data Poisoning, Biometrics, SVM, Machine Learning, Automated Speaker Verification, HiFi GAN

Abstract

Voice Authentication (VA), also known as Automatic Speaker Verification (ASV), is a widely adopted authentication method, particularly in automated systems like banking services, where it serves as a secondary layer of user authentication. Despite its popularity, VA systems are vulnerable to various attacks, including replay, impersonation, and the emerging threat of deepfake audio that mimics the voice of legitimate users. To mitigate these risks, several defense mechanisms have been proposed. One such solution, ‘‘Voice Pops", aims to distinguish an individual’s unique phoneme pronunciations during the enrollment process. While promising, the effectiveness of VA+VoicePop against a broader range of attacks, particularly logical or adversarial attacks, remains insufficiently explored. We propose a novel attack method, which we refer to as SyntheticPop, designed to target the phoneme recognition capabilities of the VA+VoicePop system. The first iteration of this attack exploits the feature extraction process of VA+VoicePops by poisoning 20% of training samples labeled "spoof" through embedding a 90Hz wave throughout the audio, in order to confuse the model’s ideas of what is considered legitimate audio. This attack proved successful, as system accuracy dropped 55%, representing the needs for more robust training algorithms. However, while successful, this method presents two main obstacles: the human test and detection due to the attack being very broad and easily detectable due to many unnatural peaks in the audio. This led to the development of SyntheticPop+FS2, which leverages the temporal stability of FS2 and introduces SyntheticPops at phoneme locations. Through our testing, it is revealed that while not leading to significant model collapse like SyntheticPop, SyntheticPop+FS2 still leads to a respectable 11.86% decrease in accuracy under the same total poison (20%). In addition, the strategic placement of SyntheticPops leads to a less pattern recognizable audio signal and the addition of a small amount of Gaussian noise helps mask energy signatures that could indicate the use of FS2.

Available for download on Saturday, December 19, 2026

Share

COinS