Publication Date
Fall 2025
Degree Type
Master's Project
Degree Name
Master of Science in Computer Science (MSCS)
Department
Computer Science
First Advisor
Teng Moh
Second Advisor
Melody Moh
Third Advisor
Amith Kamath Belman
Keywords
Data Poisoning, Mislabeling, Injection, Recurrent Neural Network, Support Vector Machine, Resource Scheme Multilinear Regression
Abstract
Data poisoning occurs in various datasets; however, it is more challenging to detect poisoning in textual datasets compared to image datasets. The focus of this paper is to determine how to detect poisoning in textual datasets. We focused on four poisoning attacks, mislabeling, injection, targeted, and non targeted attacks. Recurrent Neural Networks (RNN), Support Vector Machine (SVM), and Resource Scheme Multilinear Regression (RSMLR) are used for detecting poisoning. A custom RNN class containing an encoder and decoder was created for the RNN. 10% of the data set was used for the SVM to determine whether the rest of the dataset was poisoned. The target column was processed separately from the entire dataset for the RSMLR. To improve each model, a threshold equation was used to determine the poisoned that needed to be flagged. Using the best parameter values, the models are used for a Federated Learning (FL) for multiple passes and shuffling. Based on the experimental results, the use of the RNN and SVM together in shuffling yields the best results for poisoning attacks. The RSMLR had the poorest performance but performed well when detecting poisoning in shuffled datasets. Based on the model shuffling experiment, the models yield average accuracies of 41% for mislabeling datasets, 92% for injection datasets and 68% for targeted datasets. For the Non Targeted attacks, both RNN and SVM yield accuracies of 100%.
Recommended Citation
Kotturu, Ajeet, "Detection and Mitigation for Poisoned Textual Datasets" (2025). Master's Projects. 1607.
DOI: https://doi.org/10.31979/etd.uara-bjzm
https://scholarworks.sjsu.edu/etd_projects/1607