Publication Date
Spring 2019
Degree Type
Master's Project
Degree Name
Master of Science (MS)
Department
Computer Science
First Advisor
Melody Moh
Second Advisor
Robert Chun
Third Advisor
Teng Moh
Keywords
hate speech detection, social networks
Abstract
Although current state-of-the-art hate speech detection models achieve praiseworthy results, these models have shown themselves to be vulnerable to attack. Easy to execute lexical manipulations such as the removal of whitespace from a given text create significant issues for word-based hate speech detection models. In this paper, we reproduce the results of five cutting edge models as well as four significant evasion schemes from prior work. Only a limited amount of evasion schemes that also maintain readability exists, and this works to our advantage in the recreation of the original data. Furthermore, we demonstrate that each lexical attack or evasion scheme can be overcome with our new defense mechanisms with some reducing the effectiveness of the scheme to 1%. We also propose a new evasion scheme that outperforms the those in previous work along with a corresponding defense. Using our results as a foundation, we contend that hate speech detection models can be defended against lexically attacked data without the need for significant retraining.
Recommended Citation
Khieu, Brian Tuan, "TSAR : A System for Defending Hate Speech Detection Models Against Adversaries" (2019). Master's Projects. 740.
DOI: https://doi.org/10.31979/etd.6tsk-redu
https://scholarworks.sjsu.edu/etd_projects/740