Master of Science (MS)
Although current state-of-the-art hate speech detection models achieve praiseworthy results, these models have shown themselves to be vulnerable to attack. Easy to execute lexical manipulations such as the removal of whitespace from a given text create significant issues for word-based hate speech detection models. In this paper, we reproduce the results of five cutting edge models as well as four significant evasion schemes from prior work. Only a limited amount of evasion schemes that also maintain readability exists, and this works to our advantage in the recreation of the original data. Furthermore, we demonstrate that each lexical attack or evasion scheme can be overcome with our new defense mechanisms with some reducing the effectiveness of the scheme to 1%. We also propose a new evasion scheme that outperforms the those in previous work along with a corresponding defense. Using our results as a foundation, we contend that hate speech detection models can be defended against lexically attacked data without the need for significant retraining.
Khieu, Brian Tuan, "TSAR : A System for Defending Hate Speech Detection Models Against Adversaries" (2019). Master's Projects. 740.