No "Love" Lost: Defending Hate Speech Detection Models Against Adversaries
Publication Date
1-1-2020
Document Type
Conference Proceeding
Publication Title
2020 14th International Conference on Ubiquitous Information Management and Communication (IMCOM)
DOI
10.1109/IMCOM48794.2020.9001767
Abstract
Although current state-of-the-art hate speech detection models achieve praiseworthy results, these models have shown themselves to be vulnerable to attacks. Easy-to-execute lexical evasion schemes such as removal of whitespace from a given text creates significant issues for word-based hate speech detection models. In this paper, we reproduce the results of five cutting-edge models as well as four significant evasion schemes from prior work. These schemes are required to maintain readability which enables us to recreate the original data. We present several new defenses that leverage this need for maintained meaning and readability, and these schemes perform on par with or exceed the results of adversarial retraining. Furthermore, we demonstrate that each lexical attack or evasion scheme can be overcome with our new defense mechanisms with some reducing the effectiveness of the scheme to a mere.1 to.01 drop in F-1 score. We also propose a new evasion scheme that outperforms those in previous work along with a corresponding defense. Using our results as a foundation, we contend that hate speech detection models can be defended against lexically morphed data without the need for significant retraining. Our work suggests that by utilizing the requirement for preserved meaning, one can create a suitable defense against evasion schemes with a high reversal rate.
Funding Number
2018-2023
Keywords
adversarial attacks, deep learning, lexical attacks, machine learning, social media
Department
Computer Science
Recommended Citation
Melody Moh, Teng Sheng Moh, and Brian Khieu. "No "Love" Lost: Defending Hate Speech Detection Models Against Adversaries" 2020 14th International Conference on Ubiquitous Information Management and Communication (IMCOM) (2020). https://doi.org/10.1109/IMCOM48794.2020.9001767