No "Love" Lost: Defending Hate Speech Detection Models Against Adversaries

Publication Date

1-1-2020

Document Type

Conference Proceeding

Publication Title

2020 14th International Conference on Ubiquitous Information Management and Communication (IMCOM)

DOI

10.1109/IMCOM48794.2020.9001767

Abstract

Although current state-of-the-art hate speech detection models achieve praiseworthy results, these models have shown themselves to be vulnerable to attacks. Easy-to-execute lexical evasion schemes such as removal of whitespace from a given text creates significant issues for word-based hate speech detection models. In this paper, we reproduce the results of five cutting-edge models as well as four significant evasion schemes from prior work. These schemes are required to maintain readability which enables us to recreate the original data. We present several new defenses that leverage this need for maintained meaning and readability, and these schemes perform on par with or exceed the results of adversarial retraining. Furthermore, we demonstrate that each lexical attack or evasion scheme can be overcome with our new defense mechanisms with some reducing the effectiveness of the scheme to a mere.1 to.01 drop in F-1 score. We also propose a new evasion scheme that outperforms those in previous work along with a corresponding defense. Using our results as a foundation, we contend that hate speech detection models can be defended against lexically morphed data without the need for significant retraining. Our work suggests that by utilizing the requirement for preserved meaning, one can create a suitable defense against evasion schemes with a high reversal rate.

Funding Number

2018-2023

Keywords

adversarial attacks, deep learning, lexical attacks, machine learning, social media

Department

Computer Science

Share

COinS