HealthLies: Dataset and Machine Learning Models for Detecting Fake Health News

Publication Date

1-1-2022

Document Type

Conference Proceeding

Publication Title

Proceedings - IEEE 8th International Conference on Big Data Computing Service and Applications, BigDataService 2022

DOI

10.1109/BigDataService55688.2022.00008

First Page

1

Last Page

8

Abstract

Current datasets and models focusing on health fake news identification are few and far between and primarily based on COVID-19. In this paper, we introduce a new health news-specific dataset called HealthLies, which includes 11,001 facts and myths about diseases such as COVID-19, Cancer, Polio, Zika, HIV/AIDS, SARS, and Ebola collected from a wide range of sources. We train several machine learning models, including KNN, SVM, Logistic Regression, Naive Bayes, an MLP Classifier, and a deep learning model based on the state-of-the-art Natural Language Processing (NLP) BERT model, which we name BERT-HealthLies. We find that BERT-HealthLies typically achieves the highest accuracy across models, though other models may be preferable in some real-time applications due to their orders of magnitude faster prediction and training times. In addition, ensembling BERT-HealthLies with the other models performs up to 12% better than BERT-HealthLies alone when identifying fake news related to a new disease for which we do not yet have training data.

Keywords

Fake Health News, Fake News, HealthLies, Machine Learning, NLP

Department

Computer Engineering

Share

COinS