HealthLies: Dataset and Machine Learning Models for Detecting Fake Health News

Publication Date

1-1-2022

Document Type

Conference Proceeding

Publication Title

Proceedings - IEEE 8th International Conference on Big Data Computing Service and Applications, BigDataService 2022

DOI

10.1109/BigDataService55688.2022.00008

First Page

1

Last Page

8

Abstract

Current datasets and models focusing on health fake news identification are few and far between and primarily based on COVID-19. In this paper, we introduce a new health news-specific dataset called HealthLies, which includes 11,001 facts and myths about diseases such as COVID-19, Cancer, Polio, Zika, HIV/AIDS, SARS, and Ebola collected from a wide range of sources. We train several machine learning models, including KNN, SVM, Logistic Regression, Naive Bayes, an MLP Classifier, and a deep learning model based on the state-of-the-art Natural Language Processing (NLP) BERT model, which we name BERT-HealthLies. We find that BERT-HealthLies typically achieves the highest accuracy across models, though other models may be preferable in some real-time applications due to their orders of magnitude faster prediction and training times. In addition, ensembling BERT-HealthLies with the other models performs up to 12% better than BERT-HealthLies alone when identifying fake news related to a new disease for which we do not yet have training data.

Keywords

Fake Health News, Fake News, HealthLies, Machine Learning, NLP

Department

Computer Engineering

Plum Print visual indicator of research metrics
PlumX Metrics
  • Citations
    • Citation Indexes: 4
  • Usage
    • Abstract Views: 11
  • Captures
    • Readers: 10
see details

Share

COinS