HealthLies: Dataset and Machine Learning Models for Detecting Fake Health News
Proceedings - IEEE 8th International Conference on Big Data Computing Service and Applications, BigDataService 2022
Current datasets and models focusing on health fake news identification are few and far between and primarily based on COVID-19. In this paper, we introduce a new health news-specific dataset called HealthLies, which includes 11,001 facts and myths about diseases such as COVID-19, Cancer, Polio, Zika, HIV/AIDS, SARS, and Ebola collected from a wide range of sources. We train several machine learning models, including KNN, SVM, Logistic Regression, Naive Bayes, an MLP Classifier, and a deep learning model based on the state-of-the-art Natural Language Processing (NLP) BERT model, which we name BERT-HealthLies. We find that BERT-HealthLies typically achieves the highest accuracy across models, though other models may be preferable in some real-time applications due to their orders of magnitude faster prediction and training times. In addition, ensembling BERT-HealthLies with the other models performs up to 12% better than BERT-HealthLies alone when identifying fake news related to a new disease for which we do not yet have training data.
Fake Health News, Fake News, HealthLies, Machine Learning, NLP
Garima Chaphekar and Jorjeta G. Jetcheva. "HealthLies: Dataset and Machine Learning Models for Detecting Fake Health News" Proceedings - IEEE 8th International Conference on Big Data Computing Service and Applications, BigDataService 2022 (2022): 1-8. https://doi.org/10.1109/BigDataService55688.2022.00008