An Advanced Malware Detection System Based on NLP to Generate Genetic Markers
Publication Date
1-1-2024
Document Type
Conference Proceeding
Publication Title
Digest of Technical Papers - IEEE International Conference on Consumer Electronics
DOI
10.1109/ICCE59016.2024.10444433
Abstract
Malware researchers are increasingly employing machine learning to counter sophisticated obfuscation techniques in malware activities. However, relying solely on machine learning models, especially with limited malware samples, is suboptimal. To enhance model effectiveness, careful data preprocessing is essential. To aid machine learning models in their learning process, we disassembled malware binary files into mnemonic opcode sequences and leveraging insights from Natural Language Processing, specifically Word2Vec and Doc2Vec. This methodology generates 'genetic markers' proficient at capturing inherent malware attributes, effectively bypassing opcode obfuscation. Using various machine learning models-Naive Bayes, Logistic Regression, Support Vector Machine, Random Forest, Multi-layer Perceptron, and Convolutional biLSTM-we achieve promising results in distinguishing among 20 discrete malware families. Our findings reveal that employing customized Word2Vec and Doc2Vec models for distinct malware family datasets enhances the precision of 'genetic markers.' This refinement boosts the learning capabilities of machine learning models, resulting in a robust detection rate of 99%.
Keywords
Embedding Vectors, Machine Learning, Malware Detection, Natural Language Processing
Department
Computer Science
Recommended Citation
Quang Duy Tran, Jaehyun Lim, and Fabio Di Troia. "An Advanced Malware Detection System Based on NLP to Generate Genetic Markers" Digest of Technical Papers - IEEE International Conference on Consumer Electronics (2024). https://doi.org/10.1109/ICCE59016.2024.10444433