An Advanced Malware Detection System Based on NLP to Generate Genetic Markers

Publication Date

1-1-2024

Document Type

Conference Proceeding

Publication Title

Digest of Technical Papers - IEEE International Conference on Consumer Electronics

DOI

10.1109/ICCE59016.2024.10444433

Abstract

Malware researchers are increasingly employing machine learning to counter sophisticated obfuscation techniques in malware activities. However, relying solely on machine learning models, especially with limited malware samples, is suboptimal. To enhance model effectiveness, careful data preprocessing is essential. To aid machine learning models in their learning process, we disassembled malware binary files into mnemonic opcode sequences and leveraging insights from Natural Language Processing, specifically Word2Vec and Doc2Vec. This methodology generates 'genetic markers' proficient at capturing inherent malware attributes, effectively bypassing opcode obfuscation. Using various machine learning models-Naive Bayes, Logistic Regression, Support Vector Machine, Random Forest, Multi-layer Perceptron, and Convolutional biLSTM-we achieve promising results in distinguishing among 20 discrete malware families. Our findings reveal that employing customized Word2Vec and Doc2Vec models for distinct malware family datasets enhances the precision of 'genetic markers.' This refinement boosts the learning capabilities of machine learning models, resulting in a robust detection rate of 99%.

Keywords

Embedding Vectors, Machine Learning, Malware Detection, Natural Language Processing

Department

Computer Science

Share

COinS