Malware classification with Word2Vec, HMM2Vec, BERT, and ELMo
Journal of Computer Virology and Hacking Techniques
Malware classification is an important and challenging problem in information security. Modern malware classification techniques rely on machine learning models that can be trained on features such as opcode sequences, API calls, and byte n-grams, among many others. In this research, we consider opcode features and we implement machine learning techniques, where we apply word embedding techniques—specifically, Word2Vec, HMM2Vec, BERT, and ELMo—as a feature engineering step. The resulting embedding vectors are then used as features for classification algorithms. The classification algorithms that we employ are support vector machines (SVM), k-nearest neighbor (kNN), random forests (RF), and convolutional neural networks (CNN). We conduct substantial experiments involving seven malware families. Our experiments extend beyond previous related work in this field. We show that we can obtain slightly better performance than in comparable previous work, with significantly faster model training times.
Aparna Sunil Kale, Vinay Pandya, Fabio Di Troia, and Mark Stamp. "Malware classification with Word2Vec, HMM2Vec, BERT, and ELMo" Journal of Computer Virology and Hacking Techniques (2023): 1-16. https://doi.org/10.1007/s11416-022-00424-3