Malware classification with Word2Vec, HMM2Vec, BERT, and ELMo

Publication Date

3-1-2023

Document Type

Article

Publication Title

Journal of Computer Virology and Hacking Techniques

Volume

19

Issue

1

DOI

10.1007/s11416-022-00424-3

First Page

1

Last Page

16

Abstract

Malware classification is an important and challenging problem in information security. Modern malware classification techniques rely on machine learning models that can be trained on features such as opcode sequences, API calls, and byte n-grams, among many others. In this research, we consider opcode features and we implement machine learning techniques, where we apply word embedding techniques—specifically, Word2Vec, HMM2Vec, BERT, and ELMo—as a feature engineering step. The resulting embedding vectors are then used as features for classification algorithms. The classification algorithms that we employ are support vector machines (SVM), k-nearest neighbor (kNN), random forests (RF), and convolutional neural networks (CNN). We conduct substantial experiments involving seven malware families. Our experiments extend beyond previous related work in this field. We show that we can obtain slightly better performance than in comparable previous work, with significantly faster model training times.

Department

Computer Science

Share

COinS