BERT for Malware Classification
Publication Date
1-1-2022
Document Type
Contribution to a Book
Publication Title
Advances in Information Security
Volume
54
DOI
10.1007/978-3-030-97087-1_7
First Page
161
Last Page
181
Abstract
In this paper, we aim to accomplish malware classification using word embeddings. Specifically, we trained machine learning models using word embeddings generated by BERT. We extract the “words” directly from the malware samples to achieve multi-class classification. In fact, the attention mechanism of a pre-trained BERT model can be used in malware classification by capturing information about the relation between each opcode and every other opcode belonging to a specific malware family. As means of comparison, we repeat the same experiments with Word2Vec. Differently than BERT, Word2Vec generates word embeddings where words with similar context are considered closer, being able to classify malware samples based on similarity. As classification algorithms, we used and compared Support Vector Machines (SVM), Logistic Regression, Random Forests, and Multi-Layer Perceptron (MLP). We found that the classification accuracy obtained by the word embeddings generated by BERT is effective in detecting malware samples, and superior in accuracy when compared to the ones created by Word2Vec.
Department
Computer Science
Recommended Citation
Joel Alvares and Fabio Di Troia. "BERT for Malware Classification" Advances in Information Security (2022): 161-181. https://doi.org/10.1007/978-3-030-97087-1_7