Malware Detection through Contextualized Vector Embeddings
Publication Date
1-1-2023
Document Type
Conference Proceeding
Publication Title
2023 Silicon Valley Cybersecurity Conference, SVCC 2023
DOI
10.1109/SVCC56964.2023.10165170
Abstract
Detecting malware is an integral part of system security. In recent years, machine learning models have been applied with success to overcome this challenging problem. The aim of this research is to apply context-dependent word embeddings to classify malware. We extract opcodes from the malware samples and use them to generate the embeddings that train the classifiers. Transformers are a novel architecture that utilizes self-attention to handle long-range dependencies. Different transformer architectures, namely, BERT, DistilBERT, AIBERT, and RoBERTa, are implemented in this work to generate context-dependent word embeddings. Apart from using transformer models, we also experimented with ELMo, a bidirectional language model which can generate contextualized opcode embeddings. These embeddings are used to train our machine learning models in classifying samples from different malware families. We compared our contextualized results with context-free embeddings generated by Word2Vec, and HMM2Vec algorithms. The classification algorithms trained on our embeddings consist of Resnet-18 CNN, Random Forest, Support Vector Machines (SVMs), and k-Nearest Neighbours (k-NNs).
Keywords
AlBERT, BERT, ELMo, Malware detection, RoBERTa, Transformer
Department
Computer Science
Recommended Citation
Vinay Pandya and Fabio Di Troia. "Malware Detection through Contextualized Vector Embeddings" 2023 Silicon Valley Cybersecurity Conference, SVCC 2023 (2023). https://doi.org/10.1109/SVCC56964.2023.10165170