Malware Detection through Contextualized Vector Embeddings

Publication Date

1-1-2023

Document Type

Conference Proceeding

Publication Title

2023 Silicon Valley Cybersecurity Conference, SVCC 2023

DOI

10.1109/SVCC56964.2023.10165170

Abstract

Detecting malware is an integral part of system security. In recent years, machine learning models have been applied with success to overcome this challenging problem. The aim of this research is to apply context-dependent word embeddings to classify malware. We extract opcodes from the malware samples and use them to generate the embeddings that train the classifiers. Transformers are a novel architecture that utilizes self-attention to handle long-range dependencies. Different transformer architectures, namely, BERT, DistilBERT, AIBERT, and RoBERTa, are implemented in this work to generate context-dependent word embeddings. Apart from using transformer models, we also experimented with ELMo, a bidirectional language model which can generate contextualized opcode embeddings. These embeddings are used to train our machine learning models in classifying samples from different malware families. We compared our contextualized results with context-free embeddings generated by Word2Vec, and HMM2Vec algorithms. The classification algorithms trained on our embeddings consist of Resnet-18 CNN, Random Forest, Support Vector Machines (SVMs), and k-Nearest Neighbours (k-NNs).

Keywords

AlBERT, BERT, ELMo, Malware detection, RoBERTa, Transformer

Department

Computer Science

Share

COinS