Publication Date
Spring 2022
Degree Type
Master's Project
Degree Name
Master of Science (MS)
Department
Computer Science
First Advisor
Fabio Di Troia
Second Advisor
Mark Stamp
Third Advisor
Katerina Potika
Keywords
Contextualized Embeddings, Transformer models, BERT, Bidirectional Language Models, ELMo, Glove, Word2vec, Fasttext, Optuna
Abstract
Malware classification is a technique to classify different types of malware which form an integral part of system security. The aim of this project is to use context dependant word embeddings to classify malware. Tansformers is a novel architecture which utilizes self attention to handle long range dependencies. They are particularly effective in many complex natural language processing tasks such as Masked Lan- guage Modelling(MLM) and Next Sentence Prediction(NSP). Different transfomer architectures such as BERT, DistilBert, Albert, and Roberta are used to generate context dependant word embeddings. These embeddings would help in classifying different malware samples based on their similarity and context.
Apart from using transformer models we also experimented with different bidi- rectional language models sunch as ELMo which can generate contextualized opcode embeddings.This project also discusses algorithms for generating embeddings for byte level N-grams. We utilize Word2vec, Glove and Fasttext algorithms to generate context free embeddings. The classification algorithms employed in this project consist of Resnet-101 CNN, Random forest,Support Vector Machines (SVM), and �� nearest neighbours. Transformer models sometimes act as black boxes which makes it difficult to understand their decisions.Various intrepretable models are utilized to explain their inner workings and improve our understanding of the model to explain their results.
Recommended Citation
Pandya, Vinay, "Contextualized Vector Embeddings for Malware Detection" (2022). Master's Projects. 1083.
DOI: https://doi.org/10.31979/etd.rjra-9c8m
https://scholarworks.sjsu.edu/etd_projects/1083